I have it in mind to build a Multilayer Perceptron for predicting financial time series. I understand the algorithm concepts (linear combiner, activation function, etc). But while trying to build the input, hidden and output layers, I'm running into some questions about the basics.
A) I'm confused about the input vs. hidden layer. Is it the input or hidden layer neurons that have the i) linear combiner and ii) activation function.
Example: for 1 tick of USDEUR trade data, say my input vector is [ time, bid, ask, bid-volume, ask-volume ] . If my input layer has 5 neurons, does that entire set go to each neuron in the input layer? Or is "bid" input to one neuron, "ask" is input to another neuron, etc.
B) What are the values of the hidden-layer neurons ? If we have a fired value from our input neurons, how do we translate that into the values that we ultimately want to predict [ bid-percent-change, ask-percent-change ] ?
Ie: Ultimately, I'd like the output layer to give me a vector of 2 [ bid-percent-change, ask-percent-change ]
C) For a financial time series, a "sliding window" is suggested as an input vector. How would that input vector fit in with the other inputs: bid, ask, time, etc ?
Answer
First, let's speak about perceptrons in general:
their input $X_0$ is a $K$-dimensional vector. So if you want to use $(P_{bid}(t),P_{ask}(t), Q_{bid}(t),Q_{ask}(t))$, it would mean that without any effort (but later we will see that is would be better to do some efforts, as usual): $$X_0(t)=(P_{bid}(t),P_{ask}(t), Q_{bid}(t),Q_{ask}(t))'\in\mathbb{R}^4$$
then the hidden layer is made of $N$ hidden neurons $(Z_n)_{1\leq n\leq N}$, each of them is associated to weights $(w^h_{k,n})_{1\leq k\leq K}$ and a biais $b^h_n$: the activation of one hidden unit $Z_n$ is $$Z_n(t)=\Phi\left( \sum_k w^h_{k,n}\cdot X_k(t) + b^h_n\right)$$ where $\Phi(\cdot)$ is the activation function of the perceptron; if you want to do something fancy, you can use different activation functions, but the regular one is a sigmoid (i.e. $\rm th$). Note that $Z=(Z_n)_{1\leq n\leq N}$ is in $\mathbb{R}^N$.
It means that each hidden unit $Z_n(t)$ will receive a combination of all the inputs. In your example: $$Z_n(t)=\Phi\left( w^h_{1,n} \,P_{bid}(t) + w^h_{2,n} \,P_{ask}(t)+w^h_{3,n} \, Q_{bid}(t)+ w^h_{4,n} \,Q_{ask}(t) + b^h_n\right)$$
- last but not least the output layer $Y$ is made of what you want to predict, let say that you target $(\rho_{ask}(t+1),\rho_{bid}(t+1))$ (i.e. the pct of change in volume), then you take: $$Z(t)=(\rho_{ask}(t+1),\rho_{bid}(t+1))'\in\mathbb{R}^2$$ you also need weights and biais for the output so that ($U$ is the number of outputs): $$Y_n(t)=\Phi\left( \sum_{u=1}^U w^o_{u,n}\cdot Y_n(t) + b^o_u\right)$$ sometimes people do not take any activation function for the output, but I would not recommend that.
All this put together, you can express the outputs as a function of the inputs with: $$Y_n(t)=\Phi\left( \sum_{u=1}^U w^o_{u,n}\cdot \Phi\left( \sum_k w^h_{k,n}\cdot X_k(t) + b^h_n\right) + b^o_u\right)$$
How does it work?
The perceptron has to be trained on a database of a sample of $T$ associations of inputs and outputs $(X(t),Y(t))_{1\leq t\leq T}$. Its training is made of finding the weights and biais minimizing the $L2$ distance between the expected outputs and the obtained ones: $$\left\vert\begin{array}{ll} \mbox{Minimize}& \mathbb{E}_t \left\| Y(t) - \Phi\left( \sum_{u=1}^U w^o_{u,n}\cdot \Phi\left( \sum_k w^h_{k,n}\cdot X_k(t) + b^h_n\right) + b^o_u\right)\right\|^2\\ \mbox{Variable}& (w^h_{k,n},b^h_n,w^o_{u,n},b^o_u)_{1\leq u\leq U, 1\leq n\leq N,1\leq k\leq K} \end{array}\right.$$
What people liked in perceptron is that as far as you use usual quadratic minimization methods, the training is quite fast, and the associated computations are easy to do.
After the training, you obtain an estimate for $Y$ that is optimal in the sense of your minimization program (i.e. the $L2$ distance statistical sense).
How to use perceptrons?
I like your question because I think that what you are asking is close to "ok but can it be that simple? just throwing my inputs in $X$, asking for my outputs in $Y$, and all will work in few seconds?". Of course the answer is no.
First you can note that you need to normalize your inputs a little (center and reduce them for instance, i.e. $X_k(t)\rightarrow (X_k(t) - \mathbb{E}_t(X_k))/\sqrt{\mathbb{V}_t(X_k)}$), more than that you can preprocess the inputs thanks to your understanding of the modelling problem. For instance in your case you could use: $$X(t)=\left(\frac{P_{ask}(t)-P_{bid}(t)}{P_{ask}(t)+P_{bid}(t)}, \frac{Q_{bid}(t)}{Q_{bid}(t)+Q_{ask}(t)}\right)'$$ your input will be somehow homogeneous with your outputs (I am not guaranteeing any good result, it is just an improvement of your toy example)
You also need to pay attention to the outputs you ask to the perceptron you predict. Ask you the question "why not using $U$ perceptrons if I have $U$ outputs?". The only good reason is that you are convinced that sharing the same hidden units to try to predict simultaneously the $U$ outputs will stabilize the weights: the outputs should be deeply linked to a same underlying phenomenon.
What about the time in all this? I made as if the association $(X,Y)$ was i.i.d. with respect to time. If it is not the case you can try to use a more sliding approach like TDNN (Time Delayed Neural Networks). It is to the perceptron what GARCH models are to linear regressions (more or less)...
Of course you need to take care of over-fitting and all other VCdimension-like topics of statistical learning.
Last remark: you seem to try to use perceptrons for intraday prediction. In the context of high frequency trading, do never forgot that you will interact with the orderbook dynamics, so you will be in a control-oriented framework rather than a prediction one. What I mean is that by sending orders in the LOB, you will change it, but continue to use the changed state of the orderbook for the next step of your prediction... It is called market impact and it implies that you try to control the LOB.
Some special techniques have been developed to use perceptrons in the scope of control, like in How piecewise affine neural networks can generate a stable nonlinear control, by Lehalle and Azencott, in Proceedings of the 1999 IEEE International Symposium on Intelligent Control/Intelligent Systems and Semiotics, 1999.
No comments:
Post a Comment