My goal is to find clusters of stocks. The "affinity" matrix will define the "closeness" of points. This article gives a bit more background. The ultimate purpose is to investigate the "cohesion" within ETFs and between similar ETFs for arbitrage possibilities. Eventually if everything goes well this could lead to the creation of a tool for risk modelling or valuation. Currently the project is in the proposal/POC phase so resources are limited.
I found this Python example for clustering with related docs. The code uses correlations of the difference in open and close prices as values for the affinity matrix. I prefer to use the average return and standard deviation of returns. This can be visualised as a two dimensional space with the average and standard deviation as dimensions. Instead of correlation, I would then calculate the "distance" between data points (stocks) and fill the affinity matrix with the distances. The choice of the distance function is still an open issue. Is calculating the distance between data points instead of correlations valid?
If it is can I extend this approach with more dimensions, such as dividend yield or ratios such as price/earnings?
I did a few experiments with different numbers of parameters and different distance functions resulting in different numbers of clusters ranging from 1 to more than 300 for a sample size of 900 stocks. The sample consists of large and mid cap stocks listed on the NYSE and NASDAQ. Is there a rule of thumb for the number of clusters one should expect?
Answer
You should consider an unsupervised learning algorithm such as K-nearest neighbor ('KNN').
KNN will measure the distance amongst the observations in your space. You can and probably should consider alternative distance functions (besides euclidean) particularly if you are clustering on features such as returns which have outliers. There are quite a few unsupervised clustering algorithms out there - see here. You can certainly include features such as stock characteristics with these algorithms. You can also include the betas of the securities with respect to various risk factors as well. This would allow you to capture the distances in correlation space since a security based covariance matrix can be expressed as the : cross-product of (betas for factors) * covariance matrix of factor returns * transposed(betas for factors).
I would spend time thinking about the appropriate choice of features (which features are stable? which features predict risk or return? which sets of features are contributing unique sources of information? what are the invariants?) and choice of distance function.
Also, if you are mixing features with different unit scales (i.e. returns, betas, variances) then you need to normalize/pre-process your inputs otherwise the features with the highest variance will be the primary basis for clustering. Alternatively, you can stick to one class of features for your your clustering so you have some more intuition on interpreting the results.
No comments:
Post a Comment