I have a bunch of time series; i need to clean them before modelling. So far I just know the “filtering/smoothing” method : -Ex: moving average methodology (filter the data with a moving average (filter), then obtain a noise (serie minus filter) and remove data points which correspond to a high noise (i.e with a specific threshold) :
(simple) Example of the moving average filter method with three outliers :
Data and filter : Noise and threshold : Cleaned data :
Do you recommend a specific filter ? do you know a better automatic method ?
Answer
Not so fast! I think it is of the utmost importance to first examine whether the data points are real outliers, i.e. noise that is contaminating the data, or perhaps the most important pieces of the time series!
For example when you look at US stock market data of the last 50 years and remove only the ten biggest moves because they are outliers you get a completely different time series!
See page 276 of The Black Swan from Nassim Taleb
So you have to be extremely careful and double check all the data points you remove by whatever available method out there!
In general what you consider an outlier also very much depends on the model you are using. So what seems to be an outlier in one model (e.g. a linear model) is part of the package in a more complex model (e.g. a non-linear model). So it is also a matter of experience how to proceed.
So all in all I think there is no easy answer to your question. A good starting point may be the first chapter of the following new book (2013) which is available online:
Outlier analysis by C. Aggarwal
On a more practical note you can use the forecast-package in R in its new version 5.0 from Rob Hyndman. The new version was just released (27/01/2014) and has upgraded functionality for preprocessing time series and outliers:
http://robjhyndman.com/hyndsight/forecast5/
No comments:
Post a Comment