Detecting time series outliers

The tsoutliers() function in the forecast package for R is useful for identifying anomalies in a time series. However, it is not properly documented anywhere. This post is intended to fill that gap.

The function began as an answer on CrossValidated and was later added to the forecast package because I thought it might be useful to other people. It has since been updated and made more reliable.

The procedure decomposes the time series into trend, seasonal and remainder components: \[y_t = T_t + S_t + R_t.\] The seasonal component is optional, and it may containing several seasonal patterns corresponding to the seasonal periods in the data. The idea is to first remove any seasonality and trend in the data, and then find outliers in the remainder series, \(R_t\).

For data observed more frequently than annually, we use a robust approach to estimate \(T_t\) and \(S_t\) by first applying the MSTL method to the data. MSTL will iteratively estimate the seasonal component(s).

Then the strength of seasonality is measured using \[F_s = 1-\frac{\text{Var}(y_t - \hat{T}_t - \hat{S}_t)}{\text{Var}(y_t - \hat{T}_t)}.\] If \(F_s>0.6\), a seasonally adjusted series is computed: \[y_t^* = y_t - \hat{S}_t.\] A seasonal strength threshold is used here because the estimate of \(\hat{S}_t\) is likely to be overfitted and very noisy if the underlying seasonality is too weak (or non-existent), potentially masking any outliers by having them absorbed into the seasonal component.

If \(F_s \le 0.6\), or if the data is observed annually or less frequently, we simply set \(y_t^*=y_t\).

Next, we re-estimate the trend component from the \(y_t^*\) values. For non-seasonal time series such as annual data, this is necessary as we don’t have the trend estimate from the STL decomposition. But even if we have computed an STL decomposition, we may not have used it if \(F_s \le 0.6\).

The trend component \(T_t\) is estimated by applying Friedman’s super smoother (via supsmu()) to the \(y_t^*\) data. This function has been tested on lots of data and tends to work well on a wide range of problems.

We look for outliers in the estimated remainder series \[\hat{R}_t = y^*_t - \hat{T}_t.\] If \(Q1\) denotes the 25th percentile and \(Q3\) denotes the 75th percentile of the remainder values, then the interquartile range is defined as \(IQR = Q3-Q1\). Observations are labelled as outliers if they are less than \(Q1 - 3\times IQR\) or greater than \(Q3 + 3\times IQR\). This is the definition used by Tukey (1977, p44) in his original boxplot proposal for “far out” values.

If the remainder values are normally distributed, then the probability of an observation being identified as an outlier is approximately 1 in 427000.

Any outliers identified in this manner are replaced with linearly interpolated values using the neighbouring observations, and the process is repeated.

Example: Gold data

The gold price data contains daily morning gold prices in US dollars from 1 January 1985 to 31 March 1989. The data was given to me by a client who wanted me to forecast the gold price. (I told him it would be almost impossible to beat a naive forecast). The data are shown below.


There are periods of missing values, and one obvious outlier which is about $100 greater than what would be expected. This was simply a typo, with someone typing 593.70 rather than 493.70. Let’s see if the tsoutliers() function can spot it.

## $index
## [1] 770
## $replacements
## [1] 495

Sure enough, it is easily found and the suggested replacement (linearly interpolated) is close to the true value.

The tsclean() function removes outliers identified in this way, and replaces them (and any missing values) with linearly interpolated replacements.

autoplot(tsclean(gold), series="clean", color='red', lwd=0.9) +
  autolayer(gold, series="original", color='gray', lwd=1) +
  geom_point(data = tsoutliers(gold) %>%, 
             aes(x=index, y=replacements), col='blue') +
  labs(x = "Day", y = "Gold price ($US)")

The blue dot shows the replacement for the outlier, the red lines show the replacements for the missing values.

comments powered by Disqus