A new R package for detecting unusual time series

The anomalous package provides some tools to detect unusual time series in a large collection of time series. This is joint work with Earo Wang (an honours student at Monash) and Nikolay Laptev (from Yahoo Labs). Yahoo is interested in detecting unusual patterns in server metrics.

The package is based on this paper with Earo and Nikolay.

The basic idea is to measure a range of features of the time series (such as strength of seasonality, an index of spikiness, first order autocorrelation, etc.) Then a principal component decomposition of the feature matrix is calculated, and outliers are identified in 2-dimensional space of the first two principal component scores.

We use two methods to identify outliers.

  1. A bivariate kernel density estimate of the first two PC scores is computed, and the points are ordered based on the value of the density at each observation. This gives us a ranking of most outlying (least density) to least outlying (highest density).
  2. A series of $\alpha$-convex hulls are computed on the first two PC scores with decreasing $\alpha$, and points are classified as outliers when they become singletons separated from the main hull. This gives us an alternative ranking with the most outlying having separated at the highest value of $\alpha$, and the remaining outliers with decreasing values of $\alpha$.

I explained the ideas in a talk last Tuesday given at a joint meeting of the Statistical Society of Australia and the Melbourne Data Science Meetup Group. Slides are available here. A link to a video of the talk will also be added there when it is ready.

The density-ranking of PC scores was also used in my work on detecting outliers in functional data. See my 2010 JCGS paper and the associated rainbow package for R.

There are two versions of the package: one under an ACM licence, and a limited version under a GPL licence. Eventually we hope to make the GPL version contain everything, but we are currently dependent on the alphahull package which has an ACM licence.


Related Posts:


  • Ankur

    Hi Rob,

    I have gone through this paper of yours and I am trying to see if I can fit this into my current project. The problem that I have is to detect the under-performing (or, faulty) solar inverter from among a dozen of solar inverters. Every inverter throws out ten-minute interval data from a lot of sensors the most important of which is the temperature reading. So for one sensor, we have as many time-series as the number of inverters, and I would like to separate out the one which is the least similar to others (or, the outlier time series). Since I have daily data, I am planning to use the latest 7-day data (or, if required, latest 14-day data).

    As such, it will be great if you could clarify some points from your paper.

    1. Refering to the example you have mentioned (in your paper), you have considered one common variable from 100 different servers so that you have 100 univariate time-series for 100 servers. Is this correct, or you have considered > 1 variable from every server?
    2. Again, in the example, you have created 16 features for every time-series. Instead, can I just use 4 or 5 features? Also, is there a way to decide which feature is more important than the other and so on.

    Eagerly looking forward to your reply.

    Thanks,
    Ankur

    • 1. We have 3 time series from each server, and we have applied the method to many thousands of time series. Fig 1 is just a small sample of 100 series. We actually use many more.
      2. Use as many or as few as you like. We haven’t developed tools for determining which features to use yet. But this is a standard problem in machine learning — treat it like a feature selection problem for unsupervised clustering.

      • Ankur

        Thanks Rob!

        On point 1, when you have 3 time series from each server, is it like the whole algo is run thrice so that one run finds outlier(s) for the 1st set of time series, the next run finds outlier(s) for the 2nd set of time series, and so on. And then make inferences regarding the outlier server(s). Or, is it like all the time series (# 300 referring to the same example) are analyzed together, leading to 300 x 16 features in all, and then moving onto PCA?

        Thanks,
        Ankur

        • We’ve done both — running the algorithm on each of the three sets of series, and also running it on the features from all series together.

          • Ankur

            Thank you very much Rob! Time now to implement this on my problem.

          • Ankur

            Hi Rob,

            When running your code with method = “hdr”, hdrinfo$fxy gives the density score for every time series and tmp.idx sorts the scores in increasing order, so that the first time series is in the lowest density region, the second time series in a bit denser region, followed by the third time series in an even denser region, and so on. However, is there a way to determine where to draw the line to find the outlier time series. When there are thousands of time series, I guess we can safely assume that the top 10/20/n time series may be anomalous, but when there are only a dozen or so time series, there may not be any outlier at all or at the most 1 outlier.

            Thanks,
            Ankur

          • That depends on the particular problem. You can set the threshold wherever you think is appropriate. One approach is to select the number of “outliers” you want to identify.

          • Ankur

            Thank you Rob!

          • Ankur

            Hi Rob,

            I have gone through the documentation of ‘hdrcde’ package but can’t figure out how to evaluate the kernel function at 0. Does it mean to find the density for the coordinates [0,0] by forcing these as PC1 and PC2 values (along with the rest of the PC1 and PC2 coordinates), and then selecting a threshold? Maybe I am missing something?

            Thanks,

            Ankur

          • No. Just find the value of the density that would arise if there was a distant outlier. It will be K(0)/(nab) where K() is the kernel function and a and b are the two bandwidths. Set your threshold slightly higher than that value.

          • Ankur

            Thanks Rob. Let me figure this out.

          • Ankur

            Thanks so much Rob for your guidance on this!! This is the first time I have ventured into machine learning (for my project) and that’s why so many questions. I have done some reading and here is how I am thinking of finding K(0) / n*a*b along with a few more questions.

            1. n -> will this be the # of rows in the PC space (i.e. as many as the # of inverters in my case), or the # of samples that the ‘kde’ function uses to find the density estimates. So n will be 16 (as per the former) and 151 for a two-dimension data (as per the latter)?

            2. a and b -> can I directly use the bandwidths determined by the function ‘Hpi’ (in the ‘ks’ package) – so ‘a’ will be the element (1,1) and ‘b’ will be the element (2,2) in the bandwidth matrix. Or, should I find a and b separately assuming PC1 as one univariate distribution and PC2 as another?

            3. Taking the Gaussian kernel function, K(0) will simply be 1/sqrt(2*pi) i.e. 0.3989
            4. Evaluate the density from K(0) / n*a*b.

            Or, chuck all the above steps and simply evaluate the density by kde(scores, eval.points = c(0,0)). Then, to select the threshold, any suggestion on slightly larger is how large, maybe, 10% / 20%. And now any density value lower than this threshold will imply an outlier time series, correct?

            Truly appreciate all your help.

            Thanks,
            Ankur

  • Hua yang

    Hi Rob,
    if a time series is detected anomaly, what kind of transformation should be done before we use it to create a model?

    Thanks

  • Alex Lavin

    Hi Rob,
    It’s unclear to me if this anomaly detection algorithm runs on streaming or batch data. That is, does your method look at a time-series one data point at a time, outputting whether or not that point is anomalous before getting the next data point? Or does it look at the full time-series of data points to detect if there are anomalies anywhere in the time-series? An example of the former would be getting a stock price every five seconds during the trading day, and the latter would be looking at a five-year glance of the stock’s price. Thanks!

    Alex