Measuring time series characteristics

A few years ago, I was working on a project where we measured various characteristics of a time series and used the information to determine what forecasting method to apply or how to cluster the time series into meaningful groups. The two main papers to come out of that project were:

  1. Wang, Smith and Hyndman (2006) Characteristic-​​based clustering for time series data. Data Mining and Knowledge Discovery, 13(3), 335-364.
  2. Wang, Smith-Miles and Hyndman (2009) “Rule induction for forecasting method selection: meta-​​learning the characteristics of univariate time series”, Neurocomputing, 72, 2581-2594.

I’ve since had a lot of requests for the code which one of my coauthors has been helpfully emailing to anyone who asked. But to make it easier, we thought it might be helpful if I post some updated code here. This is not the same as the R code we used in the paper, as I’ve improved it in several ways (so it will give different results). If you just want the code, skip to the bottom of the post.

Finding the period of the data

Usually in time series work, we know the period of the data (if the observations are monthly, the period is 12, for example). But in this project, some of our data was of unknown period and we wanted a method to automatically determine the appropriate period. The method we used was based on local peaks and troughs in the ACF. But I’ve since devised a better approach (prompted on using an estimate of the spectral density:

find.freq <- function(x)
  n <- length(x)
  x <- as.ts(x)
  # Remove trend from data
  x <- residuals(tslm(x ~ trend))
  # Compute spectrum by fitting ar model to largest section of x
  n.freq <- 500
  spec <-, plot=FALSE, n.freq=n.freq)
  if(max(spec$spec)>10) # Arbitrary threshold chosen by trial and error.
    period <- floor(1/spec$freq[which.max(spec$spec)] + 0.5)
    if(period==Inf) # Find next local maximum
      j <- which(diff(spec$spec)>0)
        nextmax <- j[1] + which.max(spec$spec[(j[1]+1):n.freq])
        if(nextmax < length(spec$freq))
          period <- floor(1/spec$freq[nextmax] + 0.5)
          period <- 1L
        period <- 1L
    period <- 1L

The function is called find.freq because time series people often call the period of seasonality the “frequency” (which is of course highly confusing).
[Update: This function is now part of the forecast package as findfrequency().]

Decomposing the data into trend and seasonal components

We needed a measure of the strength of trend and the strength of seasonality, and to do this we decomposed the data into trend, seasonal and error terms.

Because not all data could be decomposed additively, we first needed to apply an automated Box-Cox transformation. We tried a range of Box-Cox parameters on a grid, and selected the one which gave the most normal errors. That worked ok, but I’ve since found some papers that provide quite good automated Box-Cox algorithms that I’ve implemented in the forecast package. So this code uses Guerrero’s (1993) method instead.

For seasonal time series, we decomposed the transformed data using an stl decomposition with periodic seasonality.

For non-seasonal time series, we estimated the trend of the transformed data using penalized regression splines via the mgcv package.

decomp <- function(x,transform=TRUE)
  # Transform series
  if(transform & min(x,na.rm=TRUE) >= 0)
    lambda <- BoxCox.lambda(na.contiguous(x))
    x <- BoxCox(x,lambda)
    lambda <- NULL
    transform <- FALSE
  # Seasonal data
    x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
    trend <- x.stl$time.series[,2]
    season <- x.stl$time.series[,1]
    remainder <- x - trend - season
  else #Nonseasonal data
    tt <- 1:length(x)
    trend <- rep(NA,length(x))
    trend[!] <- fitted(gam(x ~ s(tt)))
    season <- NULL
    remainder <- x - trend


Putting everything on a [0,1] scale

We wanted to measure a range of characteristics such as strength of seasonality, strength of trend, level of nonlinearity, skewness, kurtosis, serial correlatedness, self-similarity, level of chaoticity (is that a word?) and the periodicity of the data. But we wanted all these on the same scale which meant mapping the natural range of each measure onto [0,1]. The following two functions were used to do this.

# f1 maps [0,infinity) to [0,1]
f1 <- function(x,a,b)
  eax <- exp(a*x)
  if (eax == Inf)
    f1eax <- 1
    f1eax <- (eax-1)/(eax+b)
# f2 maps [0,1] onto [0,1]
f2 <- function(x,a,b)
  eax <- exp(a*x)
  ea <- exp(a)

The values of $a$ and $b$ in each function were chosen so the measure had a 90th percentile of 0.10 when the data were iid standard normal, and a value of 0.9 using a well-known benchmark time series.

Calculating the measures

Now we are ready to calculate the measures on the original data, as well as on the adjusted data (after removing trend and seasonality).

measures <- function(x)
  N <- length(x)
  freq <- find.freq(x)
  fx <- c(frequency=(exp((freq-1)/50)-1)/(1+exp((freq-1)/50)))
  x <- ts(x,f=freq)
  # Decomposition
  decomp.x <- decomp(x)
  # Adjust data
  if(freq > 1)
    fits <- decomp.x$trend + decomp.x$season
  else # Nonseasonal data
    fits <- decomp.x$trend
  adj.x <- decomp.x$x - fits + mean(decomp.x$trend, na.rm=TRUE)
  # Backtransformation of adjusted data
    tadj.x <- InvBoxCox(adj.x,decomp.x$lambda)
    tadj.x <- adj.x
  # Trend and seasonal measures
  v.adj <- var(adj.x, na.rm=TRUE)
  if(freq > 1)
    detrend <- decomp.x$x - decomp.x$trend
    deseason <- decomp.x$x - decomp.x$season
    trend <- ifelse(var(deseason,na.rm=TRUE) < 1e-10, 0, 
    season <- ifelse(var(detrend,na.rm=TRUE) < 1e-10, 0,
  else #Nonseasonal data
    trend <- ifelse(var(decomp.x$x,na.rm=TRUE) < 1e-10, 0,
    season <- 0
  m <- c(fx,trend,season)
  # Measures on original data
  xbar <- mean(x,na.rm=TRUE)
  s <- sd(x,na.rm=TRUE)
  # Serial correlation
  Q <- Box.test(x,lag=10)$statistic/(N*10)
  fQ <- f2(Q,7.53,0.103)
  # Nonlinearity
  p <- terasvirta.test(na.contiguous(x))$statistic
  fp <- f1(p,0.069,2.304)
  # Skewness
  sk <- abs(mean((x-xbar)^3,na.rm=TRUE)/s^3)
  fs <- f1(sk,1.510,5.993)
  # Kurtosis
  k <- mean((x-xbar)^4,na.rm=TRUE)/s^4
  fk <- f1(k,2.273,11567)
  # Hurst=d+0.5 where d is fractional difference.
  H <- fracdiff(na.contiguous(x),0,0)$d + 0.5
  # Lyapunov Exponent
  if(freq > N-10)
      stop("Insufficient data")
  Ly <- numeric(N-freq)
  for(i in 1:(N-freq))
    idx <- order(abs(x[i] - x))
    idx <- idx[idx < (N-freq)]
    j <- idx[2]
    Ly[i] <- log(abs((x[i+freq] - x[j+freq])/(x[i]-x[j])))/freq
    if([i]) | Ly[i]==Inf | Ly[i]==-Inf)
      Ly[i] <- NA
  Lyap <- mean(Ly,na.rm=TRUE)
  fLyap <- exp(Lyap)/(1+exp(Lyap))
  m <- c(m,fQ,fp,fs,fk,H,fLyap)
  # Measures on adjusted data
  xbar <- mean(tadj.x, na.rm=TRUE)
  s <- sd(tadj.x, na.rm=TRUE)
  # Serial
  Q <- Box.test(adj.x,lag=10)$statistic/(N*10)
  fQ <- f2(Q,7.53,0.103)
  # Nonlinearity
  p <- terasvirta.test(na.contiguous(adj.x))$statistic
  fp <- f1(p,0.069,2.304)
  # Skewness
  sk <- abs(mean((tadj.x-xbar)^3,na.rm=TRUE)/s^3)
  fs <- f1(sk,1.510,5.993)
  # Kurtosis
  k <- mean((tadj.x-xbar)^4,na.rm=TRUE)/s^4
  fk <- f1(k,2.273,11567)
  m <- c(m,fQ,fp,fs,fk)
  names(m) <- c("frequency", "trend","seasonal",
    "dc autocorrelation","dc non-linear","dc skewness","dc kurtosis")


Here is a quick example applied to Australian monthly gas production:

     frequency              trend           seasonal    autocorrelation 
        0.1096             0.9989             0.9337             0.9985 
    non-linear           skewness           kurtosis              Hurst 
        0.4947             0.1282             0.0055             0.9996 
      Lyapunov dc autocorrelation      dc non-linear        dc skewness 
        0.5662             0.1140             0.0538             0.1743 
   dc kurtosis 

The function is far from perfect, and it is not hard to find examples where it fails. For example, it doesn’t work with multiple seasonality — try measure(taylor) and check the seasonality. Also, I’m not convinced the kurtosis provides anything useful here, or that the skewness measure is done in the best way possible. But it was really a proof of concept, so we will leave it to others to revise and improve the code.

In our papers, we took the measures obtained using R, and produced self-organizing maps using Viscovery. There is now a som package in R for that, so it might be possible to integrate that step into R as well. The dendogram was generated in matlab, although that could now also be done in R using the ggdendro package for example.

Download the code in a single file.

Related Posts:

  • Francis Markham

    Thanks for posting your code Rob. This is very similar to a project I envisaged a couple of years ago, before I had the skills to put the ideas into practice. I’ll let you know if I find the time to publish a paper based on the code.

  • Guest

    If there a way to use the forcast function out of the measures out put function ?

    • I don’t understand what you mean. What do you want to forecast?

      • Guest

        I was looking at one of the many forecast package example:

            fit <- Arima(WWWusage,order=c(3,1,0))
            plot(forecast(fit,h=20))Is it possible to imagine doing something similar to:
           fit2 <- measures(WWWusage)   plot(forecast(fit2,h=20))?

        • forecast(fit) produces forecasts from an ARIMA model. How can you forecast from the measures? These are just summary statistics of the time series.

          • Guest

            Sorry, my question may not have any sense. From the measures function output, timeseries can be clustered. Linking (how?) clusters with appropriate models, a forecast can then be performed. For instance, a cluster with strong trend and seasonnality components could be fitted with fit <- tslm(y ~ trend + season) and then forcasted. 

  • Abhijith B

    How can we automate a timeseries procedure to find the orders of ARIMA in R (not with auto.arima()).
    Suppose if there are more than one time series data are available.

    • What’s wrong with using auto.arima()? If you have more than one time series, use auto.arima() on each one.

  • Paul Prew

    Hello, This looked very promising for aproject I’m working on.  I tried to run the code, but got an error message.  Maybe it’s becasue I never specified the (a,b) arguments?  I ran the code in blocks as grouped in your post.  I know R a bit, mostly interact using RCommander.  I’ll be the first to admit my data is strange, but it’s actual pricing history for a raw material.

    > srn.dat$srn172163.100 x summary(x)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     0.6250  0.6350  0.6850  0.7089  0.7550  0.8810
    > dput(x)
    structure(c(0.705, 0.705, 0.7077561766, 0.725, 0.725, 0.789008964,
    0.8664955774, 0.868, 0.881, 0.8706588748, 0.8199582478, 0.725,
    0.675, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625,
    0.635, 0.635, 0.635, 0.635, 0.635, 0.7920070187, 0.8376571637,
    0.755, 0.755, 0.685, 0.685, 0.685, 0.685, 0.685, 0.7180954833
    ), .Tsp = c(2009, 2044, 1), class = “ts”)
    > f1 summary(f1eax)
    ERROR:  object ‘f1eax’ not found

    measures m
    [7] ERROR:  object ‘m’ not found

    > ls() [1] “decomp”         “f1”             “f2”             “find.freq”      [5] “snr.dat”        “snr.datsrn.dmx” “srn.dat”        “srn.dmx”        [9] “srn.dmy”        “srn.tslars”     “x”

  • Paul Prew

    Thank you, Rob.  Worked perfectly.  Next month I start an R programming class, maybe I can avoid these elementary errors in the future.

    I see there’s auto.arima, and this automated measures script — is there any automated procedure for transfer functions?  We searches My project involves forecasting raw material cost from price indices of certain commodities.  There’s a couple thousand raw materials, and I thought the time series clustering might shed some light.  But an automated transfer seems even more valuable.

    Thanks again for making this code available.  regards, Paul

    • Automated transfer functions would be cool, but it isn’t available in R at this stage. The commercial package Autobox will do it, I think. 

  • Umer Zeeshan Ijaz

    Thanks for sharing the scripts Rob. They are quite helpful! Installing forecast package on Ubuntu (11.04 Natty) was a big pain though as I found out that I can only install “forecast” when the current version of R >=2.14. In my case, I was running 2.12 and so I had to not only install the newer version of R but also update the GCC to version 4.6. In my case, one of the dependencies, “quadprog” wasn’t compiling with the current version (I think it was 4.5) of GCC.

    Having fixed all these problems, I am writing what I did to resolve these issues so that those of us who have debian distros and are facing the same problem can save some hassle.

    ###### If GCC is a problem ######

    The following steps are more or less similar to what is mentioned in the following blogpost.

    Following this, I got the GCC from      (gcc-4.6-20120504.tar.bz2)

    Untar it:
    tar xjvf gcc-4.6-20110401.tar.bz2

    This version of gcc requires the following libraries: gmp-4.3.2, mpfr-2.4.2, mpc-0.8.1. To get these libraries, run the following script in the untarred folder :

    Go to each folders for the above libraries that are created in gcc-4.6-20120427/contrib and install using either:
    make install

    Now we are ready to compile gcc. Come out of gcc-4.6-20120427 and make a folder called build:
    mkdir build

    Now you should have two folders in the current directory i.e.

    Go inside the build folder and run the following
    sudo make install

    Update your Ubuntu to use the new GCC:
    cd /usr/bin
    sudo rm g++
    sudo ln -s /usr/local/bin/g++-4.6 g++
    sudo rm gcc
    sudo ln -s /usr/local/bin/gcc-4.6 gcc

    ##### If using an older version of R #####

    Update R as:
    sudo apt-get install r-base
    sudo apt-get install r-recommended

    Now inside R check if you have the latest version

    Install forecast package now

    If you are accessing internet through your proxy server, set the following environmental variable in R before you run the above command

    I was still unable to get quadprog to install. It required libgfortran to be in /usr/lib/ whereas in my Lenovo thinkpad’s ubuntu distro, by default it was located at /usr/lib/x86_64-linux-gnu/
    I used the following command to find it:
    ldconfig -p | grep -i “libgfortran”

    I then made a soft symbolic link to it in /usr/lib using
    ln -s /usr/lib/x86_64-linux-gnu/ /usr/lib/

    ##### If RKward stopped working after upgrading R #####

    I use RKward ( as a GUI editor to program in R. After updating R, it stopped working properly. It gave the message “The ‘rkward’ R-library either could not be loaded at all, or not in the correct version” and then crashed as soon as I ran any R commands. Here is the fix for it:

    Remove RKward using any of the following commands:
    sudo apt-get purge rkward / sudo apt-get autoremove rkward / sudo apt-get remove rkward

    Make sure that the library is not deinstalled/deselected in dpkg
    dpkg –get-selection | grep -i “rkward”

    If found then purge it using
    dpkg -P rkward

    Make sure you remove the following directory as well
    rm -r ~/.rkward

    Install rkward using this ppa
    sudo add-apt-repository ppa:rkward-devel/rkward-stable-cran
    sudo apt-get update

    Umer (
    #            sudo apt-get install rkward

    • The latest versions of forecast now require R2.14 so they can use code compiling and RcppArmadillo to speed things up.

  • Yanfei Kang

    Hi Rob, 

    Thanks for making the codes public. It works perfectly except the kurtosis and Dc kurtosis. You may find in your example on Australian monthly gas production, both the kurtosis and Dc kurtosis are 1.0000, which is kind of wired. By trying the codes on several other time series, I think the reason is as follows:In 

     # Kurtosis

      k <- mean((tadj.x-xbar)^4,na.rm=TRUE)/s^4

      fk <- f1(k,2.273,11567), the $s$ is actually the value taken from:

      # Skewness

      s <- abs(mean((tadj.x-xbar)^3,na.rm=TRUE)/s^3)

      fs <- f1(s,1.510,5.993)
    but not the wanted $s$ value from 

      s <- sd(x,na.rm=TRUE). In your example, the skewness value before transformation should be very small, which causes $k$ to be infinity. That's why the transformed kurtosis $fk$ is 1.0000. Also for Dc kurtosis.Thanks again for making the codes available.Regards,yanfei

    • Oops. Good catch. That error was introduced when I modified the code for the blog post — it wasn’t in the code we used for the paper. Anyway, now fixed. Thanks.

  • Pingback: Time Series Decomposition : Box Cox for Additive Decomp | Q&A System()

  • Scott Locklin

    Thanks kindly for that source code. While I love the SOM’s, I couldn’t get it to work that way. It does work pretty well with proxy::dist and stats::hclust + stats::cutree
    This is a great trick: I used something similar to it + knn to detect regimes in subsegments of very large time series on another problem.

    Your version of the technique is better for shorter/lower-frequency timeseries. The lack of adequate clustering algorithms on CRAN cries out for a remedy, even if it is a pastiche of many packages. Maybe it’s time for me to write one…

  • John Williams

    Thanks heaps Rob. Just a quick note: on my system (R 3.0.1) I had to add a couple of requires: tseries and fracdiff. But that’s trivial. Thanks again! 🙂

    • Thanks. When I wrote this, the forecast package automatically loaded those packages. More recent versions of forecast now only import the functions needed, so there is now a need to add those requires.

  • SriV

    Dear Rob,
    You code is very informative. Thanks. Regarding the ‘normalization’ , particularly, the function “f1” which converts any [0, Infininity] to [0, 1], can you please explain that function a bit more? I am unable to understand it. Also, what are those three arguments to the function – x, a and b? I looked at its usage further down, but, still could not understand.


  • Baman

    Thank you for publishing these helpful codes, I’m trying to use the measure() function for my data set, but that gives me an error for some of them

    > x

    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

    2001 9 20 13 33 25 9 5 34 20 11 7 8

    2002 5 4 10 46 71 32 10 15 4 18 7 5

    2003 2 5 5 9 26 18 16 16 6 8 1 13

    2004 5 9 7 32 33 10 24 16 42 33 18 38

    2005 17 13 2 10 12 46 23 37 11 20 9 23

    2006 33 14 31 17 39 20 18 31 26 48 30 53

    2007 10 32 27 27 29 46 33 22 32 30 29 51

    2008 44 20 42 35 42 14 30 26 62 36 34 29

    2009 36 37 25 42 32 48 20


    Error in stl(x, s.window = “periodic”, na.action = na.contiguous) :

    only univariate series are allowed

    Could you please give me any suggestion to correct it?

    • It looks like your data is in a dataframe or a matrix object, rather than a ts object. It must be a univariate ts object.

  • Pingback: Hyndsight - Detecting seasonality()

  • Geraldine

    Dear Rob
    Thank you so much for posting your code. It is exactly what I was looking for.
    I’m currently working on implementing the second block of your code. Am I wrong that it doesn’t work with time series < 10 periods? It seems like the gam function can not handle these short time series. I replaced it with an lm function and that seems to work ok. Do you have any idea what else I can do to fix that problem? Around 40% of my time series have less than 10 periods.

    Best Geraldine

  • A

    Hi Rob,

    Thanks for posting this r example its exactly what i was looking for. Along the same lines as above could you use something similar for pattern detection in transactions. i have a sparse transaction data set and would like to generate some summary stats like periodicity, $ amount that is periodic and others … any thoughts?

  • Christopher

    Excellent blog I must say!

    The find.freq function works brilliantly. On the daily data set I am using, it correctly worked out the frequency to be 7.

    When I tried it on only the week days, it mentioned the frequency is 23, which is remarkably close to 21.42857=29.6*5/7 which is the average number of work days in a month. (Or conversely 23*7/5 is 32.)

    Is there a function that gives the list of all the seasonal.periods for use in msts(…)?

    I experimented with a hunch of taking the first period, averaging by that and then finding that period, etc.

    start=1; #also try start=f;
    if(length(freqs)==1){ return(freqs); }
    for(i in 2:length(freqs)){
    find.freq.all(dailyts) #using daily data

    The above gives (7,28) or (7,35) depending on if the seq starts with 1 or f. (See comment above.)

    Which would imply that the seasonal periods for msts should be (7,28) or (7,35).

    The logic appears sensitive to initial conditions given the sensitivity of the algorithm parameters. The mean of 28 and 35 is 31.5 which is close to the average length of a month.

    I suspect I reinvented the wheel, what is the name of this algorithm?

    Any thoughts or comments?

    P.S. Later, I ran the above code in trying all starts of 1:7 and I got 35,35,28,28,28,28,28 for the second period. The average works out to 30. Interesting…

    • I think you would be better off trying to remove the seasonality of the first frequency before looking for the next one. I’m not sure what your period.apply() function does here. You could use stl to estimate the first seasonal pattern, remove it, and then use find.freq() to look at what’s left.

      • Christopher

        The period.apply in the above code determines weekly daily average.
        Changing that line to x=period.apply(x,seq(start,length(x),f),sum); computes weekly totals and gives the same results.

        When I tried stl() it gives “series is not periodic or has less than two periods.”

        However, I know that the first of the month (that is a weekday) there is additional web traffic over and above the usual. Thus it determines that there is an additional spike every 4th or 5th week.
        Overall there is a growth pattern. Doesn’t that beginning of the month activity count as a seasonal pattern?

  • Pingback: New in forecast 6.0 | Hyndsight()

  • sasuke

    should the result of the function be used as frequency for my time series
    in my case I have daily data with an observation every 15 min,total of 96 per day, of 3 months worth of data from last year and this year and nothing in between

    • I assume you mean the find.freq() function. If you know the frequency, you don’t need to estimate it. In your case, you have potential seasonal periods of 96, 96*7 and 96*365. Or use all three with an msts object.

  • Thanks. Yes, I was off by one there although it makes almost no difference as the resulting frequency is rounded to an integer. This function has been improved slightly and is now available as findfrequency() in the forecast package.

  • Kathrine Makhnatch

    Thanks for sharing the script. It’s very helpful!
    But in the function maping [0,infinity) to [0,1] it will be better to replace eax == Inf by is.infinite(eax).

  • Rakesh Tripathi

    How to judge the accuracy of time series decomposition provided by R?

    Because at different level of data aggregation and seasonality different repetitive patterns appear in seasonality index.

    How to know that what is the best suited seasonality and granularity of data on which the decomposition is reliable?

    I am trying to find signal to noise ration where signal is considered as trend + seasonality and noise is reminder component.

    Is there any theoretical backing through which we can be confident of that at least we can rely on this decomposition for underlying patterns.

    • This is not a help site. Please ask general questions on

  • Fabrizio Maccallini

    Thank you, it’s people with your attitude that will make our future great! I have a question: in the first block of code when you compute the spectrum through the function, is there a reason the method should be Yule-Walker (default) rather than Burg?

  • Uttara Ananthakrishnan

    Thank you so much for this code. I was wondering if there was a reason to set the lag at 10 for Box Test (For the Q-Statistic). Is it standard practice in time series data or was it data specific?