Time series cross-validation: an R example

I was recently asked how to implement time series cross-validation in R. Time series people would normally call this “forecast evaluation with a rolling origin” or something similar, but it is the natural and obvious analogue to leave-one-out cross-validation for cross-sectional data, so I prefer to call it “time series cross-validation”.

Here is some example code applying time series CV and comparing 1-step, 2-step, …, 12-step forecasts using the Mean Absolute Error (MAE). Here I compare (1) a linear model containing trend and seasonal dummies applied to the log data; (2) an ARIMA model applied to the log data; and (3) an ETS model applied to the original data. The code is slow because I am estimating an ARIMA and ETS model for each iteration. (I’m also estimating a linear model, but that doesn’t take long.)

library(fpp) # To load the data set a10
plot(a10, ylab="$ million", xlab="Year", main="Antidiabetic drug sales")
plot(log(a10), ylab="", xlab="Year", main="Log Antidiabetic drug sales")
k <- 60 # minimum data length for fitting a model
n <- length(a10)
mae1 <- mae2 <- mae3 <- matrix(NA,n-k,12)
st <- tsp(a10)[1]+(k-2)/12
for(i in 1:(n-k))
  xshort <- window(a10, end=st + i/12)
  xnext <- window(a10, start=st + (i+1)/12, end=st + (i+12)/12)
  fit1 <- tslm(xshort ~ trend + season, lambda=0)
  fcast1 <- forecast(fit1, h=12)
  fit2 <- Arima(xshort, order=c(3,0,1), seasonal=list(order=c(0,1,1), period=12), 
      include.drift=TRUE, lambda=0, method="ML")
  fcast2 <- forecast(fit2, h=12)
  fit3 <- ets(xshort,model="MMM",damped=TRUE)
  fcast3 <- forecast(fit3, h=12)
  mae1[i,1:length(xnext)] <- abs(fcast1[['mean']]-xnext)
  mae2[i,1:length(xnext)] <- abs(fcast2[['mean']]-xnext)
  mae3[i,1:length(xnext)] <- abs(fcast3[['mean']]-xnext)
plot(1:12, colMeans(mae1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="MAE",
lines(1:12, colMeans(mae2,na.rm=TRUE), type="l",col=3)
lines(1:12, colMeans(mae3,na.rm=TRUE), type="l",col=4)

This yields the following figure.

A useful variation on this procedure is to keep the training window of fixed length. In that case, replace the first line in the for loop with

xshort <- window(a10, start=st+(i-k+1)/12, end=st+i/12)

Then the training set always consists of k observations.

Another variation is to compute one-step forecasts in the test set. Then the body of the for loop should be replaced with the following.

  xshort <- window(a10, end=st + i/12)
  xnext <- window(a10, start=st + (i+1)/12, end=st + (i+12)/12)
  xlong <- window(a10, end=st + (i+12)/12)
  fit1 <- tslm(xshort ~ trend + season, lambda=0)
  fcast1 <- forecast(fit1, h=12)$mean
  fit2 <- Arima(xshort, order=c(3,0,1), seasonal=list(order=c(0,1,1), period=12), 
                include.drift=TRUE, lambda=0, method="ML")
  fit2a <- Arima(xlong, model=fit2, lambda=0)
  fcast2 <- fitted(fit2a)[-(1:length(xshort))]
  fit3 <- ets(xshort,model="MMM",damped=TRUE)
  fit3a <- ets(xlong, model=fit3)
  fcast3 <- fitted(fit3a)[-(1:length(xshort))]
  mae1[i,1:length(xnext)] <- abs(fcast1-xnext)
  mae2[i,1:length(xnext)] <- abs(fcast2-xnext)
  mae3[i,1:length(xnext)] <- abs(fcast3-xnext)

Here the models are fitted to the training set (xshort), and then applied to the longer data set (xlong) without re-estimating the parameters. So the fitted values from the latter are one-step forecasts on the whole data set. Therefore, the last part of the fitted values vector are one-step forecasts on the test set.

Yet another variation which is useful for large data sets is to use a form of k-fold cross-validation where the training sets increment by several values at a time. For example, instead of incrementing by one observation in each iteration, we could shift the training set forward by 12 observations.

k <- 60 # minimum data length for fitting a model
n <- length(a10)
mae1 <- mae2 <- mae3 <- matrix(NA,12,12)
st <- tsp(a10)[1]+(k-1)/12
for(i in 1:12)
  xshort <- window(a10, end=st + (i-1))
  xnext <- window(a10, start=st + (i-1) + 1/12, end=st + i)
  fit1 <- tslm(xshort ~ trend + season, lambda=0)
  fcast1 <- forecast(fit1, h=12)
  fit2 <- Arima(xshort, order=c(3,0,1), seasonal=list(order=c(0,1,1), period=12), 
      include.drift=TRUE, lambda=0, method="ML")
  fcast2 <- forecast(fit2, h=12)
  fit3 <- ets(xshort,model="MMM",damped=TRUE)
  fcast3 <- forecast(fit3, h=12)
  mae1[i,] <- abs(fcast1[['mean']]-xnext)
  mae2[i,] <- abs(fcast2[['mean']]-xnext)
  mae3[i,] <- abs(fcast3[['mean']]-xnext)
plot(1:12, colMeans(mae1), type="l", col=2, xlab="horizon", ylab="MAE",
lines(1:12, colMeans(mae2), type="l",col=3)
lines(1:12, colMeans(mae3), type="l",col=4)

However, because this is based on fewer estimation steps, the results are much more volatile. It may be best to average over the forecast horizon as well:

> mean(mae1)
[1] 0.8053712
> mean(mae2)
[1] 0.7118831
> mean(mae3)
[1] 0.792813

The above code assumes you are using v3.02 or later of the forecast package.

Related Posts:

  • Udo Sglavo shows how to do this in SAS here: Part 1 and Part 2.

    • Nice work! How fast is it compared to my direct approach?

      • This was all run on my laptop, under windows.  I used the doRedis backend and all 4 cores for the parallel runs.
        Linear model (non-parallel): Rob: 3.51, Zach: 2.06
        Arima model: Rob: 184 Zach: 100
        ETS model: Rob: 535 Zach: 410
        It looks like I’m getting a slightly different answer than you for the error at the 1st forecast horizon.  Other than that, the results appear identical.

      • I also ran the Arima model sequentially under my code, and it took 188 seconds.

  • In the first block of code, you setup the for loop as follows: ‘for(i in 1:(n-k-1))’.  This seems to skip the last possible value of xnext, which is a vector of length 1, with a value of 19.43174, for June 2008.  Why note use ‘for(i in 1:(n-k))’ to setup your for loop, and include this last value in your test set?  (If you change the MAE matrices to matrix(NA,n-k,12) the code runs fine) 

    • Thanks Zach. I’m not sure why I left that one out. In any case, I’ve now updated the code as you suggest.

  • Pingback: Time Series Cross Validation | Jeffrey Chin Tang Wong()

  • Anusha Anusha

    Dear Sir, Thank you for a useful post. Would you please give a reference for the above method of cross validating time series? Also I have not found many papers discussing time series cross validation. Is the method suggested above widely acceptable or most used method for carrying out cross validation for time series and are there other alternatives that must be considered along with the above method?

    Thanking you

  • auto.arima does not use cross-validation. You would have to write something similar to auto.arima which uses cv to select the model.

  • Roy Sanderson

    Dear Rob

    Thanks for the excellent description. I’ve been wondering whether this will help me crack an apparently simple problem: I’ve a seasonal Arima model (2,1,1)x(1,0,1) 24hrs, with hrly records over 78 days, and 5 significant xreg variables. I’m trying to determine the “effect size” of each xreg variable, and have been setting 4 of them to their means whilst leaving 1 at is original, calculating predicted values, and doing this in turn for each xreg. Is this correct (whilst it’s the usual way for lm, perhaps not for ts)? Even if it is correct, I’ve no idea how to calculate 95% prediction intervals for xreg effect sizes from an Arima. So I was wondering whether to calculate lots of Arimas, in a ‘leave-k-out’ approach for all but a subset of the data, and forecasting values for each small subset. As this will take a very long time to run on my slow PC I thought I’d better check with you first!
    Many thanks for any hints.


    • Your approach will give you estimated effect sizes. However, I don’t know what you mean by prediction intervals for effect sizes. A prediction interval is for an unobserved random variable, whereas an effect size is not random. Perhaps you mean a confidence interval? In that case, the ci can be estimated as x times the ci for beta.

      • Roy Sanderson

        Dear Rob
        Many thanks for your reply – apologies, I should have said 95% CI. What I’m after is plots similar to those obtained via the allEffects function, except for the xreg variables used in an arima rather than a lm. Many thanks if you think the method I’ve suggested should be OK.

  • Isabella Ghement

    Dear Prof. Hyndman,

    In practice, would the entire time series be used to guess that the ARIMA model should be specified as ARIMA(p,d,q)(P,D,Q)s? where p=3, d=0, q=1, P=0, D=1,Q=1 and s=12?

    Once the structure of the ARIMA model is identified from the entire time series, is this same structure used when producing rolling-window forecasts?

    Thank you in advance for considering my question.

    • You would normally be trying several models and selecting the best based on a cross-validation procedure.

      • Isabella Ghement

        Thank you so much for your prompt response. I am still unclear on how this is supposed to work in practice, so I will elaborate below on my confusion.

        Let’s say that I would like to predict the unknown value of a future observation y_{t+1} on the basis of the observations y_{1}, …, y_{n}.

        Assuming that y_{1}, …, y_{n} follow an ARIMA(p,d,q) process, I would proceed to identify sensible values for p,d,q from the data y_{1}, …, y_{n} . For simplicity, let’s denote these values by p*, d*, q*.

        Given the values p*, d*, q*, I would estimate the parameters of the ARIMA(p*,d*,q*) process and use them to produce a forecast of the future observation y_{t+1}.

        Next, I would try to produce a measure of accuracy to accompany this forecast using the “rolling forecast origin” procedure described at http://otexts.com/fpp/2/5/.

        To this end, I would fit “forecast” ARIMA models to the following rolling training sets (where m < n):

        1) y_{1}, …, y_{m};
        2) y_{1}, …, y_{m+1};
        3) y_{1}, …, y_{m+2};

        My confusion is this: What is the connection between these "forecast" ARIMA models and the ARIMA(p*,d*,q*) model identified from the entire series?

        Do we assume that the "forecast" ARIMA models have the same order structure as the ARIMA(p*,d*,q*) model?

        • Cross-validation is both a method of measuring accuracy *and* a method for choosing a model. The model you finally use for forecasting is the one that gives the best cross-validation accuracy.

          • qoheleth

            This might or might not be Isabella’s question. You don’t normally start the modelling with CV. You would have first made a rough guess on (p,d,q) maybe using your auto.arima function or some other procedure that uses all the data. Then you feed this into a CV routine, possibly with some variants of this model for comparison. The question is, in this case, the test set is no longer unseen, so maybe the test error will still be biased downwards? (this is a question)

  • Maria Froes

    Dear Prof. Hyndman, one thing I don’t understand is, in the following command:

    st <- tsp(a10)[1]+(k-2)/12

    why do you subtract k-2?
    Thank you

    • First, you copied it incorrectly. The post said

      st <- tsp(a10)[1]+(k-1)/12

      This is the time that the shortest possible training set ends. We want k values in the training set, so there are k-1 values after the starting time.

  • Anon

    How would this approach be applied in a nested fashion (i.e. what is the time series equivalence to nested cross validation)?

  • brian

    I have recently tried something similar. I’ve set up a loop to fit a model on a rolling subset (fixed number of obs). I then compute a 1-step ahead forecast. Collecting the individual forecasts into a data frame, I then regress the actual results against the 1-step ahead forecasts. How should I interpret the parameters of this second regression? If I want to make a 1-step ahead forecast, am I better off using the raw model predictions (from the latest rolling regression) or plug in the model predictions into my second regression?

    • The coefficients measure bias in the forecasts. Unless something is very wrong, the intercept should be insignificantly different from zero and the slope should be insignificantly different from one. Assuming that is true, just use the original predictions.

  • qoheleth

    If sample size n =100, but frequency is 35, would it make sense align h to be 35? i.e. use first 30 to be test 31…65, then use first 65 to test 66..100? The problem is of course the error becomes very variable. Or would you ignore frequency when doing CV?

  • Pingback: Hyndsight - ARIMA models with long lags()

  • ubergooroo

    Can I choose a different value for K? what if my entire data length is just 60 periods?

    • Choose whatever value of k you think is reasonable.

  • Affy

    Dear Professor Hyndman,

    Thank you very much for your very useful blog.

    I’m doing one step ahead forecast for time series data set using a univariate regression model. I tried to adjust these codes for my work. I have 232 observations for each variable.I consider 152 as in-sample and 80 observations as out-of-sample. When I run my program surprisingly, I receive 80 list that each list includes 152 forecasts.However, I expect to get just 80 forecast as output of my program. I’ve checked and really cannot understand why it happens!

    #Out-of-Sample forecast
    train= window(data, start=1995,end=2007.66)
    n=nrow(test) #number of out-of-sample observation which is 80
    fc = list()

    for(i in 1:n)
    x=window(data, start= 1995, end=2007.66+(i-1)/12)

    refit= lm(x[,2]~x[,3])


    Also, I don't know how can I change the code to regress x[,2] on lag of x[,3] and use that for my forecast.I really tried to solve these problems but I couldn't. I really appreciate if you help me to find the problem with my codes.

  • Pingback: Distilled News | Data Analytics & R()

  • Jake

    Dear Dr. Hyndman,

    When I tried to do the one-step forecast method using your code listed above, I believe I found an error. The second forecast runs into a problem trying to compute the mae. Specifically at this point

    mae2[i,1:length(xnext)] <- abs(fcast2[['mean']]-xnext)

    At this point R gives the right values, but doesn't spit them out in a time series format, which throws off the the for loop and results in an error. Any idea on how to fix this?


  • Pingback: Selecting ARIMA Order using Rolling Forecast | Question and Answer()

  • Pingback: Rolling forecasts: training versus forecast accuracy evaluation | Question and Answer()