Time series cross-validation: an R example

I was recently asked how to implement time series cross-validation in R. Time series people would normally call this “forecast evaluation with a rolling origin” or something similar, but it is the natural and obvious analogue to leave-one-out cross-validation for cross-sectional data, so I prefer to call it “time series cross-validation”.

Here is some example code applying time series CV and comparing 1-step, 2-step, …, 12-step forecasts using the Mean Absolute Error (MAE). Here I compare (1) a linear model containing trend and seasonal dummies applied to the log data; (2) an ARIMA model applied to the log data; and (3) an ETS model applied to the original data. The code is slow because I am estimating an ARIMA and ETS model for each iteration. (I’m also estimating a linear model, but that doesn’t take long.)

 library(fpp) # To load the data set a10 plot(a10, ylab="$million", xlab="Year", main="Antidiabetic drug sales") plot(log(a10), ylab="", xlab="Year", main="Log Antidiabetic drug sales") k <- 60 # minimum data length for fitting a model n <- length(a10) mae1 <- mae2 <- mae3 <- matrix(NA,n-k,12) st <- tsp(a10)[1]+(k-2)/12 for(i in 1:(n-k)) { xshort <- window(a10, end=st + i/12) xnext <- window(a10, start=st + (i+1)/12, end=st + (i+12)/12) fit1 <- tslm(xshort ~ trend + season, lambda=0) fcast1 <- forecast(fit1, h=12) fit2 <- Arima(xshort, order=c(3,0,1), seasonal=list(order=c(0,1,1), period=12), include.drift=TRUE, lambda=0, method="ML") fcast2 <- forecast(fit2, h=12) fit3 <- ets(xshort,model="MMM",damped=TRUE) fcast3 <- forecast(fit3, h=12) mae1[i,1:length(xnext)] <- abs(fcast1[['mean']]-xnext) mae2[i,1:length(xnext)] <- abs(fcast2[['mean']]-xnext) mae3[i,1:length(xnext)] <- abs(fcast3[['mean']]-xnext) } plot(1:12, colMeans(mae1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="MAE", ylim=c(0.65,1.05)) lines(1:12, colMeans(mae2,na.rm=TRUE), type="l",col=3) lines(1:12, colMeans(mae3,na.rm=TRUE), type="l",col=4) legend("topleft",legend=c("LM","ARIMA","ETS"),col=2:4,lty=1) This yields the following figure. A useful variation on this procedure is to keep the training window of fixed length. In that case, replace the first line in the for loop with  xshort <- window(a10, start=st+(i-k+1)/12, end=st+i/12) Then the training set always consists of k observations. Another variation is to compute one-step forecasts in the test set. Then the body of the for loop should be replaced with the following.   xshort <- window(a10, end=st + i/12) xnext <- window(a10, start=st + (i+1)/12, end=st + (i+12)/12) xlong <- window(a10, end=st + (i+12)/12) fit1 <- tslm(xshort ~ trend + season, lambda=0) fcast1 <- forecast(fit1, h=12)$mean fit2 <- Arima(xshort, order=c(3,0,1), seasonal=list(order=c(0,1,1), period=12), include.drift=TRUE, lambda=0, method="ML") fit2a <- Arima(xlong, model=fit2, lambda=0) fcast2 <- fitted(fit2a)[-(1:length(xshort))] fit3 <- ets(xshort,model="MMM",damped=TRUE) fit3a <- ets(xlong, model=fit3) fcast3 <- fitted(fit3a)[-(1:length(xshort))] mae1[i,1:length(xnext)] <- abs(fcast1-xnext) mae2[i,1:length(xnext)] <- abs(fcast2-xnext) mae3[i,1:length(xnext)] <- abs(fcast3-xnext)

Here the models are fitted to the training set (xshort), and then applied to the longer data set (xlong) without re-estimating the parameters. So the fitted values from the latter are one-step forecasts on the whole data set. Therefore, the last part of the fitted values vector are one-step forecasts on the test set.

Yet another variation which is useful for large data sets is to use a form of k-fold cross-validation where the training sets increment by several values at a time. For example, instead of incrementing by one observation in each iteration, we could shift the training set forward by 12 observations.

 k <- 60 # minimum data length for fitting a model n <- length(a10) mae1 <- mae2 <- mae3 <- matrix(NA,12,12) st <- tsp(a10)[1]+(k-1)/12 for(i in 1:12) { xshort <- window(a10, end=st + (i-1)) xnext <- window(a10, start=st + (i-1) + 1/12, end=st + i) fit1 <- tslm(xshort ~ trend + season, lambda=0) fcast1 <- forecast(fit1, h=12) fit2 <- Arima(xshort, order=c(3,0,1), seasonal=list(order=c(0,1,1), period=12), include.drift=TRUE, lambda=0, method="ML") fcast2 <- forecast(fit2, h=12) fit3 <- ets(xshort,model="MMM",damped=TRUE) fcast3 <- forecast(fit3, h=12) mae1[i,] <- abs(fcast1[['mean']]-xnext) mae2[i,] <- abs(fcast2[['mean']]-xnext) mae3[i,] <- abs(fcast3[['mean']]-xnext) } plot(1:12, colMeans(mae1), type="l", col=2, xlab="horizon", ylab="MAE", ylim=c(0.35,1.5)) lines(1:12, colMeans(mae2), type="l",col=3) lines(1:12, colMeans(mae3), type="l",col=4) legend("topleft",legend=c("LM","ARIMA","ETS"),col=2:4,lty=1)

However, because this is based on fewer estimation steps, the results are much more volatile. It may be best to average over the forecast horizon as well:

 > mean(mae1) [1] 0.8053712 > mean(mae2) [1] 0.7118831 > mean(mae3) [1] 0.792813

The above code assumes you are using v3.02 or later of the forecast package.

Related Posts:

• Udo Sglavo shows how to do this in SAS here: Part 1 and Part 2.

• Nice work! How fast is it compared to my direct approach?

• This was all run on my laptop, under windows.  I used the doRedis backend and all 4 cores for the parallel runs.
Linear model (non-parallel): Rob: 3.51, Zach: 2.06
Arima model: Rob: 184 Zach: 100
ETS model: Rob: 535 Zach: 410
It looks like I’m getting a slightly different answer than you for the error at the 1st forecast horizon.  Other than that, the results appear identical.

• I also ran the Arima model sequentially under my code, and it took 188 seconds.

• In the first block of code, you setup the for loop as follows: ‘for(i in 1:(n-k-1))’.  This seems to skip the last possible value of xnext, which is a vector of length 1, with a value of 19.43174, for June 2008.  Why note use ‘for(i in 1:(n-k))’ to setup your for loop, and include this last value in your test set?  (If you change the MAE matrices to matrix(NA,n-k,12) the code runs fine)

• Thanks Zach. I’m not sure why I left that one out. In any case, I’ve now updated the code as you suggest.

• Pingback: Time Series Cross Validation | Jeffrey Chin Tang Wong()

• Anusha Anusha

Dear Sir, Thank you for a useful post. Would you please give a reference for the above method of cross validating time series? Also I have not found many papers discussing time series cross validation. Is the method suggested above widely acceptable or most used method for carrying out cross validation for time series and are there other alternatives that must be considered along with the above method?

Thanking you
Anusha

• It is widely used in forecasting research, but is not normally called cross-validation. I was pointing out that this approach is analogous to cross-validation used for cross-sectional data. For references, see my book at http://otexts.com/fpp/2/5/ or Fildes (1992) vol 8, p81-98.

• auto.arima does not use cross-validation. You would have to write something similar to auto.arima which uses cv to select the model.

• Roy Sanderson

Dear Rob

Thanks for the excellent description. I’ve been wondering whether this will help me crack an apparently simple problem: I’ve a seasonal Arima model (2,1,1)x(1,0,1) 24hrs, with hrly records over 78 days, and 5 significant xreg variables. I’m trying to determine the “effect size” of each xreg variable, and have been setting 4 of them to their means whilst leaving 1 at is original, calculating predicted values, and doing this in turn for each xreg. Is this correct (whilst it’s the usual way for lm, perhaps not for ts)? Even if it is correct, I’ve no idea how to calculate 95% prediction intervals for xreg effect sizes from an Arima. So I was wondering whether to calculate lots of Arimas, in a ‘leave-k-out’ approach for all but a subset of the data, and forecasting values for each small subset. As this will take a very long time to run on my slow PC I thought I’d better check with you first!
Many thanks for any hints.

Roy

• Your approach will give you estimated effect sizes. However, I don’t know what you mean by prediction intervals for effect sizes. A prediction interval is for an unobserved random variable, whereas an effect size is not random. Perhaps you mean a confidence interval? In that case, the ci can be estimated as x times the ci for beta.

• Roy Sanderson

Dear Rob
Many thanks for your reply – apologies, I should have said 95% CI. What I’m after is plots similar to those obtained via the allEffects function, except for the xreg variables used in an arima rather than a lm. Many thanks if you think the method I’ve suggested should be OK.
Roy

• Isabella Ghement

Dear Prof. Hyndman,

In practice, would the entire time series be used to guess that the ARIMA model should be specified as ARIMA(p,d,q)(P,D,Q)s? where p=3, d=0, q=1, P=0, D=1,Q=1 and s=12?

Once the structure of the ARIMA model is identified from the entire time series, is this same structure used when producing rolling-window forecasts?

Thank you in advance for considering my question.
Isabella

• You would normally be trying several models and selecting the best based on a cross-validation procedure.

• Isabella Ghement

Thank you so much for your prompt response. I am still unclear on how this is supposed to work in practice, so I will elaborate below on my confusion.

Let’s say that I would like to predict the unknown value of a future observation y_{t+1} on the basis of the observations y_{1}, …, y_{n}.

Assuming that y_{1}, …, y_{n} follow an ARIMA(p,d,q) process, I would proceed to identify sensible values for p,d,q from the data y_{1}, …, y_{n} . For simplicity, let’s denote these values by p*, d*, q*.

Given the values p*, d*, q*, I would estimate the parameters of the ARIMA(p*,d*,q*) process and use them to produce a forecast of the future observation y_{t+1}.

Next, I would try to produce a measure of accuracy to accompany this forecast using the “rolling forecast origin” procedure described at http://otexts.com/fpp/2/5/.

To this end, I would fit “forecast” ARIMA models to the following rolling training sets (where m < n):

1) y_{1}, …, y_{m};
2) y_{1}, …, y_{m+1};
3) y_{1}, …, y_{m+2};
etc.

My confusion is this: What is the connection between these "forecast" ARIMA models and the ARIMA(p*,d*,q*) model identified from the entire series?

Do we assume that the "forecast" ARIMA models have the same order structure as the ARIMA(p*,d*,q*) model?

• Cross-validation is both a method of measuring accuracy *and* a method for choosing a model. The model you finally use for forecasting is the one that gives the best cross-validation accuracy.

• qoheleth

This might or might not be Isabella’s question. You don’t normally start the modelling with CV. You would have first made a rough guess on (p,d,q) maybe using your auto.arima function or some other procedure that uses all the data. Then you feed this into a CV routine, possibly with some variants of this model for comparison. The question is, in this case, the test set is no longer unseen, so maybe the test error will still be biased downwards? (this is a question)

• Maria Froes

Dear Prof. Hyndman, one thing I don’t understand is, in the following command:

st <- tsp(a10)[1]+(k-2)/12

why do you subtract k-2?
Thank you

• First, you copied it incorrectly. The post said

st <- tsp(a10)[1]+(k-1)/12

This is the time that the shortest possible training set ends. We want k values in the training set, so there are k-1 values after the starting time.

• Maria Froes

Got it! Thank you!
But I think I haven’t copied it incorrectly. In your first set of code (line 6) you have k-2. But I understood the k-1. Thank you again.

• Sorry. I was looking at the last set. In that first set, I subtracted one more because I added i/12 in the loop, starting with i=1.

• Michael Lovell

Would that essentially be like using the leave-one-out method of cross-validation such as in the following example?…

• No. The LOO CV described there is for non-time-series data. See http://robjhyndman.com/hyndsight/crossvalidation/ for discussion of that problem.

• Michael Lovell

Thank you for your explanation. However, per this illustration: http://i.imgur.com/Som6cwz.png

if the dataset in the illustration was time series data and was sorted from past to present from left to right, I’m still a bit confused on how Leave-One-Out wouldn’t be identical to your explanation on proper Time Series CV.
Thanks again!

• No. “Leave one out” means leave ONE observation out. Time series CV means leave ALL future observations out.

• Anon

How would this approach be applied in a nested fashion (i.e. what is the time series equivalence to nested cross validation)?

• brian

I have recently tried something similar. I’ve set up a loop to fit a model on a rolling subset (fixed number of obs). I then compute a 1-step ahead forecast. Collecting the individual forecasts into a data frame, I then regress the actual results against the 1-step ahead forecasts. How should I interpret the parameters of this second regression? If I want to make a 1-step ahead forecast, am I better off using the raw model predictions (from the latest rolling regression) or plug in the model predictions into my second regression?

• The coefficients measure bias in the forecasts. Unless something is very wrong, the intercept should be insignificantly different from zero and the slope should be insignificantly different from one. Assuming that is true, just use the original predictions.

• qoheleth

If sample size n =100, but frequency is 35, would it make sense align h to be 35? i.e. use first 30 to be test 31…65, then use first 65 to test 66..100? The problem is of course the error becomes very variable. Or would you ignore frequency when doing CV?

• Pingback: Hyndsight - ARIMA models with long lags()

• ubergooroo

Can I choose a different value for K? what if my entire data length is just 60 periods?

• Choose whatever value of k you think is reasonable.

• Affy

Dear Professor Hyndman,

Thank you very much for your very useful blog.

I’m doing one step ahead forecast for time series data set using a univariate regression model. I tried to adjust these codes for my work. I have 232 observations for each variable.I consider 152 as in-sample and 80 observations as out-of-sample. When I run my program surprisingly, I receive 80 list that each list includes 152 forecasts.However, I expect to get just 80 forecast as output of my program. I’ve checked and really cannot understand why it happens!

#Out-of-Sample forecast
library(forecast)
h<-1
train= window(data, start=1995,end=2007.66)
test=window(data,start=2007.66)
n=nrow(test) #number of out-of-sample observation which is 80
fc = list()
fit=lm(train[,2]~train[,3])

for(i in 1:n)
{
x=window(data, start= 1995, end=2007.66+(i-1)/12)

refit= lm(x[,2]~x[,3])

fc[[i]]=forecast(refit,newdata=x,h)\$mean
}

Also, I don't know how can I change the code to regress x[,2] on lag of x[,3] and use that for my forecast.I really tried to solve these problems but I couldn't. I really appreciate if you help me to find the problem with my codes.

• Pingback: Distilled News | Data Analytics & R()

• Jake

Dear Dr. Hyndman,

When I tried to do the one-step forecast method using your code listed above, I believe I found an error. The second forecast runs into a problem trying to compute the mae. Specifically at this point

mae2[i,1:length(xnext)] <- abs(fcast2[['mean']]-xnext)

At this point R gives the right values, but doesn't spit them out in a time series format, which throws off the the for loop and results in an error. Any idea on how to fix this?

Thanks,
Jake

• Sorry. I’ve now fixed the code.

• Nguyễn Ngọc Đài Trang

Dear Prof. Hyndman,

Could you please advise me for the case of quarterly time series? I tried to modified your code (the first block) as following, but I am not sure if it is correct.

k <- 60 # minimum data length for fitting a model
n <- length(GVA_tf)
mae <- matrix(NA,n-k,4)
st <- tsp(GVA_tf)[1]+(k-2)/4

for(i in 1:(n-k)) {
xshort <- window(GVA_tf, end=st + i/4)
xnext <- window(GVA_tf, start=st + (i+1)/4, end=st + (i+4)/4, extend = TRUE)
fit <- ar.ols(xshort, aic = TRUE, order.max = 4, demean = FALSE)
fcast <- forecast(fit, h=4)
mae[i,1:length(xnext)] <- abs(fcast[['mean']]-xnext)
}

I got some warning for end value of xnext, so I set extend is TRUE. I guess something is not correct because the out-of-sample should be within the length of time series.
I would appreciate very much your help.

Thanks and best regards,
Trang Nguyen

• Nguyễn Ngọc Đài Trang

Dear Prof. Hyndman,

Could you please advise me how to adjust your code (first block) to quarterly time series? I tried as following, but something is not correct because I got warning that end value of xnext is extending. I need the code to proceed my Master Thesis, so I would appreciate very much your help.

GVA <- data[,1]
GVA_ts <- ts(log(GVA), start=c(1991,1), end=c(2010,4), frequency=4)
GVA_tf <- diff(GVA_ts, differences=ndiffs(GVA_ts))

k <- 60 # minimum data length for fitting a model
n <- length(GVA_tf)
mae <- matrix(NA,n-k,4)
st <- tsp(GVA_tf)[1]+(k-2)/4

for(i in 1:(n-k)) {
xshort <- window(GVA_tf, end=st + i/4)
xnext <- window(GVA_tf, start=st + (i+1)/4, end=st + (i+4)/4, extend = TRUE)
fit <- ar.ols(xshort, aic = TRUE, order.max = 4, demean = FALSE)
fcast <- forecast(fit, h=4)
mae[i,1:length(xnext)] <- abs(fcast[['mean']]-xnext)
}

Thanks and best regards,
Trang Nguyen

• mukund

When should one use the different types of ARIMA model as mentioned below:

1.Estimate the model order in the training data set and use the same order to forecast future values (updating the parameter estimates)
2.Use a rolling window (e.g. 30 day )to make a new forecast by estimating model order and parameters everytime to make a new forecast.
.
I tried the above strategies to forecast the daily downtimes of the machine as a percentage of scheduled time in manufacturing. I used the cross-validation methods and found that the rolling window with the new estimation of model order (p,d,q) and parameters every time a new forecast is a better forecasting model (MAPE for the second model is 3.34% compared to 10.25% for the first model)

I got confused on understanding why the model performs better with the method 2 or Am I doing something wrong? How do we explain this model in this manufacturing case? The breakdown pattern of the machine changes when repair work is done every time as breakdowns also depend on the “quality” of the repair work. So it is hard to fix model order p,d,q for the machine. Could this be a reason for the model 2 to perform better? any thoughts? Also, is it practical and advisable to estimate the model order everytime ?