Batch forecasting in R

I sometimes get asked about forecasting many time series automatically. Here is a recent email, for example:

I have looked but cannot find any info on generating forecasts on multiple data sets in sequence. I have been using analysis services for sql server to generate fitted time series but it is too much of a black box (or I don’t know enough to tweak/manage the inputs). In short, what package should I research that will allow me to load data, generate a forecast (presumably best fit), export the forecast then repeat for a few thousand items. I have read that R does not like ‘loops’ but not sure if the current cpu power offsets that or not. Any guidance would be greatly appreciated. Thank you!!

My response

Loops are fine in R. The are frowned upon because people use them inappropriately when there are often much more efficient vectorized versions available. But for this task, a loop is the only approach.

Reading data and exporting forecasts is standard R and does not require any additional packages to load. To generate the forecasts, use the forecast package. Either the ets() function or the auto.arima() function depending on what type of data you are modelling. If it’s high frequency data (frequency greater than 24) than you would need the tbats() function but that is very slow.

Some sample code

In the following example, there are many columns of monthly data in a csv file with the first column containing the month of observation (beginning with April 1982). Forecasts have been generated by applying forecast() directly to each time series. That will select an ETS model using the AIC, estimate the parameters, and generate forecasts. Although it returns prediction intervals, in the following code, I’ve simply extracted the point forecasts (named mean in the returned forecast object because they are usually the mean of the forecast distribution).

library(forecast)
 
retail <- read.csv("http://robjhyndman.com/data/ausretail.csv",header=FALSE)
retail <- ts(retail[,-1],f=12,s=1982+3/12)
 
ns <- ncol(retail)
h <- 24
fcast <- matrix(NA,nrow=h,ncol=ns)
for(i in 1:ns)
  fcast[,i] <- forecast(retail[,i],h=h)$mean
 
write(t(fcast),file="retailfcasts.csv",sep=",",ncol=ncol(fcast))

Note that the transpose of the fcast matrix is used in write() because the file is written row-by-row rather than column-by-column.

This code does not actually do what the questioner asked as I am writing all forecasts at once rather than exporting them at each iteration. The latter is much less efficient.

If ns is large, this could probably be more efficiently coded using the parallel package.


Related Posts:


  • FD

    Thanks for the post. Rather than for loops I collect my time series into lists and then use plyr::llply, with a custom function that extracts what I need. In the case above, where the series are all of equal length, you could use plyr::alply. I’ve also been playing around with using the ‘Characteristic Based Clustering Approach’ (from your paper with Wang) and then making forecasts based on a model fit on a cluster, rather than each individual series.

  • Han

    Another possibility is to use the doMC or foreach package. First of all, write a function to obtain forecasts for one sample. Then, loop the function for all replications and results are stored in a list normally. In doing so, you utilize multi-core features in standard Mac or PC.

  • Mike P

    Looking at the retail data after reading in the csv file it appears that it is daily, yet when creating the ts object you define the frequency=12. Wouldn’t this impede your ability to correctly use many of ts and forecast methods (i.e. cycle, fourier, etc)? What is the correct ts frequency setting for daily data? And is there a way to keep the actual starting date (i.e. in this case 1/04/1982) instead of using the generic form of s=1982?

    • It is monthly data. The date column gives the first day in each month.

  • Joseph J

    Thank you for the example. Very helpful.

  • Herimanitra

    Hi Prof. Hyndman!

    have you ever tried to blend a set of ARIMA models? Is it more efficient than choosing the “best” ARIMA?

    Thks

    • No, I haven’t tried that. In general, combining forecasts from different models works best when the models are different. So combining different ARIMA models may not be so effective.

      • Herimanitra

        Mmh!Thks

      • Jim Melichar

        I’ve been experimenting using your TBATS model recently, and experimenting with varying lengths of time series. Eg- 2yr data set, 3yr data set, 4yr data set. In this case, combining models to try and account for varying seasonality (I’m working with manufacturing data) and macro-economics seems to be effective in producing a slightly better forecast. Does your comment above for ARIMA models stick for the TBATS model?

        • The model should account for varying seasonality regardless of the length of the series. In any case, the comment is general. Combining forecasts works best when the models are not too similar. But even for very similar models, combining gives some benefit — see the bagging literature for example.

  • Ramesh

    Dear Prof.

    I am a research scholar from India. Current working on Time series modeling of R.

    I want to forecast multiple data series in R .

    the data sheet having multiple column fromT1 to T48.

    Please advice me how can I select a best forecasting method and also forecast all the columns at a time.

    Your’s Sincerely
    Ramesh
    India

  • Mila

    Hi! First of all, THANKS a lot for your hard/good work, it makes our life easier 🙂 Now, I also want to automatically forecast let’s say a thousand time series, but as they might have significant pattern differences, I apply many of the methods in the forecast package (arfima, auto.arima, ets, with and without Box Cox transf., tbats, etc.), then I calculate their accuracies and choose the one with minimum MASE for each time series; of course, there are some cases that I have to analize separately, but for the large amount of data I have I thought this would be a reasonable way of doing ‘automatic’ forecasting… Am I on the right path, or do you have any suggestions…? Thanks in advance for your time/reply. Kind regards, Mila.

    • You can’t select a model based on in-sample accuracy. Use out-of-sample accuracy instead. Better still, a rolling forecasting origin as explained at http://otexts.com/fpp/2/5/

      • Mila

        Thanks a lot!

    • A Shigri

      hi,

      I have a similar challenge, that is forecasting for 33,000 time series, each time series represents last 12 months sale for PRODUCTID. I want to predict the next month sale using forecasting (auto forecasting or auto arima) on each PRODUCTID. Dear Mila if you did find a solution to your problem could you please share here or email me at ashigri7@gmail.com

      • If any of your sales data show seasonality (and almost all sales data do), then you need several years of data. one year is woefully inadequate.

  • AY

    I’m not sure if this is the right place to post this; I have a different, but related problem. Given 50k individual time series representing revenue per customer, I would like to forecast the average (or total) revenue for this group, together with a prediction interval. I thought of a few approaches:

    1. Forecast each time series and then aggregate. (user-level)

    2. Aggregate the data first, and then forecast the one aggregated time series. (group-level)

    3. Forecast at the user-level, but cluster the data first, and share parameters across clusters. (hybrid)

    For my first pass at the problem, I tried the user-level approach. To obtain the prediction interval for the mean, would it be reasonable to average the individual prediction intervals (assuming the individual time series are independent) ?

    It seems like the hybrid approach should yield the best predictions. Could you point me to any related papers in this regard?

    Thanks,

    AY

  • Amateur TS modeler

    Professor, When you say ns is large(to use the parallel package) what is your ball park number??
    I have 100 variables and 700 date points in each. Should the above method work or would I have to use the parallel package?

  • Dan

    Professor, is there a way to get it to output the model selected/parameters?

  • RK

    This is helpful, I am wondering however if the time series data are stacked instead of separated by columns. So, if I have three columns product, date, and volume and I’d like to forecast each product inddividually, can the forecast package separate each time series… similar to by variable in SAS.

    • It is easy to reformat such data into the format I’ve used above. Or to modify my code to use the 3-column matrix directly.

  • Joan Arus

    Excuse my newbie question but.. could you please copy the sintax of that last paragraph of R code? Written it like this doesn’t seem to work in my R Studio. Thanks!

    for(i in 1:ns)
    fcast[,i] <- forecast(retail[,i],h=h)$mean

    write(t(fcast),file="retailfcasts.csv",sep=",",ncol=ncol(fcast))

    • Joan Arus

      sorry! already solved.

  • Thisara Watawana

    How to measure the forecast accuracy for each time series?

  • Harshad Madhamshettiwar

    On the grounds as the question is– How do we decompose retail dataset using the decompose() function and be able to save all the output and metrics for all columns. This is real life requirement for me. I am trying to decompose 16 products, all those have same periodicity and can be saved as time series. But somehow dont think it is working or I think decompose() function needs all products to be stored as individual ts object and passed in a loop.

    • Use a list

      • Harshad Madhamshettiwar

        Thanks you for the reply. Yes I used the list and it works as desired.

  • Jason

    Hello Professor,

    First off, you are my hero and I thank you for your work in this field and creating the forecast package!

    If I want to use a for loop like this and average the point forecasts of auto.arima and stlf methods – how would I need to change the code? I got as far as the modified for loop below, but it returns NA’s for all the forecasts. Any ideas?

    for(i in 1:ns)
    fit5 <- stlf(retail[,i])
    fcast[,i] <- forecast(fit5, h=h)$mean

    • Try this:

      for(i in 1:ns)
      {
      fc1 <- forecast(stlf(retail[,i],h=h))
      fc2 <- forecast(auto.arima(retail[,i]),h=h)
      fcast[,i] <- 0.5*(fc1$mean + fc2$mean)
      }

      • Jason

        That worked perfectly! I did not think to set it up that way. I very much appreciate your help.

  • Priyanka Jha

    Hi Professor,

    I am trying to run ARIMA for a large number of time series using for loop(Around 1 lakh customers).Stationarizing by selecting lowest aic value and p,d,f order selection is happening inside the loop only. The code runs for some initial time series(some customers ) and then it throws the following error :
    solve.default(res$hessian * n.used, A) : Lapack routine dgesv: system is
    exactly singular: U[1,1] = 0
    I have tried normalizing the data but still the same error exists. Could you please let me know if it is feasible to run arima in this case .Please suggest a solution for this.

  • nj

    Hi professor,
    I have multiple time series in the forms of column no. 5 to 145 of a csv file. I want to write its arima model parameters p,d,q,phi,theta and delta in a raw wise manner for every column, in another csv file.

    data<-read.csv('ts_columns.csv',header=TRUE,skip = 2, sep=",");

    df1 = data.frame();
    print(df1);

    for (k in 5:145){
    datats<-data[[k]];
    data1 = datats[1:100]; # first 100 points of data – training data

    mdl<-auto.arima(data1,max.p=3, max.q=3, max.P = 3,max.Q = 3, max.d=1, max.D=1, allowmean=TRUE);
    # model constraints

    df1 = rbind(df1, data.frame(mdl));
    }
    write.csv(df1, file='mdl_list.csv',append=TRUE,row.names=FALSE)

    But I am getting error. I am new to R, so the code is surely written in wrong manner. could you guide me to do the needful?