Batch forecasting in R

I some­times get asked about fore­cast­ing many time series auto­mat­i­cally. Here is a recent email, for example:

I have looked but can­not find any info on gen­er­at­ing fore­casts on mul­ti­ple data sets in sequence. I have been using analy­sis ser­vices for sql server to gen­er­ate fit­ted time series but it is too much of a black box (or I don’t know enough to tweak/​manage the inputs). In short, what pack­age should I research that will allow me to load data, gen­er­ate a fore­cast (pre­sum­ably best fit), export the fore­cast then repeat for a few thou­sand items. I have read that R does not like ‘loops’ but not sure if the cur­rent cpu power off­sets that or not. Any guid­ance would be greatly appre­ci­ated. Thank you!!

My response

Loops are fine in R. The are frowned upon because peo­ple use them inap­pro­pri­ately when there are often much more effi­cient vec­tor­ized ver­sions avail­able. But for this task, a loop is the only approach.

Read­ing data and export­ing fore­casts is stan­dard R and does not require any addi­tional pack­ages to load. To gen­er­ate the fore­casts, use the fore­cast pack­age. Either the ets() func­tion or the auto.arima() func­tion depend­ing on what type of data you are mod­el­ling. If it’s high fre­quency data (fre­quency greater than 24) than you would need the tbats() func­tion but that is very slow.

Some sam­ple code

In the fol­low­ing exam­ple, there are many columns of monthly data in a csv file with the first col­umn con­tain­ing the month of obser­va­tion (begin­ning with April 1982). Fore­casts have been gen­er­ated by apply­ing forecast() directly to each time series. That will select an ETS model using the AIC, esti­mate the para­me­ters, and gen­er­ate fore­casts. Although it returns pre­dic­tion inter­vals, in the fol­low­ing code, I’ve sim­ply extracted the point fore­casts (named mean in the returned fore­cast object because they are usu­ally the mean of the fore­cast distribution).

retail <- read.csv("",header=FALSE)
retail <- ts(retail[,-1],f=12,s=1982+3/12)
ns <- ncol(retail)
h <- 24
fcast <- matrix(NA,nrow=h,ncol=ns)
for(i in 1:ns)
  fcast[,i] <- forecast(retail[,i],h=h)$mean

Note that the trans­pose of the fcast matrix is used in write() because the file is writ­ten row-​​by-​​row rather than column-​​by-​​column.

This code does not actu­ally do what the ques­tioner asked as I am writ­ing all fore­casts at once rather than export­ing them at each iter­a­tion. The lat­ter is much less efficient.

If ns is large, this could prob­a­bly be more effi­ciently coded using the par­al­lel pack­age.

Related Posts:

  • FD

    Thanks for the post. Rather than for loops I col­lect my time series into lists and then use plyr::llply, with a cus­tom func­tion that extracts what I need. In the case above, where the series are all of equal length, you could use plyr::alply. I’ve also been play­ing around with using the ‘Char­ac­ter­is­tic Based Clus­ter­ing Approach’ (from your paper with Wang) and then mak­ing fore­casts based on a model fit on a clus­ter, rather than each indi­vid­ual series.

  • Han

    Another pos­si­bil­ity is to use the doMC or fore­ach pack­age. First of all, write a func­tion to obtain fore­casts for one sam­ple. Then, loop the func­tion for all repli­ca­tions and results are stored in a list nor­mally. In doing so, you uti­lize multi-​​core fea­tures in stan­dard Mac or PC.

  • Mike P

    Look­ing at the retail data after read­ing in the csv file it appears that it is daily, yet when cre­at­ing the ts object you define the frequency=12. Wouldn’t this impede your abil­ity to cor­rectly use many of ts and fore­cast meth­ods (i.e. cycle, fourier, etc)? What is the cor­rect ts fre­quency set­ting for daily data? And is there a way to keep the actual start­ing date (i.e. in this case 1÷04÷1982) instead of using the generic form of s=1982?

    • Rob J Hyndman

      It is monthly data. The date col­umn gives the first day in each month.

  • Joseph J

    Thank you for the exam­ple. Very helpful.

  • Her­i­man­i­tra

    Hi Prof. Hyndman!

    have you ever tried to blend a set of ARIMA mod­els? Is it more effi­cient than choos­ing the “best” ARIMA?


    • Rob J Hyndman

      No, I haven’t tried that. In gen­eral, com­bin­ing fore­casts from dif­fer­ent mod­els works best when the mod­els are dif­fer­ent. So com­bin­ing dif­fer­ent ARIMA mod­els may not be so effective.

      • Her­i­man­i­tra


  • Ramesh

    Dear Prof.

    I am a research scholar from India. Cur­rent work­ing on Time series mod­el­ing of R.

    I want to fore­cast mul­ti­ple data series in R .

    the data sheet hav­ing mul­ti­ple col­umn fromT1 to T48.

    Please advice me how can I select a best fore­cast­ing method and also fore­cast all the columns at a time.

    Your’s Sin­cerely

  • Mila

    Hi! First of all, THANKS a lot for your hard/​good work, it makes our life eas­ier :) Now, I also want to auto­mat­i­cally fore­cast let’s say a thou­sand time series, but as they might have sig­nif­i­cant pat­tern dif­fer­ences, I apply many of the meth­ods in the fore­cast pack­age (arfima, auto.arima, ets, with and with­out Box Cox transf., tbats, etc.), then I cal­cu­late their accu­ra­cies and choose the one with min­i­mum MASE for each time series; of course, there are some cases that I have to anal­ize sep­a­rately, but for the large amount of data I have I thought this would be a rea­son­able way of doing ‘auto­matic’ fore­cast­ing… Am I on the right path, or do you have any sug­ges­tions…? Thanks in advance for your time/​reply. Kind regards, Mila.

    • Rob J Hyndman

      You can’t select a model based on in-​​sample accu­racy. Use out-​​of-​​sample accu­racy instead. Bet­ter still, a rolling fore­cast­ing ori­gin as explained at http://​otexts​.com/​f​p​p​/2/5/

      • Mila

        Thanks a lot!

  • AY

    I’m not sure if this is the right place to post this; I have a dif­fer­ent, but related prob­lem. Given 50k indi­vid­ual time series rep­re­sent­ing rev­enue per cus­tomer, I would like to fore­cast the aver­age (or total) rev­enue for this group, together with a pre­dic­tion inter­val. I thought of a few approaches:

    1. Fore­cast each time series and then aggre­gate. (user-​​level)

    2. Aggre­gate the data first, and then fore­cast the one aggre­gated time series. (group-​​level)

    3. Fore­cast at the user-​​level, but clus­ter the data first, and share para­me­ters across clus­ters. (hybrid)

    For my first pass at the prob­lem, I tried the user-​​level approach. To obtain the pre­dic­tion inter­val for the mean, would it be rea­son­able to aver­age the indi­vid­ual pre­dic­tion inter­vals (assum­ing the indi­vid­ual time series are independent) ?

    It seems like the hybrid approach should yield the best pre­dic­tions. Could you point me to any related papers in this regard?



  • Ama­teur TS modeler

    Pro­fes­sor, When you say ns is large(to use the par­al­lel pack­age) what is your ball park num­ber??
    I have 100 vari­ables and 700 date points in each. Should the above method work or would I have to use the par­al­lel package?

    • Rob J Hyndman

      Exper­i­ment. See what works best.

  • Dan

    Pro­fes­sor, is there a way to get it to out­put the model selected/​parameters?

  • RK

    This is help­ful, I am won­der­ing how­ever if the time series data are stacked instead of sep­a­rated by columns. So, if I have three columns prod­uct, date, and vol­ume and I’d like to fore­cast each prod­uct ind­di­vid­u­ally, can the fore­cast pack­age sep­a­rate each time series… sim­i­lar to by vari­able in SAS.

    • Rob J Hyndman

      It is easy to refor­mat such data into the for­mat I’ve used above. Or to mod­ify my code to use the 3-​​column matrix directly.