Posts tagged forecasting

Benchmarks for forecasting

Every week I reject papers sub­mit­ted to the Inter­na­tional Jour­nal of Fore­cast­ing because they present new meth­ods with­out ever attempt­ing to demon­strate that the new meth­ods are bet­ter than exist­ing meth­ods. It is a pol­icy of the jour­nal that every new method must be com­pared to stan­dard bench­marks and exist­ing meth­ods before the paper will even be con­sid­ered for publication.

For uni­vari­ate time series meth­ods, it is not dif­fi­cult. As a min­i­mum, com­par­isons should be made against a naive method and a stan­dard method such as an ARIMA model.

  1. The naive method for non-seasonal data is based on a ran­dom walk — all fore­casts are equal to the last obser­va­tion. For sea­sonal data, the best naive method is to use the last obser­va­tion from the same sea­son. That is, for monthly data, fore­casts for Feb­ru­ary are all equal to the last Feb­ru­ary observation.
  2. Com­par­isons with ARIMA mod­els used to be prob­lem­atic because some authors did not have suf­fi­cient exper­tise to fit a good ARIMA model, and so com­par­isons were some­times made, for exam­ple, against a non-seasonal AR model when the data were obvi­ously sea­sonal. This should no longer be a prob­lem as there are now good auto­matic ARIMA algo­rithms such as auto.arima() in the fore­cast pack­age for R.

For mul­ti­vari­ate time series, the same uni­vari­ate bench­marks can be used.

For meth­ods involv­ing covari­ates, a stan­dard lin­ear regres­sion can often pro­vide a basic bench­mark. Authors some­times argue that lin­ear regres­sion is not appro­pri­ate for their data (e.g., because of non-linear rela­tion­ships or cor­re­la­tions), but that is not the point. I don’t care if the lin­ear regres­sion is appro­pri­ate — I just want them to be able to show that their method pro­vides bet­ter pre­dic­tions than a stan­dard and sim­ple bench­mark. If it can’t beat a sim­ple stan­dard regres­sion, espe­cially if it is inap­pro­pri­ate, there is not much point proceeding.

The best bench­marks are those that are already pub­lished. For exam­ple, new uni­vari­ate time series meth­ods can be com­pared with the M-competition or M3 com­pe­ti­tion data where there are already pub­lished eval­u­a­tions on large num­bers of obser­va­tions.  In this case, authors do not even have to imple­ment the bench­marks them­selves. All they have to do is use the same test sets and com­pare their MAPE or sMAPE val­ues with those pub­lished for other methods.

Just beat­ing the bench­marks is not, of itself, jus­ti­fi­ca­tion for pub­li­ca­tion, but it helps. It is also nec­es­sary to be able to describe your new method in enough detail and clar­ity that oth­ers could imple­ment it. It is usu­ally also nec­es­sary to show that the method works on more than one data set. It is rel­a­tively easy to find a method that out­per­forms the bench­marks on a sin­gle data set; but that is no rea­son to think it will be use­ful on other data sets. The M-competitions are use­ful as they pro­vide a large set of data for com­par­isons. If a method does well on 1001 or 3003 time series, then I know it is not a fluke.

Sim­i­larly, not being able to beat the bench­marks does not, of itself, mean the paper is dead. It may be that the new method is not far behind the bench­marks but has other advan­tages. Or the new method may be par­tic­u­larly good in some cir­cum­stances or for a small sub­set of problems.

The job of the author is to care­fully and per­sua­sively present the case for their pro­posed method. As an edi­tor, I am look­ing for authors to con­vince me of the value of their ideas. Papers propos­ing new fore­cast­ing meth­ods must include com­par­isons with stan­dard bench­marks, and should involve large scale empir­i­cal evaluations.

  • Share/Bookmark

Tags:

The tourism forecasting competition

Recently I wrote a paper enti­tled “The tourism fore­cast­ing com­pe­ti­tion” in which we (i.e., George Athana­sopou­los, Haiyan Song, Doris Wu and I) com­pared var­i­ous fore­cast­ing meth­ods on a rel­a­tively large set of tourism-related time series. The paper has been accepted for pub­li­ca­tion in the Inter­na­tional Jour­nal of Fore­cast­ing. (When I sub­mit a paper to the IJF it is always han­dled by another edi­tor. In this case, Mike Clements han­dled the paper and it went through sev­eral revi­sions before it was finally accepted. Just to show the process is unbi­ased, I have had a paper rejected by the jour­nal dur­ing the period I have been Editor-in-Chief.)

We are now open­ing up the com­pe­ti­tion to any­one who thinks they can do bet­ter than the best meth­ods we imple­mented in the paper. Meth­ods will be eval­u­ated based on the small­est MASE (Mean Absolute Scaled Error) — see Hyn­d­man & Koehler (2006) for details of this statistic.

To make it inter­est­ing, there is a prize. The over­all win­ner will col­lect $AUD500 and will be invited to con­tribute a dis­cus­sion paper to the Inter­na­tional Jour­nal of Fore­cast­ing describ­ing their method­ol­ogy and giv­ing their results, pro­vided either the monthly MASE results are bet­ter than 1.38, the quar­terly results are bet­ter than 1.43 or the yearly results are bet­ter than 2.28. These thresh­olds are the best per­form­ing meth­ods in the analy­sis of these data described in Athana­sopou­los et al (2010).  In other words, the win­ner has to beat the best results in this paper for at least one of the three sets of series. It will also be nec­es­sary that the win­ner be able to describe their method clearly, in suf­fi­cient detail to enable repli­ca­tion and in a form suit­able for the Inter­na­tional Jour­nal of Fore­cast­ing. The paper would appear in the April 2011 issue of the IJF.

The com­pe­ti­tion is being hosted by the inno­v­a­tive folks at kaggle.com. Head over to kaggle.com/tourism1 to get the data and enter the competition.

The com­pe­ti­tion will be in two stages. Stage 1 involves only the annual data — 518 time series. You need to sub­mit fore­casts of the next four obser­va­tions for each series before 20 Sep­tem­ber 2010. Stage 2 will involve the monthly and quar­terly data and will begin after Stage 1 closes.

Good luck!

  • Share/Bookmark

Tags:

Academic citations in the popular press

It is very unusual for a news­pa­per arti­cle to cite an aca­d­e­mic paper, unless it is in Nature, Sci­ence or the Lancet. Mostly, what we write is too tech­ni­cal and assumes too much back­ground knowl­edge for it to be acces­si­ble to any­one but spe­cial­ists. So I was pleas­antly sur­prised to find a ref­er­ence to the Inter­na­tional Jour­nal of Fore­cast­ing in a recent Wall Street Jour­nal arti­cle. It is a cita­tion of a 1996 arti­cle, so in terms of sci­en­tific research it is a bit like quot­ing the Magna Carta, but a cita­tion nevertheless.

I once tried to get news­pa­per cov­er­age of a spe­cial issue of the IJF on fore­cast­ing the US Pres­i­den­tial elec­tion. It was pub­lished about four months before the 2008 elec­tions. If any­thing was going to attract the atten­tion of the pop­u­lar press, surely this was the topic! Alas, all we man­aged was a short piece on a  research news web­site although there were copi­ous arti­cles on pre­dict­ing the elec­tion result based on less valid methods.

Even fore­cast­ing the recent world cup didn’t get any seri­ous atten­tion, despite some excel­lent (albeit unpub­lished) work over at kaggle.com. Paul the Octo­pus had tens of thou­sands of news arti­cles, but the care­ful sta­tis­ti­cal mod­el­ling at kag­gle had none at all that I could find.

All of which goes to show that news­pa­pers are not good sources of infor­ma­tion about fore­cast­ing (or any­thing else?).

  • Share/Bookmark

Tags: ,

Use fake data and real data

When devel­op­ing new sta­tis­ti­cal meth­ods, it is very use­ful to test them on both fake data (i.e., sim­u­la­tions) and real data.

Test­ing on fake data is use­ful because then you know the “true” answer and can check the pro­ce­dure under ideal con­di­tions. If your method doesn’t work when the data are designed for the task, it is unlikely to work in real con­di­tions. Fake data also enables you to test the robust­ness of your method when the con­di­tions aren’t per­fect — for exam­ple, try adding some nasty out­liers and see if the method still works. With fake data, you can gen­er­ate as many sam­ples as you need, thus ensur­ing that what you see is real (sta­tis­ti­cally sig­nif­i­cant) rather than just an odd example.

A fur­ther advan­tage of fake data is that any­one can repro­duce your work and check (or extend) your results. Some­times real data can­not be dis­trib­uted due to restric­tions imposed by the owner of the data. But there are never restric­tions on fake data. You just have to make sure you explain the data gen­er­at­ing process suf­fi­ciently clearly that other peo­ple can repli­cate what you’ve done.

Test­ing on real data is use­ful because it gives some indi­ca­tion of whether your method will be use­ful in real­ity and not just in theory.

Yeas­min Khan­dakar and I once devel­oped a neat method for select­ing the order of an ARIMA model which worked won­der­fully well on fake data that were gen­er­ated from ARIMA processes, but failed on any real data. The prob­lem seemed to be that it was par­tic­u­larly sen­si­tive to model mis-specification. So when the data had any fea­tures that were not typ­i­cal of ARIMA processes, the method failed. No real data are gen­uinely ARIMA processes, and so the method is not par­tic­u­larly use­ful (and has never been published).

On the other hand, damped expo­nen­tial smooth­ing works bet­ter than you would expect, even on data that come from processes for which damped expo­nen­tial smooth­ing is far from the­o­ret­i­cally opti­mal. In chap­ter 7 of my expo­nen­tial smooth­ing book, we showed (with real data) that using a damped expo­nen­tial smooth­ing model for all series gives results that are almost as good as those obtained after a com­pu­ta­tion­ally inten­sive search for an opti­mal model over the entire model space.

  • Share/Bookmark

Tags: ,

Help for forecasting practitioners

I often get email from fore­cast­ers want­ing assis­tance. As much as I’d like to pro­vide a free fore­cast­ing advice ser­vice to the world, that’s not what I’m paid to do, and I choose to spend my unpaid time on other things. How­ever, there are some very help­ful resources avail­able for fore­cast­ing practitioners.

First, every prac­tic­ing fore­caster should be read­ing Fore­sight. It is far and away the best jour­nal or mag­a­zine for fore­cast prac­ti­tion­ers. Sub­scribe, read it, buy the back issues. You won’t be dis­ap­pointed. Please pass this on to every fore­caster you know.

Next, get on the IIF email list. It is designed for peo­ple to ask ques­tions and there’s usu­ally some­one out there who might be able to answer. You could also try ask­ing your fore­cast­ing ques­tions at ask lokad. While that is pri­mar­ily a sup­port ser­vice for a com­mer­cial fore­cast­ing com­pany, they do say they wel­come any ques­tions on fore­cast­ing. So there’s no harm in trying.

There is also an IIF forum that has been set up but is inac­tive. If every­one started using it, it might be useful.

Attend the Inter­na­tional Fore­cast­ing Sym­po­sium. A mix of aca­d­e­mics and prac­ti­tion­ers attend, and it is a great oppor­tu­nity to find out what oth­ers are doing, and to learn some new tech­niques. The next one is in San Diego in June 2010.

Read the best books. I usu­ally rec­om­mend that prac­ti­tion­ers get hold of the fol­low­ing two books.
  

These books won first and sec­ond prizes, respec­tively, for the best fore­cast­ing books to be writ­ten dur­ing the first 25 years of the IIF. (Yes, I did co-author the sec­ond one so my rec­om­men­da­tion is biased.)

Finally, make sure you are using some decent soft­ware. The major­ity of ques­tions I’m asked are eas­ily solved by just get­ting hold of some good fore­cast­ing soft­ware. The best stand-alone fore­cast­ing pack­age I know of is Fore­cast­Pro. If you must use Excel (ugh), at least get a decent fore­cast­ing add-in such as Peer­Fore­caster. But best of all, learn R and use the fore­cast pack­age. If you are using the fore­cast pack­age for R, I may even be will­ing to pro­vide free help.

  • Share/Bookmark

Tags:

How good are economic forecasts?

I wrote last week that “macro­eco­nomic fore­casts are lit­tle bet­ter than shoot­ing blind­fold”. I don’t know if it was con­nected or not, but on the same day a jour­nal­ist (Richard Pullin) from Reuters phoned me to ask about assess­ing some eco­nomic fore­casts. He wanted to com­pare the accu­racy of sev­eral eco­nomic fore­casts for Japan and he wasn’t sure how to go about it. I helped him to cal­cu­late the MASE for the dif­fer­ent fore­casts and the results have now been pub­lished.

Some of these fore­casts look pretty good, with MASE val­ues for one-month-ahead fore­casts as low as 0.25 for indus­trial out­put and 0.29 for CPI. However, indus­trial out­put is rel­a­tively easy to pre­dict one-month ahead due to indus­trial input data such as export orders, elec­tric­ity usage, steel pro­duc­tion, etc. And, accord­ing to the pub­lished arti­cle, National CPI in Japan “tends to track Tokyo CPI, which is released a month in advance and forms the basis for fore­cast num­bers.” The other series con­sid­ered are machin­ery orders (MASE=0.43) and house­hold spend­ing (MASE=0.78), which are slightly bet­ter results than I expected.

It would be inter­est­ing to see MASE fig­ures for other coun­tries, for other series and for longer fore­cast horizons.

For those unfa­mil­iar with the MASE, it is the mean absolute scaled error, intro­duced in Hyn­d­man and Koehler (IJF, 2006). It com­pares the mean absolute error (MAE) from a fore­cast method with the in-sample one-step MAE obtained from the naive method. So for one-step fore­casts, the MASE should be less than 1 or the method is use­less. The advan­tage of the MASE is that it can be used in all sit­u­a­tions includ­ing with non-stationary data, and when the observed val­ues can be zero or neg­a­tive. There is no other accu­racy mea­sure that I know of which is scale-free and can han­dle both of those situations.

  • Share/Bookmark

Tags:

Why I don’t like statistical tests

It may come as a shock to dis­cover that a sta­tis­ti­cian does not like sta­tis­ti­cal tests. Isn’t that what sta­tis­tics is all about? Unfor­tu­nately, in some dis­ci­plines sta­tis­ti­cal analy­sis does seem to con­sist almost entirely of hypoth­e­sis test­ing, and therein lies the problem.

The stan­dard prac­tice is to con­struct a hypoth­e­sis test to deter­mine if some attribute of the data is “sig­nif­i­cant” or not, with the stan­dard p-value thresh­old of 5%. The analy­sis is per­ceived to be com­pleted when the p-value comes in under 5%. How­ever, any non-trivial hypoth­e­sis will be sig­nif­i­cant if enough data are col­lected. As George Box said, “all mod­els are wrong, but some are use­ful”. So col­lect­ing more data will demon­strate that the pro­posed hypoth­e­sis is wrong, but that doesn’t make it useless.

Then there is the com­mon con­fu­sion between sta­tis­ti­cally sig­nif­i­cant and prac­ti­cally sig­nif­i­cant. Just because some­thing is sig­nif­i­cant, doesn’t mean it is impor­tant. And just because a p-value is larger than 0.05 does not mean the null hypoth­e­sis is true. Sta­tis­ti­cians learn all this in first year, but still the research lit­er­a­ture is rid­dled with papers that imply otherwise.

The next prob­lem is that p-values are extremely sen­si­tive to collinear­ity. Con­se­quently, to use p-values based on t-tests to deter­mine the sig­nif­i­cance of terms in a regres­sion is silly. Often terms will appear insignif­i­cant, yet they should be included as they improve the pre­dic­tions. Yet this approach is prob­a­bly the most com­mon method for deter­min­ing what vari­ables to include in a regres­sion, even in some stan­dard text­books. The sit­u­a­tion is even worse in autore­gres­sion, where the collinear­ity is often very strong.

Another thing I dis­like about sta­tis­ti­cal tests is the alter­na­tive hypoth­e­sis. This was not orig­i­nally part of hypoth­e­sis test­ing as pro­posed by Fisher. It was intro­duced by Ney­man and Pear­son. Frankly, the alter­na­tive hypoth­e­sis is unnec­es­sary. It is not used in the com­pu­ta­tion of p-values or for deter­min­ing sta­tis­ti­cal sig­nif­i­cance. The only prac­ti­cal use for the alter­na­tive hypoth­e­sis that I can see is in deter­min­ing the power of a test.

Finally, I hate one-sided tests even more than two-sided tests. It’s lit­tle bet­ter than cheat­ing. You claim that a para­me­ter can only pos­si­bly move in one direc­tion, and thereby cut your p-value in half. I sus­pect it is usu­ally done to obtain sig­nif­i­cant results in order to increase the chances of pub­li­ca­tion. In real­ity, can we ever be really sure that a para­me­ter can only be zero or positive?

Now a good sta­tis­ti­cian can avoid all of these errors and use sta­tis­ti­cal tests hon­estly and appro­pri­ately. And I do occa­sion­ally use tests in my papers, hope­fully avoid­ing the above prob­lems. But I strongly pre­fer the pre­dic­tive mod­el­ling approach. That is, if you have two poten­tial mod­els, choose the one that pre­dicts best. Infor­ma­tion cri­te­ria, such as the AIC, are per­fect for this task.

In fore­cast­ing, the only place in which I find test­ing use­ful is in deter­min­ing the order of inte­gra­tion of a time series; i.e., choos­ing d in an ARIMA(p,d,q) model. If I could come up with some way of doing this effec­tively with­out using a unit-root test, I would gladly do so. But so far, I have not found a reli­able alternative.

For more on this topic, see my work­ing paper with Andrey Kostenko.

  • Share/Bookmark

Tags: ,

Forecasting the recession

Fore­cast­ers are under the pump with a reces­sion that many didn’t see com­ing. As I don’t do any macro­eco­nomic fore­cast­ing, I can sit back and smile smugly at some of my col­leagues while I work on sim­pler prob­lems such as fore­cast­ing in epi­demi­ol­ogy, demog­ra­phy and energy demand.

Some of those col­leagues are cited in the Wall Street Jour­nal today. The fol­low­ing quo­ta­tion is inter­est­ing:

The spate of cloudy crys­tal balls high­lighted an uncom­fort­able real­ity about telling the future: It is hard­est when it is most important.

Ini­tially it sounds profound—just when you need to fore­cast, the data con­spires against you and makes it dif­fi­cult. But in hind­sight I don’t think it is like that at all.

When it is easy to fore­cast (e.g., when there is a steady increas­ing trend and lit­tle volatil­ity), no-one is think­ing about the fore­cast­ing because it is obvi­ous what is going to hap­pen. And so fore­cast­ing doesn’t seem impor­tant because it doesn’t get much atten­tion. But when there is a lot of volatil­ity, then peo­ple look to fore­cast­ing for answers, just when it is hard to do it accu­rately. Con­se­quently, it is hard­est when peo­ple are think­ing about it, because they only think about it when it is hard.

That said, macro­eco­nomic fore­cast­ing has a bad name for a good rea­son. Far too many con­fi­dent fore­casts are made with­out dis­cus­sion of the uncer­tainty. If only every fore­caster pro­duced pre­dic­tion inter­vals every time they made a fore­cast, the users would realise that macro­eco­nomic fore­casts are lit­tle bet­ter than shoot­ing blindfolded.

See my talk on Fore­cast­ing and the impor­tance of being uncer­tain where I argue for manda­tory pre­dic­tion inter­vals for every point forecast.

  • Share/Bookmark

Tags:

Clive Granger (1934–2009)

Sir Clive Granger has died at the age of 74. There are some nice obit­u­ar­ies in the New York Times and the Daily Tele­graph. Also, his Wikipedia page has some good infor­ma­tion. I met Clive on sev­eral occa­sions and he was “a scholar and a gen­tle­man”, a remark­ably hum­ble man given his out­stand­ing achieve­ments and some­one who was always will­ing to help young researchers. The world of fore­cast­ing will miss him.

  • Share/Bookmark

Tags:

Prediction markets

Andrew Leigh has a nice piece in today’s AFR on fore­cast­ing via pre­dic­tion markets

  • Share/Bookmark

Tags: