Statistical modelling and analysis of big data

There is a one day work­shop on this topic on 23 Feb­ru­ary 2015 at QUT in Bris­bane. I will be speak­ing on “Visu­al­iz­ing and fore­cast­ing big time series data”.

OVERVIEW

Big data is now endemic in busi­ness, indus­try, gov­ern­ment, envi­ron­men­tal man­age­ment, med­ical sci­ence, social research and so on. One of the com­men­su­rate chal­lenges is how to effec­tively model and analyse these data.

This work­shop will bring together national and inter­na­tional experts in sta­tis­ti­cal mod­el­ling and analy­sis of big data, to share their expe­ri­ences, approaches and opin­ions about future direc­tions in this field.

The work­shop pro­gramme will com­mence at 8.30am and close at 5pm. Reg­is­tra­tion is free, how­ever num­bers are strictly lim­ited so please ensure you reg­is­ter when you receive your invi­ta­tion via email. Morn­ing and after­noon tea will be pro­vided; par­tic­i­pants will need to pur­chase their own lunch.

Fur­ther details will be made avail­able in early Jan­u­ary. Con­tinue reading →

New R package for electricity forecasting

Shu Fan and I have devel­oped a model for elec­tric­ity demand fore­cast­ing that is now widely used in Aus­tralia for long-​​term fore­cast­ing of peak elec­tric­ity demand. It has become known as the “Monash Elec­tric­ity Fore­cast­ing Model”. We have decided to release an R pack­age that imple­ments our model so that other peo­ple can eas­ily use it. The pack­age is called “MEFM” and is avail­able on github. We will prob­a­bly also put in on CRAN eventually.

The model was first described in  Hyn­d­man and Fan (2010). We are con­tin­u­ally improv­ing it, and the lat­est ver­sion is decribed in the model doc­u­men­ta­tion which will be updated from time to time.

The pack­age is being released under a GPL licence, so any­one can use it. All we ask is that our work is prop­erly cited.

Nat­u­rally, we are not able to pro­vide free tech­ni­cal sup­port, although we wel­come bug reports. We are avail­able to under­take paid con­sult­ing work in elec­tric­ity forecasting.

 

A time series classification contest

Amongst today’s email was one from some­one run­ning a pri­vate com­pe­ti­tion to clas­sify time series. Here are the essen­tial details.

The data are mea­sure­ments from a med­ical diag­nos­tic machine which takes 1 mea­sure­ment every sec­ond, and after 32–1000 sec­onds, the time series must be clas­si­fied into one of two classes. Some pre-​​classified train­ing data is pro­vided. It is not nec­es­sary to clas­sify all the test data, but you do need to have rel­a­tively high accu­racy on what is clas­si­fied. So you could find a sub­set of more eas­ily clas­si­fi­able test time series, and leave the rest of the test data unclas­si­fied. Con­tinue reading →

Honoring Herman Stekler

stekler_The first issue of the IJF for 2015 has just been pub­lished, and I’m delighted that it includes a spe­cial sec­tion hon­or­ing Her­man Stek­ler. It includes arti­cles cov­er­ing a range of his fore­cast­ing inter­ests, although not all of them (sports fore­cast­ing is miss­ing). Her­man him­self wrote a paper for it look­ing at “Forecasting—Yesterday, Today and Tomor­row”.

He is in a unique posi­tion to write such a paper as he has been doing fore­cast­ing research longer than any­one else on the planet — his first pub­lished paper on fore­cast­ing appeared in 1959. Her­man is now 82 years old, and is still very active in research. Only a cou­ple of months ago, he wrote to me with some new research ideas he had been think­ing about, ask­ing me for some feed­back. He is also an extra­or­di­nar­ily con­sci­en­tious and care­ful asso­ciate edi­tor of the IJF and a delight to work with. He is truly “a scholar and a gen­tle­man” and I am very happy that we can honor Her­man in this man­ner. Thanks to Tara Sin­clair, Prakash Loun­gani and Fred Joutz for putting this trib­ute together.

We also pub­lished an inter­view with Her­man in the IJF in 2010 which con­tains some infor­ma­tion about his early years, grad­u­ate edu­ca­tion and first aca­d­e­mic jobs.

Prediction competitions

Com­pe­ti­tions have a long his­tory in fore­cast­ing and pre­dic­tion, and have been instru­men­tal in forc­ing research atten­tion on meth­ods that work well in prac­tice. In the fore­cast­ing com­mu­nity, the M com­pe­ti­tion and M3 com­pe­ti­tion have been par­tic­u­larly influ­en­tial. The data min­ing com­mu­nity have the annual KDD cup which has gen­er­ated atten­tion on a wide range of pre­dic­tion prob­lems and asso­ci­ated meth­ods. Recent KDD cups are hosted on kag­gle.

In my research group meet­ing today, we dis­cussed our (lim­ited) expe­ri­ences in com­pet­ing in some Kag­gle com­pe­ti­tions, and we reviewed the fol­low­ing two papers which describe two pre­dic­tion competitions:

  1. Athana­sopou­los and Hyn­d­man (IJF 2011). The value of feed­back in fore­cast­ing com­pe­ti­tions. [preprint ver­sion]
  2. Roy et al (2013). The Microsoft Aca­d­e­mic Search Dataset and KDD Cup 2013.

Con­tinue reading →

Visualization of probabilistic forecasts

This week my research group dis­cussed Adrian Raftery’s recent paper on “Use and Com­mu­ni­ca­tion of Prob­a­bilis­tic Fore­casts” which pro­vides a fas­ci­nat­ing but brief sur­vey of some of his work on mod­el­ling and com­mu­ni­cat­ing uncer­tain futures. Coin­ci­den­tally, today I was also sent a copy of David Spiegelhalter’s paper on “Visu­al­iz­ing Uncer­tainty About the Future”. Both are well-​​worth reading.

It made me think about my own efforts to com­mu­ni­cate future uncer­tainty through graph­ics. Of course, for time series fore­casts I nor­mally show pre­dic­tion inter­vals. I pre­fer to use more than one inter­val at a time because it helps con­vey a lit­tle more infor­ma­tion. The default in the fore­cast pack­age for R is to show both an 80% and a 95% inter­val like this: Con­tinue reading →

IJF review papers

Review papers are extremely use­ful for new researchers such as PhD stu­dents, or when you want to learn about a new research field. The Inter­na­tional Jour­nal of Fore­cast­ing pro­duced a whole review issue in 2006, and it con­tains some of the most highly cited papers we have ever pub­lished. Now, begin­ning with the lat­est issue of the jour­nal, we have started pub­lish­ing occa­sional review arti­cles on selected areas of fore­cast­ing. The first two arti­cles are:

  1. Elec­tric­ity price fore­cast­ing: A review of the state-​​of-​​the-​​art with a look into the future by Rafał Weron.
  2. The chal­lenges of pre-​​launch fore­cast­ing of adop­tion time series for new durable prod­ucts by Paul Good­win, Sheik Meeran, and Karima Dyussekeneva.

Both tackle very impor­tant top­ics in fore­cast­ing. Weron’s paper con­tains a com­pre­hen­sive sur­vey of work on elec­tric­ity price fore­cast­ing, coher­ently bring­ing together a large body of diverse research — I think it is the longest paper I have ever approved at 50 pages. Good­win, Meeran and Dyussekeneva review research on new prod­uct fore­cast­ing, a prob­lem every com­pany that pro­duces goods or ser­vices has faced; when there are no his­tor­i­cal data avail­able, how do you fore­cast the sales of your product?

We have a few other review papers in progress, so keep an eye out for them in future issues.

 

Seasonal periods

I get ques­tions about this almost every week. Here is an exam­ple from a recent com­ment on this blog:

I have two large time series data. One is sep­a­rated by sec­onds inter­vals and the other by min­utes. The length of each time series is 180 days. I’m using R (3.1.1) for fore­cast­ing the data. I’d like to know the value of the “fre­quency” argu­ment in the ts() func­tion in R, for each data set. Since most of the exam­ples and cases I’ve seen so far are for months or days at the most, it is quite con­fus­ing for me when deal­ing with equally sep­a­rated sec­onds or min­utes. Accord­ing to my under­stand­ing, the “fre­quency” argu­ment is the num­ber of obser­va­tions per sea­son. So what is the “sea­son” in the case of seconds/​minutes? My guess is that since there are 86,400 sec­onds and 1440 min­utes a day, these should be the val­ues for the “freq” argu­ment. Is that correct?

Con­tinue reading →

Prediction intervals too narrow

Almost all pre­dic­tion inter­vals from time series mod­els are too nar­row. This is a well-​​known phe­nom­e­non and arises because they do not account for all sources of uncer­tainty. In my 2002 IJF paper, we mea­sured the size of the prob­lem by com­put­ing the actual cov­er­age per­cent­age of the pre­dic­tion inter­vals on hold-​​out sam­ples. We found that for ETS mod­els, nom­i­nal 95% inter­vals may only pro­vide cov­er­age between 71% and 87%. The dif­fer­ence is due to miss­ing sources of uncertainty.

There are at least four sources of uncer­tainty in fore­cast­ing using time series models:

  1. The ran­dom error term;
  2. The para­me­ter estimates;
  3. The choice of model for the his­tor­i­cal data;
  4. The con­tin­u­a­tion of the his­tor­i­cal data gen­er­at­ing process into the future.

Con­tinue reading →