Statistical modelling and analysis of big data

There is a one day work­shop on this topic on 23 Feb­ru­ary 2015 at QUT in Bris­bane. I will be speak­ing on “Visu­al­iz­ing and fore­cast­ing big time series data”.

OVERVIEW

Big data is now endemic in busi­ness, indus­try, gov­ern­ment, envi­ron­men­tal man­age­ment, med­ical sci­ence, social research and so on. One of the com­men­su­rate chal­lenges is how to effec­tively model and analyse these data.

This work­shop will bring together national and inter­na­tional experts in sta­tis­ti­cal mod­el­ling and analy­sis of big data, to share their expe­ri­ences, approaches and opin­ions about future direc­tions in this field.

The work­shop pro­gramme will com­mence at 8.30am and close at 5pm. Reg­is­tra­tion is free, how­ever num­bers are strictly lim­ited so please ensure you reg­is­ter when you receive your invi­ta­tion via email. Morn­ing and after­noon tea will be pro­vided; par­tic­i­pants will need to pur­chase their own lunch.

Fur­ther details will be made avail­able in early Jan­u­ary. Con­tinue reading →

Prediction competitions

Com­pe­ti­tions have a long his­tory in fore­cast­ing and pre­dic­tion, and have been instru­men­tal in forc­ing research atten­tion on meth­ods that work well in prac­tice. In the fore­cast­ing com­mu­nity, the M com­pe­ti­tion and M3 com­pe­ti­tion have been par­tic­u­larly influ­en­tial. The data min­ing com­mu­nity have the annual KDD cup which has gen­er­ated atten­tion on a wide range of pre­dic­tion prob­lems and asso­ci­ated meth­ods. Recent KDD cups are hosted on kag­gle.

In my research group meet­ing today, we dis­cussed our (lim­ited) expe­ri­ences in com­pet­ing in some Kag­gle com­pe­ti­tions, and we reviewed the fol­low­ing two papers which describe two pre­dic­tion competitions:

  1. Athana­sopou­los and Hyn­d­man (IJF 2011). The value of feed­back in fore­cast­ing com­pe­ti­tions. [preprint ver­sion]
  2. Roy et al (2013). The Microsoft Aca­d­e­mic Search Dataset and KDD Cup 2013.

Con­tinue reading →

Visualization of probabilistic forecasts

This week my research group dis­cussed Adrian Raftery’s recent paper on “Use and Com­mu­ni­ca­tion of Prob­a­bilis­tic Fore­casts” which pro­vides a fas­ci­nat­ing but brief sur­vey of some of his work on mod­el­ling and com­mu­ni­cat­ing uncer­tain futures. Coin­ci­den­tally, today I was also sent a copy of David Spiegelhalter’s paper on “Visu­al­iz­ing Uncer­tainty About the Future”. Both are well-​​worth reading.

It made me think about my own efforts to com­mu­ni­cate future uncer­tainty through graph­ics. Of course, for time series fore­casts I nor­mally show pre­dic­tion inter­vals. I pre­fer to use more than one inter­val at a time because it helps con­vey a lit­tle more infor­ma­tion. The default in the fore­cast pack­age for R is to show both an 80% and a 95% inter­val like this: Con­tinue reading →

Seasonal periods

I get ques­tions about this almost every week. Here is an exam­ple from a recent com­ment on this blog:

I have two large time series data. One is sep­a­rated by sec­onds inter­vals and the other by min­utes. The length of each time series is 180 days. I’m using R (3.1.1) for fore­cast­ing the data. I’d like to know the value of the “fre­quency” argu­ment in the ts() func­tion in R, for each data set. Since most of the exam­ples and cases I’ve seen so far are for months or days at the most, it is quite con­fus­ing for me when deal­ing with equally sep­a­rated sec­onds or min­utes. Accord­ing to my under­stand­ing, the “fre­quency” argu­ment is the num­ber of obser­va­tions per sea­son. So what is the “sea­son” in the case of seconds/​minutes? My guess is that since there are 86,400 sec­onds and 1440 min­utes a day, these should be the val­ues for the “freq” argu­ment. Is that correct?

Con­tinue reading →

ABS seasonal adjustment update

Since my last post on the sea­sonal adjust­ment prob­lems at the Aus­tralian Bureau of Sta­tis­tics, I’ve been work­ing closely with peo­ple within the ABS to help them resolve the prob­lems in time for tomorrow’s release of the Octo­ber unem­ploy­ment figures.

Now that the ABS has put out a state­ment about the prob­lem, I thought it would be use­ful to explain the under­ly­ing method­ol­ogy for those who are inter­ested. Con­tinue reading →

Prediction intervals too narrow

Almost all pre­dic­tion inter­vals from time series mod­els are too nar­row. This is a well-​​known phe­nom­e­non and arises because they do not account for all sources of uncer­tainty. In my 2002 IJF paper, we mea­sured the size of the prob­lem by com­put­ing the actual cov­er­age per­cent­age of the pre­dic­tion inter­vals on hold-​​out sam­ples. We found that for ETS mod­els, nom­i­nal 95% inter­vals may only pro­vide cov­er­age between 71% and 87%. The dif­fer­ence is due to miss­ing sources of uncertainty.

There are at least four sources of uncer­tainty in fore­cast­ing using time series models:

  1. The ran­dom error term;
  2. The para­me­ter estimates;
  3. The choice of model for the his­tor­i­cal data;
  4. The con­tin­u­a­tion of the his­tor­i­cal data gen­er­at­ing process into the future.

Con­tinue reading →

hts with regressors

The hts pack­age for R allows for fore­cast­ing hier­ar­chi­cal and grouped time series data. The idea is to gen­er­ate fore­casts for all series at all lev­els of aggre­ga­tion with­out impos­ing the aggre­ga­tion con­straints, and then to rec­on­cile the fore­casts so they sat­isfy the aggre­ga­tion con­straints. (An intro­duc­tion to rec­on­cil­ing hier­ar­chi­cal and grouped time series is avail­able in this Fore­sight paper.)

The base fore­casts can be gen­er­ated using any method, with ETS mod­els and ARIMA mod­els pro­vided as options in the forecast.gts() func­tion. As ETS mod­els do not allow for regres­sors, you will need to choose ARIMA mod­els if you want to include regres­sors. Con­tinue reading →

Explaining the ABS unemployment fluctuations

Although the Guardian claimed yes­ter­day that I had explained “what went wrong” in the July and August unem­ploy­ment fig­ures, I made no attempt to do so as I had no infor­ma­tion about the prob­lems. Instead, I just explained a lit­tle about the pur­pose of sea­sonal adjustment.

How­ever, today I learned a lit­tle more about the ABS unem­ploy­ment data prob­lems, includ­ing what may be the expla­na­tion for the fluc­tu­a­tions. This expla­na­tion was offered by Westpac’s chief econ­o­mist, Bill Evans (see here for a video of him explain­ing the issue). Con­tinue reading →