Statistical modelling and analysis of big data

I’m cur­rently attend­ing the one day work­shop on this topic at QUT in Bris­bane. This morn­ing I spoke on “Visu­al­iz­ing and fore­cast­ing big time series data”. My slides are here.

The talks are being streamed.

OVERVIEW

Big data is now endemic in busi­ness, indus­try, gov­ern­ment, envi­ron­men­tal man­age­ment, med­ical sci­ence, social research and so on. One of the com­men­su­rate chal­lenges is how to effec­tively model and analyse these data.

This work­shop will bring together national and inter­na­tional experts in sta­tis­ti­cal mod­el­ling and analy­sis of big data, to share their expe­ri­ences, approaches and opin­ions about future direc­tions in this field.

IASC Data Analysis Competition 2015

The Inter­na­tional Asso­ci­a­tion for Sta­tis­ti­cal Com­put­ing (IASC) is hold­ing a Data Analy­sis Com­pe­ti­tion. Win­ners will be invited to present their work at the Joint Meet­ing of IASC-​​ABE Satel­lite Con­fer­ence for the 60th ISI WSC 2015 to be held at Atlân­tico Búzios Con­ven­tion & Resort in Búzios, RJ, Brazil (August 2–4, 2015). They will also be invited to sub­mit a man­u­script for pos­si­ble pub­li­ca­tion (fol­low­ing peer review) to IASC’s offi­cial jour­nal, Com­pu­ta­tional Sta­tis­tics & Data Analy­sis. Con­tinue reading →

RSS feeds for statistics and related journals

I’ve now res­ur­rected the col­lec­tion of research jour­nals that I fol­low, and set it up as a shared col­lec­tion in feedly. So any­one can eas­ily sub­scribe to all of the same jour­nals, or select a sub­set of them, to fol­low on feedly. Con­tinue reading →

Prediction competitions

Com­pe­ti­tions have a long his­tory in fore­cast­ing and pre­dic­tion, and have been instru­men­tal in forc­ing research atten­tion on meth­ods that work well in prac­tice. In the fore­cast­ing com­mu­nity, the M com­pe­ti­tion and M3 com­pe­ti­tion have been par­tic­u­larly influ­en­tial. The data min­ing com­mu­nity have the annual KDD cup which has gen­er­ated atten­tion on a wide range of pre­dic­tion prob­lems and asso­ci­ated meth­ods. Recent KDD cups are hosted on kag­gle.

In my research group meet­ing today, we dis­cussed our (lim­ited) expe­ri­ences in com­pet­ing in some Kag­gle com­pe­ti­tions, and we reviewed the fol­low­ing two papers which describe two pre­dic­tion competitions:

  1. Athana­sopou­los and Hyn­d­man (IJF 2011). The value of feed­back in fore­cast­ing com­pe­ti­tions. [preprint ver­sion]
  2. Roy et al (2013). The Microsoft Aca­d­e­mic Search Dataset and KDD Cup 2013.

Con­tinue reading →

Visualization of probabilistic forecasts

This week my research group dis­cussed Adrian Raftery’s recent paper on “Use and Com­mu­ni­ca­tion of Prob­a­bilis­tic Fore­casts” which pro­vides a fas­ci­nat­ing but brief sur­vey of some of his work on mod­el­ling and com­mu­ni­cat­ing uncer­tain futures. Coin­ci­den­tally, today I was also sent a copy of David Spiegelhalter’s paper on “Visu­al­iz­ing Uncer­tainty About the Future”. Both are well-​​worth reading.

It made me think about my own efforts to com­mu­ni­cate future uncer­tainty through graph­ics. Of course, for time series fore­casts I nor­mally show pre­dic­tion inter­vals. I pre­fer to use more than one inter­val at a time because it helps con­vey a lit­tle more infor­ma­tion. The default in the fore­cast pack­age for R is to show both an 80% and a 95% inter­val like this: Con­tinue reading →

Seasonal periods

I get ques­tions about this almost every week. Here is an exam­ple from a recent com­ment on this blog:

I have two large time series data. One is sep­a­rated by sec­onds inter­vals and the other by min­utes. The length of each time series is 180 days. I’m using R (3.1.1) for fore­cast­ing the data. I’d like to know the value of the “fre­quency” argu­ment in the ts() func­tion in R, for each data set. Since most of the exam­ples and cases I’ve seen so far are for months or days at the most, it is quite con­fus­ing for me when deal­ing with equally sep­a­rated sec­onds or min­utes. Accord­ing to my under­stand­ing, the “fre­quency” argu­ment is the num­ber of obser­va­tions per sea­son. So what is the “sea­son” in the case of seconds/​minutes? My guess is that since there are 86,400 sec­onds and 1440 min­utes a day, these should be the val­ues for the “freq” argu­ment. Is that correct?

Con­tinue reading →

ABS seasonal adjustment update

Since my last post on the sea­sonal adjust­ment prob­lems at the Aus­tralian Bureau of Sta­tis­tics, I’ve been work­ing closely with peo­ple within the ABS to help them resolve the prob­lems in time for tomorrow’s release of the Octo­ber unem­ploy­ment figures.

Now that the ABS has put out a state­ment about the prob­lem, I thought it would be use­ful to explain the under­ly­ing method­ol­ogy for those who are inter­ested. Con­tinue reading →