Specifying complicated groups of time series in hts

With the lat­est ver­sion of the hts pack­age for R, it is now pos­si­ble to spec­ify rather com­pli­cated group­ing struc­tures rel­a­tively easily.

All aggre­ga­tion struc­tures can be rep­re­sented as hier­ar­chies or as cross-​​products of hier­ar­chies. For exam­ple, a hier­ar­chi­cal time series may be based on geog­ra­phy: coun­try, state, region, store. Often there is also a sep­a­rate prod­uct hier­ar­chy: prod­uct groups, prod­uct types, packet size. Fore­casts of all the dif­fer­ent types of aggre­ga­tion are required; e.g., prod­uct type A within region X. The aggre­ga­tion struc­ture is a cross-​​product of the two hierarchies.

This frame­work includes even appar­ently non-​​hierarchical data: con­sider the sim­ple case of a time series of deaths split by sex and state. We can con­sider sex and state as two very sim­ple hier­ar­chies with only one level each. Then we wish to fore­cast the aggre­gates of all com­bi­na­tions of the two hierarchies.

Any num­ber of sep­a­rate hier­ar­chies can be com­bined in this way. Non-​​hierarchical fac­tors such as sex can be treated as single-​​level hier­ar­chies. Con­tinue reading →

Interview for the Capital of Statistics

Earo Wang recently inter­viewed me for the Chi­nese web­site Cap­i­tal of Sta­tis­tics. The Eng­lish tran­script of the inter­vew is on Earo’s per­sonal web­site.

This is the third inter­view I’ve done in the last 18 months. The oth­ers were for:


Forecasting annual totals from monthly data

This ques­tion was posed on cross​val​i​dated​.com:

I have a monthly time series (for 2009–2012 non-​​stationary, with sea­son­al­ity). I can use ARIMA (or ETS) to obtain point and inter­val fore­casts for each month of 2013, but I am inter­ested in fore­cast­ing the total for the whole year, includ­ing pre­dic­tion inter­vals. Is there an easy way in R to obtain inter­val fore­casts for the total for 2013?

I’ve come across this prob­lem before in my con­sult­ing work, although I don’t think I’ve ever pub­lished my solu­tion. So here it is. Con­tinue reading →

Seeking help

Every day I receive emails, or com­ments on this blog, ask­ing for help with R, fore­cast­ing, LaTeX, pos­si­ble research top­ics, how to install soft­ware, or some other thing I’m sup­posed to know some­thing about. Unfor­tu­nately, I can­not pro­vide a one-​​man help ser­vice to the rest of the world. I used to reply to all the requests explain­ing where to go for help, but I stopped reply­ing a while ago as it took too much time to do even that.

If you want help, please ask at either stats​.stack​ex​change​.com (for R or sta­tis­tics ques­tions) or tex​.stack​ex​change​.com (for LaTeX questions).

Unless you are one of my stu­dents, the only ques­tions I will answer are ones that con­cern my R pack­ages or research papers. And even then, I won’t reply if the answer is in the help files. I write those help files for a rea­son, so please read them.

I’m sorry I can’t do more, but if I did every­thing peo­ple ask me to do, I’d never write any papers or pro­duce any R pack­ages, and I think that’s a bet­ter use of my time.

Academia StackExchange

There’s a new Stack­Ex­change site that might be use­ful to read­ers: Acad­e­mia. It is a Q&A site for aca­d­e­mics and those enrolled in higher education.

The draft FAQ says it will cover:

  • Life as a grad­u­ate stu­dent, post­doc­toral researcher, uni­ver­sity professor
  • Tran­si­tion­ing from under­grad­u­ate to grad­u­ate researcher
  • Inner work­ings of research departments
  • Require­ments and expec­ta­tions of academicians

Judg­ing from the first 89 ques­tions, this is going to be extremely help­ful, espe­cially for PhD students.


CrossValidated Journal Club

Jour­nal Clubs are a great way to learn new research ideas and to keep up with the lit­er­a­ture. The idea is that a group of peo­ple get together every week or so to dis­cuss a paper of joint inter­est. This can hap­pen within your own research group or depart­ment, or vir­tu­ally online.

There is now a vir­tual jour­nal club oper­at­ing in con­junc­tion with Cross​Val​i​dated​.com. The first paper dis­cussed was on text data min­ing. It appears that the next paper may be on col­lab­o­ra­tive fil­ter­ing.

The empha­sis is on Open Access papers, prefer­ably with asso­ci­ated soft­ware that is freely avail­able. Some of the dis­cus­sion tends to cen­tre on how to imple­ment the ideas in R.

For those of us in Aus­tralia, the tim­ing is tricky. The first dis­cus­sion took place at 3am local time!

If you can’t make the Cross­Val­i­dated Jour­nal Club chats, why not start your own local club?

CrossValidated launched!

The Cross­Val­i­dated Q&A site is now out of beta and the new design and site name is live.

New design

The new design looks great, thanks to Jin Yang, our designer-​​in-​​residence. Note the nor­mal den­sity icon for accepted answers and the site icon depict­ing a 5-​​fold cross-​​validation (light green for the test set and dark green for the train­ing set). There is a faint back­ground graphic in the header and footer from a pro­gram that tracks and plots a person’s mouse move­ment. This gives the sug­ges­tion of ran­dom­ness as well as the idea of data visu­al­iza­tion (another topic cov­ered on the site).

Name and URL

The URL cross​val​i​dated​.com will work, but re-​​directs to stats​.stack​ex​change​.com. The Stack­Ex­change team (who host the site and pro­vide all the archi­tec­ture) wanted the site to be a sub­do­main of stack​ex​change​.com. How­ever, at least we got the name CrossValidated.


The site is intended for use by sta­tis­ti­cians, data min­ers, and any­one else doing data analy­sis. It cov­ers ques­tions about

  • sta­tis­ti­cal analysis
  • data min­ing and machine learning
  • data visu­al­iza­tion
  • prob­a­bil­ity theory
  • sta­tis­ti­cal and data-​​driven com­put­ing (e.g., ques­tions about R, SAS, SPSS, Stata and Minitab)

The inclu­sion of data min­ing and machine learn­ing along with sta­tis­tics and prob­a­bil­ity was a delib­er­ate attempt to get these two com­mu­ni­ties to talk. We work on sim­i­lar prob­lems, but often with dif­fer­ent tools and dif­fer­ent per­spec­tives. I hope the site comes to be widely used within both com­mu­ni­ties. In fact, I hope that we can even­tu­ally stop talk­ing about two com­mu­ni­ties and just refer to the “data sci­ence community”.

My orig­i­nal idea was that this would be help­ful to researchers strug­gling with data analy­sis issues but have no sta­tis­ti­cian to ask for help. University-​​based sta­tis­ti­cians are often inun­dated with requests for help from researchers in other dis­ci­plines who have no quan­ti­ta­tive train­ing but need to do apply some sta­tis­ti­cal techniques.


For those who haven’t been read­ing this blog, I pro­posed this site on 15 April 2010. The scope of the site was deter­mined via a com­mu­nity process, then we went through a phase of build­ing a suf­fi­cient com­mu­nity. The beta site was launched on 19 July 2010 with the first ques­tion on “Elic­it­ing pri­ors from experts”.

The site was offi­cially launched today (5 Novem­ber 2010). So it took just over 200 days from pro­posal to launch — I had no idea what I was start­ing, but I’m glad it worked out! There are now 1048 ques­tions and 1763 users which is a great start. But there must be hun­dreds of thou­sands of peo­ple doing data analy­sis and who would really ben­e­fit from a site like this. So please spread the word about Cross​Val​i​dated​.com.

How to avoid annoying a referee

It’s not a good idea to annoy the ref­er­ees of your paper. They make rec­om­men­da­tions to the edi­tor about your work and it is best to keep them happy. There is an inter­est­ing dis­cus­sion on stats​.stack​ex​change​.com on this sub­ject. This inspired my own list below.

  • Explain what you’ve done clearly, avoid­ing unnec­es­sary jargon.
  • Don’t claim your paper con­tributes more than it actu­ally does. (I ref­er­eed a paper this week where the author claimed to have invented prin­ci­pal com­po­nent analysis!)
  • Ensure all fig­ures have clear cap­tions and labels.
  • Include cita­tions to the referee’s own work. Obvi­ously you don’t know who is going to ref­eree your paper, but you should aim to cite the main work in the area. It places your work in con­text, and keeps the ref­er­ees happy if they are the authors.
  • Make sure the cited papers say what you think they say. Sight what you cite!
  • Include proper cita­tions for all soft­ware pack­ages. If you are unsure how to cite an R pack­age, try the com­mand citation("packagename").
  • Never pla­gia­rise from other papers — not even sen­tence frag­ments. Use your own words. I’ve ref­er­eed a the­sis which had slabs taken from my own lec­ture notes includ­ing the typos.
  • Don’t pla­gia­rise from your own papers. Either ref­er­ence your ear­lier work, or pro­vide a sum­mary in new words.
  • Pro­vide enough detail so your work can be repli­cated. Where pos­si­ble, pro­vide the data and code. Make sure the code works.
  • When respond­ing to ref­eree reports, make sure you answer every­thing asked of you. (See my ear­lier post “Always lis­ten to review­ers”)
  • If you’ve revised the paper based on ref­er­ees’ com­ments, then thank them in the acknowl­edge­ments section.

For some applied papers, there are spe­cific sta­tis­ti­cal issues that need attention:

  • Give effect sizes with con­fi­dence inter­vals, not just p-​​values.
  • Don’t describe data using the mean and stan­dard devi­a­tion with­out indi­cat­ing whether the data were more-​​or-​​less sym­met­ric and unimodal.
  • Don’t split con­tin­u­ous data into groups.
  • Make sure your data sat­isfy the assump­tions of the sta­tis­ti­cal meth­ods used.

More tongue-​​in-​​cheek advice is pro­vided by Strat­ton and Neil (2005), “How to ensure your paper is rejected by the sta­tis­ti­cal reviewer”. Dia­betic Med­i­cine, 22(4), 371–373.

Feel free to add your own sug­ges­tions over at stats​.stack​ex​change​.com.

Happy World Statistics Day!

The United Nations has declared today “World Sta­tis­tics Day”. I’ve no idea what that means, or why we need a WSD. Per­haps it is because the date is 20.10.2010 (except in North Amer­ica where it is 10.20.2010). But then, what hap­pens from 2013 to 2099? And do we just for­get the whole idea after 3112?

In any case, if we are going to have a WSD, let’s use it to do some­thing use­ful. Patrick Burns has some ideas over at Port­fo­lio Probe. Here are some of my own:

  • Learn R. The time has come when it is not really pos­si­ble to be a well-​​informed applied sta­tis­ti­cian if you are not a reg­u­lar R user.
  • Get involved on Stats​.Stack​Ex​change​.com. It’s only 3 months old, but we already have over 1600 users and it has quickly become the best place to ask and answer ques­tions about sta­tis­tics, data min­ing, data visu­al­iza­tion and every­thing else to do with analysing data.
  • If you’re in Lon­don, head to the RSS get­stats launch, held appro­pri­ately at 20:10 on 20.10.2010.
  • Learn some new sta­tis­ti­cal tech­niques. If you have never used the boot­strap, EM algo­rithm, mixed mod­els or the Kalman fil­ter, now is a great day to start.
  • Stop using hypoth­e­sis tests and p-​​values! Instead, use con­fi­dence inter­vals and the AIC.

Feel free to add your own ideas in the com­ments … unless you’re too busy cel­e­brat­ing this aus­pi­cious occasion.