With the latest version of the hts package for R, it is now possible to specify rather complicated grouping structures relatively easily.
All aggregation structures can be represented as hierarchies or as cross-products of hierarchies. For example, a hierarchical time series may be based on geography: country, state, region, store. Often there is also a separate product hierarchy: product groups, product types, packet size. Forecasts of all the different types of aggregation are required; e.g., product type A within region X. The aggregation structure is a cross-product of the two hierarchies.
This framework includes even apparently non-hierarchical data: consider the simple case of a time series of deaths split by sex and state. We can consider sex and state as two very simple hierarchies with only one level each. Then we wish to forecast the aggregates of all combinations of the two hierarchies.
Any number of separate hierarchies can be combined in this way. Non-hierarchical factors such as sex can be treated as single-level hierarchies. Continue reading →
Earo Wang recently interviewed me for the Chinese website Capital of Statistics. The English transcript of the intervew is on Earo’s personal website.
This is the third interview I’ve done in the last 18 months. The others were for:
This question was posed on crossvalidated.com:
I have a monthly time series (for 2009-2012 non-stationary, with seasonality). I can use ARIMA (or ETS) to obtain point and interval forecasts for each month of 2013, but I am interested in forecasting the total for the whole year, including prediction intervals. Is there an easy way in R to obtain interval forecasts for the total for 2013?
I’ve come across this problem before in my consulting work, although I don’t think I’ve ever published my solution. So here it is. Continue reading →
Every day I receive emails, or comments on this blog, asking for help with R, forecasting, LaTeX, possible research topics, how to install software, or some other thing I’m supposed to know something about. Unfortunately, I cannot provide a one-man help service to the rest of the world. I used to reply to all the requests explaining where to go for help, but I stopped replying a while ago as it took too much time to do even that.
If you want help, please ask at either stats.stackexchange.com (for R or statistics questions) or tex.stackexchange.com (for LaTeX questions).
Unless you are one of my students, the only questions I will answer are ones that concern my R packages or research papers. And even then, I won’t reply if the answer is in the help files. I write those help files for a reason, so please read them.
I’m sorry I can’t do more, but if I did everything people ask me to do, I’d never write any papers or produce any R packages, and I think that’s a better use of my time.
There’s a new StackExchange site that might be useful to readers: Academia. It is a Q&A site for academics and those enrolled in higher education.
The draft FAQ says it will cover:
- Life as a graduate student, postdoctoral researcher, university professor
- Transitioning from undergraduate to graduate researcher
- Inner workings of research departments
- Requirements and expectations of academicians
Judging from the first 89 questions, this is going to be extremely helpful, especially for PhD students.
It would be nice to have a place to share ideas, links, comments in a very informal way with others involved in research in statistical methodology and data science. CrossValidated.com is great for specific questions, but is not suitable for commenting on papers or sharing ideas and links. Continue reading →
Journal Clubs are a great way to learn new research ideas and to keep up with the literature. The idea is that a group of people get together every week or so to discuss a paper of joint interest. This can happen within your own research group or department, or virtually online.
There is now a virtual journal club operating in conjunction with CrossValidated.com. The first paper discussed was on text data mining. It appears that the next paper may be on collaborative filtering.
The emphasis is on Open Access papers, preferably with associated software that is freely available. Some of the discussion tends to centre on how to implement the ideas in R.
For those of us in Australia, the timing is tricky. The first discussion took place at 3am local time!
If you can’t make the CrossValidated Journal Club chats, why not start your own local club?
The CrossValidated Q&A site is now out of beta and the new design and site name is live.
The new design looks great, thanks to Jin Yang, our designer-in-residence. Note the normal density icon for accepted answers and the site icon depicting a 5-fold cross-validation (light green for the test set and dark green for the training set). There is a faint background graphic in the header and footer from a program that tracks and plots a person’s mouse movement. This gives the suggestion of randomness as well as the idea of data visualization (another topic covered on the site).
Name and URL
The URL crossvalidated.com will work, but re-directs to stats.stackexchange.com. The StackExchange team (who host the site and provide all the architecture) wanted the site to be a subdomain of stackexchange.com. However, at least we got the name CrossValidated.
The site is intended for use by statisticians, data miners, and anyone else doing data analysis. It covers questions about
- statistical analysis
- data mining and machine learning
- data visualization
- probability theory
- statistical and data-driven computing (e.g., questions about R, SAS, SPSS, Stata and Minitab)
The inclusion of data mining and machine learning along with statistics and probability was a deliberate attempt to get these two communities to talk. We work on similar problems, but often with different tools and different perspectives. I hope the site comes to be widely used within both communities. In fact, I hope that we can eventually stop talking about two communities and just refer to the “data science community”.
My original idea was that this would be helpful to researchers struggling with data analysis issues but have no statistician to ask for help. University-based statisticians are often inundated with requests for help from researchers in other disciplines who have no quantitative training but need to do apply some statistical techniques.
For those who haven’t been reading this blog, I proposed this site on 15 April 2010. The scope of the site was determined via a community process, then we went through a phase of building a sufficient community. The beta site was launched on 19 July 2010 with the first question on “Eliciting priors from experts”.
The site was officially launched today (5 November 2010). So it took just over 200 days from proposal to launch — I had no idea what I was starting, but I’m glad it worked out! There are now 1048 questions and 1763 users which is a great start. But there must be hundreds of thousands of people doing data analysis and who would really benefit from a site like this. So please spread the word about CrossValidated.com.
It’s not a good idea to annoy the referees of your paper. They make recommendations to the editor about your work and it is best to keep them happy. There is an interesting discussion on stats.stackexchange.com on this subject. This inspired my own list below.
- Explain what you’ve done clearly, avoiding unnecessary jargon.
- Don’t claim your paper contributes more than it actually does. (I refereed a paper this week where the author claimed to have invented principal component analysis!)
- Ensure all figures have clear captions and labels.
- Include citations to the referee’s own work. Obviously you don’t know who is going to referee your paper, but you should aim to cite the main work in the area. It places your work in context, and keeps the referees happy if they are the authors.
- Make sure the cited papers say what you think they say. Sight what you cite!
- Include proper citations for all software packages. If you are unsure how to cite an R package, try the command
- Never plagiarise from other papers — not even sentence fragments. Use your own words. I’ve refereed a thesis which had slabs taken from my own lecture notes including the typos.
- Don’t plagiarise from your own papers. Either reference your earlier work, or provide a summary in new words.
- Provide enough detail so your work can be replicated. Where possible, provide the data and code. Make sure the code works.
- When responding to referee reports, make sure you answer everything asked of you. (See my earlier post “Always listen to reviewers“)
- If you’ve revised the paper based on referees’ comments, then thank them in the acknowledgements section.
For some applied papers, there are specific statistical issues that need attention:
- Give effect sizes with confidence intervals, not just p-values.
- Don’t describe data using the mean and standard deviation without indicating whether the data were more-or-less symmetric and unimodal.
- Don’t split continuous data into groups.
- Make sure your data satisfy the assumptions of the statistical methods used.
More tongue-in-cheek advice is provided by Stratton and Neil (2005), “How to ensure your paper is rejected by the statistical reviewer”. Diabetic Medicine, 22(4), 371-373.
Feel free to add your own suggestions over at stats.stackexchange.com.
The United Nations has declared today “World Statistics Day”. I’ve no idea what that means, or why we need a WSD. Perhaps it is because the date is 20.10.2010 (except in North America where it is 10.20.2010). But then, what happens from 2013 to 2099? And do we just forget the whole idea after 3112?
In any case, if we are going to have a WSD, let’s use it to do something useful. Patrick Burns has some ideas over at Portfolio Probe. Here are some of my own:
- Learn R. The time has come when it is not really possible to be a well-informed applied statistician if you are not a regular R user.
- Get involved on Stats.StackExchange.com. It’s only 3 months old, but we already have over 1600 users and it has quickly become the best place to ask and answer questions about statistics, data mining, data visualization and everything else to do with analysing data.
- If you’re in London, head to the RSS getstats launch, held appropriately at 20:10 on 20.10.2010.
- Learn some new statistical techniques. If you have never used the bootstrap, EM algorithm, mixed models or the Kalman filter, now is a great day to start.
- Stop using hypothesis tests and p-values! Instead, use confidence intervals and the AIC.
Feel free to add your own ideas in the comments … unless you’re too busy celebrating this auspicious occasion.