Earo Wang recently interviewed me for the Chinese website Capital of Statistics. The English transcript of the intervew is on Earo’s personal website. This is the third interview I’ve done in the last 18 months. The others were for: Data Mining Research. Republished in Amstat News. DecisionStats.

## Posts Tagged ‘StackExchange’:

## Forecasting annual totals from monthly data

This question was posed on crossvalidated.com: I have a monthly time series (for 2009–2012 non-stationary, with seasonality). I can use ARIMA (or ETS) to obtain point and interval forecasts for each month of 2013, but I am interested in forecasting the total for the whole year, including prediction intervals. Is there an easy way in R to obtain interval forecasts for the total for 2013? I’ve come across this problem before in my consulting work, although I don’t think I’ve ever published my solution. So here it is.

## Seeking help

Every day I receive emails, or comments on this blog, asking for help with R, forecasting, LaTeX, possible research topics, how to install software, or some other thing I’m supposed to know something about. Unfortunately, I cannot provide a one-man help service to the rest of the world. I used to reply to all the requests explaining where to go for help, but I stopped replying a while ago as it took too much time to do even that. If you want help, please ask at either stats.stackexchange.com (for R or statistics questions) or tex.stackexchange.com (for LaTeX questions). Unless you are one of my students, the only questions I will answer are ones that concern my R packages or research papers. And even then, I won’t reply if the answer is in the help files. I write those help files for a reason, so please read them. I’m sorry I can’t do more, but if I did everything people ask me to do, I’d never write any papers or produce any R packages, and I think that’s a better use of my time.

## Academia StackExchange

There’s a new StackExchange site that might be useful to readers: Academia. It is a Q&A site for academics and those enrolled in higher education. The draft FAQ says it will cover: Life as a graduate student, postdoctoral researcher, university professor Transitioning from undergraduate to graduate researcher Inner workings of research departments Requirements and expectations of academicians Judging from the first 89 questions, this is going to be extremely helpful, especially for PhD students.

## Social networking for researchers

It would be nice to have a place to share ideas, links, comments in a very informal way with others involved in research in statistical methodology and data science. CrossValidated.com is great for specific questions, but is not suitable for commenting on papers or sharing ideas and links.

## CrossValidated Journal Club

Journal Clubs are a great way to learn new research ideas and to keep up with the literature. The idea is that a group of people get together every week or so to discuss a paper of joint interest. This can happen within your own research group or department, or virtually online. There is now a virtual journal club operating in conjunction with CrossValidated.com. The first paper discussed was on text data mining. It appears that the next paper may be on collaborative filtering. The emphasis is on Open Access papers, preferably with associated software that is freely available. Some of the discussion tends to centre on how to implement the ideas in R. For those of us in Australia, the timing is tricky. The first discussion took place at 3am local time! If you can’t make the CrossValidated Journal Club chats, why not start your own local club?

## CrossValidated launched!

The CrossValidated Q&A site is now out of beta and the new design and site name is live. New design The new design looks great, thanks to Jin Yang, our designer-in-residence. Note the normal density icon for accepted answers and the site icon depicting a 5-fold cross-validation (light green for the test set and dark green for the training set). There is a faint background graphic in the header and footer from a program that tracks and plots a person’s mouse movement. This gives the suggestion of randomness as well as the idea of data visualization (another topic covered on the site). Name and URL The URL crossvalidated.com will work, but re-directs to stats.stackexchange.com. The StackExchange team (who host the site and provide all the architecture) wanted the site to be a subdomain of stackexchange.com. However, at least we got the name CrossValidated. Scope The site is intended for use by statisticians, data miners, and anyone else doing data analysis. It covers questions about statistical analysis data mining and machine learning data visualization probability theory statistical and data-driven computing (e.g., questions about R, SAS, SPSS, Stata and Minitab) The inclusion of data mining and machine learning along with statistics and probability was a deliberate attempt to get these two communities

## How to avoid annoying a referee

It’s not a good idea to annoy the referees of your paper. They make recommendations to the editor about your work and it is best to keep them happy. There is an interesting discussion on stats.stackexchange.com on this subject. This inspired my own list below. Explain what you’ve done clearly, avoiding unnecessary jargon. Don’t claim your paper contributes more than it actually does. (I refereed a paper this week where the author claimed to have invented principal component analysis!) Ensure all figures have clear captions and labels. Include citations to the referee’s own work. Obviously you don’t know who is going to referee your paper, but you should aim to cite the main work in the area. It places your work in context, and keeps the referees happy if they are the authors. Make sure the cited papers say what you think they say. Sight what you cite! Include proper citations for all software packages. If you are unsure how to cite an R package, try the command citation(“packagename”). Never plagiarise from other papers — not even sentence fragments. Use your own words. I’ve refereed a thesis which had slabs taken from my own lecture notes including the typos. Don’t plagiarise from your own papers. Either reference

## Happy World Statistics Day!

The United Nations has declared today “World Statistics Day”. I’ve no idea what that means, or why we need a WSD. Perhaps it is because the date is 20.10.2010 (except in North America where it is 10.20.2010). But then, what happens from 2013 to 2099? And do we just forget the whole idea after 3112? In any case, if we are going to have a WSD, let’s use it to do something useful. Patrick Burns has some ideas over at Portfolio Probe. Here are some of my own: Learn R. The time has come when it is not really possible to be a well-informed applied statistician if you are not a regular R user. Get involved on Stats.StackExchange.com. It’s only 3 months old, but we already have over 1600 users and it has quickly become the best place to ask and answer questions about statistics, data mining, data visualization and everything else to do with analysing data. If you’re in London, head to the RSS getstats launch, held appropriately at 20:10 on 20.10.2010. Learn some new statistical techniques. If you have never used the bootstrap, EM algorithm, mixed models or the Kalman filter, now is a great day to start. Stop using hypothesis tests and p-values! Instead, use confidence intervals

## Why every statistician should know about cross-validation

Surprisingly, many statisticians see cross-validation as something data miners do, but not a core statistical technique. I thought it might be helpful to summarize the role of cross-validation in statistics, especially as it is proposed that the Q&A site at stats.stackexchange.com should be renamed CrossValidated.com. Cross-validation is primarily a way of measuring the predictive performance of a statistical model. Every statistician knows that the model fit statistics are not a good guide to how well a model will predict: high does not necessarily mean a good model. It is easy to over-fit the data by including too many degrees of freedom and so inflate and other fit statistics. For example, in a simple polynomial regression I can just keep adding higher order terms and so get better and better fits to the data. But the predictions from the model on new data will usually get worse as higher order terms are added. One way to measure the predictive ability of a model is to test it on a set of data not used in estimation. Data miners call this a “test set” and the data used for estimation is the “training set”. For example, the predictive accuracy of a model can be measured by the mean squared error on the test set. This will