A very useful way of keeping up with blogs in a particular area is to subscribe to a blog aggregator. These will syndicate posts from a large number of blogs and provide links back to the original sources. So you only need to subscribe once to get all the good stuff in that area. There are now several blog aggregators available that might be of interest to readers here. And this blog is now syndicated on several other sites including those listed below.
Posts Tagged ‘statistics’:
Measuring time series characteristics
A few years ago, I was working on a project where we measured various characteristics of a time series and used the information to determine what forecasting method to apply or how to cluster the time series into meaningful groups. The two main papers to come out of that project were: Wang, Smith and Hyndman (2006) Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery, 13(3), 335–364. Wang, Smith-Miles and Hyndman (2009) “Rule induction for forecasting method selection: meta-learning the characteristics of univariate time series”, Neurocomputing, 72, 2581–2594. I’ve since had a lot of requests for the code which one of my coauthors has been helpfully emailing to anyone who asked. But to make it easier, we thought it might be helpful if I post some updated code here. This is not the same as the R code we used in the paper, as I’ve improved it in several ways (so it will give different results). If you just want the code, skip to the bottom of the post.
Data visualization
For those who have not read the seminal works of Tufte and Cleveland, please hang your heads in shame. To salvage some sense of self-worth, you can then head over to Solomon Messing’s blog where he is starting a series on data visualization based on the principles developed by Tufte and Cleveland (with R examples). The classics are also worth reading, and remain relevant despite the 20 or 30 years that have elapsed since they appeared.
Internet surveys
I received the following email today: I am preparing a thesis … I need to conduct the widest possible poll, and it occurred to me that perhaps you could guide me toward an internet-based way in which this can be done easily. I have a ten-question questionnaire prepared, that I wish to have an random sample of the population respond to. I have no budget for this, so I hope you can suggest a way in which a good number of responses can be harvested using blogs or sites you may be aware of. Here is my response.
Cyclic and seasonal time series
These terms get confused all the time (e.g., this question on CrossValidated.com), and so I thought it might be helpful to try to summarize the distinction and some of the associated models.
What you wish you knew before you started a PhD
I asked my research group recently what they wished they had learned before they started work on a PhD. Here are some of the responses.
Learn Machine Learning at Stanford for free
Andrew Ng’s machine learning course at Stanford is being offered free to anyone online in the (northern) fall of 2011. I’ve seen some of the notes from this course and it looks to be an excellent broad introduction to machine learning and data mining. For example, support vector machines, neural networks, kernels, clustering, dimension reduction, etc.
Ten rules for data analysis
Peter Kennedy was an associate editor of the International Journal of Forecasting and a superb applied econometrician. He died unexpectedly in August 2010. He was best known for his excellent book A Guide to Econometrics as well as his “Ten Commandments of Applied Econometrics”. He provided a variation on his ten commandments in advice to his students in the form of the following ten rules:
Statistical tests for variable selection
I received an email today with the following comment: I’m using ARIMA with Intervention detection and was planning to use your package to identify my initial ARIMA model for later iteration, however I found that sometimes the auto.arima function returns a model where AR/MA coefficients are not significant. So my question is: Is there a way to filter the search for ARIMA models that only have significant coefficients. I can remove the non-significant coefficients but I think it would be better to search for those models that only have significant coefficients. Statistical significance is not usually a good basis for determining whether a variable should be included in a model, despite the fact that many people who should know better use them for exactly this purpose. Even some textbooks discuss variable selection using statistical tests, thus perpetuating bad statistical practice. Statistical tests were designed to test hypotheses, not select variables. Tests on coefficients are answering a different question from whether the variable is useful in forecasting. It is possible to have an insignificant coefficient associated with a variable that is useful for forecasting. It is also possible to have a significant variable associated with a variable that is better omitted when forecasting. To see why the first situation occurs, think about two highly
(More)…
Lies, damn lies and statistics
There’s a nice article with this title by Stephan Lewandowsky on the ABC website today, exploring the difference between anecdotes and data, and the dangers of cherry-picking evidence.

Rob J Hyndman