There is a one day workshop on this topic on 23 February 2015 at QUT in Brisbane. I will be speaking on “Visualizing and forecasting big time series data”.
Big data is now endemic in business, industry, government, environmental management, medical science, social research and so on. One of the commensurate challenges is how to effectively model and analyse these data.
This workshop will bring together national and international experts in statistical modelling and analysis of big data, to share their experiences, approaches and opinions about future directions in this field.
The workshop programme will commence at 8.30am and close at 5pm. Registration is free, however numbers are strictly limited so please ensure you register when you receive your invitation via email. Morning and afternoon tea will be provided; participants will need to purchase their own lunch.
Further details will be made available in early January. Continue reading →
Shu Fan and I have developed a model for electricity demand forecasting that is now widely used in Australia for long-term forecasting of peak electricity demand. It has become known as the “Monash Electricity Forecasting Model”. We have decided to release an R package that implements our model so that other people can easily use it. The package is called “MEFM” and is available on github. We will probably also put in on CRAN eventually.
The model was first described in Hyndman and Fan (2010). We are continually improving it, and the latest version is decribed in the model documentation which will be updated from time to time.
The package is being released under a GPL licence, so anyone can use it. All we ask is that our work is properly cited.
Naturally, we are not able to provide free technical support, although we welcome bug reports. We are available to undertake paid consulting work in electricity forecasting.
Amongst today’s email was one from someone running a private competition to classify time series. Here are the essential details.
The data are measurements from a medical diagnostic machine which takes 1 measurement every second, and after 32–1000 seconds, the time series must be classified into one of two classes. Some pre-classified training data is provided. It is not necessary to classify all the test data, but you do need to have relatively high accuracy on what is classified. So you could find a subset of more easily classifiable test time series, and leave the rest of the test data unclassified. Continue reading →
The first issue of the IJF for 2015 has just been published, and I’m delighted that it includes a special section honoring Herman Stekler. It includes articles covering a range of his forecasting interests, although not all of them (sports forecasting is missing). Herman himself wrote a paper for it looking at “Forecasting—Yesterday, Today and Tomorrow”.
He is in a unique position to write such a paper as he has been doing forecasting research longer than anyone else on the planet — his first published paper on forecasting appeared in 1959. Herman is now 82 years old, and is still very active in research. Only a couple of months ago, he wrote to me with some new research ideas he had been thinking about, asking me for some feedback. He is also an extraordinarily conscientious and careful associate editor of the IJF and a delight to work with. He is truly “a scholar and a gentleman” and I am very happy that we can honor Herman in this manner. Thanks to Tara Sinclair, Prakash Loungani and Fred Joutz for putting this tribute together.
We also published an interview with Herman in the IJF in 2010 which contains some information about his early years, graduate education and first academic jobs.
Competitions have a long history in forecasting and prediction, and have been instrumental in forcing research attention on methods that work well in practice. In the forecasting community, the M competition and M3 competition have been particularly influential. The data mining community have the annual KDD cup which has generated attention on a wide range of prediction problems and associated methods. Recent KDD cups are hosted on kaggle.
In my research group meeting today, we discussed our (limited) experiences in competing in some Kaggle competitions, and we reviewed the following two papers which describe two prediction competitions:
- Athanasopoulos and Hyndman (IJF 2011). The value of feedback in forecasting competitions. [preprint version]
- Roy et al (2013). The Microsoft Academic Search Dataset and KDD Cup 2013.
Continue reading →
The Human Mortality Database is a wonderful resource for anyone interested in demographic data. It is a carefully curated collection of high quality deaths and population data from 37 countries, all in a consistent format with consistent definitions. I have used it many times and never cease to be amazed at the care taken to maintain such a great resource.
The data are continually being revised and updated. Today the Australian data has been updated to 2011. There is a time lag because of lagged death registrations which results in undercounts; so only data that are likely to be complete are included.
Tim Riffe from the HMD has provided the following information about the update:
- All death counts since 1964 are now included by year of occurrence, up to 2011. We have 2012 data but do not publish them because they are likely a 5% undercount due to lagged registration.
- Death count inputs for 1921 to 1963 are now in single ages. Previously they were in 5-year age groups. Rather than having an open age group of 85+ in this period counts usually go up to the maximum observed (stated) age. This change (i) introduces minor heaping in early years and (ii) implies different apparent old-age mortality than before, since previously anything above 85 was modeled according to the Methods Protocol.
- Population denominators have been swapped out for years 1992 to the present, owing to new ABS methodology and intercensal estimates for the recent period.
Some of the data can be read into R using the
hmd.e0 functions from the demography package. Tim has his own package on github that provides a more extensive interface.
This week my research group discussed Adrian Raftery’s recent paper on “Use and Communication of Probabilistic Forecasts” which provides a fascinating but brief survey of some of his work on modelling and communicating uncertain futures. Coincidentally, today I was also sent a copy of David Spiegelhalter’s paper on “Visualizing Uncertainty About the Future”. Both are well-worth reading.
It made me think about my own efforts to communicate future uncertainty through graphics. Of course, for time series forecasts I normally show prediction intervals. I prefer to use more than one interval at a time because it helps convey a little more information. The default in the forecast package for R is to show both an 80% and a 95% interval like this: Continue reading →
Review papers are extremely useful for new researchers such as PhD students, or when you want to learn about a new research field. The International Journal of Forecasting produced a whole review issue in 2006, and it contains some of the most highly cited papers we have ever published. Now, beginning with the latest issue of the journal, we have started publishing occasional review articles on selected areas of forecasting. The first two articles are:
- Electricity price forecasting: A review of the state-of-the-art with a look into the future by Rafał Weron.
- The challenges of pre-launch forecasting of adoption time series for new durable products by Paul Goodwin, Sheik Meeran, and Karima Dyussekeneva.
Both tackle very important topics in forecasting. Weron’s paper contains a comprehensive survey of work on electricity price forecasting, coherently bringing together a large body of diverse research — I think it is the longest paper I have ever approved at 50 pages. Goodwin, Meeran and Dyussekeneva review research on new product forecasting, a problem every company that produces goods or services has faced; when there are no historical data available, how do you forecast the sales of your product?
We have a few other review papers in progress, so keep an eye out for them in future issues.
I get questions about this almost every week. Here is an example from a recent comment on this blog:
I have two large time series data. One is separated by seconds intervals and the other by minutes. The length of each time series is 180 days. I’m using R (3.1.1) for forecasting the data. I’d like to know the value of the “frequency” argument in the ts() function in R, for each data set. Since most of the examples and cases I’ve seen so far are for months or days at the most, it is quite confusing for me when dealing with equally separated seconds or minutes. According to my understanding, the “frequency” argument is the number of observations per season. So what is the “season” in the case of seconds/minutes? My guess is that since there are 86,400 seconds and 1440 minutes a day, these should be the values for the “freq” argument. Is that correct?
Continue reading →