# Rmarkdown template for a Monash working paper

This is only directly relevant to my Monash students and colleagues, but the same idea might be useful for adapting to other institutions.

Some recent changes in the rmarkdown and bookdown packages mean that it is now possible to produce working papers in exactly the same format as we previously used with LaTeX. Continue reading →

# Tourism time series repository

A few years ago, I wrote a paper with George Athanasopoulos and others about a tourism forecasting competition. We originally made the data available as an online supplement to the paper, but that has unfortunately since disappeared although the paper itself is still available.

So I am posting the data here in case anyone wants to use it for replicating our results, or for other research purposes. The data are split into monthly, quarterly and yearly data. There are 366 monthly series, 427 quarterly series and 518 yearly series. Each group of series is further split into training data and test data. Further information is provided in the paper.

If you use the data in a publication, please cite the IJF paper as the source, along with a link to this blog post.

# The latest IJF issue with GEFCom2014 results

The latest issue of the IJF is a bumper issue with over 500 pages of forecasting insights.

The GEFCom2014 papers are included in a special section on probabilistic energy forecasting, guest edited by Tao Hong and Pierre Pinson. This is a major milestone in energy forecasting research with the focus on probabilistic forecasting and forecast evaluation done using a quantile scoring method. Only a few years ago I was having to explain to energy professionals why you couldn’t use a MAPE to evaluate a percentile forecast. With this special section, we now have a tutorial review on probabilistic electric load forecasting by Tao Hong and Shu Fan, which should help everyone get up to speed with current forecasting approaches, evaluation methods and common misunderstandings. The section also contains a large number of very high quality articles showing how to do state-of-the-art density forecasting for electricity load, electricity price, solar and wind power. Moreover, we have some benchmarks on publicly available data sets so future researchers can easily compare their methods against these published results.

In addition to the special section on probabilistic energy forecasting, there is an invited review paper on call centre forecasting by Ibrahim, Ye, L’Ecuyer and Shen. This is an important area in practice, and this paper provides a helpful review of the literature, a summary of the key issues in building good models, and suggests some possible future research directions.

There is also an invited paper from Blasques, Koopman, Łasak and Lucas on “In-sample confidence bands and out-of-sample forecast bands for time-varying parameters in observation-driven models” with some great discussion from Catherine Forbes and Pierre Perron. This was the subject of Siem Jan Koopman’s ISF talk in 2014.

Finally, there are 18 regular contributed papers, more than we normally publish in a whole issue, on topics ranging from forecasting excess stock returns to demographics of households, from forecasting food prices, to evaluating forecasts of counts and intermittent demand. Check them all out on ScienceDirect.

# rOpenSci unconference in Brisbane, 21-22 April 2016

The first rOpenSci unconference in Australia will be held on Thursday and Friday (April 21-22) in Brisbane, at the Microsoft Innovation Centre.

This event will bring together researchers, developers, data scientists and open data enthusiasts from industry, government and university. The aim is to conceptualise and develop R-based tools that address current challenges in data science, open science and reproducibility.

Past examples of the projects can herehere, and here. Also here.

You can view more details, see who else is attending, and most importantly, apply to attend at the website.

# Reproducibility in computational research

Jane Frazier spoke at our research team meeting today on “Reproducibility in computational research”. We had a very stimulating and lively discussion about the issues involved. One interesting idea was that reproducibility is on a scale, and we can all aim to move further along the scale towards making our own research more reproducible. For example

• Can you reproduce your results tomorrow on the same computer with the same software installed?
• Could someone else on a different computer reproduce your results with the same software installed?
• Could you reproduce your results in 3 years time after some of your software environment may have changed?
• etc.

Think about what changes you need to make to move one step further along the reproducibility continuum, and do it.

Jane’s slides and handout are below. Continue reading →

# R vs Autobox vs ForecastPro vs …

Every now and then a commercial software vendor makes claims on social media about how their software is so much better than the forecast package for R, but no details are provided.

There are lots of reasons why you might select a particular software solution, and R isn’t for everyone. But anyone claiming superiority should at least provide some evidence rather than make unsubstantiated claims. Continue reading →

# A new open source data set for detecting time series outliers

Yahoo Labs has just released an interesting new data set useful for research on detecting anomalies (or outliers) in time series data. There are many contexts in which anomaly detection is important. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers. Continue reading →

# New Australian data on the HMD

The Human Mortality Database is a wonderful resource for anyone interested in demographic data. It is a carefully curated collection of high quality deaths and population data from 37 countries, all in a consistent format with consistent definitions. I have used it many times and never cease to be amazed at the care taken to maintain such a great resource.

The data are continually being revised and updated. Today the Australian data has been updated to 2011. There is a time lag because of lagged death registrations which results in undercounts; so only data that are likely to be complete are included.

Tim Riffe from the HMD has provided the following information about the update:

1. All death counts since 1964 are now included by year of occurrence, up to 2011. We have 2012 data but do not publish them because they are likely a 5% undercount due to lagged registration.
2. Death count inputs for 1921 to 1963 are now in single ages. Previously they were in 5-year age groups. Rather than having an open age group of 85+ in this period counts usually go up to the maximum observed (stated) age. This change (i) introduces minor heaping in early years and (ii) implies different apparent old-age mortality than before, since previously anything above 85 was modeled according to the Methods Protocol.
3. Population denominators have been swapped out for years 1992 to the present, owing to new ABS methodology and intercensal estimates for the recent period.

Some of the data can be read into R using the hmd.mx and hmd.e0 functions from the demography package. Tim has his own package on github that provides a more extensive interface.

# Errors on percentage errors

The MAPE (mean absolute percentage error) is a popular measure for forecast accuracy and is defined as
$$\text{MAPE} = 100\text{mean}(|y_t – \hat{y}_t|/|y_t|)$$
where $y_t$ denotes an observation and $\hat{y}_t$ denotes its forecast, and the mean is taken over $t$.

Armstrong (1985, p.348) was the first (to my knowledge) to point out the asymmetry of the MAPE saying that “it has a bias favoring estimates that are below the actual values”. Continue reading →