Melbourne Data Science Initiative 2016

In just over three weeks, the inaugural MeDaScIn event will take place. This is an initiative to grow the talent pool of local data scientists and to promote Melbourne as a world city of excellence in Data Science.

The main event takes place on Friday 6th May, with lots of interesting sounding titles and speakers from business and government. I’m the only academic speaker on the program, giving the closing talk on “Automatic FoRecasting”. Earlier in the day I am running a forecasting workshop where I will discuss forecasting issues and answer questions for about 90 minutes. There are still a few places left for the main event, and for the workshops. Book soon if you want to attend.

All the details are here.

Plotting overlapping prediction intervals

I often see figures with two sets of prediction intervals plotted on the same graph using different line types to distinguish them. The results are almost always unreadable. A better way to do this is to use semi-transparent shaded regions. Here is an example showing two sets of forecasts for the Nile River flow.

f1 = forecast(auto.arima(Nile, lambda=0), h=20, level=95)
f2 = forecast(ets(Nile), h=20, level=95)
plot(f1, shadecol=rgb(0,0,1,.4), flwd=1,
     main="Forecasts of Nile River flow",
     xlab="Year", ylab="Billions of cubic metres")
lines(f2$mean, col=rgb(.7,0,0))
       col=c("blue","red"), lty=1,


The blue region shows 95% prediction intervals for the ARIMA forecasts, while the red region shows 95% prediction intervals for the ETS forecasts. Where they overlap, the colors blend to make purple. In this case, the point forecasts are quite close, but the prediction intervals are relatively different.

Model variance for ARIMA models

From today’s email:

I wanted to ask you about your R forecast package, in particular the Arima() function. We are using this function to fit an ARIMAX model and produce model estimates and standard errors, which in turn can be used to get p-values and later model forecasts. To double check our work, we are also fitting the same model in SAS using PROC ARIMA and comparing model coefficients and output. Continue reading →

Electricity price forecasting competition

The GEFCom competitions have been a great success in generating good research on forecasting methods for electricity demand, and in enabling a comprehensive comparative evaluation of various methods. But they have only considered price forecasting in a simplified setting. So I’m happy to see this challenge is being taken up as part of the European Energy Market Conference for 2016, to be held from 6-9 June at the University of Porto in Portugal. Continue reading →

The hidden benefits of open-source software

I’ve been having discussions with colleagues and university administration about the best way for universities to manage home-grown software.

The traditional business model for software is that we build software and sell it to everyone willing to pay. Very often, that leads to a software company spin-off that has little or nothing to do with the university that nurtured the development. Think MATLAB, S-Plus, Minitab, SAS and SPSS, all of which grew out of universities or research institutions. This model has repeatedly been shown to stifle research development, channel funds away from the institutions where the software was born, and add to research costs for everyone.

I argue that the open-source model is a much better approach both for research development and for university funding. Under the open-source model, we build software, and make it available for anyone to use and adapt under an appropriate licence. This approach has many benefits that are not always appreciated by university administrators. Continue reading →

Big Data for Official Statistics Competition

This is a new competition being organized by EuroStat. The first phase involves nowcasting economic indicators at national and European level including unemployment, HICP, Tourism and Retail Trade and some of their variants.

The main goal of the competition is to discover promising methodologies and data sources that could, now or in the future, be used to improve the production of official statistics in the European Statistical System.

The organizers seem to have been encouraged by the success of Kaggle and other data science competition platforms. Unfortunately, they have chosen not to give any prizes other than an invitation to give a conference presentation or poster, which hardly seems likely to attract many good participants.

The deadline for registration is 10 January 2016. The duration of the competition is roughly a year (including about a month for evaluation).

See the call for participation for more information.