Melbourne Data Science Initiative 2016

In just over three weeks, the inaugural MeDaScIn event will take place. This is an initiative to grow the talent pool of local data scientists and to promote Melbourne as a world city of excellence in Data Science.

The main event takes place on Friday 6th May, with lots of interesting sounding titles and speakers from business and government. I’m the only academic speaker on the program, giving the closing talk on “Automatic FoRecasting”. Earlier in the day I am running a forecasting workshop where I will discuss forecasting issues and answer questions for about 90 minutes. There are still a few places left for the main event, and for the workshops. Book soon if you want to attend.

All the details are here.

Plotting overlapping prediction intervals

I often see figures with two sets of prediction intervals plotted on the same graph using different line types to distinguish them. The results are almost always unreadable. A better way to do this is to use semi-transparent shaded regions. Here is an example showing two sets of forecasts for the Nile River flow.

f1 = forecast(auto.arima(Nile, lambda=0), h=20, level=95)
f2 = forecast(ets(Nile), h=20, level=95)
plot(f1, shadecol=rgb(0,0,1,.4), flwd=1,
     main="Forecasts of Nile River flow",
     xlab="Year", ylab="Billions of cubic metres")
lines(f2$mean, col=rgb(.7,0,0))
       col=c("blue","red"), lty=1,


The blue region shows 95% prediction intervals for the ARIMA forecasts, while the red region shows 95% prediction intervals for the ETS forecasts. Where they overlap, the colors blend to make purple. In this case, the point forecasts are quite close, but the prediction intervals are relatively different.

rOpenSci unconference in Brisbane, 21-22 April 2016

The first rOpenSci unconference in Australia will be held on Thursday and Friday (April 21-22) in Brisbane, at the Microsoft Innovation Centre.

This event will bring together researchers, developers, data scientists and open data enthusiasts from industry, government and university. The aim is to conceptualise and develop R-based tools that address current challenges in data science, open science and reproducibility.

Past examples of the projects can herehere, and here. Also here.

You can view more details, see who else is attending, and most importantly, apply to attend at the website.

Model variance for ARIMA models

From today’s email:

I wanted to ask you about your R forecast package, in particular the Arima() function. We are using this function to fit an ARIMAX model and produce model estimates and standard errors, which in turn can be used to get p-values and later model forecasts. To double check our work, we are also fitting the same model in SAS using PROC ARIMA and comparing model coefficients and output. Continue reading →

Starting a career in data science

I received this email from one of my undergraduate students:

I’m writing to you asking for advice on how to start a career in Data Science. Other professions seem a bit more straight forward, in that accountants for example simply look for Internships and ways into companies from there. From my understanding, the nature of careers in data science seem to be on a project-to-project basis. I’m not sure how to get my foot stuck in the door.

I am expecting to finish degree by Semester 1 2016. In my job searching so far, I have only encountered positions which require 3+ years of previous data analysis experience and have not seen any “entry-level” data analysis positions or graduate data positions. What is the nature of entry level recruitment in this industry?

Any help would be greatly appreciated.


Continue reading →

Making data analysis easier

Di Cook and I are organizing a workshop on “Making data analysis easier” for 18-19 February 2016.

We are calling it WOMBAT2016, which an acronym for Workshop Organized by the Monash Business Analytics Team. Appropriately, it will be held at the Melbourne Zoo. Our plan is to make these workshops an annual event.

Some details are available on the workshop website. Key features are:

  • Hadley Wickham is our keynote speaker. He has been instrumental in changing the way we think about data analysis, and providing new tools for tidying, rearranging, summarising and plotting data. His R packages (including tidyr, dplyr, ggplot2, and ggvis) are very widely used.
  • Other speakers include Phil Brierley, Eugene Dubossarsky, Heike Hofmann, Thomas Lumley, Andrew Robinson, Elle Saber, Carson Sievert, Zoe van Havre, Geoff Webb, Yanchang Zhao, as well as Di and me.
  • The numbers are limited to a total of 100 with a quota on students, academics and people from business/industry. The aim is to have a good mix of people from different backgrounds to encourage productive discussions and mutual learning.
  • Register on Eventbrite.
  • We also have some places available for contributing speakers (15 minute talks). If you would like to do a contributed talk, you will need to email us a title and abstract by 15 January. We will notify you if your peer-reviewed abstract is successful by 29 January.

If you miss out on the workshop, you can still hear Hadley speak. Data Science Melbourne will host a meetup featuring him in the evening of Monday 22 February 2016.