Omitting outliers

Someone sent me this email today:

One of my colleagues said that you once said/wrote that you had encountered very few real outliers in your work, and that normally the “outlier-looking” data points were proper data points that should not have been treated as outliers. Have you discussed this in writing? If so, I would love to read it.

I don’t think I’ve ever said or written anything quite like that, and I see lots of outliers in real data. But I have counselled against omitting apparent outliers.

Often the most interesting part of a data set is in the unusual or unexpected observations, so I’m strongly opposed to automatic omission of outliers. The most famous case of that is the non-detection of the hole in the ozone layer by NASA. The way I was told the story was that outliers had been automatically filtered from the data obtained from Nimbus-7. It was only when the British Antarctic Survey observed the phenomenon in the mid 1980s that scientists went back and found the problem could have been detected a decade earlier if automated outlier filtering had not been applied by NASA. In fact, that is also how the story was told on the NASA website for a few years. But in a letter to the editor of the IMS bulletin, Pukelsheim (1990) explains that the reality was more complicated. In the corrected story, scientists were investigating the unusual observations to see if they were genuine, or the result of instrumental error, but still didn’t detect the problem until quite late.

Whatever actually happened, outliers need to be investigated not omitted. Try to understand what caused some observations to be different from the bulk of the observations. If you understand the reasons, you are then in a better position to judge whether the points can legitimately removed from the data set, or whether you’ve just discovered something new and interesting. Never remove a point just because it is weird.

Electricity price forecasting competition

The GEFCom competitions have been a great success in generating good research on forecasting methods for electricity demand, and in enabling a comprehensive comparative evaluation of various methods. But they have only considered price forecasting in a simplified setting. So I’m happy to see this challenge is being taken up as part of the European Energy Market Conference for 2016, to be held from 6-9 June at the University of Porto in Portugal. Continue reading →

What’s your Hall number?

Today I attended the funeral of Peter Hall, one of the finest mathematical statisticians ever to walk the earth and easily the best from Australia. One of the most remarkable things about Peter was his astonishing productivity, with over 600 papers. As I sat in the audience I realised that many of the people there were probably coauthors of papers with Peter, and I wondered how many statisticians in the world would have been his coauthors or second-degree co-authors.

In mathematics, people calculate Erdős numbers — the “collaborative distance” between Paul Erdős and another person, as measured by authorship of mathematical papers. An Erdős number of 1 means you wrote a paper with Erdős; an Erdős number of 2 means you wrote a paper with someone who has an Erdős number of 1; and so on. My Erdős number is 3, measured in two different ways:

  • via Peter Brockwell / Kai-Lai Chung / Paul Erdös
  • via J. Keith Ord / Peter C Fishburn / Paul Erdös

It seems appropriate that we should compute Hall numbers in statistics. Mine is 1, as I was lucky enough to have coauthored two papers with Peter Hall. You can compute your own Hall number here. Just put your own surname in the second author field.



ACEMS Business Analytics Prize 2016

We have established a new annual prize for research students at Monash University in the general area of business analytics, funded by the Australian Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS). The rules of the award are listed below.

  1. The student must have submitted a paper to a high quality journal or refereed conference on some topic in the general area of business analytics, computational statistics or data visualization.
  2. Up to $3000 will be awarded to the student to assist with research expenses subject to the approval of the relevant supervisor.
  3. Applications should include the submitted paper, along with a brief statement (no more than 200 words) on how they intend to spend the money. Applications should be emailed to by 31 March 2016.
  4. The winning student will be selected by a panel consisting of Di Cook, Rob Hyndman, Catherine Forbes and Geoff Webb.
  5. Any HDR student currently enrolled at Monash University is eligible to apply.

Questions about the award can be asked in the comments section below.

Starting a career in data science

I received this email from one of my undergraduate students:

I’m writing to you asking for advice on how to start a career in Data Science. Other professions seem a bit more straight forward, in that accountants for example simply look for Internships and ways into companies from there. From my understanding, the nature of careers in data science seem to be on a project-to-project basis. I’m not sure how to get my foot stuck in the door.

I am expecting to finish degree by Semester 1 2016. In my job searching so far, I have only encountered positions which require 3+ years of previous data analysis experience and have not seen any “entry-level” data analysis positions or graduate data positions. What is the nature of entry level recruitment in this industry?

Any help would be greatly appreciated.


Continue reading →

Making data analysis easier

Di Cook and I are organizing a workshop on “Making data analysis easier” for 18-19 February 2016.

We are calling it WOMBAT2016, which an acronym for Workshop Organized by the Monash Business Analytics Team. Appropriately, it will be held at the Melbourne Zoo. Our plan is to make these workshops an annual event.

Some details are available on the workshop website. Key features are:

  • Hadley Wickham is our keynote speaker. He has been instrumental in changing the way we think about data analysis, and providing new tools for tidying, rearranging, summarising and plotting data. His R packages (including tidyr, dplyr, ggplot2, and ggvis) are very widely used.
  • Other speakers include Phil Brierley, Eugene Dubossarsky, Heike Hofmann, Thomas Lumley, Andrew Robinson, Elle Saber, Carson Sievert, Zoe van Havre, Geoff Webb, Yanchang Zhao, as well as Di and me.
  • The numbers are limited to a total of 100 with a quota on students, academics and people from business/industry. The aim is to have a good mix of people from different backgrounds to encourage productive discussions and mutual learning.
  • Register on Eventbrite.
  • We also have some places available for contributing speakers (15 minute talks). If you would like to do a contributed talk, you will need to email us a title and abstract by 15 January. We will notify you if your peer-reviewed abstract is successful by 29 January.

If you miss out on the workshop, you can still hear Hadley speak. Data Science Melbourne will host a meetup featuring him in the evening of Monday 22 February 2016.


RStudio just keeps getting better

RStudio has been a life-changer for the way I work, and for how I teach data analysis. I still have a couple of minor frustrations with it, but they are slowly disappearing as RStudio adds features.

I use dual monitors and I like to code on one monitor and have the console and plots on the other monitor. Otherwise I see too little context, and long lines get wrapped making the code harder to read. So I was very excited to see that RStudio has provided a great Christmas present this year, with source code panes able to be split off into separate windows.

You need the preview version as the feature hasn’t yet found its way into the release version. The features are explained in this help file, in which I also discovered the amazing shortcut Ctrl + . to jump to a function definition. I’ve no idea how long that has been in RStudio, but I’ll be using it a lot.

Now if they would only introduce the ability to select columns for cut/copy/paste …