The hidden benefits of open-source software

I’ve been having discussions with colleagues and university administration about the best way for universities to manage home-grown software.

The traditional business model for software is that we build software and sell it to everyone willing to pay. Very often, that leads to a software company spin-off that has little or nothing to do with the university that nurtured the development. Think MATLAB, S-Plus, Minitab, SAS and SPSS, all of which grew out of universities or research institutions. This model has repeatedly been shown to stifle research development, channel funds away from the institutions where the software was born, and add to research costs for everyone.

I argue that the open-source model is a much better approach both for research development and for university funding. Under the open-source model, we build software, and make it available for anyone to use and adapt under an appropriate licence. This approach has many benefits that are not always appreciated by university administrators. Continue reading →

Travelling Thilaksha

One of my PhD students, Thilaksha Tharanganie, has been very successful in getting travel funding to attend conferences. She was the subject of a write-up in today’s Monash News.

We encourage students to attend conferences, and provide funding for them to attend one international conference and one local conference during their PhD candidature. Thilaksha was previously funded to attend last year’s COMPSTAT in Geneva, Switzerland and IMS conference in Sydney. Having exhausted local funding, she has now convinced several other organizations to support her conference habit.

Now she just has to finish that thesis…

Paperpile makes me more productive

One of the first things I tell my new research students is to use a reference management system to help them keep track of the papers they read, and to assist in creating bib files for their bibliography. Most of them use Mendeley, one or two use Zotero. Both do a good job and both are free.

I use neither. I did use Mendeley for several years (and blogged about it a few years ago), but it became slower and slower to sync as my reference collection grew. Eventually it simply couldn’t handle the load. I have over 11,000 papers in my collection of papers, and I was spending several minutes every day waiting for Mendeley just to update the database.

Then I came across Paperpile, which is not so well known as some of its competitors, but it is truly awesome. I’ve now been using it for over a year, and I have grown to depend on it every day to keep track of all the papers I read, and to create my bib files. Continue reading →

Di Cook is moving to Monash

I’m delighted that Professor Dianne Cook will be joining Monash University in July 2015 as a Professor of Business Analytics. Di is an Australian who has worked in the US for the past 25 years, mostly at Iowa State University. She is moving back to Australia and joining the Department of Econometrics and Business Statistics in the Monash Business School, as part of our initiative in Business Analytics.

Di is a world leader in data visu­al­iza­tion, and is well-​​known for her work on inter­ac­tive graph­ics. She is also the academic supervisor of several leading data scientists including Hadley Wickham and Yihui Xie, both of whom work for RStudio.

Di has a great deal of energy and enthusiasm for computational statistics and data visualization, and will play a key role in developing and teaching our new subjects in business analytics.

The Monash Business School is already exceptionally strong in econometrics (ranked 7th in the world on RePEc), and forecasting (ranked 11th on RePEc), and we have recently expanded into actuarial science. With Di joining the department, we will be extending our expertise in the area of data visualization as well.



Congratulations to Dr Souhaib Ben Taieb

Souhaib Ben Taieb has been awarded his doctorate at the Université libre de Bruxelles and so he is now officially Dr Ben Taieb! Although Souhaib lives in Brussels, and was a student at the Université libre de Bruxelles, I co-supervised his doctorate (along with Professor Gianluca Bontempi). Souhaib is the 19th PhD student of mine to graduate.

His thesis was on “Machine learning strategies for multi-step-ahead time series forecasting” and is now available online. The prior research in this area has largely centred around two strategies (recursive and direct), and which one works better in certain circumstances. Recursive forecasting is the standard approach where a model is designed to predict one step ahead, and is then iterated to obtain multi-step-ahead forecasts. Direct forecasting involves using a separate forecasting model for each forecast horizon. Souhaib took a very different perspective from the prior research and has developed new strategies that are either hybrids of these two strategies, or completely different from either of them. The resulting forecasts are often significantly better than those obtained using the more traditional approaches.

Some of the papers to come out of Souhaib’s thesis are already available on his Google scholar page.

Well done Souhaib, and best wishes for the future.




Visit of Di Cook

Next week, Professor Di Cook from Iowa State University is visiting my research group at Monash University. Di is a world leader in data visualization, and is especially well-known for her work on interactive graphics and the XGobi and GGobi software. See her book with Deb Swayne for details.

For those wanting to hear her speak, read on. Continue reading →

Varian on big data

Last week my research group discussed Hal Varian’s interesting new paper on “Big data: new tricks for econometrics”, Journal of Economic Perspectives, 28(2): 3-28.

It’s a nice introduction to trees, bagging and forests, plus a very brief entree to the LASSO and the elastic net, and to slab and spike regression. Not enough to be able to use them, but ok if you’ve no idea what they are. Continue reading →

To explain or predict?

Last week, my research group discussed Galit Shmueli’s paper “To explain or to predict?”, Statistical Science, 25(3), 289-310. (See her website for further materials.) This is a paper everyone doing statistics and econometrics should read as it helps to clarify a distinction that is often blurred. In the discussion, the following issues were covered amongst other things.

  1. The AIC is better suited to model selection for prediction as it is asymptotically equivalent to leave-one-out cross-validation in regression, or one-step-cross-validation in time series. On the other hand, it might be argued that the BIC is better suited to model selection for explanation, as it is consistent.
  2. P-values are associated with explanation, not prediction. It makes little sense to use p-values to determine the variables in a model that is being used for prediction. (There are problems in using p-values for variable selection in any context, but that is a different issue.)
  3. Multicollinearity has a very different impact if your goal is prediction from when your goal is estimation. When predicting, multicollinearity is not really a problem provided the values of your predictors lie within the hyper-region of the predictors used when estimating the model.
  4. An ARIMA model has no explanatory use, but is great at short-term prediction.
  5. How to handle missing values in regression is different in a predictive context compared to an explanatory context. For example, when building an explanatory model, we could just use all the data for which we have complete observations (assuming there is no systematic nature to the missingness). But when predicting, you need to be able to predict using whatever data you have. So you might have to build several models, with different numbers of predictors, to allow for different variables being missing.
  6. Many statistics and econometrics textbooks fail to observe these distinctions. In fact, a lot of statisticians and econometricians are trained only in the explanation paradigm, with prediction an afterthought. That is unfortunate as most applied work these days requires predictive modelling, rather than explanatory modelling.