Reproducibility in computational research

Jane Frazier spoke at our research team meeting today on “Reproducibility in computational research”. We had a very stimulating and lively discussion about the issues involved. One interesting idea was that reproducibility is on a scale, and we can all aim to move further along the scale towards making our own research more reproducible. For example

  • Can you reproduce your results tomorrow on the same computer with the same software installed?
  • Could someone else on a different computer reproduce your results with the same software installed?
  • Could you reproduce your results in 3 years time after some of your software environment may have changed?
  • etc.

Think about what changes you need to make to move one step further along the reproducibility continuum, and do it.

Jane’s slides and handout are below. Continue reading →

R vs Autobox vs ForecastPro vs …

Every now and then a commercial software vendor makes claims on social media about how their software is so much better than the forecast package for R, but no details are provided.

There are lots of reasons why you might select a particular software solution, and R isn’t for everyone. But anyone claiming superiority should at least provide some evidence rather than make unsubstantiated claims. Continue reading →

New Australian data on the HMD

The Human Mortality Database is a wonderful resource for anyone interested in demographic data. It is a carefully curated collection of high quality deaths and population data from 37 countries, all in a consistent format with consistent definitions. I have used it many times and never cease to be amazed at the care taken to maintain such a great resource.

The data are continually being revised and updated. Today the Australian data has been updated to 2011. There is a time lag because of lagged death registrations which results in undercounts; so only data that are likely to be complete are included.

Tim Riffe from the HMD has provided the following information about the update:

  1. All death counts since 1964 are now included by year of occurrence, up to 2011. We have 2012 data but do not publish them because they are likely a 5% undercount due to lagged registration.
  2. Death count inputs for 1921 to 1963 are now in single ages. Previously they were in 5-year age groups. Rather than having an open age group of 85+ in this period counts usually go up to the maximum observed (stated) age. This change (i) introduces minor heaping in early years and (ii) implies different apparent old-age mortality than before, since previously anything above 85 was modeled according to the Methods Protocol.
  3. Population denominators have been swapped out for years 1992 to the present, owing to new ABS methodology and intercensal estimates for the recent period.

Some of the data can be read into R using the and hmd.e0 functions from the demography package. Tim has his own package on github that provides a more extensive interface.

Errors on percentage errors

The MAPE (mean absolute percentage error) is a popular measure for forecast accuracy and is defined as

    \[\text{MAPE} = 100\text{mean}(|y_t - \hat{y}_t|/|y_t|)\]

where y_t denotes an observation and \hat{y}_t denotes its forecast, and the mean is taken over t.

Armstrong (1985, p.348) was the first (to my knowledge) to point out the asymmetry of the MAPE saying that “it has a bias favoring estimates that are below the actual values”. Continue reading →

Job at Center for Open Science

This looks like an interesting job.

Dear Dr. Hyndman,

I write from the Center for Open Science, a non-profit organization based in Charlottesville, Virginia in the United States, which is dedicated to improving the alignment between scientific values and scientific practices. We are dedicated to open source and open science.

We are reaching out to you to find out if you know anyone who might be interested in our Statistical and Methodological Consultant position.

The position is a unique opportunity to consult on reproducible best practices in data analysis and research design; the consultant will make shorts visits to provide lectures and training at universities, laboratories, conferences, and through virtual mediums. An especially unique part of the job involves collaborating with the White House’s Office of Science and Technology Policy on matters relating to reproducibility.

If you know someone with substantial training and experience in scientific research, quantitative methods, reproducible research practices, and some programming experience (at least R, ideally Python or Julia) might you please pass this along to them?

Anyone may find out more about the job or apply via our website:

The position is full-time and located at our office in beautiful Charlottesville, VA.

Thanks in advance for your time and help.

More time series data online

Earlier this week I had coffee with Ben Fulcher who told me about his online collection comprising about 30,000 time series, mostly medical series such as ECG measurements, meteorological series, birdsong, etc. There are some finance series, but not many other data from a business or economic context, although he does include my Time Series Data Library. In addition, he provides Matlab code to compute a large number of characteristics. Anyone wanting to test time series algorithms on a large collection of data should take a look.

Unfortunately there is no R code, and no R interface for downloading the data.

Reflections on UseR! 2013

This week I’ve been at the R Users conference in Albacete, Spain. These conferences are a little unusual in that they are not really about research, unlike most conferences I attend. They provide a place for people to discuss and exchange ideas on how R can be used.

Here are some thoughts and highlights of the conference, in no particular order. Continue reading →