Yahoo Labs has just released an interesting new data set useful for research on detecting anomalies (or outliers) in time series data. There are many contexts in which anomaly detection is important. For Yahoo, the main use case is in detecting unusual traffic on Yahoo servers. Continue reading →
The Human Mortality Database is a wonderful resource for anyone interested in demographic data. It is a carefully curated collection of high quality deaths and population data from 37 countries, all in a consistent format with consistent definitions. I have used it many times and never cease to be amazed at the care taken to maintain such a great resource.
The data are continually being revised and updated. Today the Australian data has been updated to 2011. There is a time lag because of lagged death registrations which results in undercounts; so only data that are likely to be complete are included.
Tim Riffe from the HMD has provided the following information about the update:
- All death counts since 1964 are now included by year of occurrence, up to 2011. We have 2012 data but do not publish them because they are likely a 5% undercount due to lagged registration.
- Death count inputs for 1921 to 1963 are now in single ages. Previously they were in 5-year age groups. Rather than having an open age group of 85+ in this period counts usually go up to the maximum observed (stated) age. This change (i) introduces minor heaping in early years and (ii) implies different apparent old-age mortality than before, since previously anything above 85 was modeled according to the Methods Protocol.
- Population denominators have been swapped out for years 1992 to the present, owing to new ABS methodology and intercensal estimates for the recent period.
The MAPE (mean absolute percentage error) is a popular measure for forecast accuracy and is defined as
where denotes an observation and denotes its forecast, and the mean is taken over .
Armstrong (1985, p.348) was the first (to my knowledge) to point out the asymmetry of the MAPE saying that “it has a bias favoring estimates that are below the actual values”. Continue reading →
This looks like an interesting job.
Dear Dr. Hyndman,
I write from the Center for Open Science, a non-profit organization based in Charlottesville, Virginia in the United States, which is dedicated to improving the alignment between scientific values and scientific practices. We are dedicated to open source and open science.
We are reaching out to you to find out if you know anyone who might be interested in our Statistical and Methodological Consultant position.
The position is a unique opportunity to consult on reproducible best practices in data analysis and research design; the consultant will make shorts visits to provide lectures and training at universities, laboratories, conferences, and through virtual mediums. An especially unique part of the job involves collaborating with the White House’s Office of Science and Technology Policy on matters relating to reproducibility.
If you know someone with substantial training and experience in scientific research, quantitative methods, reproducible research practices, and some programming experience (at least R, ideally Python or Julia) might you please pass this along to them?
Anyone may find out more about the job or apply via our website:
The position is full-time and located at our office in beautiful Charlottesville, VA.
Thanks in advance for your time and help.
Earlier this week I had coffee with Ben Fulcher who told me about his online collection comprising about 30,000 time series, mostly medical series such as ECG measurements, meteorological series, birdsong, etc. There are some finance series, but not many other data from a business or economic context, although he does include my Time Series Data Library. In addition, he provides Matlab code to compute a large number of characteristics. Anyone wanting to test time series algorithms on a large collection of data should take a look.
Unfortunately there is no R code, and no R interface for downloading the data.
I recently co-authored a chapter on “Prospective Life Tables” for this book, edited by Arthur Charpentier. R code to reproduce the figures and to complete the exercises for our chapter is now available on github. Code for the other chapters should also be available soon. The book can be pre-ordered on Amazon.
This week I’ve been at the R Users conference in Albacete, Spain. These conferences are a little unusual in that they are not really about research, unlike most conferences I attend. They provide a place for people to discuss and exchange ideas on how R can be used.
Here are some thoughts and highlights of the conference, in no particular order. Continue reading →
Updated: 21 November 2012
Make is a marvellous tool used by programmers to build software, but it can be used for much more than that. I use
make whenever I have a large project involving R files and LaTeX files, which means I use it for almost all of the papers I write, and almost of the consulting reports I produce. Continue reading →
This week I’m in Cyprus attending the COMPSTAT2012 conference. There’s been the usual interesting collection of talks, and interactions with other researchers. But I was struck by two side comments in talks this morning that I’d like to mention.
Stephen Pollock: Don’t imagine your model is the truth
Actually, Stephen said something like “economists (or was it econometricians?) have a bad habit of imagining their models are true”. He gave the example of people asking whether GDP “has a unit root”? GDP is an economic measurement. It no more has a unit root than I do. But the models used to approximate the dynamics of GDP may have a unit root. This is an example of confusing your data with your model. Or to put it the other way around, imagining that the model is true rather than an approximation. A related thing that tends to annoy me is to refer to the model as the “data generating process”. No model is a data generating process, unless the data were obtained by simulation from the model. Models are only ever approximations, and imagining that they are data generating processes only leads to over-confidence and bad science.
Matías Salibián-Barrera: Make all your code public
After giving an interesting survey of the robustbase and rrcov packages for R, Matías spent the last ten minutes of his talk presenting the case for reproducible research and arguing for making R code public as much as possible. The benefits of making our code public are obvious:
- The research can be reproduced and checked by others. This is simply good science.
- Your work will be cited more frequently. Other researchers are much less likely to refer to your work if they have to implement your methods themselves. But if you make it easy, then people will use your methods and consequently cite your papers.
He also said something like this: “Don’t wait until journals require you to submit code and data; start now by putting your code and data on a website.” I agree. Every methodological paper should have an R package as a complement. If that’s too much work, at least put some code on a website so that other people can implement your method. What’s the point of hiding your code? In some ways, the code is more important than the accompanying package as it represents a precise description of the method whereas the written paper may not include all the necessary details.