New Australian data on the HMD

The Human Mor­tal­ity Data­base is a won­der­ful resource for any­one inter­ested in demo­graphic data. It is a care­fully curated col­lec­tion of high qual­ity deaths and pop­u­la­tion data from 37 coun­tries, all in a con­sis­tent for­mat with con­sis­tent def­i­n­i­tions. I have used it many times and never cease to be amazed at the care taken to main­tain such a great resource.

The data are con­tin­u­ally being revised and updated. Today the Aus­tralian data has been updated to 2011. There is a time lag because of lagged death reg­is­tra­tions which results in under­counts; so only data that are likely to be com­plete are included.

Tim Riffe from the HMD has pro­vided the fol­low­ing infor­ma­tion about the update:

  1. All death counts since 1964 are now included by year of occur­rence, up to 2011. We have 2012 data but do not pub­lish them because they are likely a 5% under­count due to lagged registration.
  2. Death count inputs for 1921 to 1963 are now in sin­gle ages. Pre­vi­ously they were in 5-​​year age groups. Rather than hav­ing an open age group of 85+ in this period counts usu­ally go up to the max­i­mum observed (stated) age. This change (i) intro­duces minor heap­ing in early years and (ii) implies dif­fer­ent appar­ent old-​​age mor­tal­ity than before, since pre­vi­ously any­thing above 85 was mod­eled accord­ing to the Meth­ods Pro­to­col.
  3. Pop­u­la­tion denom­i­na­tors have been swapped out for years 1992 to the present, owing to new ABS method­ol­ogy and inter­censal esti­mates for the recent period.

Some of the data can be read into R using the hmd.mx and hmd.e0 func­tions from the demog­ra­phy pack­age. Tim has his own pack­age on github that pro­vides a more exten­sive interface.

Errors on percentage errors

The MAPE (mean absolute per­cent­age error) is a pop­u­lar mea­sure for fore­cast accu­racy and is defined as

    \[\text{MAPE} = 100\text{mean}(|y_t - \hat{y}_t|/|y_t|)\]

where y_t denotes an obser­va­tion and \hat{y}_t denotes its fore­cast, and the mean is taken over t.

Arm­strong (1985, p.348) was the first (to my knowl­edge) to point out the asym­me­try of the MAPE say­ing that “it has a bias favor­ing esti­mates that are below the actual val­ues”. Con­tinue reading →

Job at Center for Open Science

This looks like an inter­est­ing job.

Dear Dr. Hyndman,

I write from the Cen­ter for Open Sci­ence, a non-​​profit orga­ni­za­tion based in Char­lottesville, Vir­ginia in the United States, which is ded­i­cated to improv­ing the align­ment between sci­en­tific val­ues and sci­en­tific prac­tices. We are ded­i­cated to open source and open science.

We are reach­ing out to you to find out if you know any­one who might be inter­ested in our Sta­tis­ti­cal and Method­olog­i­cal Con­sul­tant position.

The posi­tion is a unique oppor­tu­nity to con­sult on repro­ducible best prac­tices in data analy­sis and research design; the con­sul­tant will make shorts vis­its to pro­vide lec­tures and train­ing at uni­ver­si­ties, lab­o­ra­to­ries, con­fer­ences, and through vir­tual medi­ums. An espe­cially unique part of the job involves col­lab­o­rat­ing with the White House’s Office of Sci­ence and Tech­nol­ogy Pol­icy on mat­ters relat­ing to reproducibility.

If you know some­one with sub­stan­tial train­ing and expe­ri­ence in sci­en­tific research, quan­ti­ta­tive meth­ods, repro­ducible research prac­tices, and some pro­gram­ming expe­ri­ence (at least R, ide­ally Python or Julia) might you please pass this along to them?

Any­one may find out more about the job or apply via our website:

http://​cen​ter​foropen​science​.org/​j​o​b​s​/​#​stats

The posi­tion is full-​​time and located at our office in beau­ti­ful Char­lottesville, VA.

Thanks in advance for your time and help.

More time series data online

Ear­lier this week I had cof­fee with Ben Fulcher who told me about his online col­lec­tion com­pris­ing about 30,000 time series, mostly med­ical series such as ECG mea­sure­ments, mete­o­ro­log­i­cal series, bird­song, etc. There are some finance series, but not many other data from a busi­ness or eco­nomic con­text, although he does include my Time Series Data Library. In addi­tion, he pro­vides Mat­lab code to com­pute a large num­ber of char­ac­ter­is­tics. Any­one want­ing to test time series algo­rithms on a large col­lec­tion of data should take a look.

Unfor­tu­nately there is no R code, and no R inter­face for down­load­ing the data.

Reflections on UseR! 2013

This week I’ve been at the R Users con­fer­ence in Albacete, Spain. These con­fer­ences are a lit­tle unusual in that they are not really about research, unlike most con­fer­ences I attend. They pro­vide a place for peo­ple to dis­cuss and exchange ideas on how R can be used.

Here are some thoughts and high­lights of the con­fer­ence, in no par­tic­u­lar order. Con­tinue reading →

Makefiles for R/​LaTeX projects

Updated: 21 Novem­ber 2012

Make is a mar­vel­lous tool used by pro­gram­mers to build soft­ware, but it can be used for much more than that. I use make when­ever I have a large project involv­ing R files and LaTeX files, which means I use it for almost all of the papers I write, and almost of the con­sult­ing reports I pro­duce. Con­tinue reading →

COMPSTAT2012

This week I’m in Cyprus attend­ing the COMPSTAT2012 con­fer­ence. There’s been the usual inter­est­ing col­lec­tion of talks, and inter­ac­tions with other researchers. But I was struck by two side com­ments in talks this morn­ing that I’d like to mention.

Stephen Pol­lock: Don’t imag­ine your model is the truth

Actu­ally, Stephen said some­thing like “econ­o­mists (or was it econo­me­tri­cians?) have a bad habit of imag­in­ing their mod­els are true”. He gave the exam­ple of peo­ple ask­ing whether GDP “has a unit root”? GDP is an eco­nomic mea­sure­ment. It no more has a unit root than I do. But the mod­els used to approx­i­mate the dynam­ics of GDP may have a unit root. This is an exam­ple of con­fus­ing your data with your model. Or to put it the other way around, imag­in­ing that the model is true rather than an approx­i­ma­tion. A related thing that tends to annoy me is to refer to the model as the “data gen­er­at­ing process”. No model is a data gen­er­at­ing process, unless the data were obtained by sim­u­la­tion from the model. Mod­els are only ever approx­i­ma­tions, and imag­in­ing that they are data gen­er­at­ing processes only leads to over-​​confidence and bad science.

Matías Salibián-​​Barrera: Make all your code public

After giv­ing an inter­est­ing sur­vey of the robust­base and rrcov pack­ages for R, Matías spent the last ten min­utes of his talk pre­sent­ing the case for repro­ducible research and argu­ing for mak­ing R code pub­lic as much as pos­si­ble.  The ben­e­fits of mak­ing our code pub­lic are obvious:

  • The research can be repro­duced and checked by oth­ers. This is sim­ply good science.
  • Your work will be cited more fre­quently. Other researchers are much less likely to refer to your work if they have to imple­ment your meth­ods them­selves. But if you make it easy, then peo­ple will use your meth­ods and con­se­quently cite your papers.

He also said some­thing like this: “Don’t wait until jour­nals require you to sub­mit code and data; start now by putting your code and data on a web­site.” I agree. Every method­olog­i­cal paper should have an R pack­age as a com­ple­ment.  If that’s too much work, at least put some code on a web­site so that other peo­ple can imple­ment your method. What’s the point of hid­ing your code? In some ways, the code is more impor­tant than the accom­pa­ny­ing pack­age as it rep­re­sents a pre­cise descrip­tion of the method whereas the writ­ten paper may not include all the nec­es­sary details.

How to avoid annoying a referee

It’s not a good idea to annoy the ref­er­ees of your paper. They make rec­om­men­da­tions to the edi­tor about your work and it is best to keep them happy. There is an inter­est­ing dis­cus­sion on stats​.stack​ex​change​.com on this sub­ject. This inspired my own list below.

  • Explain what you’ve done clearly, avoid­ing unnec­es­sary jargon.
  • Don’t claim your paper con­tributes more than it actu­ally does. (I ref­er­eed a paper this week where the author claimed to have invented prin­ci­pal com­po­nent analysis!)
  • Ensure all fig­ures have clear cap­tions and labels.
  • Include cita­tions to the referee’s own work. Obvi­ously you don’t know who is going to ref­eree your paper, but you should aim to cite the main work in the area. It places your work in con­text, and keeps the ref­er­ees happy if they are the authors.
  • Make sure the cited papers say what you think they say. Sight what you cite!
  • Include proper cita­tions for all soft­ware pack­ages. If you are unsure how to cite an R pack­age, try the com­mand citation("packagename").
  • Never pla­gia­rise from other papers — not even sen­tence frag­ments. Use your own words. I’ve ref­er­eed a the­sis which had slabs taken from my own lec­ture notes includ­ing the typos.
  • Don’t pla­gia­rise from your own papers. Either ref­er­ence your ear­lier work, or pro­vide a sum­mary in new words.
  • Pro­vide enough detail so your work can be repli­cated. Where pos­si­ble, pro­vide the data and code. Make sure the code works.
  • When respond­ing to ref­eree reports, make sure you answer every­thing asked of you. (See my ear­lier post “Always lis­ten to review­ers”)
  • If you’ve revised the paper based on ref­er­ees’ com­ments, then thank them in the acknowl­edge­ments section.

For some applied papers, there are spe­cific sta­tis­ti­cal issues that need attention:

  • Give effect sizes with con­fi­dence inter­vals, not just p-​​values.
  • Don’t describe data using the mean and stan­dard devi­a­tion with­out indi­cat­ing whether the data were more-​​or-​​less sym­met­ric and unimodal.
  • Don’t split con­tin­u­ous data into groups.
  • Make sure your data sat­isfy the assump­tions of the sta­tis­ti­cal meth­ods used.

More tongue-​​in-​​cheek advice is pro­vided by Strat­ton and Neil (2005), “How to ensure your paper is rejected by the sta­tis­ti­cal reviewer”. Dia­betic Med­i­cine, 22(4), 371–373.

Feel free to add your own sug­ges­tions over at stats​.stack​ex​change​.com.