A blog by Rob J Hyndman 

Twitter Gplus RSS

Use fake data and real data

Published on 11 June 2010

When devel­op­ing new sta­tis­ti­cal meth­ods, it is very use­ful to test them on both fake data (i.e., sim­u­la­tions) and real data.

Test­ing on fake data is use­ful because then you know the “true” answer and can check the pro­ce­dure under ideal con­di­tions. If your method doesn’t work when the data are designed for the task, it is unlikely to work in real con­di­tions. Fake data also enables you to test the robust­ness of your method when the con­di­tions aren’t per­fect — for exam­ple, try adding some nasty out­liers and see if the method still works. With fake data, you can gen­er­ate as many sam­ples as you need, thus ensur­ing that what you see is real (sta­tis­ti­cally sig­nif­i­cant) rather than just an odd example.

A fur­ther advan­tage of fake data is that any­one can repro­duce your work and check (or extend) your results. Some­times real data can­not be dis­trib­uted due to restric­tions imposed by the owner of the data. But there are never restric­tions on fake data. You just have to make sure you explain the data gen­er­at­ing process suf­fi­ciently clearly that other peo­ple can repli­cate what you’ve done.

Test­ing on real data is use­ful because it gives some indi­ca­tion of whether your method will be use­ful in real­ity and not just in theory.

Yeas­min Khan­dakar and I once devel­oped a neat method for select­ing the order of an ARIMA model which worked won­der­fully well on fake data that were gen­er­ated from ARIMA processes, but failed on any real data. The prob­lem seemed to be that it was par­tic­u­larly sen­si­tive to model mis-​​specification. So when the data had any fea­tures that were not typ­i­cal of ARIMA processes, the method failed. No real data are gen­uinely ARIMA processes, and so the method is not par­tic­u­larly use­ful (and has never been published).

On the other hand, damped expo­nen­tial smooth­ing works bet­ter than you would expect, even on data that come from processes for which damped expo­nen­tial smooth­ing is far from the­o­ret­i­cally opti­mal. In chap­ter 7 of my expo­nen­tial smooth­ing book, we showed (with real data) that using a damped expo­nen­tial smooth­ing model for all series gives results that are almost as good as those obtained after a com­pu­ta­tion­ally inten­sive search for an opti­mal model over the entire model space.


Related Posts:


 
No Comments  comments