A blog by Rob J Hyndman 

Twitter Gplus RSS

Why I don’t like statistical tests

Published on 24 August 2009

It may come as a shock to dis­cover that a sta­tis­ti­cian does not like sta­tis­ti­cal tests. Isn’t that what sta­tis­tics is all about? Unfor­tu­nately, in some dis­ci­plines sta­tis­ti­cal analy­sis does seem to con­sist almost entirely of hypoth­e­sis test­ing, and therein lies the problem.

The stan­dard prac­tice is to con­struct a hypoth­e­sis test to deter­mine if some attribute of the data is “sig­nif­i­cant” or not, with the stan­dard p-​​value thresh­old of 5%. The analy­sis is per­ceived to be com­pleted when the p-​​value comes in under 5%. How­ever, any non-​​trivial hypoth­e­sis will be sig­nif­i­cant if enough data are col­lected. As George Box said, “all mod­els are wrong, but some are use­ful”. So col­lect­ing more data will demon­strate that the pro­posed hypoth­e­sis is wrong, but that doesn’t make it useless.

Then there is the com­mon con­fu­sion between sta­tis­ti­cally sig­nif­i­cant and prac­ti­cally sig­nif­i­cant. Just because some­thing is sig­nif­i­cant, doesn’t mean it is impor­tant. And just because a p-​​value is larger than 0.05 does not mean the null hypoth­e­sis is true. Sta­tis­ti­cians learn all this in first year, but still the research lit­er­a­ture is rid­dled with papers that imply otherwise.

The next prob­lem is that p-​​values are extremely sen­si­tive to collinear­ity. Con­se­quently, to use p-​​values based on t-​​tests to deter­mine the sig­nif­i­cance of terms in a regres­sion is silly. Often terms will appear insignif­i­cant, yet they should be included as they improve the pre­dic­tions. Yet this approach is prob­a­bly the most com­mon method for deter­min­ing what vari­ables to include in a regres­sion, even in some stan­dard text­books. The sit­u­a­tion is even worse in autore­gres­sion, where the collinear­ity is often very strong.

Another thing I dis­like about sta­tis­ti­cal tests is the alter­na­tive hypoth­e­sis. This was not orig­i­nally part of hypoth­e­sis test­ing as pro­posed by Fisher. It was intro­duced by Ney­man and Pear­son. Frankly, the alter­na­tive hypoth­e­sis is unnec­es­sary. It is not used in the com­pu­ta­tion of p-​​values or for deter­min­ing sta­tis­ti­cal sig­nif­i­cance. The only prac­ti­cal use for the alter­na­tive hypoth­e­sis that I can see is in deter­min­ing the power of a test.

Finally, I hate one-​​sided tests even more than two-​​sided tests. It’s lit­tle bet­ter than cheat­ing. You claim that a para­me­ter can only pos­si­bly move in one direc­tion, and thereby cut your p-​​value in half. I sus­pect it is usu­ally done to obtain sig­nif­i­cant results in order to increase the chances of pub­li­ca­tion. In real­ity, can we ever be really sure that a para­me­ter can only be zero or positive?

Now a good sta­tis­ti­cian can avoid all of these errors and use sta­tis­ti­cal tests hon­estly and appro­pri­ately. And I do occa­sion­ally use tests in my papers, hope­fully avoid­ing the above prob­lems. But I strongly pre­fer the pre­dic­tive mod­el­ling approach. That is, if you have two poten­tial mod­els, choose the one that pre­dicts best. Infor­ma­tion cri­te­ria, such as the AIC, are per­fect for this task.

In fore­cast­ing, the only place in which I find test­ing use­ful is in deter­min­ing the order of inte­gra­tion of a time series; i.e., choos­ing d in an ARIMA(p,d,q) model. If I could come up with some way of doing this effec­tively with­out using a unit-​​root test, I would gladly do so. But so far, I have not found a reli­able alternative.

For more on this topic, see my work­ing paper with Andrey Kostenko.


Related Posts:


 
1 Comment  comments 
  • Brett Inder

    A very provoca­tive piece, par­tic­u­larly given the one-​​sided research agenda of some of our col­leagues!
    Re use of cri­te­ria like AIC vs test­ing for the pur­pose of decid­ing which vari­ables to include in a model, it could be argued that they are more sim­i­lar than dif­fer­ent. AIC-​​type cri­te­ria are just penalised like­li­hoods, so com­par­ing two AICs is like per­form­ing a LR test, with crit­i­cal val­ues that are related to the penal­ties. So the dif­fer­ences just come down to how the crit­i­cal value /​ p-​​value varies with sam­ple size, etc.