Statistical tests for variable selection

I received an email today with the fol­low­ing comment:

I’m using ARIMA with Inter­ven­tion detec­tion and was plan­ning to use your pack­age to iden­tify my ini­tial ARIMA model for later iter­a­tion, how­ever I found that some­times the auto.arima func­tion returns a model where AR/​MA coef­fi­cients are not sig­nif­i­cant. So my ques­tion is: Is there a way to fil­ter the search for ARIMA mod­els that only have sig­nif­i­cant coef­fi­cients. I can remove the non-​​significant coef­fi­cients but I think it would be bet­ter to search for those mod­els that only have sig­nif­i­cant coefficients.

Sta­tis­ti­cal sig­nif­i­cance is not usu­ally a good basis for deter­min­ing whether a vari­able should be included in a model, despite the fact that many peo­ple who should know bet­ter use them for exactly this pur­pose.  Even some text­books dis­cuss vari­able selec­tion using sta­tis­ti­cal tests, thus per­pet­u­at­ing bad sta­tis­ti­cal practice.

Sta­tis­ti­cal tests were designed to test hypothe­ses, not select vari­ables. Tests on coef­fi­cients are answer­ing a dif­fer­ent ques­tion from whether the vari­able is use­ful in fore­cast­ing. It is pos­si­ble to have an insignif­i­cant coef­fi­cient asso­ci­ated with a vari­able that is use­ful for fore­cast­ing. It is also pos­si­ble to have a sig­nif­i­cant vari­able asso­ci­ated with a vari­able that is bet­ter omit­ted when forecasting.

To see why the first sit­u­a­tion occurs, think about two highly cor­re­lated pre­dic­tor vari­ables. It may be that the model that includes them both gives the best fore­casts, but any sta­tis­ti­cal tests on the coef­fi­cients can give insignif­i­cant val­ues because it is hard to dis­tin­guish their sep­a­rate con­tri­bu­tions (thus caus­ing the stan­dard errors on their coef­fi­cients to be large). This is almost always a prob­lem with AR coef­fi­cients because the cor­re­spond­ing pre­dic­tors are lagged vari­a­tions of each other and often highly correlated.

The sec­ond sit­u­a­tion occurs, for exam­ple, when a pre­dic­tor has high vari­abil­ity and a small coef­fi­cient. When the sam­ple size is large enough, the esti­mated coef­fi­cient may be sta­tis­ti­cally sig­nif­i­cant. But for fore­cast­ing pur­poses, includ­ing the pre­dic­tor increases the vari­ance of the fore­cast with­out con­tribut­ing much addi­tional information.

See Harrell’s book Regres­sion Mod­el­ling Strate­gies for fur­ther dis­cus­sion on the mis­use of sta­tis­ti­cal tests for vari­able selection.

A much more reli­able guide to select­ing terms in any model, includ­ing ARIMA mod­els, is to use cross-​​validation or an approx­i­ma­tion to it such as the AIC. The auto.arima() func­tion from the fore­cast pack­age in R uses the AIC by default and usu­ally chooses a rea­son­ably good model for fore­cast­ing. If users wish to exper­i­ment with other mod­els, use the AIC for com­par­i­son not sig­nif­i­cance tests of the coef­fi­cients.

Related Posts:

  • Vinh Nguyen

    I couldn’t agree more! Rarely do I come by a post that I agree with completely.

  • Peter Cahusac

    Excel­lent post, sen­si­ble com­ments, thanks.

  • Ken

    Nice post. Could you per­haps talk more about:

    1. AIC vs AICc. When should one use AICc? When n/​k < 30, where n is the sam­ple size and k is the num­ber of esti­mated parameters?

    2. Is AICc valid only for lin­ear mod­els with exoge­nous regressors?

    3. Does the CV suf­fer from small sam­ple issues (espe­cially in the case of the lin­ear regres­sion model)?

    4. At what sam­ple size, I guess rel­a­tive to the num­ber of esti­mated para­me­ters, does the AIC more or less become equiv­a­lent to the CV?

    4. When you have a lot of regres­sors, you can’t esti­mate your model. Does it mat­ter if you add one vari­able at a time or go from h regres­sors and work your way down one regres­sor at a time until you have the min CV or AIC model, and then begin to test one vari­able at a time of the remain­ing vari­ables to see if any can reduce the CV or AIC further?

    I believe that many under­grad and grad stu­dents would ben­e­fit greatly from hear­ing your response to the above questions.

    Thanks much and keep you excel­lent work.

  • Tarmo Leinonen

    If users wish to exper­i­ment with other [sub] mod­els, use the AIC [or some­thing] for comparison ”

    This http://​addict​ed​tor​.free​.fr/​g​r​a​p​h​i​q​u​e​s​/​R​G​r​a​p​h​G​a​l​l​e​r​y​.​p​h​p​?​g​r​a​ph=29 might do the trick. Note the link to the code on the page.

    (A related bug https://​stat​.ethz​.ch/​p​i​p​e​r​m​a​i​l​/​r​-​d​e​v​e​l​/​2​0​0​9​-​J​u​n​e​/​0​5​3​6​9​4​.html seems to be still in R ver­sion 2.13.1 )

    a prob­lem with AR coef­fi­cients … the cor­re­spond­ing pre­dic­tors are lagged vari­a­tions of each other and often highly correlated”

    And MA coef­fi­cients and AR coef­fi­cients are cor­re­lated too. The esti­ma­tion pro­duces fre­quently AR and MA coef­fi­cients which can­cel each oth­ers out even if the con­fi­dence inter­val esti­mate claims that the coef­fi­cients are very sig­nif­i­cantly dif­fer­ent from zero. To detect these mirage coef­fi­cients, re-​​estimate with either max.p=0 or max.q=0.

    cov2cor(vcov( is sup­posed to reveal (not-​​wanted) cor­re­la­tions between coefficients.

  • mike

    Excel­lent post!
    Rob, How do we know if a time series do not depend on t ? I mean, for exam­ple, how do we know if a prod­uct demand are cor­re­lated or not?.

  • Adrian

    Thanks for the excel­lent post. Would your answer change if the p value is con­structed using Newey and West’s HAC estimator ?