A blog by Rob J Hyndman 

Twitter Gplus RSS

Facts and fallacies of the AIC

Published on 4 July 2013

Akaike’s Infor­ma­tion Cri­te­rion (AIC) is a very use­ful model selec­tion tool, but it is not as well under­stood as it should be. I fre­quently read papers, or hear talks, which demon­strate mis­un­der­stand­ings or mis­use of this impor­tant tool. The fol­low­ing points should clar­ify some aspects of the AIC, and hope­fully reduce its misuse.

  1. The AIC is a penal­ized like­li­hood, and so it requires the like­li­hood to be max­i­mized before it can be cal­cu­lated. It makes lit­tle sense to com­pute the AIC if esti­ma­tion is done using some­thing else (e.g., min­i­miz­ing MAPE). Nor­mally, the resid­u­als are assumed to be Gauss­ian, and then ML esti­mates are often (but not always) equiv­a­lent to LS esti­mates. In these cases, com­put­ing the AIC after min­i­miz­ing the MSE is ok.
  2. A model selected by the AIC after Gauss­ian MLE will give pre­dic­tions equal to the con­di­tional mean. If you then com­pare the pre­dic­tions using MAE, or MAPE, or some other cri­te­rion, they may not per­form well because these other cri­te­ria are not opti­mal for the con­di­tional mean. Match the error mea­sure to the esti­ma­tion method.
  3. The AIC does not assume the resid­u­als are Gauss­ian. It is just that the Gauss­ian like­li­hood is most fre­quently used. But if you want to use some other dis­tri­b­u­tion, go ahead. The AIC is the penal­ized like­li­hood, whichever like­li­hood you choose to use.
  4. The AIC does not require nested mod­els. One of the neat things about the AIC is that you can com­pare very dif­fer­ent mod­els. How­ever, make sure the like­li­hoods are com­puted on the same data. For exam­ple, you can­not com­pare an ARIMA model with dif­fer­enc­ing to an ARIMA model with­out dif­fer­enc­ing, because you lose one or more obser­va­tions via dif­fer­enc­ing. That is why auto.arima uses a unit root test to choose the order of dif­fer­enc­ing, and only uses the AIC to select the orders of the AR and MA components.
  5. For a sim­i­lar rea­son, you can­not com­pare the AIC from an ETS model with the AIC from an ARIMA model. The two mod­els treat ini­tial val­ues dif­fer­ently. For exam­ple, after dif­fer­enc­ing, an ARIMA model is com­puted on fewer obser­va­tions, whereas an ETS model is always com­puted on the full set of data. Even when the mod­els are equiv­a­lent (e.g., an ARIMA(0,1,1) and an ETS(A,N,N)), the AIC val­ues will be dif­fer­ent. Effec­tively, the like­li­hood of an ETS model is con­di­tional on the ini­tial state vec­tor, whereas the like­li­hood of a non-​​stationary ARIMA model is con­di­tional on the first few obser­va­tions, even when a dif­fuse prior is used for the non­sta­tion­ary components.
  6. Beware of AIC val­ues com­puted using con­di­tional like­li­hoods because the con­di­tion­ing may be dif­fer­ent for dif­fer­ent mod­els. Then the AIC val­ues are not comparable.
  7. Fre­quently, the con­stant term in the AIC is omit­ted. That is fine for model selec­tion as the con­stant is the same for all mod­els. But be care­ful com­par­ing the AIC value between soft­ware pack­ages, or between model classes, as they may treat the con­stant term dif­fer­ently, and then the AIC val­ues are not comparable.
  8. The AIC is not really an “in-​​sample” mea­sure. Yes, it is com­puted using the train­ing data. But asymp­tot­i­cally, min­i­miz­ing the AIC is equiv­a­lent to min­i­miz­ing the leave-​​one-​​out cross-​​validation MSE for cross-​​sectional data, and equiv­a­lent to min­i­miz­ing the out-​​of-​​sample one-​​step fore­cast MSE for time series mod­els. This prop­erty is what makes it such an attrac­tive cri­te­rion for use in select­ing mod­els for forecasting.
  9. The AIC is not a mea­sure of fore­cast accu­racy. Although it has the above cross-​​validation prop­erty, com­par­ing AIC val­ues across data sets is essen­tially mean­ing­less. If you really want to mea­sure the cross-​​validated MSE, then you will need to cal­cu­late it directly.
  10. The AIC is not a con­sis­tent model selec­tion method. That does not bother me as I don’t believe there is a true model to be selected. The AIC is opti­mal (in some senses) for fore­cast­ing, and that is much more impor­tant in my opinion.

There are some fur­ther help­ful clar­i­fi­ca­tions in AIC Myths and Mis­un­der­stand­ings by Ander­son and Burn­ham.


Related Posts:


 
4 Comments  comments 
  • http://apiolaza.net Luis Api­o­laza

    ML doesn’t require nested mod­els; how­ever, some of us use REML (Restricted or Resid­ual ML) which involves a pro­jec­tion of the data using the fixed effects. In that case the mod­els to be com­pared require the same set of fixed effects to use AIC.

  • Stephan Kolassa

    Another ref­er­ence on AIC which I use and refer to fre­quently is Burn­ham & Anderson’s mono­graph “Model Selec­tion and Multi-​​Model Infer­ence: A Prac­ti­cal Information-​​Theoretic Approach”. I espe­cially like its approach (sim­i­lar to your point 10 above) that there is no “true model” but an infin­ity of influ­ences with what they call taper­ing effect sizes.

  • Pingback: AIC, Kullback-Leibler and a more general Information Criterion | Thiago G. Martins

  • Pingback: Hyndsight - Fitting models to short time series