Facts and fallacies of the AIC

Akaike’s Infor­ma­tion Cri­te­rion (AIC) is a very use­ful model selec­tion tool, but it is not as well under­stood as it should be. I fre­quently read papers, or hear talks, which demon­strate mis­un­der­stand­ings or mis­use of this impor­tant tool. The fol­low­ing points should clar­ify some aspects of the AIC, and hope­fully reduce its misuse.

  1. The AIC is a penal­ized like­li­hood, and so it requires the like­li­hood to be max­i­mized before it can be cal­cu­lated. It makes lit­tle sense to com­pute the AIC if esti­ma­tion is done using some­thing else (e.g., min­i­miz­ing MAPE). Nor­mally, the resid­u­als are assumed to be Gauss­ian, and then ML esti­mates are often (but not always) equiv­a­lent to LS esti­mates. In these cases, com­put­ing the AIC after min­i­miz­ing the MSE is ok.
  2. A model selected by the AIC after Gauss­ian MLE will give pre­dic­tions equal to the con­di­tional mean. If you then com­pare the pre­dic­tions using MAE, or MAPE, or some other cri­te­rion, they may not per­form well because these other cri­te­ria are not opti­mal for the con­di­tional mean. Match the error mea­sure to the esti­ma­tion method.
  3. The AIC does not assume the resid­u­als are Gauss­ian. It is just that the Gauss­ian like­li­hood is most fre­quently used. But if you want to use some other dis­tri­b­u­tion, go ahead. The AIC is the penal­ized like­li­hood, whichever like­li­hood you choose to use.
  4. The AIC does not require nested mod­els. One of the neat things about the AIC is that you can com­pare very dif­fer­ent mod­els. How­ever, make sure the like­li­hoods are com­puted on the same data. For exam­ple, you can­not com­pare an ARIMA model with dif­fer­enc­ing to an ARIMA model with­out dif­fer­enc­ing, because you lose one or more obser­va­tions via dif­fer­enc­ing. That is why auto.arima uses a unit root test to choose the order of dif­fer­enc­ing, and only uses the AIC to select the orders of the AR and MA components.
  5. For a sim­i­lar rea­son, you can­not com­pare the AIC from an ETS model with the AIC from an ARIMA model. The two mod­els treat ini­tial val­ues dif­fer­ently. For exam­ple, after dif­fer­enc­ing, an ARIMA model is com­puted on fewer obser­va­tions, whereas an ETS model is always com­puted on the full set of data. Even when the mod­els are equiv­a­lent (e.g., an ARIMA(0,1,1) and an ETS(A,N,N)), the AIC val­ues will be dif­fer­ent. Effec­tively, the like­li­hood of an ETS model is con­di­tional on the ini­tial state vec­tor, whereas the like­li­hood of a non-​​stationary ARIMA model is con­di­tional on the first few obser­va­tions, even when a dif­fuse prior is used for the non­sta­tion­ary components.
  6. Beware of AIC val­ues com­puted using con­di­tional like­li­hoods because the con­di­tion­ing may be dif­fer­ent for dif­fer­ent mod­els. Then the AIC val­ues are not comparable.
  7. Fre­quently, the con­stant term in the AIC is omit­ted. That is fine for model selec­tion as the con­stant is the same for all mod­els. But be care­ful com­par­ing the AIC value between soft­ware pack­ages, or between model classes, as they may treat the con­stant term dif­fer­ently, and then the AIC val­ues are not comparable.
  8. The AIC is not really an “in-​​sample” mea­sure. Yes, it is com­puted using the train­ing data. But asymp­tot­i­cally, min­i­miz­ing the AIC is equiv­a­lent to min­i­miz­ing the leave-​​one-​​out cross-​​validation MSE for cross-​​sectional data, and equiv­a­lent to min­i­miz­ing the out-​​of-​​sample one-​​step fore­cast MSE for time series mod­els. This prop­erty is what makes it such an attrac­tive cri­te­rion for use in select­ing mod­els for forecasting.
  9. The AIC is not a mea­sure of fore­cast accu­racy. Although it has the above cross-​​validation prop­erty, com­par­ing AIC val­ues across data sets is essen­tially mean­ing­less. If you really want to mea­sure the cross-​​validated MSE, then you will need to cal­cu­late it directly.
  10. The AIC is not a con­sis­tent model selec­tion method. That does not bother me as I don’t believe there is a true model to be selected. The AIC is opti­mal (in some senses) for fore­cast­ing, and that is much more impor­tant in my opinion.

There are some fur­ther help­ful clar­i­fi­ca­tions in AIC Myths and Mis­un­der­stand­ings by Ander­son and Burn­ham.

Related Posts:

  • http://apiolaza.net Luis Api­o­laza

    ML doesn’t require nested mod­els; how­ever, some of us use REML (Restricted or Resid­ual ML) which involves a pro­jec­tion of the data using the fixed effects. In that case the mod­els to be com­pared require the same set of fixed effects to use AIC.

  • Stephan Kolassa

    Another ref­er­ence on AIC which I use and refer to fre­quently is Burn­ham & Anderson’s mono­graph “Model Selec­tion and Multi-​​Model Infer­ence: A Prac­ti­cal Information-​​Theoretic Approach”. I espe­cially like its approach (sim­i­lar to your point 10 above) that there is no “true model” but an infin­ity of influ­ences with what they call taper­ing effect sizes.

  • Pingback: AIC, Kullback-Leibler and a more general Information Criterion | Thiago G. Martins()

  • Pingback: Hyndsight - Fitting models to short time series()

  • Jesse Papen­burg

    Thanks for high­light­ing some key points about the AIC. In point 4:“The AIC does not require nested mod­els”, you note that the like­li­hoods must be esti­mated from the same data in oreder to com­pare AICs. Obser­va­tions are lost with dif­fer­enc­ing; there­fore, we should not com­pare the AIC of an ARIMA model with dif­fer­enc­ing to one with­out differencing.

    Is it then also true, because obser­va­tions are lost with lag­ging, that one should not com­pare the AIC of a bivari­ate ARIMA model that used a lagged trans­fer func­tion to the AIC of bivari­ate ARIMA model that did not incor­po­rate a lag (both mod­els included the same input and response series)?

    • http://robjhyndman.com/ Rob J Hyndman

      That depends on whether the con­di­tional like­li­hood or full like­li­hood is being used.

      • http://www.thechildren.com/departments-and-staff/staff/jesse-papenburg-md-frcpc-pediatric-infectious-disease-specialist-and Jesse Papen­burg

        Thank for the prompt reply!
        I believe that I have been using full like­li­hood; I have spec­i­fied the esti­ma­tion method in SAS to the ML option. From the SAS man­ual: “The METHOD= ML option pro­duces max­i­mum like­li­hood esti­mates. The like­li­hood func­tion is max­i­mized via non­lin­ear least squares using Marquardt’s method.”. This is in con­trast to the Uncon­di­tional Least Squares and the Con­di­tional Least Squares options.

  • Christo­pher Waldeck

    Dr. Hyn­d­man,

    I’m try­ing to syn­the­size your 5th point here and Ander­son and Burn­ham’s sec­ond bul­let point. From the con­text, I believe they mean “data set” in the tra­di­tional way — as a set of real­iza­tions of a data gen­er­a­tion process. How­ever, if this is the case, wouldn’t com­par­ing a trun­cated (in the case of dif­fer­enc­ing) data set with the entire set still be a com­par­i­son of the same under­ly­ing process? Addi­tion­ally, if you could expand on the effect of the ini­tial (ran­dom) state vec­tor of ETS mod­els on the AIC cal­cu­la­tion, it would be much appre­ci­ated. I believe you’re say­ing that the ini­tial ran­dom state of the model is nec­es­sar­ily not equiv­a­lent to a max­i­mum like­li­hood fit of the true, under­ly­ing ran­dom process, but I don’t have con­fi­dence in my interpretation.

    • http://robjhyndman.com/ Rob J Hyndman

      The issue is not in the trun­ca­tion, but in the con­di­tion­ing. In ETS mod­els, the ini­tial state is not con­sid­ered ran­dom, but a vec­tor of estimable para­me­ters. This is unlike stan­dard state space mod­els where the ini­tial state is usu­ally treated as ran­dom. There is some dis­cus­sion of this in my 2008 Springer book (www​.expo​nen​tialsmooth​ing​.net).

  • Dau­man­tas

    Dear prof. Hyn­d­man,
    could you pro­vide a ref­er­ence for the state­ment “asymp­tot­i­cally, min­i­miz­ing the AIC is equiv­a­lent to min­i­miz­ing … the out-​​​​of-​​​​sample one-​​​​step fore­cast MSE for time series mod­els”. The clos­est I could find is by tak­ing together the sec­tion “Cross Val­i­da­tion” and the state­ment “For large val­ues of N, min­i­miz­ing the AIC is equiv­a­lent to min­i­miz­ing the CV value.” from you fore­cast­ing text­book https://​www​.otexts​.org/​f​p​p/5/3, but that’s per­haps not direct enough. I could cite this blog post, of course, but I thought the ref­er­ees would rather pre­fer an aca­d­e­mic paper or a textbook.

    • http://robjhyndman.com/ Rob J Hyndman

      Try Kon­ishi and Kitagawa

  • Pingback: Tudo o que você queria saber sobre o AIC mas nunca te contaram | De Gustibus Non Est Disputandum()

  • Pingback: Linear Regression – How To Do It Properly | Likelihood Log()