Facts and fallacies of the AIC

Akaike’s Information Criterion (AIC) is a very useful model selection tool, but it is not as well understood as it should be. I frequently read papers, or hear talks, which demonstrate misunderstandings or misuse of this important tool. The following points should clarify some aspects of the AIC, and hopefully reduce its misuse.

  1. The AIC is a penalized likelihood, and so it requires the likelihood to be maximized before it can be calculated. It makes little sense to compute the AIC if estimation is done using something else (e.g., minimizing MAPE). Normally, the residuals are assumed to be Gaussian, and then ML estimates are often (but not always) equivalent to LS estimates. In these cases, computing the AIC after minimizing the MSE is ok.
  2. A model selected by the AIC after Gaussian MLE will give predictions equal to the conditional mean. If you then compare the predictions using MAE, or MAPE, or some other criterion, they may not perform well because these other criteria are not optimal for the conditional mean. Match the error measure to the estimation method.
  3. The AIC does not assume the residuals are Gaussian. It is just that the Gaussian likelihood is most frequently used. But if you want to use some other distribution, go ahead. The AIC is the penalized likelihood, whichever likelihood you choose to use.
  4. The AIC does not require nested models. One of the neat things about the AIC is that you can compare very different models. However, make sure the likelihoods are computed on the same data. For example, you cannot compare an ARIMA model with differencing to an ARIMA model without differencing, because you lose one or more observations via differencing. That is why auto.arima uses a unit root test to choose the order of differencing, and only uses the AIC to select the orders of the AR and MA components.
  5. For a similar reason, you cannot compare the AIC from an ETS model with the AIC from an ARIMA model. The two models treat initial values differently. For example, after differencing, an ARIMA model is computed on fewer observations, whereas an ETS model is always computed on the full set of data. Even when the models are equivalent (e.g., an ARIMA(0,1,1) and an ETS(A,N,N)), the AIC values will be different. Effectively, the likelihood of an ETS model is conditional on the initial state vector, whereas the likelihood of a non-stationary ARIMA model is conditional on the first few observations, even when a diffuse prior is used for the nonstationary components.
  6. Beware of AIC values computed using conditional likelihoods because the conditioning may be different for different models. Then the AIC values are not comparable.
  7. Frequently, the constant term in the AIC is omitted. That is fine for model selection as the constant is the same for all models. But be careful comparing the AIC value between software packages, or between model classes, as they may treat the constant term differently, and then the AIC values are not comparable.
  8. The AIC is not really an “in-sample” measure. Yes, it is computed using the training data. But asymptotically, minimizing the AIC is equivalent to minimizing the leave-one-out cross-validation MSE for cross-sectional data, and equivalent to minimizing the out-of-sample one-step forecast MSE for time series models. This property is what makes it such an attractive criterion for use in selecting models for forecasting.
  9. The AIC is not a measure of forecast accuracy. Although it has the above cross-validation property, comparing AIC values across data sets is essentially meaningless. If you really want to measure the cross-validated MSE, then you will need to calculate it directly.
  10. The AIC is not a consistent model selection method. That does not bother me as I don’t believe there is a true model to be selected. The AIC is optimal (in some senses) for forecasting, and that is much more important in my opinion.

There are some further helpful clarifications in AIC Myths and Misunderstandings by Anderson and Burnham.

Related Posts:

  • http://apiolaza.net Luis Apiolaza

    ML doesn’t require nested models; however, some of us use REML (Restricted or Residual ML) which involves a projection of the data using the fixed effects. In that case the models to be compared require the same set of fixed effects to use AIC.

  • Stephan Kolassa

    Another reference on AIC which I use and refer to frequently is Burnham & Anderson’s monograph “Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach”. I especially like its approach (similar to your point 10 above) that there is no “true model” but an infinity of influences with what they call tapering effect sizes.

  • Pingback: AIC, Kullback-Leibler and a more general Information Criterion | Thiago G. Martins()

  • Pingback: Hyndsight - Fitting models to short time series()

  • Jesse Papenburg

    Thanks for highlighting some key points about the AIC. In point 4:”The AIC does not require nested models”, you note that the likelihoods must be estimated from the same data in oreder to compare AICs. Observations are lost with differencing; therefore, we should not compare the AIC of an ARIMA model with differencing to one without differencing.

    Is it then also true, because observations are lost with lagging, that one should not compare the AIC of a bivariate ARIMA model that used a lagged transfer function to the AIC of bivariate ARIMA model that did not incorporate a lag (both models included the same input and response series)?

    • http://robjhyndman.com/ Rob J Hyndman

      That depends on whether the conditional likelihood or full likelihood is being used.

      • http://www.thechildren.com/departments-and-staff/staff/jesse-papenburg-md-frcpc-pediatric-infectious-disease-specialist-and Jesse Papenburg

        Thank for the prompt reply!
        I believe that I have been using full likelihood; I have specified the estimation method in SAS to the ML option. From the SAS manual: “The METHOD= ML option produces maximum likelihood estimates. The likelihood function is maximized via nonlinear least squares using Marquardt’s method.”. This is in contrast to the Unconditional Least Squares and the Conditional Least Squares options.

  • Christopher Waldeck

    Dr. Hyndman,

    I’m trying to synthesize your 5th point here and Ander­son and Burn­ham’s second bullet point. From the context, I believe they mean “data set” in the traditional way – as a set of realizations of a data generation process. However, if this is the case, wouldn’t comparing a truncated (in the case of differencing) data set with the entire set still be a comparison of the same underlying process? Additionally, if you could expand on the effect of the initial (random) state vector of ETS models on the AIC calculation, it would be much appreciated. I believe you’re saying that the initial random state of the model is necessarily not equivalent to a maximum likelihood fit of the true, underlying random process, but I don’t have confidence in my interpretation.

    • http://robjhyndman.com/ Rob J Hyndman

      The issue is not in the truncation, but in the conditioning. In ETS models, the initial state is not considered random, but a vector of estimable parameters. This is unlike standard state space models where the initial state is usually treated as random. There is some discussion of this in my 2008 Springer book (www.exponentialsmoothing.net).

  • Daumantas

    Dear prof. Hyndman,
    could you provide a reference for the statement “asymptotically, minimizing the AIC is equivalent to minimizing … the out-​​of-​​sample one-​​step forecast MSE for time series models”. The closest I could find is by taking together the section “Cross Validation” and the statement “For large values of N, minimizing the AIC is equivalent to minimizing the CV value.” from you forecasting textbook https://www.otexts.org/fpp/5/3, but that’s perhaps not direct enough. I could cite this blog post, of course, but I thought the referees would rather prefer an academic paper or a textbook.

    • http://robjhyndman.com/ Rob J Hyndman

      Try Konishi and Kitagawa

  • Pingback: Tudo o que você queria saber sobre o AIC mas nunca te contaram | De Gustibus Non Est Disputandum()

  • Pingback: Linear Regression – How To Do It Properly | Likelihood Log()