Facts and fallacies of the AIC

Akaike’s Information Criterion (AIC) is a very useful model selection tool, but it is not as well understood as it should be. I frequently read papers, or hear talks, which demonstrate misunderstandings or misuse of this important tool. The following points should clarify some aspects of the AIC, and hopefully reduce its misuse.

  1. The AIC is a penalized likelihood, and so it requires the likelihood to be maximized before it can be calculated. It makes little sense to compute the AIC if estimation is done using something else (e.g., minimizing MAPE). Normally, the residuals are assumed to be Gaussian, and then ML estimates are often (but not always) equivalent to LS estimates. In these cases, computing the AIC after minimizing the MSE is ok.
  2. A model selected by the AIC after Gaussian MLE will give predictions equal to the conditional mean. If you then compare the predictions using MAE, or MAPE, or some other criterion, they may not perform well because these other criteria are not optimal for the conditional mean. Match the error measure to the estimation method.
  3. The AIC does not assume the residuals are Gaussian. It is just that the Gaussian likelihood is most frequently used. But if you want to use some other distribution, go ahead. The AIC is the penalized likelihood, whichever likelihood you choose to use.
  4. The AIC does not require nested models. One of the neat things about the AIC is that you can compare very different models. However, make sure the likelihoods are computed on the same data. For example, you cannot compare an ARIMA model with differencing to an ARIMA model without differencing, because you lose one or more observations via differencing. That is why auto.arima uses a unit root test to choose the order of differencing, and only uses the AIC to select the orders of the AR and MA components.
  5. For a similar reason, you cannot compare the AIC from an ETS model with the AIC from an ARIMA model. The two models treat initial values differently. For example, after differencing, an ARIMA model is computed on fewer observations, whereas an ETS model is always computed on the full set of data. Even when the models are equivalent (e.g., an ARIMA(0,1,1) and an ETS(A,N,N)), the AIC values will be different. Effectively, the likelihood of an ETS model is conditional on the initial state vector, whereas the likelihood of a non-stationary ARIMA model is conditional on the first few observations, even when a diffuse prior is used for the nonstationary components.
  6. Beware of AIC values computed using conditional likelihoods because the conditioning may be different for different models. Then the AIC values are not comparable.
  7. Frequently, the constant term in the AIC is omitted. That is fine for model selection as the constant is the same for all models. But be careful comparing the AIC value between software packages, or between model classes, as they may treat the constant term differently, and then the AIC values are not comparable.
  8. The AIC is not really an “in-sample” measure. Yes, it is computed using the training data. But asymptotically, minimizing the AIC is equivalent to minimizing the leave-one-out cross-validation MSE for cross-sectional data, and equivalent to minimizing the out-of-sample one-step forecast MSE for time series models. This property is what makes it such an attractive criterion for use in selecting models for forecasting.
  9. The AIC is not a measure of forecast accuracy. Although it has the above cross-validation property, comparing AIC values across data sets is essentially meaningless. If you really want to measure the cross-validated MSE, then you will need to calculate it directly.
  10. The AIC is not a consistent model selection method. That does not bother me as I don’t believe there is a true model to be selected. The AIC is optimal (in some senses) for forecasting, and that is much more important in my opinion.

There are some further helpful clarifications in AIC Myths and Misunderstandings by Anderson and Burnham.

Related Posts:

  • ML doesn’t require nested models; however, some of us use REML (Restricted or Residual ML) which involves a projection of the data using the fixed effects. In that case the models to be compared require the same set of fixed effects to use AIC.

  • Stephan Kolassa

    Another reference on AIC which I use and refer to frequently is Burnham & Anderson’s monograph “Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach”. I especially like its approach (similar to your point 10 above) that there is no “true model” but an infinity of influences with what they call tapering effect sizes.

  • Pingback: AIC, Kullback-Leibler and a more general Information Criterion | Thiago G. Martins()

  • Pingback: Hyndsight - Fitting models to short time series()

  • Jesse Papenburg

    Thanks for highlighting some key points about the AIC. In point 4:”The AIC does not require nested models”, you note that the likelihoods must be estimated from the same data in oreder to compare AICs. Observations are lost with differencing; therefore, we should not compare the AIC of an ARIMA model with differencing to one without differencing.

    Is it then also true, because observations are lost with lagging, that one should not compare the AIC of a bivariate ARIMA model that used a lagged transfer function to the AIC of bivariate ARIMA model that did not incorporate a lag (both models included the same input and response series)?

    • That depends on whether the conditional likelihood or full likelihood is being used.

      • Thank for the prompt reply!
        I believe that I have been using full likelihood; I have specified the estimation method in SAS to the ML option. From the SAS manual: “The METHOD= ML option produces maximum likelihood estimates. The likelihood function is maximized via nonlinear least squares using Marquardt’s method.”. This is in contrast to the Unconditional Least Squares and the Conditional Least Squares options.

  • Christopher Waldeck

    Dr. Hyndman,

    I’m trying to synthesize your 5th point here and Ander­son and Burn­ham’s second bullet point. From the context, I believe they mean “data set” in the traditional way – as a set of realizations of a data generation process. However, if this is the case, wouldn’t comparing a truncated (in the case of differencing) data set with the entire set still be a comparison of the same underlying process? Additionally, if you could expand on the effect of the initial (random) state vector of ETS models on the AIC calculation, it would be much appreciated. I believe you’re saying that the initial random state of the model is necessarily not equivalent to a maximum likelihood fit of the true, underlying random process, but I don’t have confidence in my interpretation.

    • The issue is not in the truncation, but in the conditioning. In ETS models, the initial state is not considered random, but a vector of estimable parameters. This is unlike standard state space models where the initial state is usually treated as random. There is some discussion of this in my 2008 Springer book (www.exponentialsmoothing.net).

  • Daumantas

    Dear prof. Hyndman,
    could you provide a reference for the statement “asymptotically, minimizing the AIC is equivalent to minimizing … the out-​​of-​​sample one-​​step forecast MSE for time series models”. The closest I could find is by taking together the section “Cross Validation” and the statement “For large values of N, minimizing the AIC is equivalent to minimizing the CV value.” from you forecasting textbook https://www.otexts.org/fpp/5/3, but that’s perhaps not direct enough. I could cite this blog post, of course, but I thought the referees would rather prefer an academic paper or a textbook.

  • Pingback: Tudo o que você queria saber sobre o AIC mas nunca te contaram | De Gustibus Non Est Disputandum()

  • Pingback: Linear Regression – How To Do It Properly | Likelihood Log()

  • Joe

    Dr. Hyndman,

    I am fitting conditional (or matched case-control) logistic regression models that use conditional maximum likelihood. Based on # 6, I should not use AIC for model selection, correct? Or does that (perhaps) not apply if the same program is used to fit each model? (For the sake of completeness: I am using clogit in the survival package in R). AIC model selection packages in R do work on the clogit models, but of course that does not necessarily imply it is valid.

    “Beware of AIC values computed using conditional likelihoods because the conditioning may be different for different models. Then the AIC values are not comparable.”


    • Joe

      I think I figured it out. Looks like folks use QIC. If you have additional thoughts, I’d still love to hear them.
      Sorry for the hasty post.

      Craiu, R. V., Duchesne, T. & Fortin, D. (2008) Inference Methods for the Conditional Logistic Regression Model with Longitudinal Data. Biometrical Journal, 50, 97–109.

  • Timothy Lyons

    I realize this is an old question, but given a dataset with n observations, how can either information criterion be used to compare autoregressive orders. Wont an Ar(1) model only have n-1 observations, and Ar(3) n-3 observations? It seems like calculating an AIC for each order results in the use of a different data set each time, unless you truncate the data sets for smaller order models beforehand, correct?

    • No. It is possibly to compute the likelihood on the full data set. The problem you mention only arises when you use the conditional likelihood

      • Timothy Lyons

        Thank you for the clarification!

  • alexT

    Dear Rob, thanks for this illuminating post!

    By any chance, do you know a reference where one can read a proof that AIC for time series is asymptotically equivalent to “minimizing the out-of-sample one-step forecast MSE”.


    • Try either the Burnham-Anderson book, or the Konishi-Kitagawa book.

      • alexT

        Will do, thanks!

  • Surnjani Djoko

    Hi Rob,

    I am using auto.arima to with xreg, the xreg is a set of dummy variables of 0 and 1. I got “No suitable ARIMA mode found”. But I was getting result when xreg is NULL. I am wondering if I am misunderstand the use of xreg. Thank you in advanced for any insight. I did not find any prior discussion on this, I apologize if this has been discussed before.



  • Sean

    Dear prof. Hyndman,

    Bullet point 3 says one can use other likelihoods than the Gaussian. However, can one also compare between likelihoods? For example, imagine one estimates the same model on the data, namely an AR(1), but estimates this model by both Gaussian MLE [1] and t-MLE [2]. Can I compare specifications [1] and [2] by AIC? I see that in this setup only the value of the log-likelihoods matters. So more generally can I compare AR(1) up to AR(p) based on both Gaussian and t-MLE? Many thanks in advance.

  • Cagdas Ozgenc

    Hello Prof. Hyndman,

    I went through the entire derivation of AIC. Two things that attracted my attention.

    1) The derivation is based on unconditional distributions. The entropy of the true density is ignored with the assumption that it will be the same for all candidate models. It gets interesting once we try to apply AIC to conditional models like regression. At this point if the conditioning set is changing while keeping the target variable same, it gets fuzzy what true density we are talking about and which entropy is being ignored. This is related to point #6 above. Normally it is widely accepted that one can use AIC on nested models. But even in nested models the conditioning is changing.

    2) The last step when TIC expression is simplified to AIC by taking the trace of a matrix, the only way that matrix becomes identity and trace becomes a simple dimension of the matrix is when the model is equivalent to truth. This makes AIC questionable when truth is not within the candidates.

    What’s your opinion?