Why every statistician should know about cross-validation

Surprisingly, many statisticians see cross-validation as something data miners do, but not a core statistical technique. I thought it might be helpful to summarize the role of cross-validation in statistics, especially as it is proposed that the Q&A site at stats.stackexchange.com should be renamed CrossValidated.com.

Cross-validation is primarily a way of measuring the predictive performance of a statistical model. Every statistician knows that the model fit statistics are not a good guide to how well a model will predict: high $R^2$ does not necessarily mean a good model. It is easy to over-fit the data by including too many degrees of freedom and so inflate $R^2$ and other fit statistics. For example, in a simple polynomial regression I can just keep adding higher order terms and so get better and better fits to the data. But the predictions from the model on new data will usually get worse as higher order terms are added.

One way to measure the predictive ability of a model is to test it on a set of data not used in estimation. Data miners call this a “test set” and the data used for estimation is the “training set”. For example, the predictive accuracy of a model can be measured by the mean squared error on the test set. This will generally be larger than the MSE on the training set because the test data were not used for estimation.

However, there is often not enough data to allow some of it to be kept back for testing. A more sophisticated version of training/​test sets is leave-one-out cross-​​validation (LOOCV) in which the accuracy measures are obtained as follows. Suppose there are $n$ independent observations, $y_1,\dots,y_n$.

  1. Let observation $i$ form the test set, and fit the model using the remaining data. Then compute the error $(e_{i}^*=y_{i}-\hat{y}_{i})$ for the omitted observation. This is sometimes called a “predicted residual” to distinguish it from an ordinary residual.
  2. Repeat step 1 for $i=1,\dots,n$.
  3. Compute the MSE from $e_{1}^*,\dots,e_{n}^*$. We shall call this the CV.

This is a much more efficient use of the available data, as you only omit one observation at each step. However, it can be very time consuming to implement (except for linear models — see below).

Other statistics (e.g., the MAE) can be computed similarly. A related measure is the PRESS statistic (predicted residual sum of squares) equal to $n\times$MSE.

Variations on cross-validation include leave-k-out cross-validation (in which k observations are left out at each step) and k-fold cross-validation (where the original sample is randomly partitioned into k subsamples and one is left out in each iteration). Another popular variant is the .632+bootstrap of Efron & Tibshirani (1997) which has better properties but is more complicated to implement.

Minimizing a CV statistic is a useful way to do model selection such as choosing variables in a regression or choosing the degrees of freedom of a nonparametric smoother. It is certainly far better than procedures based on statistical tests and provides a nearly unbiased measure of the true MSE on new observations.

However, as with any variable selection procedure, it can be misused. Beware of looking at statistical tests after selecting variables using cross-validation — the tests do not take account of the variable selection that has taken place and so the p-values can mislead.

It is also important to realise that it doesn’t always work. For example, if there are exact duplicate observations (i.e., two or more observations with equal values for all covariates and for the $y$ variable) then leaving one observation out will not be effective.

Another problem is that a small change in the data can cause a large change in the model selected. Many authors have found that k-fold cross-validation works better in this respect.

In a famous paper, Shao (1993) showed that leave-one-out cross validation does not lead to a consistent estimate of the model. That is, if there is a true model, then LOOCV will not always find it, even with very large sample sizes. In contrast, certain kinds of leave-k-out cross-validation, where k increases with n, will be consistent. Frankly, I don’t consider this is a very important result as there is never a true model. In reality, every model is wrong, so consistency is not really an interesting property.

Cross-validation for linear models

While cross-validation can be computationally expensive in general, it is very easy and fast to compute LOOCV for linear models. A linear model can be written as
$$
\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}.
$$
Then
$$
\hat{\boldsymbol{\beta}} = (\mathbf{X}’\mathbf{X})^{-1}\mathbf{X}’\mathbf{Y}
$$
and the fitted values can be calculated using
$$
\mathbf{\hat{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}’\mathbf{X})^{-1}\mathbf{X}’\mathbf{Y} = \mathbf{H}\mathbf{Y},
$$
where $\mathbf{H} = \mathbf{X}(\mathbf{X}’\mathbf{X})^{-1}\mathbf{X}’$ is known as the “hat-matrix” because it is used to compute $\mathbf{\hat{Y}}$ (“Y-hat”).

If the diagonal values of $\mathbf{H}$ are denoted by $h_{1},\dots,h_{n}$, then the cross-validation statistic can be computed using
$$
\text{CV} = \frac{1}{n}\sum_{i=1}^n [e_{i}/(1-h_{i})]^2,
$$
where $e_{i}$ is the residual obtained from fitting the model to all $n$ observations. See Christensen’s book Plane Answers to Complex Questions for a proof. Thus, it is not necessary to actually fit $n$ separate models when computing the CV statistic for linear models. This remarkable result allows cross-validation to be used while only fitting the model once to all available observations.

Relationships with other quantities

Cross-validation statistics and related quantities are widely used in statistics, although it has not always been clear that these are all connected with cross-validation.

Jackknife

A jackknife estimator is obtained by recomputing an estimate leaving out one observation at a time from the estimation sample. The $n$ estimates allow the bias and variance of the statistic to be calculated.

Akaike’s Information Criterion

Akaike’s Information Criterion is defined as
$$
\text{AIC} = -2\log {\cal L}+ 2p,
$$
where ${\cal L}$ is the maximized likelihood using all available data for estimation and $p$ is the number of free parameters in the model. Asymptotically, minimizing the AIC is equivalent to minimizing the CV value. This is true for any model (Stone 1977), not just linear models. It is this property that makes the AIC so useful in model selection when the purpose is prediction.

Schwarz Bayesian Information Criterion

A related measure is Schwarz’s Bayesian Information Criterion:
$$
\text{BIC} = -2\log {\cal L}+ p\log(n),
$$
where $n$ is the number of observations used for estimation. Because of the heavier penalty, the model chosen by BIC is either the same as that chosen by AIC, or one with fewer terms. Asymptotically, for linear models minimizing BIC is equivalent to leave-$v$-out cross-validation when $v = n[1-1/(\log(n)-1)]$ (Shao 1997).

Many statisticians like to use BIC because it is consistent — if there is a true underlying model, then with enough data the BIC will select that model. However, in reality there is rarely if ever a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts (because the parameter estimates may not be accurate).

Cross-validation for time series

When the data are not independent cross-validation becomes more difficult as leaving out an observation does not remove all the associated information due to the correlations with other observations. For time series forecasting, a cross-validation statistic is obtained as follows

  1. Fit the model to the data $y_1,\dots,y_t$ and let $\hat{y}_{t+1}$ denote the forecast of the next observation. Then compute the error $(e_{t+1}^*=y_{t+1}-\hat{y}_{t+1})$ for the forecast observation.
  2. Repeat step 1 for $t=m,\dots,n-1$ where $m$ is the minimum number of observations needed for fitting the model.
  3. Compute the MSE from $e_{m+1}^*,\dots,e_{n}^*$.

References

An excellent and comprehensive recent survey of cross-validation results is Arlot and Celisse (2010)


Related Posts:


  • Stephan Kolassa

    Very nice article, thanks! The Arlot and Celisse paper is even freely available from Project Euclid… as if I didn’t have enough to read already… 😉

    Any thoughts on using cross-validation with mixed linear models, e.g., with repeated measurements on each participant in clinical studies? It seems as if Arlot & Celisse don’t explicitly treat this case.

  • Stephan Kolassa

    Hm, this looks interesting. Thanks!

  • inwit

    Hi, Rob! This post of yours brought me back one old question about time series and cross-validation. Instead of posting it here, I’ve sent it to StackExchange. This is a great blog and a good source for inspiration! Keep rocking! 🙂

  • Vishal Belsare

    Rob, thanks for a nice post! and for pointing out the paper by Arlot & Celisse.

  • Abhijit

    Hi Rob,

    Very nicely done, indeed.

    I’ll bite and ask about your comment on consistency. I would agree that any model we create is wrong, but it doesn’t follow that there is no underlying true model, however complex, that we’re trying to approximate. I would posit that
    at least locally, and perhaps globally, there is a true regression function E(Y¦X). We can discuss this offline, if you like.

    • I think we are using consistency in two different ways. I agree that consistency of an estimator of E(Y|X) is important when we know X. And we get that with any decent estimator provided the form of Y = f(X,error) is correctly specified. (e.g., if f is linear and error is iid than OLS is consistent.)

      But the problem of variable selection is finding E(Y|Z) where Z is unknown and probably unknowable. The BIC provides consistency only when Z is contained within our set of potential predictor variables, but we can never know if that is true. That is why I suggest that consistency is not a useful concept in this context.

  • Another issue I have with consistency is that it addresses infinite sample case, but you may never see enough data for the infinite sample properties to matter. For instance, Contrastive Divergence estimator is consistent but for a dense model over n variables it takes in the order of 2^n samples for the estimate to approach true value

    • Yes, so a lot of my colleagues look at rates of convergence rather than just consistency. We’ve also been playing with simulation strategies to see how well a “consistent” method does in the finite sample situations.

  • Once you have an estimate of method’s performance for finite sample size, why does consistency matter? IE, would you ever have a reason to prefer a consistent estimator over an inconsistent one with better finite-sample properties?

  • I think the biggest difference between practitioners of stats and machine learning is what inferences they care about.

    Suppose we have data on region of the location you live in, education, sex, age, ethnicity, price of home, and mortgage
    on-time payment status, say in a time series over a decade.

    With my “machine learning” hat on, I want to predict whether an individual will default in some time frame given the value of the predictors. My imagined “customer” is a bank. I might throw the data through an SVM with a complex kernel if I only care about 0/1 outcome, or through a decision forest, or use K nearest neighbors. All of these are likely to produce reasonable default predictions, and a committee of such even better predictions.

    With my “applied statistician” hat on, I might want to estimate the effect of age on mortgage defaults, controlling for the other predictors.

    It’s just a very different game. Cross-validation makes much more sense in the former game. But as you point out, one needs to be careful. The biggest mistake I see, in practice, is the one you mention — tuning using cross-validation over all folds then assuming you’ll get the same performance on new data.

    Both groups tend to forget that neither the population nor effects are stationary (in the statistical sense). That is, the population over which we’re predicting isn’t the same as the one over which we collected the data. Sure, we can do things like post-stratification to adjust for sampling bias, but there’s the underlying change in attitudes, wealth, and so on that are changing in this example.

  • Hi Rob,

    I really enjoyed the article. Do you have a reference for time series cross-validation technique that you mention at the end?

    Thanks, Chandler

    • That is common practice is forecasting evaluation studies, but I’ve never seen it in a textbook. I’ve put it in my new book (incomplete) at http://robjhyndman.com/fpp/2/5/.

      • Thomas

        Dear Rob,
        Many thanks for the article (2 years later…). Do you by any chance have any reference of this technique being used in a published article?
        Thanks!
        Thomas

  • Torsten Seemann

    Mention should be made of the more general information-theoretic approaches such as MML (minimum message length) and MDL (minimum description length) of which AIC and BIC are restricted instances of:

    MML: http://en.wikipedia.org/wiki/Minimum_Message_Length
    MDL: http://en.wikipedia.org/wiki/Minimum_description_length

    • Owen

      I agree. I think of MDL as “AIC new and improved — now with consistency!” 🙂
      The asymptotics on MDL have been proven up to two order terms beyond where AIC breaks down, i.e. down to o(1) on the total message length.
      Papers by A. Barron and others.

  • Any comments on relative merits of cross-validation and 0.632+ bootstrap, especially for time series?

    • I don’t know much about this. Efron and Tibshirani (
      http://www.jstor.org/stable/2965703 ) argue for the 0.632 bootstrap over cross-validation but I don’t think it has any real theoretical support. I’ve not thought about how the 0.632 bootstrap would work in the time series context.

  • The bootstrap itself has plenty of theoretical support (*) both in an independent and dependent data contex. (References below.) However, I have not seen much in terms of generalizing the 0.632 (which Hall, at least, argues should really be 0.667). I did read about after posting my question, to see what I could find:

    R.M.Kunst, “Cross validation of prediction models for seasonal time series by parametric bootstrapping,” Austrian Journal of Statistics, 37(3&4), 2008, 271-284.

    D.N.Politis, J.P.Romano, “The stationary bootstrap”,  JASM, 89(428), 1994, 1303-1313.

    (*) S.N.Lahiri, RESAMPLING METHODS FOR DEPENDENT DATA, Springer, 2010.

    P.Hall, “On the biases of error estimators in prediction problems”, Statistics and Probability Letters 24(3), 15 Aug 1995, 

    There’s some work reported over at IEEE Transactions on 0.632 bootstrap approaches to time series, but I’m not a member and don’t have ready access to a library to look.

    Also, there’s a reference for cross-validation to dependent data, namely, 

    P.Burman, E.Chow, D.Nolan, “A cross-validatory method for dependent data”, BIOMETRIKA 1994, 81(2), 351-358.

    The entire area of resampling is pretty well developed, in practice as well as theory, e.g., 

    A.C.Davison, D.V.Hinkley, BOOTSTRAP METHODS AND THEIR APPLICATION, Cambridge University Press, 1997.

    M.R.Chernick, BOOTSTRAP METHODS: A GUIDE FOR PRACTITIONERS AND RESEARCHERS, 2nd edition, 2008.

    P.Hall, THE BOOTSTRAP AND EDGEWORTH EXPANSION, Springer, 1992.

    Happy to bring your readership up to date.

    • Thanks for the references. I meant that Efron and Tibshirani’s 0.632 bootstrap idea was empirically rather than theoretically based. Of course, there are many variations of the bootstrap that have been thoroughly studied from a theoretical perspective.

  • Pingback: Calculating forecast error with time series cross-validation | Q&A System()

  • Fabio Goncalves

    Hi Rob, thanks for the article! The link to Shao (1995) below actually points to a 1993 paper, which doesn’t seem to mention Schwarz’s BIC. 
     “Asymp­tot­i­cally, for lin­ear mod­els min­i­miz­ing BIC is equiv­a­lent to leave––out cross-validation when  (Shao 1995).”

    Would you be able to confirm this reference?

    Thanks!

    • Thanks for spotting that error. I’ve fixed the link to point to Shao (1997).

  • Matt Schneider

    To the group: I read various machine learning papers on prediction that select a tuning parameter or number of iterations (let’s say for boosting or trees) based on k-fold cross validation.  I can see how it makes sense if either of those parameters (tuning, iterations) are chosen on each of k training sets and then we look at the average prediction error (let’s say MSE) of the k test sets. However, am I misunderstanding something or is it a misuse when the parameters are chosen based on the aggregate prediction results after doing all the k-folds (arg min (parameters) { total MSE on test sets} ) ? 

    Two thoughts: 1) For inference this may be OK because those “best” tuning parameters accurately model the population well and a typical error of a withheld observation.  2) To call this forecasting, it seems off. Optimal parameters depend on all the data. None of the data was completely withheld.  “Prediction error” isn’t necessarily “forecast error?”

     Agree? Disagree? 

  • Pingback: Research tips - Major changes to the forecast package()

  • Pingback: R Binomial Regression | GH Powell, D.I.()

  • Dear professor Rob J Hyndman

    I am Chong Wu from China and I will be finishing a BS degree in Applied Math at the Huazhong University of Science & Technology (Top 10 in China) next year. I like this clear and enlightened article.
    I was wondering if you have any plan to recruit new PhD candidate in 2013 fall.

  • Great overview. Thanks.

  • Econstudent

    Great post, Professor,

    I am a little surprised that for time series (or dependent data in general) you did not mention the pertinent reference

    P.Burman, E.Chow, D.Nolan, “A cross-validatory method for dependent data”, BIOMETRIKA 1994, 81(2), 351-358.

    And a more recent contribution is

    @article{
    Author = {Racine, Jeff},
    Title = {Consistent cross-validatory model-selection for dependent data: hv-block cross-validation},
    Journal = {Journal of Econometrics},
    Volume = {99},
    Pages = {39-61},
    Year = {2000} }

  • Pingback: How is Hyndman's explanation of proper Time Series Cross Validation different from Leave-One-Out? | Question and Answer()

  • Pingback: Cross Validation | My Visual Notes()

  • Pingback: Quora()

  • Pingback: ChangUk, Park » Model Selection()

  • Pingback: Selecting Predictors – 2 | Business Forecasting()

  • Vishal Ugle

    Thanks for a great article, for time series part is it better to calculate MSE or mean absolute deviation, since we are just forecasting one value will the absolute deviation be a better measure?

    • If you want to forecast the mean, use the MSE. If you want to forecast the median, use the MAE.

  • S Spaniard

    Hi I have a question about time series cross-validation, that may be a stupid question but I ll try: there is a known (observed) time series data y1,….yt, The model is fitted using this data and y^t+1 is the forecast.

    But for the error computation, e*t+1=yt+1 – y^t+1,
    how are you finding yt+1 value.

    yt+1 is unknown, right?

    If yt+1 is part of the known(observed) data, how is it different than any other leave one out method.

  • Pingback: Hyndsight - Variations on rolling forecasts()

  • Daumantas

    Cross validation for time series will use training samples that will be considerably shorter than the full original sample. Suppose we have two samples from the same population, small one *s* and large one *S*. In *s*, the model which will minimize the forecast MSE will likely be more parsimonious than the forecast-MSE-minimizing model in *S*. Does that mean that time series cross validation will systematically yield more parsimonious models than, say, AIC applied on the original sample? If so, isn’t that a problem? Should we always use AIC instead of time series cross validation?

    • Yes, that is a problem when the time series are relatively short. In that case, AIC is to be preferred.

  • Pingback: Time series cross-validation: an R example | Hyndsight()

  • Pingback: Revisiting GECCO 2013 Industrial Challenge – Part 1 | Computer Engineering Lab Blog()

  • Pingback: Regresión no lineal, Cross-Validation y Regularization. | dlegorreta()

  • David Tseng

    Really nice concept.
    I came out a question, how do I use Time-Series Cross-Validation in “Classification” ?
    For my problem, I want to classify a purchase order will delay or not. I want to use random forest to solve this problem.
    And how can I do cross-validation?

    Thanks in advance.

  • David Tseng

    Really nice concept about Time-Series Cross-Validation.
    And I came out a problem, how can I apply Time-Series Cross-Validation on “Classification” problem?
    For example, I want to forecast a purchase order will delay or not, I am going to use SVM to do it.

    Do you have any idea about doing Time-Series Cross-Validation on this question?

    Thanks in advance.

  • Pingback: Linear Regression – How To Do It Properly | Likelihood Log()

  • Gabriel Card

    my question is what about comparing models with different number of variables? and what about different distributions like comparing binomial to negative binomial. there are 2 parameters you have to estimate for negbin and only 1 in the binomial.

    • That’s the whole point of cross-validation — it does not matter how many parameters each model has.

  • Pingback: Another impartial look at global warming… | Watts Up With That?()

  • Adwaith Gupta

    For cross-validations for time series, in Step-2 do I need to use y(1),y(2),……..y(m)? That would mean that I am using more inputs from past every time. Or, should I just use a fix length of inputs from y(t) to y(t-m)?

    • I’ve said nothing at all about inputs here. You are assuming some model.

      The length of the training data is increasing with each iteration. In some circumstances, it is preferable to have a fixed window length for the training data.

      • Adwaith Gupta

        If the length of training is increasing then this is not pure cross-validation, what is also getting taken into account is the so called auto-correlation part. Is that correct?

        • What is taken into account is that the past is used to predict the future, and not vice-versa. Accounting for autocorrelation is one feature of that, but not the only one. There are some circumstances when true cross-validation will work with time series data as explained in this paper: http://robjhyndman.com/working-papers/cv-time-series/

  • Volker Hadamschek

    Hi,

    thanks a lot for this article.

    One question with regard to CV for time series:
    If I get it right, no information in the test data points must be used to train the model. In order to estimate its performance properly.

    But what about the case when y_t+1 is not independent of y_t (and other former data points), which is in general the case? Is the method you describe valid then?

    Many thanks and best wishes

    • That’s when you use time series cross-validation

      • Volker Hadamschek

        “Fit the model to the data $y_1,dots,y_t$ and let $hat{y}_{t+1}$ denote the forecast of the next observation.”

        But if y_t+1 has been exploited indirectly for our model training (because there is dependence to y_t). How fair is it then to estimate the performance of our model by using y_t+1?

  • Derek Brown

    Great article! Thank you. Professor.
    I was wondering if you have any information on cross validation for the logistic regression model on the L2 regularization (ridge regression)?

  • Joshua Loftus

    “However, as with any variable selection procedure, it can be misused. Beware of looking at statistical tests after selecting variables using cross-validation — the tests do not take account of the variable selection that has taken place and so the p-values can mislead.”

    No more! See arxiv.org/abs/1511.08866 and the literature therein. (Software forthcoming)