A blog by Rob J Hyndman 

Twitter Gplus RSS

Why every statistician should know about cross-​​validation

Published on 4 October 2010

Sur­pris­ingly, many sta­tis­ti­cians see cross-​​validation as some­thing data min­ers do, but not a core sta­tis­ti­cal tech­nique. I thought it might be help­ful to sum­ma­rize the role of cross-​​validation in sta­tis­tics, espe­cially as it is pro­posed that the Q&A site at stats​.stack​ex​change​.com should be renamed Cross​Val​i​dated​.com.

Cross-​​validation is pri­mar­ily a way of mea­sur­ing the pre­dic­tive per­for­mance of a sta­tis­ti­cal model. Every sta­tis­ti­cian knows that the model fit sta­tis­tics are not a good guide to how well a model will pre­dict: high R^2 does not nec­es­sar­ily mean a good model. It is easy to over-​​fit the data by includ­ing too many degrees of free­dom and so inflate R^2 and other fit sta­tis­tics. For exam­ple, in a sim­ple poly­no­mial regres­sion I can just keep adding higher order terms and so get bet­ter and bet­ter fits to the data. But the pre­dic­tions from the model on new data will usu­ally get worse as higher order terms are added.

One way to mea­sure the pre­dic­tive abil­ity of a model is to test it on a set of data not used in esti­ma­tion. Data min­ers call this a “test set” and the data used for esti­ma­tion is the “train­ing set”. For exam­ple, the pre­dic­tive accu­racy of a model can be mea­sured by the mean squared error on the test set. This will gen­er­ally be larger than the MSE on the train­ing set because the test data were not used for estimation.

How­ever, there is often not enough data to allow some of it to be kept back for test­ing. A more sophis­ti­cated ver­sion of training/​​test sets is leave-​​one-​​out cross-​​​​validation (LOOCV) in which the accu­racy mea­sures are obtained as fol­lows. Sup­pose there are n inde­pen­dent obser­va­tions, y_1,\dots,y_n.

  1. Let obser­va­tion i form the test set, and fit the model using the remain­ing data. Then com­pute the error (e_{i}^*=y_{i}-\hat{y}_{i}) for the omit­ted obser­va­tion. This is some­times called a “pre­dicted resid­ual” to dis­tin­guish it from an ordi­nary residual.
  2. Repeat step 1 for i=1,\dots,n.
  3. Com­pute the MSE from e_{1}^*,\dots,e_{n}^*. We shall call this the CV.

This is a much more effi­cient use of the avail­able data, as you only omit one obser­va­tion at each step. How­ever, it can be very time con­sum­ing to imple­ment (except for lin­ear mod­els — see below).

Other sta­tis­tics (e.g., the MAE) can be com­puted sim­i­larly. A related mea­sure is the PRESS sta­tis­tic (pre­dicted resid­ual sum of squares) equal to n\timesMSE.

Vari­a­tions on cross-​​validation include leave-​​k-​​out cross-​​validation (in which k obser­va­tions are left out at each step) and k-​​fold cross-​​validation (where the orig­i­nal sam­ple is ran­domly par­ti­tioned into k sub­sam­ples and one is left out in each iter­a­tion). Another pop­u­lar vari­ant is the .632+bootstrap of Efron & Tib­shi­rani (1997) which has bet­ter prop­er­ties but is more com­pli­cated to implement.

Min­i­miz­ing a CV sta­tis­tic is a use­ful way to do model selec­tion such as choos­ing vari­ables in a regres­sion or choos­ing the degrees of free­dom of a non­para­met­ric smoother. It is cer­tainly far bet­ter than pro­ce­dures based on sta­tis­ti­cal tests and pro­vides a nearly unbi­ased mea­sure of the true MSE on new observations.

How­ever, as with any vari­able selec­tion pro­ce­dure, it can be mis­used. Beware of look­ing at sta­tis­ti­cal tests after select­ing vari­ables using cross-​​validation — the tests do not take account of the vari­able selec­tion that has taken place and so the p-​​values can mislead.

It is also impor­tant to realise that it doesn’t always work. For exam­ple, if there are exact dupli­cate obser­va­tions (i.e., two or more obser­va­tions with equal val­ues for all covari­ates and for the y vari­able) then leav­ing one obser­va­tion out will not be effective.

Another prob­lem is that a small change in the data can cause a large change in the model selected. Many authors have found that k-​​fold cross-​​validation works bet­ter in this respect.

In a famous paper, Shao (1993) showed that leave-​​one-​​out cross val­i­da­tion does not lead to a con­sis­tent esti­mate of the model. That is, if there is a true model, then LOOCV will not always find it, even with very large sam­ple sizes. In con­trast, cer­tain kinds of leave-​​k-​​out cross-​​validation, where k increases with n, will be con­sis­tent. Frankly, I don’t con­sider this is a very impor­tant result as there is never a true model. In real­ity, every model is wrong, so con­sis­tency is not really an inter­est­ing property.

Cross-​​validation for lin­ear models

While cross-​​validation can be com­pu­ta­tion­ally expen­sive in gen­eral, it is very easy and fast to com­pute LOOCV for lin­ear mod­els. A lin­ear model can be writ­ten as

    \[ \mathbf{Y} = \mathbf{X}\mbox{\boldmath$\beta$} + \mathbf{e}. \]

Then

    \[ \hat{\mbox{\boldmath$\beta$}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} \]

and the fit­ted val­ues can be cal­cu­lated using

    \[ \mathbf{\hat{Y}} = \mathbf{X}\hat{\mbox{\boldmath$\beta$}} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{H}\mathbf{Y}, \]

where \mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' is known as the “hat-​​matrix” because it is used to com­pute \mathbf{\hat{Y}} (“Y-​​hat”).

If the diag­o­nal val­ues of \mathbf{H} are denoted by h_{1},\dots,h_{n}, then the cross-​​validation sta­tis­tic can be com­puted using

    \[ \text{CV} = \frac{1}{n}\sum_{i=1}^n [e_{i}/(1-h_{i})]^2, \]

where e_{i} is the resid­ual obtained from fit­ting the model to all n obser­va­tions. See Christensen’s book Plane Answers to Com­plex Ques­tions for a proof. Thus, it is not nec­es­sary to actu­ally fit n sep­a­rate mod­els when com­put­ing the CV sta­tis­tic for lin­ear mod­els. This remark­able result allows cross-​​validation to be used while only fit­ting the model once to all avail­able observations.

Rela­tion­ships with other quantities

Cross-​​validation sta­tis­tics and related quan­ti­ties are widely used in sta­tis­tics, although it has not always been clear that these are all con­nected with cross-​​validation.

Jack­knife

A jack­knife esti­ma­tor is obtained by recom­put­ing an esti­mate leav­ing out one obser­va­tion at a time from the esti­ma­tion sam­ple. The n esti­mates allow the bias and vari­ance of the sta­tis­tic to be calculated.

Akaike’s Infor­ma­tion Criterion

Akaike’s Infor­ma­tion Cri­te­rion is defined as

    \[ \text{AIC} = -2\log {\cal L}+ 2p, \]

where {\cal L} is the max­i­mized like­li­hood using all avail­able data for esti­ma­tion and p is the num­ber of free para­me­ters in the model. Asymp­tot­i­cally, min­i­miz­ing the AIC is equiv­a­lent to min­i­miz­ing the CV value. This is true for any model (Stone 1977), not just lin­ear mod­els. It is this prop­erty that makes the AIC so use­ful in model selec­tion when the pur­pose is prediction.

Schwarz Bayesian Infor­ma­tion Criterion

A related mea­sure is Schwarz’s Bayesian Infor­ma­tion Criterion:

    \[ \text{BIC} = -2\log {\cal L}+ p\log(n), \]

where n is the num­ber of obser­va­tions used for esti­ma­tion. Because of the heav­ier penalty, the model cho­sen by BIC is either the same as that cho­sen by AIC, or one with fewer terms. Asymp­tot­i­cally, for lin­ear mod­els min­i­miz­ing BIC is equiv­a­lent to leave–v–out cross-​​validation when v = n[1-1/(\log(n)-1)] (Shao 1997).

Many sta­tis­ti­cians like to use BIC because it is con­sis­tent — if there is a true under­ly­ing model, then with enough data the BIC will select that model. How­ever, in real­ity there is rarely if ever a true under­ly­ing model, and even if there was a true under­ly­ing model, select­ing that model will not nec­es­sar­ily give the best fore­casts (because the para­me­ter esti­mates may not be accurate).

Cross-​​validation for time series

When the data are not inde­pen­dent cross-​​validation becomes more dif­fi­cult as leav­ing out an obser­va­tion does not remove all the asso­ci­ated infor­ma­tion due to the cor­re­la­tions with other obser­va­tions. For time series fore­cast­ing, a cross-​​validation sta­tis­tic is obtained as follows

  1. Fit the model to the data y_1,\dots,y_t and let \hat{y}_{t+1} denote the fore­cast of the next obser­va­tion. Then com­pute the error (e_{t+1}^*=y_{t+1}-\hat{y}_{t+1}) for the fore­cast observation.
  2. Repeat step 1 for t=m,\dots,n-1 where m is the min­i­mum num­ber of obser­va­tions needed for fit­ting the model.
  3. Com­pute the MSE from e_{m+1}^*,\dots,e_{n}^*.

Ref­er­ences

An excel­lent and com­pre­hen­sive recent sur­vey of cross-​​validation results is Arlot and Celisse (2010)


Related Posts:


 
29 Comments  comments 
  • Stephan Kolassa

    Very nice arti­cle, thanks! The Arlot and Celisse paper is even freely avail­able from Project Euclid… as if I didn’t have enough to read already… ;-)

    Any thoughts on using cross-​​validation with mixed lin­ear mod­els, e.g., with repeated mea­sure­ments on each par­tic­i­pant in clin­i­cal stud­ies? It seems as if Arlot & Celisse don’t explic­itly treat this case.

  • Stephan Kolassa

    Hm, this looks inter­est­ing. Thanks!

  • inwit

    Hi, Rob! This post of yours brought me back one old ques­tion about time series and cross-​​validation. Instead of post­ing it here, I’ve sent it to Stack­Ex­change. This is a great blog and a good source for inspi­ra­tion! Keep rock­ing! :)

  • Vishal Bel­sare

    Rob, thanks for a nice post! and for point­ing out the paper by Arlot & Celisse.

  • Abhi­jit

    Hi Rob,

    Very nicely done, indeed.

    I’ll bite and ask about your com­ment on con­sis­tency. I would agree that any model we cre­ate is wrong, but it doesn’t fol­low that there is no under­ly­ing true model, how­ever com­plex, that we’re try­ing to approx­i­mate. I would posit that
    at least locally, and per­haps glob­ally, there is a true regres­sion func­tion E(Y¦X). We can dis­cuss this offline, if you like.

    • http://robjhyndman.com Rob J Hyndman

      I think we are using con­sis­tency in two dif­fer­ent ways. I agree that con­sis­tency of an esti­ma­tor of E(Y|X) is impor­tant when we know X. And we get that with any decent esti­ma­tor pro­vided the form of Y = f(X,error) is cor­rectly spec­i­fied. (e.g., if f is lin­ear and error is iid than OLS is consistent.)

      But the prob­lem of vari­able selec­tion is find­ing E(Y|Z) where Z is unknown and prob­a­bly unknow­able. The BIC pro­vides con­sis­tency only when Z is con­tained within our set of poten­tial pre­dic­tor vari­ables, but we can never know if that is true. That is why I sug­gest that con­sis­tency is not a use­ful con­cept in this context.

  • http://yaroslavvb.blogspot.com Yaroslav Bula­tov

    Another issue I have with con­sis­tency is that it addresses infi­nite sam­ple case, but you may never see enough data for the infi­nite sam­ple prop­er­ties to mat­ter. For instance, Con­trastive Diver­gence esti­ma­tor is con­sis­tent but for a dense model over n vari­ables it takes in the order of 2n sam­ples for the esti­mate to approach true value

    • http://statbandit.wordpress.com Abhi­jit

      Yes, so a lot of my col­leagues look at rates of con­ver­gence rather than just con­sis­tency. We’ve also been play­ing with sim­u­la­tion strate­gies to see how well a “con­sis­tent” method does in the finite sam­ple situations.

  • http://yaroslavvb.blogspot.com Yaroslav Bula­tov

    Once you have an esti­mate of method’s per­for­mance for finite sam­ple size, why does con­sis­tency mat­ter? IE, would you ever have a rea­son to pre­fer a con­sis­tent esti­ma­tor over an incon­sis­tent one with bet­ter finite-​​sample properties?

  • http://lingpipe-blog.com/ Bob Car­pen­ter

    I think the biggest dif­fer­ence between prac­ti­tion­ers of stats and machine learn­ing is what infer­ences they care about.

    Sup­pose we have data on region of the loca­tion you live in, edu­ca­tion, sex, age, eth­nic­ity, price of home, and mort­gage
    on-​​time pay­ment sta­tus, say in a time series over a decade.

    With my “machine learn­ing” hat on, I want to pre­dict whether an indi­vid­ual will default in some time frame given the value of the pre­dic­tors. My imag­ined “cus­tomer” is a bank. I might throw the data through an SVM with a com­plex ker­nel if I only care about 01 out­come, or through a deci­sion for­est, or use K near­est neigh­bors. All of these are likely to pro­duce rea­son­able default pre­dic­tions, and a com­mit­tee of such even bet­ter predictions.

    With my “applied sta­tis­ti­cian” hat on, I might want to esti­mate the effect of age on mort­gage defaults, con­trol­ling for the other predictors.

    It’s just a very dif­fer­ent game. Cross-​​validation makes much more sense in the for­mer game. But as you point out, one needs to be care­ful. The biggest mis­take I see, in prac­tice, is the one you men­tion — tun­ing using cross-​​validation over all folds then assum­ing you’ll get the same per­for­mance on new data.

    Both groups tend to for­get that nei­ther the pop­u­la­tion nor effects are sta­tion­ary (in the sta­tis­ti­cal sense). That is, the pop­u­la­tion over which we’re pre­dict­ing isn’t the same as the one over which we col­lected the data. Sure, we can do things like post-​​stratification to adjust for sam­pling bias, but there’s the under­ly­ing change in atti­tudes, wealth, and so on that are chang­ing in this example.

  • http://chandlerlutz.com Chan­dler Lutz

    Hi Rob,

    I really enjoyed the arti­cle. Do you have a ref­er­ence for time series cross-​​validation tech­nique that you men­tion at the end?

    Thanks, Chan­dler

    • http://robjhyndman.com Rob J Hyndman

      That is com­mon prac­tice is fore­cast­ing eval­u­a­tion stud­ies, but I’ve never seen it in a text­book. I’ve put it in my new book (incom­plete) at http://​rob​jhyn​d​man​.com/​f​p​p​/2/5/.

      • Thomas

        Dear Rob,
        Many thanks for the arti­cle (2 years later…). Do you by any chance have any ref­er­ence of this tech­nique being used in a pub­lished arti­cle?
        Thanks!
        Thomas

  • Torsten See­mann

    Men­tion should be made of the more gen­eral information-​​theoretic approaches such as MML (min­i­mum mes­sage length) and MDL (min­i­mum descrip­tion length) of which AIC and BIC are restricted instances of:

    MML: http://​en​.wikipedia​.org/​w​i​k​i​/​M​i​n​i​m​u​m​_​M​e​s​s​a​g​e​_​L​ength
    MDL: http://​en​.wikipedia​.org/​w​i​k​i​/​M​i​n​i​m​u​m​_​d​e​s​c​r​i​p​t​i​o​n​_​l​ength

  • http://profiles.google.com/bayesianlogic.1 Jan Galkowski

    Any com­ments on rel­a­tive mer­its of cross-​​validation and 0.632+ boot­strap, espe­cially for time series?

    • http://robjhyndman.com Rob J Hyndman

      I don’t know much about this. Efron and Tib­shi­rani (
      http://​www​.jstor​.org/​s​t​a​b​l​e​/​2​9​65703 ) argue for the 0.632 boot­strap over cross-​​validation but I don’t think it has any real the­o­ret­i­cal sup­port. I’ve not thought about how the 0.632 boot­strap would work in the time series context.

  • http://profiles.google.com/bayesianlogic.1 Jan Galkowski

    The boot­strap itself has plenty of the­o­ret­i­cal sup­port (*) both in an inde­pen­dent and depen­dent data con­tex. (Ref­er­ences below.) How­ever, I have not seen much in terms of gen­er­al­iz­ing the 0.632 (which Hall, at least, argues should really be 0.667). I did read about after post­ing my ques­tion, to see what I could find:

    R.M.Kunst, “Cross val­i­da­tion of pre­dic­tion mod­els for sea­sonal time series by para­met­ric boot­strap­ping,” Aus­trian Jour­nal of Sta­tis­tics, 37(3&4), 2008, 271–284.

    D.N.Politis, J.P.Romano, “The sta­tion­ary boot­strap”,  JASM, 89(428), 1994, 1303–1313.

    (*) S.N.Lahiri, RESAMPLING METHODS FOR DEPENDENT DATA, Springer, 2010.

    P.Hall, “On the biases of error esti­ma­tors in pre­dic­tion prob­lems”, Sta­tis­tics and Prob­a­bil­ity Let­ters 24(3), 15 Aug 1995, 

    There’s some work reported over at IEEE Trans­ac­tions on 0.632 boot­strap approaches to time series, but I’m not a mem­ber and don’t have ready access to a library to look.

    Also, there’s a ref­er­ence for cross-​​validation to depen­dent data, namely, 

    P.Burman, E.Chow, D.Nolan, “A cross-​​validatory method for depen­dent data”, BIOMETRIKA 1994, 81(2), 351–358.

    The entire area of resam­pling is pretty well devel­oped, in prac­tice as well as the­ory, e.g., 

    A.C.Davison, D.V.Hinkley, BOOTSTRAP METHODS AND THEIR APPLICATION, Cam­bridge Uni­ver­sity Press, 1997.

    M.R.Chernick, BOOTSTRAP METHODS: A GUIDE FOR PRACTITIONERS AND RESEARCHERS, 2nd edi­tion, 2008.

    P.Hall, THE BOOTSTRAP AND EDGEWORTH EXPANSION, Springer, 1992.

    Happy to bring your read­er­ship up to date.

    • http://robjhyndman.com Rob J Hyndman

      Thanks for the ref­er­ences. I meant that Efron and Tibshirani’s 0.632 boot­strap idea was empir­i­cally rather than the­o­ret­i­cally based. Of course, there are many vari­a­tions of the boot­strap that have been thor­oughly stud­ied from a the­o­ret­i­cal perspective.

  • Pingback: Calculating forecast error with time series cross-validation | Q&A System

  • Fabio Goncalves

    Hi Rob, thanks for the arti­cle! The link to Shao (1995) below actu­ally points to a 1993 paper, which doesn’t seem to men­tion Schwarz’s BIC
     “Asymp­tot­i­cally, for lin­ear mod­els min­i­miz­ing BIC is equiv­a­lent to leave––out cross-​​validation when  (Shao 1995).”

    Would you be able to con­firm this reference?

    Thanks!

    • http://robjhyndman.com Rob J Hyndman

      Thanks for spot­ting that error. I’ve fixed the link to point to Shao (1997).

  • Matt Schnei­der

    To the group: I read var­i­ous machine learn­ing papers on pre­dic­tion that select a tun­ing para­me­ter or num­ber of iter­a­tions (let’s say for boost­ing or trees) based on k-​​fold cross val­i­da­tion.  I can see how it makes sense if either of those para­me­ters (tun­ing, iter­a­tions) are cho­sen on each of k train­ing sets and then we look at the aver­age pre­dic­tion error (let’s say MSE) of the k test sets. How­ever, am I mis­un­der­stand­ing some­thing or is it a mis­use when the para­me­ters are cho­sen based on the aggre­gate pre­dic­tion results after doing all the k-​​folds (arg min (para­me­ters) { total MSE on test sets} ) ? 

    Two thoughts: 1) For infer­ence this may be OK because those “best” tun­ing para­me­ters accu­rately model the pop­u­la­tion well and a typ­i­cal error of a with­held obser­va­tion.  2) To call this fore­cast­ing, it seems off. Opti­mal para­me­ters depend on all the data. None of the data was com­pletely with­held.  “Pre­dic­tion error” isn’t nec­es­sar­ily “fore­cast error?”

     Agree? Disagree? 

  • Pingback: Research tips - Major changes to the forecast package

  • Pingback: R Binomial Regression | GH Powell, D.I.

  • http://www.facebook.com/chongwu.math Chong Wu

    Dear pro­fes­sor Rob J Hyndman

    I am Chong Wu from China and I will be fin­ish­ing a BS degree in Applied Math at the Huazhong Uni­ver­sity of Sci­ence & Tech­nol­ogy (Top 10 in China) next year. I like this clear and enlight­ened arti­cle.
    I was won­der­ing if you have any plan to recruit new PhD can­di­date in 2013 fall.

  • http://twitter.com/IstvanHajnal Ist­van Hajnal

    Great overview. Thanks.