Initializing the Holt-​​Winters method

The Holt-​​Winters method is a pop­u­lar and effec­tive approach to fore­cast­ing sea­sonal time series. But dif­fer­ent imple­men­ta­tions will give dif­fer­ent fore­casts, depend­ing on how the method is ini­tial­ized and how the smooth­ing para­me­ters are selected. In this post I will dis­cuss var­i­ous ini­tial­iza­tion methods.

Sup­pose the time series is denoted by y_1,\dots,y_n and the sea­sonal period is m (e.g., m=12 for monthly data). Let \hat{y}_{t+h|t} be the h–step fore­cast made using data to time t. Then the addi­tive for­mu­la­tion of Holt-​​Winters’ method is given by the fol­low­ing equations

    \begin{align*} \ell_t & = \alpha(y_t - s_{t-m}) + (1-\alpha)(\ell_{t-1}+b_{t-1})\\ b_t & = \beta(\ell_t - \ell_{t-1}) + (1-\beta)b_{t-1}\\ s_t &= \gamma(y_t-\ell_{t-1} - b_{t-1}) + (1-\gamma)s_{t-m}\\ \hat{y}_{t+h|t} &= \ell_t + b_th + s_{t+h-m}, \end{align*}

and the mul­ti­plica­tive ver­sion is given by

    \begin{align*} \ell_t & = \alpha(y_t/ s_{t-m}) + (1-\alpha)(\ell_{t-1}+b_{t-1})\\ b_t & = \beta(\ell_t - \ell_{t-1}) + (1-\beta)b_{t-1}\\ s_t &= \gamma(y_t/(\ell_{t-1} + b_{t-1})) + (1-\gamma)s_{t-m}\\ \hat{y}_{t+h|t} &= (\ell_t + b_th)s_{t+h-m}. \end{align*}

In many books, the sea­sonal equa­tion (with s_t on the LHS) is slightly dif­fer­ent from these, but I pre­fer the ver­sion above because it makes it eas­ier to write the sys­tem in state space form. In prac­tice, the mod­i­fied form makes very lit­tle dif­fer­ence to the forecasts.

In my 1998 text­book, the fol­low­ing ini­tial­iza­tion was pro­posed. Set

    \begin{align*} \ell_m & = (y_1+\cdots+y_m)/m\\ b_m &= \left[(y_{m+1}+y_{m+2}+\cdots+y_{m+m})-(y_1+y_2+\cdots+y_{m})\right]/m^2. \end{align*}

The level is obvi­ously the aver­age of the first year of data. The slope is set to be the aver­age of the slopes for each period in the first two years:

    \[ (y_{m+1}-y_1)/m,\quad (y_{m+2}-y_2)/m,\quad \dots,\quad (y_{m+m}-y_m)/m. \]

Then, for addi­tive sea­son­al­ity set s_i=y_i-\ell_m and for mul­ti­plica­tive sea­son­al­ity set s_i=y_i/\ell_m, where i=1,\dots,m. This works pretty well, and is easy to imple­ment, but for short and noisy series it can give occa­sional dodgy results. It also has the dis­ad­van­tage of pro­vid­ing state esti­mates for period m, so that the first fore­cast is for period m+1 rather than period 1.

In some books (e.g., Bow­er­man, O’Connell and Koehler, 2005), a regression-​​based pro­ce­dure is used instead. They sug­gest fit­ting a regres­sion with lin­ear trend to the first few years of data (usu­ally 3 or 4 years are used). Then the ini­tial level \ell_0 is set to the inter­cept, and the ini­tial slope b_0 is set to the regres­sion slope. The ini­tial sea­sonal val­ues s_{-m+1},\dots,s_0 are com­puted from the detrended data. This is a very poor method that should not be used as the trend will be biased by the sea­sonal pat­tern. Imag­ine a sea­sonal pat­tern, for exam­ple, where the last period of the year is always the largest value for the year. Then the trend will be biased upwards. Unfor­tu­nately, Bow­er­man, O’Connell and Koehler (2005) are not alone in rec­om­mend­ing bad meth­ods. I’ve seen sim­i­lar, and worse, pro­ce­dures rec­om­mended in other books.

While it would be pos­si­ble to imple­ment a rea­son­ably good regres­sion method, a much bet­ter pro­ce­dure is based on a decom­po­si­tion. This is what was rec­om­mended in my 2008 Springer book and is imple­mented in the HoltWinters and ets func­tions in R.

  1. First fit a 2\times m mov­ing aver­age smoother to the first 2 or 3 years of data (HoltWinters uses 2 years, ets uses 3 years). Here is a quick intro to mov­ing aver­age smooth­ing.
  2. Then sub­tract (for addi­tive HW) or divide (for mul­ti­plica­tive HW) the smooth trend from the orig­i­nal data to get de-​​trended data. The ini­tial sea­sonal val­ues are then obtained from the aver­aged de-​​trended data. For exam­ple, the ini­tial sea­sonal value for Jan­u­ary is the aver­age of the de-​​trended Januaries.
  3. Next sub­tract (for addi­tive HW) or divide (for mul­ti­plica­tive HW) the sea­sonal val­ues from the orig­i­nal data to get sea­son­ally adjusted data.
  4. Fit a lin­ear trend to the sea­son­ally adjusted data to get the ini­tial level \ell_0 (the inter­cept) and the ini­tial slope b_0.

This is gen­er­ally quite good and fast to imple­ment and allows “fore­casts” to be pro­duced from period 1. (Of course, they are not really fore­casts as the data to be fore­cast has been used in con­struct­ing them.) How­ever, it does require 2 or 3 years of data. For very short time series, an alter­na­tive (imple­mented in the ets func­tion in R from v4.07) is to use a sim­ple lin­ear model with time trend and first order Fourier approx­i­ma­tion to the sea­sonal com­po­nent. Use the lin­ear trend in place of the mov­ing aver­age smoother, then pro­ceed with steps 2–4 as above.

Whichever method is used, these ini­tial val­ues should be seen as rough esti­mates only. They can be improved by opti­miz­ing them along with the smooth­ing para­me­ters using max­i­mum like­li­hood esti­ma­tion, for exam­ple. The only imple­men­ta­tion of the Holt-​​Winters’ method that does that, to my knowl­edge, is the ets func­tion in R. In that func­tion, the above pro­ce­dure is used to find start­ing val­ues for the estimation.

Some recent work (De Liv­era, Hyn­d­man and Sny­der, 2010) shows that all of the above may soon be redun­dant for the addi­tive case (but not for the mul­ti­plica­tive case). In Sec­tion 3.1, we show that for lin­ear mod­els, the ini­tial state val­ues can be con­cen­trated out of the like­li­hood and esti­mated directly using a regres­sion pro­ce­dure. Although we present the idea in the con­text of com­plex sea­son­al­ity, it is valid for any lin­ear expo­nen­tial smooth­ing model. I am plan­ning on mod­i­fy­ing the ets func­tion to imple­ment this idea, but it will prob­a­bly have to a wait for a cou­ple of months as my “to do” list is rather long.


Related Posts:


  • http://www.duke-energy.com Scott Haag

    I am an engi­neer, not a math­e­mati­cian. But is the equa­tion for mul­ti­plica­tive St cor­rect? Shouldn’t “b” be pre­ceded by a plus sign, not a neg­a­tive sign? We want to advance to the cur­rent period, not go back to a prior period? Oth­er­wise, a good explanation.

    • http://robjhyndman.com Rob J Hyndman

      Thanks. Now fixed.

  • http://www.xkcd.com John

    Great post. I’d been using the method rec­om­mended by Bow­er­man, O’Connel, and Koehler, so it’s nice to see a critique.

    A ques­tion related to HW but of a dif­fer­ent nature. I’ve been opti­miz­ing alpha, delta, and gamma for HW using a non­lin­ear opti­miza­tion solver to min­i­mize the sum of the squared errors between the actual data and the pre­vi­ous period’s fore­cast of the next period. Are there other approaches for opti­miz­ing these para­me­ters quickly if my goal is to cre­ate a HW model using 3 years worth of data to fore­cast a fourth year? I have 6000 dif­fer­ent time series fore­casts that I need para­me­ters for, which is why I use the word “quickly”.

    Thanks.

    • http://robjhyndman.com Rob J Hyndman

      That method works pretty well. There are sev­eral vari­a­tions on it (as dis­cussed in this IJF (2002) paper, but none that will be quicker than what you are doing if you want to opti­mize the para­me­ters for every time series.

      One alter­na­tive to non­lin­ear opti­miza­tion for each time series is to use the same smooth­ing para­me­ters for all time series that you are fore­cast­ing (or divide the time series into homo­ge­neous groups and use the same para­me­ters in each group).

      Another is to use fixed, pre-​​specified, val­ues of the para­me­ters and don’t tune them at all. This is actu­ally a very com­mon approach and works sur­pris­ingly well.

  • Abhishek

    Hello Rob
    Great post. I was build­ing Winter’s model in MS excel. I am using Solver add in to find opti­mal val­ues for alpha, beta and gamma. It gives alpha and gamma as 1 and beta as zero. I am try­ing to min­i­mize the Mean Absolute Error.
    I am skep­ti­cal about this solu­tion given by solver. Can you sug­gest if the solver is the best tool to find the opti­mal val­ues of the para­me­ter that fits the data or is there some bet­ter tool available.

    • http://robjhyndman.com Rob J Hyndman

      Solver is a non-​​linear opti­mizer. It does a rea­son­able job, but when there are local max­ima it will often con­verge to a non-​​optimal solu­tion. There are bet­ter non-​​linear opti­miz­ers that are more robust to local max­ima (e.g., PSO), but they will take a lot longer. You will have to use some­thing other than MS-​​Excel if you want to explore bet­ter tools. I sug­gest you use R.

  • Sebas­t­ian

    This is a great post and has been extremely use­ful to me (I work with bio­sur­veil­lance data) as well as your for­mu­las for pre­dic­tion inter­vals of the mul­ti­plica­tive HW (Hyn­d­man, Koehler, Ord and Sny­der, 2005)

    Regard­ing these for­mu­las, I have a ques­tion (sorry to ask it in this post):
    As far as I under­stand, they are only valid when the resid­u­als have no auto­cor­re­la­tion. How­ever, I have some data for which HW with opti­miza­tion of the MSE gives highly cor­re­lated resid­u­als (r1 \approx 0.5) Is there a way to mod­ify the equa­tions of the pre­dic­tion inter­vals to take this into account? Or maybe they can be used as lower (or even upper) bounds for the pre­dic­tion intervals?

    Many thanks in advance!

    • http://robjhyndman.com Rob J Hyndman

      Yes, the pre­dic­tion inter­vals assume uncor­re­lated resid­u­als. It would be pos­si­ble, but com­pli­cated, to derive inter­vals allow­ing for auto­cor­re­lated errors. How­ever, if auto­cor­re­la­tion occurs, I sug­gest you use an ARIMA model instead.

  • Lak­shmi

    Dear Rob,

    I am work­ing on holt winter’s method for a FMCG product.I need to fore­cast for future 2 years and i have the his­tory data for 3 years. Are there any thumb rules sug­gest­ing that alpha should not be greater than 0.3, beta should not be greater than 0.3 and gamma should not be greater than 0.1?
    If yes, then at what sit­u­a­tions should we use higher alpha, beta, gamma?

    • http://robjhyndman.com Rob J Hyndman

      If you don’t have enough data to opti­mize the para­me­ters, then a rea­son­able rule is to set alpha=0.2, beta=0.1, gamma=0.1.

  • Tithi Sen

    Hi Rob,

    I have data points for 49 months, while for­mu­lat­ing SI for HW in excel, how many months of data shall Intake to cal­cu­late the ini­tial dou­ble mov­ing average

  • Hamada zakaria

    good morn­ing­please, if you can give me the method of search­ing for alpha beta gamma with Excel or other method,

  • Elif

    Thanks for info. But I’m con­fused about the seasonalities.

    Con­sider hav­ing 24 months of data, I take their aver­ages for each month and find 12 sea­son­al­i­ties. Accord­ing to your for­mula using S(t+h-m = 24+1−12=13) in order to fore­cast 25. month, I should use 13. sea­son­al­ity which is also a fore­cast for jan. (not an aver­age). This sounds very improper to me, but I inves­ti­gated also some other books and they give the same formulation.

    Can you please explain it to me…

    • http://robjhyndman.com Rob J Hyndman

      If your data start in Jan­u­ary, then obs 25 is a Jan­u­ary. So you should use the best esti­mate of the Jan sea­son­al­ity when fore­cast­ing it. The best esti­mate is from the pre­vi­ous Jan which hap­pens to be obs 13.

  • Guest

    I am try­ing to use holt win­ters method to fore­cast my next year’s monthly demand…I have two years monthly data…how would i get the pro­jected num­bers for next year mon­th­wise as this method requires actual data to com­pute the for­cast which i dont have…

  • Guest

    I am try­ing to use holt win­ters method to fore­cast my next year’s monthly demand…I have two years monthly data…how would i get the pro­jected num­bers for next year mon­th­wise as this method requires actual data to com­pute the for­cast which i dont have.

  • Girondist

    Awe­some post, and I’ve been try­ing out these dif­fer­ent meth­ods on some data with a pos­i­tive lin­ear trend. The ini­tial level you get from your ’98 text­book is MUCH HIGHER than the other 2 meth­ods given here for my data since it’s the aver­age value of the first 12 months rather than a y-​​intercept. It’s strange to me that the ’98 method is legit.

    On to my ques­tion, per your 3rd method listed I’m using the 2×12 smoother to get my ini­tial sea­son­al­ity fac­tors on 3 years worth of data. Should I throw away my edges? In other words, 3 years worth of data would turn into 2 years worth of smoothed data, and I’d aver­age 2 smoothed points to get a sin­gle ini­tial sea­son­al­ity factor.

    Thanks. Keep this stuff com­ing. There’s are so few good time series sources on the internet.

    • http://robjhyndman.com Rob J Hyndman

      Note that the level from my 98 book is for the mth time period, not the 0th time period as for the other meth­ods. That is why it will be much larger when there is a pos­i­tive trend.

      Yes, the smooth­ing will cause you to lose a year.

    • sqrt

      I dont get it. If you smooth 3 years (36 peri­ods) with a mov­ing aver­age of 2×12=24 peri­ods you’ll loose 12 peri­ods on each edge .. what am I missing?

      • http://robjhyndman.com/ Rob J Hyndman

        No, a 2×12 MA means do a 12MA fol­lowed by a 2MA. It is equiv­a­lent to a weighted 13MA and you lose 6 peri­ods on each edge. See http://​otexts​.com/​f​p​p​/6/2/

        • sqrt

          So it’s a cen­tred MA, isnt it?

          Could you also please con­cre­tise the third ini­tial­i­sa­tion vari­ant (“For very short time series, an alter­na­tive ..”)? I found the cor­re­spond­ing part in the R source code, but don’t quite under­stand it. Sub­tract­ing the mean of the first year to get sea­sonal val­ues just sounds like your first approach.

          • http://robjhyndman.com/ Rob J Hyndman

            Yes, a 2x12-​​MA is also called a cen­tered 12-​​MA.

            For short series, I use the same as my 1998 approach for the sea­sonal com­po­nent. But the other com­po­nents are han­dled dif­fer­ently. In R, the ini­tial sea­sonal com­po­nent is simply

            switch(seasontype, A = y-mean(y), M = y/mean(y))

  • Erik P.

    Thanks for this excel­lent post — it’s a cou­ple years old but still very use­ful. I am try­ing to write code to ini­tial­ize the opti­miza­tion process for a gen­eral ETS(., ., .) process and this is help­ing me a lot. I have four questions:

    A) If m is odd, I sup­pose that instead of a 2xm MA, we use a plain m item MA, right? Which means we can use one data point less to get the same num­ber of data points for each “month”.

    B) Would it make sense to take a mov­ing *geo­met­ric* mean for the mul­ti­plica­tive case? This seems to lead to bet­ter results, at least for a low noise sit­u­a­tion: in the limit for epsilon=0, it gives the exact answer.

    C) Things seem to get a lit­tle more com­pli­cated for the other ETS(., ., .) mod­els. My cur­rent plan, fol­low­ing this blog post, is:

    1. Com­pute aver­ages per sea­son using mov­ing aver­ages — 2xm if m is even, m if m is odd; arith­metic mean if sea­son­al­ity is addi­tive, geo­met­ric mean if it is multiplicative.

    1.5 Fit a trend line through the result­ing aver­ages; the type of trend line is lin­ear for addi­tive trend, expo­nen­tial for mul­ti­plica­tive trend, and com­pli­cated for damped trends (I believe I have a han­dle on this).

    2. Remove the fit­ted trend line (not the aver­ages) from the orig­i­nal data by sub­trac­tion (A /​ Ad trends) or divi­sion (M /​ Md trends) to obtain detrended data. Ini­tial sea­sonal val­ues are obtained as aver­ages of the aver­ages that cor­re­spond to their “month”. I’m not yet sure if these aver­ages should also be geo­met­ric in some cases, but it doesn’t seem to make a whole lot of difference.

    3. Sub­tract (for “A” sea­son­al­ity) or divide (for “M” sea­son­al­ity) the sea­sonal val­ues from the orig­i­nal data to obtain desea­son­al­ized data.

    4. Do the same as step 1.5 but now to desea­son­al­ized data to obtain ini­tial val­ues for l0, b0, and if appro­pri­ate, phi. (Tak­ing care to

    Does any of this seem obvi­ously flawed?

    D) I haven’t really thought about how to ini­tial­ize alpha, beta, and gamma yet; are there obvi­ous strate­gies apart from just using default values?

    Let me know if this is too involved; the ques­tions got more com­pli­cated as I was typ­ing. I sup­pose I could ask this as a num­ber of sep­a­rate cross​val​i​dated​.com questions.

    • http://robjhyndman.com/ Rob J Hyndman

      A) Yes use an m-​​MA when m is odd.

      B) Yes, that would also work.

      C) That sounds ok.

      D) You can see my pref­er­ences in the forecast:::initparam func­tion. The defaults are alpha=0.5, beta=0.1, gamma=0.01, phi=0.978.

      • Erik P.

        Thanks Rob!

  • sqrt

    We have a very short time series (25 val­ues). Now that I’ve tested your three meth­ods I also used an evo­lu­tion­ary algo­rithm to opti­mize all para­me­ters at once (10 sea­sonal indices, 3 smooth­ing para­me­ters, ini­tial level and slope). The result’s SSE is min­i­mal but the curve just looks too smoothed (kind of sinu­soidal). I don’t trust these results. What’s wrong with this approach? I need to com­pare my HW fore­casts with an ARIMA model but whether HW fore­casts bet­ter or not depends on which init approach I used .. so it’s not an objec­tive choice any­more. Thx in advance ..

    • http://robjhyndman.com/ Rob J Hyndman

      You can­not expect to esti­mate 15 para­me­ters with 25 observations.

      • sqrt

        Thought so. Guess I’ll use your third method then since you’ve rec­om­mended it espe­cially for short series. How­ever, the com­par­i­son with the ARIMA model still depends a lot on how HW was initialized ..

  • Steve

    Your Post:
    “1.First fit a mov­ing aver­age smoother to the first 2 or 3 years of data (HoltWin­ters uses 2 years, ets uses 3 years).

    2.Then sub­tract (for addi­tive HW) or divide (for mul­ti­plica­tive HW)
    the smooth trend from the orig­i­nal data to get de-​​​​trended data. The
    ini­tial sea­sonal val­ues are then obtained from the aver­aged
    de-​​​​trended data. For exam­ple, the ini­tial sea­sonal value for
    Jan­u­ary is the aver­age of the de-​​​​trended Januaries.

    3.Next sub­tract (for addi­tive HW) or divide (for mul­ti­plica­tive HW) the sea­sonal val­ues from the orig­i­nal data to get sea­son­ally adjusted data.

    4.Fit a lin­ear trend to the sea­son­ally adjusted data to get the ini­tial level (the inter­cept) and the ini­tial slope ”

    Could you please give exam­ple how to cal­cu­late the ini­tial value for H-​​W method as the step you dis­cussed above? I am really stuck at this stage.