# Initializing the Holt-Winters method

The Holt-Winters method is a popular and effective approach to forecasting seasonal time series. But different implementations will give different forecasts, depending on how the method is initialized and how the smoothing parameters are selected. In this post I will discuss various initialization methods.

Suppose the time series is denoted by $y_1,\dots,y_n$ and the seasonal period is $m$ (e.g., $m=12$ for monthly data). Let $\hat{y}_{t+h|t}$ be the $h$-step forecast made using data to time $t$. Then the additive formulation of Holt-Winters’ method is given by the following equations

\begin{align}
\ell_t & = \alpha(y_t – s_{t-m}) + (1-\alpha)(\ell_{t-1}+b_{t-1})\\
b_t & = \beta(\ell_t – \ell_{t-1}) + (1-\beta)b_{t-1}\\
s_t &= \gamma(y_t-\ell_{t-1} – b_{t-1}) + (1-\gamma)s_{t-m}\\
\hat{y}_{t+h|t} &= \ell_t + b_th + s_{t+h-m},
\end{align}

and the multiplicative version is given by

\begin{align}
\ell_t & = \alpha(y_t/ s_{t-m}) + (1-\alpha)(\ell_{t-1}+b_{t-1})\\
b_t & = \beta(\ell_t – \ell_{t-1}) + (1-\beta)b_{t-1}\\
s_t &= \gamma(y_t/(\ell_{t-1} + b_{t-1})) + (1-\gamma)s_{t-m}\\
\hat{y}_{t+h|t} &= (\ell_t + b_th)s_{t+h-m}.
\end{align}

In many books, the seasonal equation (with $s_t$ on the LHS) is slightly different from these, but I prefer the version above because it makes it easier to write the system in state space form. In practice, the modified form makes very little difference to the forecasts.

In my 1998 textbook, the following initialization was proposed. Set
\begin{align}
\ell_m & = (y_1+\cdots+y_m)/m\\
b_m &= \left[(y_{m+1}+y_{m+2}+\cdots+y_{m+m})-(y_1+y_2+\cdots+y_{m})\right]/m^2.
\end{align}
The level is obviously the average of the first year of data. The slope is set to be the average of the slopes for each period in the first two years:
$$(y_{m+1}-y_1)/m,\quad (y_{m+2}-y_2)/m,\quad \dots,\quad (y_{m+m}-y_m)/m.$$
Then, for additive seasonality set $s_i=y_i-\ell_m$ and for multiplicative seasonality set $s_i=y_i/\ell_m$, where $i=1,\dots,m$. This works pretty well, and is easy to implement, but for short and noisy series it can give occasional dodgy results. It also has the disadvantage of providing state estimates for period $m$, so that the first forecast is for period $m+1$ rather than period 1.

In some books (e.g., Bowerman, O’Connell and Koehler, 2005), a regression-based procedure is used instead. They suggest fitting a regression with linear trend to the first few years of data (usually 3 or 4 years are used). Then the initial level $\ell_0$ is set to the intercept, and the initial slope $b_0$ is set to the regression slope. The initial seasonal values $s_{-m+1},\dots,s_0$ are computed from the detrended data. This is a very poor method that should not be used as the trend will be biased by the seasonal pattern. Imagine a seasonal pattern, for example, where the last period of the year is always the largest value for the year. Then the trend will be biased upwards. Unfortunately, Bowerman, O’Connell and Koehler (2005) are not alone in recommending bad methods. I’ve seen similar, and worse, procedures recommended in other books.

While it would be possible to implement a reasonably good regression method, a much better procedure is based on a decomposition. This is what was recommended in my 2008 Springer book and is implemented in the HoltWinters and ets functions in R.

1. First fit a $2\times m$ moving average smoother to the first 2 or 3 years of data (HoltWinters uses 2 years, ets uses 3 years). Here is a quick intro to moving average smoothing.
2. Then subtract (for additive HW) or divide (for multiplicative HW) the smooth trend from the original data to get de-trended data. The initial seasonal values are then obtained from the averaged de-trended data. For example, the initial seasonal value for January is the average of the de-trended Januaries.
3. Next subtract (for additive HW) or divide (for multiplicative HW) the seasonal values from the original data to get seasonally adjusted data.
4. Fit a linear trend to the seasonally adjusted data to get the initial level $\ell_0$ (the intercept) and the initial slope $b_0$.

This is generally quite good and fast to implement and allows “forecasts” to be produced from period 1. (Of course, they are not really forecasts as the data to be forecast has been used in constructing them.) However, it does require 2 or 3 years of data. For very short time series, an alternative (implemented in the ets function in R from v4.07) is to use a simple linear model with time trend and first order Fourier approximation to the seasonal component. Use the linear trend in place of the moving average smoother, then proceed with steps 2-4 as above.

Whichever method is used, these initial values should be seen as rough estimates only. They can be improved by optimizing them along with the smoothing parameters using maximum likelihood estimation, for example. The only implementation of the Holt-Winters’ method that does that, to my knowledge, is the ets function in R. In that function, the above procedure is used to find starting values for the estimation.

Some recent work (De Livera, Hyndman and Snyder, 2010) shows that all of the above may soon be redundant for the additive case (but not for the multiplicative case). In Section 3.1, we show that for linear models, the initial state values can be concentrated out of the likelihood and estimated directly using a regression procedure. Although we present the idea in the context of complex seasonality, it is valid for any linear exponential smoothing model. I am planning on modifying the ets function to implement this idea, but it will probably have to a wait for a couple of months as my “to do” list is rather long.

### Related Posts:

• I am an engineer, not a mathematician. But is the equation for multiplicative St correct? Shouldn’t “b” be preceded by a plus sign, not a negative sign? We want to advance to the current period, not go back to a prior period? Otherwise, a good explanation.

• Thanks. Now fixed.

• Great post. I’d been using the method recommended by Bowerman, O’Connel, and Koehler, so it’s nice to see a critique.

A question related to HW but of a different nature. I’ve been optimizing alpha, delta, and gamma for HW using a nonlinear optimization solver to minimize the sum of the squared errors between the actual data and the previous period’s forecast of the next period. Are there other approaches for optimizing these parameters quickly if my goal is to create a HW model using 3 years worth of data to forecast a fourth year? I have 6000 different time series forecasts that I need parameters for, which is why I use the word “quickly”.

Thanks.

• That method works pretty well. There are several variations on it (as discussed in this IJF (2002) paper, but none that will be quicker than what you are doing if you want to optimize the parameters for every time series.

One alternative to nonlinear optimization for each time series is to use the same smoothing parameters for all time series that you are forecasting (or divide the time series into homogeneous groups and use the same parameters in each group).

Another is to use fixed, pre-specified, values of the parameters and don’t tune them at all. This is actually a very common approach and works surprisingly well.

• Abhishek

Hello Rob
Great post. I was building Winter’s model in MS excel. I am using Solver add in to find optimal values for alpha, beta and gamma. It gives alpha and gamma as 1 and beta as zero. I am trying to minimize the Mean Absolute Error.
I am skeptical about this solution given by solver. Can you suggest if the solver is the best tool to find the optimal values of the parameter that fits the data or is there some better tool available.

• Solver is a non-linear optimizer. It does a reasonable job, but when there are local maxima it will often converge to a non-optimal solution. There are better non-linear optimizers that are more robust to local maxima (e.g., PSO), but they will take a lot longer. You will have to use something other than MS-Excel if you want to explore better tools. I suggest you use R.

• Sebastian

This is a great post and has been extremely useful to me (I work with biosurveillance data) as well as your formulas for prediction intervals of the multiplicative HW (Hyndman, Koehler, Ord and Snyder, 2005)

Regarding these formulas, I have a question (sorry to ask it in this post):
As far as I understand, they are only valid when the residuals have no autocorrelation. However, I have some data for which HW with optimization of the MSE gives highly correlated residuals (r1 \approx 0.5) Is there a way to modify the equations of the prediction intervals to take this into account? Or maybe they can be used as lower (or even upper) bounds for the prediction intervals?

• Yes, the prediction intervals assume uncorrelated residuals. It would be possible, but complicated, to derive intervals allowing for autocorrelated errors. However, if autocorrelation occurs, I suggest you use an ARIMA model instead.

• Lakshmi

Dear Rob,

I am working on holt winter’s method for a FMCG product.I need to forecast for future 2 years and i have the history data for 3 years. Are there any thumb rules suggesting that alpha should not be greater than 0.3, beta should not be greater than 0.3 and gamma should not be greater than 0.1?
If yes, then at what situations should we use higher alpha, beta, gamma?

• If you don’t have enough data to optimize the parameters, then a reasonable rule is to set alpha=0.2, beta=0.1, gamma=0.1.

• Tithi Sen

Hi Rob,

I have data points for 49 months, while formulating SI for HW in excel, how many months of data shall Intake to calculate the initial double moving average

good morningplease, if you can give me the method of searching for alpha beta gamma with Excel or other method,

• Elif

Thanks for info. But I’m confused about the seasonalities.

Consider having 24 months of data, I take their averages for each month and find 12 seasonalities. According to your formula using S(t+h-m = 24+1-12=13) in order to forecast 25. month, I should use 13. seasonality which is also a forecast for jan. (not an average). This sounds very improper to me, but I investigated also some other books and they give the same formulation.

Can you please explain it to me…

• If your data start in January, then obs 25 is a January. So you should use the best estimate of the Jan seasonality when forecasting it. The best estimate is from the previous Jan which happens to be obs 13.

• Guest

I am trying to use holt winters method to forecast my next year’s monthly demand…I have two years monthly data…how would i get the projected numbers for next year monthwise as this method requires actual data to compute the forcast which i dont have…

• Guest

I am trying to use holt winters method to forecast my next year’s monthly demand…I have two years monthly data…how would i get the projected numbers for next year monthwise as this method requires actual data to compute the forcast which i dont have.

• Girondist

Awesome post, and I’ve been trying out these different methods on some data with a positive linear trend. The initial level you get from your ’98 textbook is MUCH HIGHER than the other 2 methods given here for my data since it’s the average value of the first 12 months rather than a y-intercept. It’s strange to me that the ’98 method is legit.

On to my question, per your 3rd method listed I’m using the 2×12 smoother to get my initial seasonality factors on 3 years worth of data. Should I throw away my edges? In other words, 3 years worth of data would turn into 2 years worth of smoothed data, and I’d average 2 smoothed points to get a single initial seasonality factor.

Thanks. Keep this stuff coming. There’s are so few good time series sources on the internet.

• Note that the level from my 98 book is for the mth time period, not the 0th time period as for the other methods. That is why it will be much larger when there is a positive trend.

Yes, the smoothing will cause you to lose a year.

• sqrt

I dont get it. If you smooth 3 years (36 periods) with a moving average of 2×12=24 periods you’ll loose 12 periods on each edge .. what am I missing?

• No, a 2×12 MA means do a 12MA followed by a 2MA. It is equivalent to a weighted 13MA and you lose 6 periods on each edge. See http://otexts.com/fpp/6/2/

• sqrt

So it’s a centred MA, isnt it?

Could you also please concretise the third initialisation variant (“For very short time series, an alter­na­tive ..”)? I found the corresponding part in the R source code, but don’t quite understand it. Subtracting the mean of the first year to get seasonal values just sounds like your first approach.

• Yes, a 2×12-MA is also called a centered 12-MA.

For short series, I use the same as my 1998 approach for the seasonal component. But the other components are handled differently. In R, the initial seasonal component is simply

switch(seasontype, A = y-mean(y), M = y/mean(y))

• Erik P.

Thanks for this excellent post – it’s a couple years old but still very useful. I am trying to write code to initialize the optimization process for a general ETS(., ., .) process and this is helping me a lot. I have four questions:

A) If m is odd, I suppose that instead of a 2xm MA, we use a plain m item MA, right? Which means we can use one data point less to get the same number of data points for each “month”.

B) Would it make sense to take a moving *geometric* mean for the multiplicative case? This seems to lead to better results, at least for a low noise situation: in the limit for epsilon=0, it gives the exact answer.

C) Things seem to get a little more complicated for the other ETS(., ., .) models. My current plan, following this blog post, is:

1. Compute averages per season using moving averages – 2xm if m is even, m if m is odd; arithmetic mean if seasonality is additive, geometric mean if it is multiplicative.

1.5 Fit a trend line through the resulting averages; the type of trend line is linear for additive trend, exponential for multiplicative trend, and complicated for damped trends (I believe I have a handle on this).

2. Remove the fitted trend line (not the averages) from the original data by subtraction (A / Ad trends) or division (M / Md trends) to obtain detrended data. Initial seasonal values are obtained as averages of the averages that correspond to their “month”. I’m not yet sure if these averages should also be geometric in some cases, but it doesn’t seem to make a whole lot of difference.

3. Subtract (for “A” seasonality) or divide (for “M” seasonality) the seasonal values from the original data to obtain deseasonalized data.

4. Do the same as step 1.5 but now to deseasonalized data to obtain initial values for l0, b0, and if appropriate, phi. (Taking care to

Does any of this seem obviously flawed?

D) I haven’t really thought about how to initialize alpha, beta, and gamma yet; are there obvious strategies apart from just using default values?

Let me know if this is too involved; the questions got more complicated as I was typing. I suppose I could ask this as a number of separate crossvalidated.com questions.

• A) Yes use an m-MA when m is odd.

B) Yes, that would also work.

C) That sounds ok.

D) You can see my preferences in the forecast:::initparam function. The defaults are alpha=0.5, beta=0.1, gamma=0.01, phi=0.978.

• Erik P.

Thanks Rob!

• sqrt

We have a very short time series (25 values). Now that I’ve tested your three methods I also used an evolutionary algorithm to optimize all parameters at once (10 seasonal indices, 3 smoothing parameters, initial level and slope). The result’s SSE is minimal but the curve just looks too smoothed (kind of sinusoidal). I don’t trust these results. What’s wrong with this approach? I need to compare my HW forecasts with an ARIMA model but whether HW forecasts better or not depends on which init approach I used .. so it’s not an objective choice anymore. Thx in advance ..

• You cannot expect to estimate 15 parameters with 25 observations.

• sqrt

Thought so. Guess I’ll use your third method then since you’ve recommended it especially for short series. However, the comparison with the ARIMA model still depends a lot on how HW was initialized ..

• Steve

“1.First fit a mov­ing aver­age smoother to the first 2 or 3 years of data (HoltWinters uses 2 years, ets uses 3 years).

2.Then sub­tract (for addi­tive HW) or divide (for mul­ti­plica­tive HW)
the smooth trend from the orig­i­nal data to get de-​​trended data. The
ini­tial sea­sonal val­ues are then obtained from the aver­aged
de-​​trended data. For exam­ple, the ini­tial sea­sonal value for
Jan­u­ary is the aver­age of the de-​​trended Januaries.

3.Next sub­tract (for addi­tive HW) or divide (for mul­ti­plica­tive HW) the sea­sonal val­ues from the orig­i­nal data to get sea­son­ally adjusted data.

4.Fit a lin­ear trend to the sea­son­ally adjusted data to get the ini­tial level (the inter­cept) and the ini­tial slope ”

Could you please give example how to calculate the initial value for H-W method as the step you discussed above? I am really stuck at this stage.

• sergio

well explained the theme of winter but method to generate the seeds for winter additive method of initializing in F0 as would the equations would use thank you very much

• Terry Ziemer

I have a question related to Holt-Winters, but about prediction intervals, specifically for the multiplicative model. Ord’s paper in the International Journal of Forecasting from 2001 contains 4 different models for MAM, with different prediction intervals for each model. The Hyndman, Koehler, Ord, and Snyder paper from 2003 uses a single model for MAM, which I believe is Ord’s model 1 from the earlier paper. The question I have is that it appears that the prediction intervals in these 2 papers are not the same. I have put the formulas into Excel to calculate prediction intervals for a set of data and I get different results using the Ord intervals and the Hyndman, et al intervals. So, the question I have is which of these is correct? Also, do you have a test set of data with correct prediction intervals, parameter values, etc that I could use to check my calculations in Excel?

• This is a simple change of notation. See chapter 2 of my Springer book where we discuss the two notations.

• Guest

Thanks for this post very informative. For series short enough that the moving-average method cannot be used, I wonder what you think of using initial values chosen by grid search or a PRNG among numbers near the 0th level and (0,1)th slope. (I’m not sure what value of “near” to use, though, and would love your input on that also.)