Forecasting: principles and practice (NYC)
Date: 25-27 June 2018
Location: eBay, 625 6th Ave, New York, NY 10011, USA.
This page is for people enrolled in my NYC 3-day workshop.
Prerequisites
Please bring your own laptop with a recent version of R and RStudio installed, along with the following packages and their dependencies:
fpp2
readxl
thief
knitr
Participants will be assumed to be familiar with basic statistical tools such as multiple regression, but no knowledge of time series or forecasting will be assumed.
Reference
Need help with R?
Approximate Program
Monday 25 June
09.00 - 9.30 | Registration and introduction | Slides 1-1 |
9.30 - 10.30 | Time series graphics | Slides 1-2 |
10.30 - 11.00 | Break | |
11.00 - 12.30 | Forecast evaluation | Slides 1-3 |
12.30 - 1.30 | Lunch | |
1.30 - 3.00 | Exponential Smoothing | Slides 1-4 |
3.00 - 3.30 | Break | |
3.30 - 5.00 | ETS and state space models | Slides 1-5 |
Tuesday 26 June
9.00 - 10.30 | Transformations and STL |
Slides 2-1 Slides 2-2 |
10.30 - 11.00 | Break | |
11.00 - 12.30 | Non-seasonal ARIMA models | Slides 2-3 |
12.30 - 1.30 | Lunch | |
1.30 - 3.00 | Non-seasonal ARIMA models | Slides 2-4 |
3.00 - 3.30 | Break | |
3.30 - 5.00 | Seasonal ARIMA models | Slides 2-5 |
Wednesday 27 June
9.00 - 10.30 | Dynamic regression | Slides 3-1 |
10.30 - 11.00 | Break | |
11.00 - 12.30 | Forecasting with multiple seasonality | Slides 3-2 |
12.30 - 1.30 | Lunch | |
1.30 - 3.00 | Hierarchical forecasting | Slides 3-3 |
3.00 - 3.30 | Break | |
3.30 - 4.15 | Extras | Slides 3-4 |
4.15 - 5.00 | Final comments, questions and discussion | Slides 3-5 |
Lab sessions
Download the following Rmarkdown file for recording your lab session exercises: Labs.Rmd
Lab Session 1
Download the monthly Australian retail data. These represent retail sales in various categories for different Australian states.
Read the data into R and choose one of the series. This time series will be used throughout the short course in lab sessions.
Please script this, don’t just use the Rstudio click-and-point interface. That way you can save the results for easy replication later.
<- read.csv("retail.csv") retaildata <- ts(retaildata[,4], frequency=12, start=c(1982,4)) mytimeseries autoplot(mytimeseries)
[Replace the column number with your own chosen column.]
Lab Session 2
Apply the following functions to your chosen retail time series.
ggseasonplot, ggsubseriesplot
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
Lab Session 3
The following graphics functions have been introduced:
autoplot, ggseasonplot, ggsubseriesplot, gglagplot, ggAcf
Apply these functions to the following series:
bicoal, chicken, dole, usdeaths, bricksq, lynx, ibmclose
Use the help files to find out what the series are.
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
Lab Session 4
Calculate the residuals from a seasonal naïve forecast applied to quarterly Australian beer production from 1992.
<- window(ausbeer, start=1992) beer <- snaive(beer) fc autoplot(fc) <- residuals(fc) res autoplot(res)
Test if the residuals are white noise.
checkresiduals(fc)
What do you conclude?
Lab Session 5
For your retail time series:
Split the data into two parts using
<- window(mytimeseries, end=c(2010,12)) train <- window(mytimeseries, start=2011) test
Check that your data have been split appropriately by producing the following plot.
autoplot(cbind(Training=train,Test=test))
Calculate forecasts using
snaive
applied totrain
.Compare the accuracy of your forecasts against the actual values stored in
test
.<- snaive(train) f1 accuracy(f1, test)
Check the residuals
checkresiduals(f1)
Do the residuals appear to be uncorrelated and normally distributed?
How sensitive are the accuracy measures to the training/test split?
Lab Session 6
Use time series cross-validation to compare the four benchmark methods on your retail data.
Lab Session 7
For this exercise, use the price of a dozen eggs in the United States from 1900–1993 (data set
eggs
).- Use SES and Holt’s method (with and without damping) to forecast “future” data.
[Hint: useh=100
so you can clearly see the differences between the options when plotting the forecasts.] - Which method gives the best training RMSE?
- Are these RMSE values comparable?
- Do the residuals from the best fitting method look like white noise?
- Use SES and Holt’s method (with and without damping) to forecast “future” data.
Lab Session 8
Apply Holt-Winters’ multiplicative method to your retail time series data.
- Why is multiplicative seasonality necessary here?
- Experiment with making the trend damped.
- Compare the RMSE of the one-step forecasts from the two methods. Which do you prefer?
- Check that the residuals from the best method look like white noise.
Now use the same training/test split you did in Lab Session 5 to test your Holt-Winters’ forecasts on the retail data. Can you beat the seasonal naïve approach from Lab Session 5?
Lab Session 9
Use
ets()
to find the best ETS model for your retail data.- How does this model compare to the Holt-Winters’ method you applied in the previous lab session?
- What do the smoothing parameters tell you about the trend and seasonality?
Use
ets()
on some of these series:bicoal, chicken, dole, usdeaths, bricksq, lynx, ibmclose, eggs, ausbeer
Does it always give good forecasts?
Find an example where it does not work well. Can you figure out why?
Lab Session 10
To finish Day 1, we will compare the various models we have fitted to your retail data. First we will use the training/test split for the comparison. Second, we will use time series cross-validation.
The code for the forecasts for the training/test split you have already written in previous labs. Copy and paste it to do the full comparison. Which method did best?
For the time series cross-validation, use
h=12
. For example,<- tsCV(mytimeseries, meanf, h=12) e1
A similar approach can be used for all the benchmark functions:
meanf()
,naive()
,rwf()
,hw()
. But forets()
, we first we need to write a new function that combinesets()
withforecast()
.First, use
ets()
to find the best model for these data.Then, use the following code to fit an ETS model and return forecasts. Replace ??? with the relevant details of your model.
<- function(y,h,...) etsfc {ets(y,...) %>% forecast(h=h) }<- tsCV(mytimeseries, etsfc, h=12, model="???", damped=???) e6
Note that we are “cheating” a little bit here by fixing the model based on the entire time series, even though we are estimating the parameters of that model on every training set.
A more realistic version would be to allow
etsfc
to select a different model every time. Just remove themodel
anddamped
arguments when you calltsCV()
. This will take much longer, of course. Try it, and see what difference it makes to the results.
Lab Session 11
For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.
usnetelec
mcopper
enplanements
a10
Why is a Box-Cox transformation unhelpful for the
cangas
data?What Box-Cox transformation would you select for your retail data?
Lab Session 12
We will use the bricksq
data for this exercise.
- Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)
- Compute and plot the seasonally adjusted data.
- Repeat with a robust STL decomposition. Does it make much difference?
Lab Session 13
Continuing with the bricksq
data:
- Use a naïve method to produce forecasts of the seasonally adjusted data obtained in the last lab session.
- Reseasonalize the results to give forecasts on the original scale.
[Hint: use thestlf
function withmethod="naive"
.] - Do the residuals look uncorrelated?
Lab Session 14
- For your retail data, try an STL decomposition applied to the Box-Cox transformed series, followed by ETS on the seasonally adjusted data. How does that compare with your best previous forecasts on the test set?
Lab Session 15
For the following series, find an appropriate differencing (after transformation if necessary) to obtain stationary data.
usnetelec
usgdp
mcopper
enplanements
visitors
For your retail data, find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.
Lab Session 16
Select a time series from wmurders
, usgdp
and mcopper
.
- if necessary, find a suitable Box-Cox transformation for the data;
- fit a suitable ARIMA model to the transformed data using
auto.arima()
; - try some other plausible models by experimenting with the orders chosen;
- choose what you think is the best model and check the residual diagnostics;
- produce forecasts of your fitted model. Do the forecasts look reasonable?
- compare the results with what you would obtain using
ets()
(with no transformation).
Lab Session 17
For your retail time series, develop an appropriate seasonal ARIMA model, and compare the forecasts with those you obtained earlier.
Choose one of the following seasonal time series:
condmilk
,hsales
,uselec
- Do the data need transforming? If so, find a suitable transformation.
- Are the data stationary? If not, find an appropriate differencing which yields stationary data.
- Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AIC values?
- Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.
- Forecast the next 24 months of data using your preferred model.
- Compare the forecasts obtained using
ets()
.
Lab Session 18
For this exercise, we will continue to use your retail data and use time series cross-validation to compare ETS and ARIMA models.
The calculations for ETS were done in Lab Session 10.
Now we will repeat the exercise using an ARIMA model.
Use
auto.arima()
to find the best model for these data and record the training set RMSE.We will now check how much larger the one-step RMSE is on out-of-sample data using time series cross-validation. The following code will compute the result, beginning with four years of data in the training set. Replace
???
with the appropriate values for your ARIMA model.<- function(y,h) arimafc {%>% Arima(order=???, seasonal=???, lambda=??) %>% forecast(h=h) y }<- tsCV(mytimeseries, arimafc) e sqrt(mean(e^2, na.rm=TRUE))
Plot the errors. Do they look uncorrelated and homoscedastic?
If the first few values have high variance, repeat the RMSE calculation after omitting them.
In practice, we will not know the best model on the whole data set until we observe all the data. So a more realistic analysis would be to allow
auto.arima()
to select a different model each time through the loop.- Calculate the RMSE values using this approach. (Warning: it will take a while as there are a lot of models to fit.)
- How do the RMSE values compare? Does the re-selection of a model at each step make much difference?
Lab Session 19
This exercise concerns motel
: the total monthly takings from accommodation and the total room nights occupied at hotels, motels, and guest houses in Victoria, Australia, between January 1980 and June 1995. Total monthly takings are in thousands of Australian dollars; total room nights occupied are in thousands.
- Use the data to calculate the average cost of a night’s accommodation in Victoria each month.
- Use
cpimel
to estimate the monthly CPI. - Produce time series plots of both variables and explain why logarithms of both variables need to be taken before fitting any models.
- Fit an appropriate regression model with ARIMA errors. Explain your reasoning in arriving at the final model.
- Forecast the average price per room for the next twelve months using your fitted model. (Hint: You will need to produce forecasts of the CPI figures first.)
Lab Session 20
For your retail time series:
- Develop an appropriate dynamic regression model with Fourier terms for the seasonality. Use the AIC to select the number of Fourier terms to include in the model. (You will probably need to use the same Box-Cox transformation you identified previously.)
- Check the residuals of the fitted model. Does the residual series look like white noise?
- Compare the forecasts with those you obtained earlier using alternative models.
Lab Session 21
Use the
tbats
function to model your retail time series.- Check the residuals and produce forecasts.
- Does this completely automated approach work for these data?
- Have you saved any degrees of freedom by using Fourier terms rather than seasonal differencing?
Consider the US finished motor gasoline products supplied (thousands of barrels per day) in the
gasoline
data set.- Create a training set consisting of all but the last two years of data.
- Fit a
tbats
model to the training data. - Check the residuals and produce forecasts.
- Compute the accuracy of the forecasts using the test set.
- Compare the results to those obtained using STL and dynamic harmonic regression. Which method does best for this series?
Lab Session 22
We will reconcile the forecasts for the infant deaths data. The following code can be used. Check that you understand what each step is doing.
library(hts)
plot(infantgts)
smatrix(infantgts)
# Forecast 10-steps-ahead and reconcile the forecasts
<- forecast(infantgts, h=10)
infantforecast
# Plot the forecasts including only the last ten historical years
plot(infantforecast, include=10)
# set up training and testing sets
<- window(infantgts, end=1993)
training <- window(infantgts, start=1994)
test
# Compute forecasts on training data
<- forecast(training, h=10)
forecast
# calculate ME, RMSE, MAE, MAPE, MPE and MASE
accuracy.gts(forecast, test)
- How would you measure overall forecast accuracy across the whole collection of time series?
- Repeat the training/test set analysis using “bottom-up” forecasts (set
method="bu"
in theforecast
function). - Does the “optimal” reconciliation method work best here?
Lab Session 23
Use thief()
on your retail data, using ARIMA models for each aggregation. Does the temporal reconciliation lead to improved forecasts on your test set?
Lab Session Notes
Here are the Rmd and html files I created during the workshop