This is a one-day workshop given as part of the Melbourne Data Science Week.
Date: 29 May 2017
Presenters: Rob J Hyndman and Earo Wang
Location: KPMG, Tower Two, Collins Square, 727 Collins St, Melbourne
Prerequisites
Please bring your own laptop with a recent version of R installed, along with the following packages and their dependencies:
devtools
fpp2
knitr
plotly
shiny
tidyverse
Participants will be assumed to be familiar with basic statistical tools such as multiple regression, but no knowledge of time series or forecasting will be assumed.
Need help with R?
Program
08.30 - 09.00 | Registration and welcome | Slides |
09.00 - 10.30 |
Time series and R, Time series graphics Lab Sessions 1-2 |
Slides |
10.30 - 11.00 | Morning tea | |
11.00 - 12.30 |
Visualising temporal data Lab Sessions 3-4 |
Slides |
12.30 - 13.30 | Lunch | |
13.30 - 15.00 |
Some automatic forecasting algorithms Lab Sessions 5-6 |
Slides |
15.00 - 15.30 | Afternoon tea | |
15.30 - 16.45 |
Forecast evaluation Lab Sessions 7-8 |
Slides |
16.45 - 17:00 | Wrap up | Slides |
Lab sessions
Lab Session 1
Download the
Retail.Rmd
file. This will be used for all analysis of the retail data.Download the monthly Australian retail data. These represent retail sales in various categories for different Australian states.
Read the data into R and choose one of the series. This time series will be used throughout the workshop in lab sessions 1–2, and 5–10.
Please script this, don’t just use the Rstudio click-and-point interface. That way you can save the results for easy replication later.
You will need the
read_excel
function from thereadxl
package:retaildata <- readxl::read_excel("retail.xlsx", skip = 1) mytimeseries <- ts(retaildata[["A3349873A"]], frequency=12, start=c(1982,4)) autoplot(mytimeseries)
[Replace the column name with your own chosen column.]
Lab Session 2
The following graphics functions have been introduced:
autoplot, ggseasonplot, ggmonthplot, gglagplot, ggAcf, ggtsdisplay
- Explore your chosen retail time series using these functions.
- Can you spot any seasonality, cyclicity and trend?
- What do you learn about the series?
Lab Session 3
Download the Rmd
file for this lab session.
- Download the billboard data. The
billboard
dataset contains the date a song first entered the Billboard Top 100 in 2000 and its rank over 76 weeks. - Read the dataset into R and take a look at the data.
- Transform the data to the long data form named as
billboard_long
. - [Bonus] Split the
billboard_long
to two separate datasets assong
andrank
. Thesong
data will includeartist
,track
,time
and a new column calledid
assigning a unique identifier for each song. Therank
data will include theid
,date
,week
,rank
columns. Theid
column is the key variable that maintains the linking between two datasets.
Lab Session 4
Download the Rmd
file for this lab session.
- Download the weather data.
- Read the dataset into R and tidy it up for visualising with
ggplot2
later. - Write some
ggplot2
code to reproduce the plot shown on the slides.
Lab Session 5
- Use
ets()
to find the best ETS model for your retail data.
- What does the model choice tell you about the data?
- What do the smoothing parameters tell you about the trend and seasonality?
- Do the forecasts look reasonable?
- Obtain up-to-date retail data from the ABS website (Cat. 8501.0, Table 11), and compare your forecasts with the actual numbers. How good were the forecasts from the various models?
Lab Session 6
We will now fit an ARIMA model for your retail data.
What Box-Cox transformation would you select to stabilize the variance?
Use
auto.arima
to obtain a seasonal ARIMA model, and compare the forecasts with those you obtained earlier, and with the latest retail data.Experiment with different Box-Cox transformations to see their effect on the chosen model and forecasts.
Lab Session 7
For your retail time series:
Use the
accuracy
function to compare the forecasts obtained from your ETS and ARIMA models. Which is giving the best forecasts?Repeat with forecasts obtained using
stlf
(with the same Box-Cox transformation as you used for the ARIMA model).Repeat with forecasts obtained using
snaive
(there’s no need for a transformation).Which approach gives the best forecasts?
Lab Session 8
- Use
ets
to find the best model for your retail data and record the training set MAPE. - We will now check how much larger the MAPE is on out-of-sample data using time series cross-validation. The following code will compute the result. Replace
???
with the appropriate values for your ETS model.
```r
fets <- function(x, h, model="ZZZ", damped=NULL, ...) {
forecast(ets(x, model=model, damped=damped), h=h)
}
e <- tsCV(mytimeseries, fets, model=???, damped=???)
pe <- 100*e/mytimeseries
sqrt(mean(pe^2, na.rm=TRUE))
```
Plot
pe
usingautoplot
andggAcf
. Do they look uncorrelated and homoskedastic?In practice, we will not know the best model on the whole data set until we observe all the data. So a more realistic analysis would be to allow
ets
to select a different model each time through the loop. Calculate the MAPE using this approach. (Warning: there a lot of models to fit, so this will take a while.)How do the MAPE values compare? Does the re-selection of a model at each step make much difference?