<- gafa_stock |>
dgoog filter(Symbol == "GOOG", year(Date) >= 2018) |>
mutate(trading_day = row_number()) |>
update_tsibble(index=trading_day, regular=TRUE) |>
mutate(diff = difference(Close))
Time series analysis & forecasting using R
Two-day workshop at ANU. 9-10 November 2022
Venue
Room 5.02, Marie Reay Teaching Building, Australian National University, Canberra.
Course description
It is becoming increasingly common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some new packages and methods that have been developed to handle the analysis of large collections of time series.
On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course).
Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package. We will look at creating ensemble forecasts and hybrid forecasts, as well as some new forecasting methods that have performed well in large-scale forecasting competitions. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related.
Learning objectives
Attendees will learn:
- How to wrangle time series data with familiar tidy tools.
- How to compute time series features and visualize large collections of time series.
- How to select a good forecasting algorithm for your time series.
- How to ensure forecasts of a large collection of time series are coherent.
Is this course for me?
This course will be appropriate for you if you answer yes to these questions:
- Do you already use the tidyverse packages in R such as dplyr, tidyr, tibble and ggplot2?
- Do you need to analyse large collections of related time series?
- Would you like to learn how to use some tidy tools for time series analysis including visualization, decomposition and forecasting?
Prework
People who don’t use R regularly, or don’t know the tidyverse packages, are recommended to do the tutorials at learnr.numbat.space beforehand.
Please bring your own laptop with a recent version of R and RStudio installed. The following code will install the main packages needed for the workshop.
install.packages(c("tidyverse","fpp3", "GGally", "sugrrants"))
Schedule
Time | Activity |
---|---|
09:00 - 10:30 | Session 1 |
10:30 - 11:00 | Coffee break |
11:00 - 12:30 | Session 2 |
12:30 - 13:30 | Lunch break |
13:30 - 15:00 | Session 3 |
15:00 - 15:30 | Coffee break |
15:30 - 17:00 | Session 4 |
Slides
Day 1
Day 2
Lab Sessions
Lab Session 1
- Download
tourism.xlsx
fromhttp://robjhyndman.com/data/tourism.xlsx
, and read it into R usingread_excel()
from thereadxl
package. - Create a tsibble which is identical to the
tourism
tsibble from thetsibble
package. - Find what combination of
Region
andPurpose
had the maximum number of overnight trips on average. - Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
Lab Session 2
- Create time plots of the following four time series:
Bricks
fromaus_production
,Lynx
frompelt
,Close
fromgafa_stock
,Demand
fromvic_elec
. - Use
help()
to find out about the data in each series. - For the last plot, modify the axis labels and title.
Lab Session 3
Look at the quarterly tourism data for the Snowy Mountains
<- tourism |> filter(Region == "Snowy Mountains") snowy
- Use
autoplot()
,gg_season()
andgg_subseries()
to explore the data. - What do you learn?
- Use
Produce a calendar plot for the
pedestrian
data from one location and one year.
Lab Session 4
We have introduced the following functions: gg_lag
and ACF
. Use these functions to explore the four time series: Bricks
from aus_production
, Lynx
from pelt
, Close
price of Amazon from gafa_stock
, Demand
from vic_elec
. Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
Lab Session 5
You can compute the daily changes in the Google stock price in 2018 using
Does diff
look like white noise?
Lab Session 6
Consider the GDP information in global_economy
. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
Lab Session 7
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.
- United States GDP from
global_economy
- Slaughter of Victorian “Bulls, bullocks and steers” in
aus_livestock
- Victorian Electricity Demand from
vic_elec
. - Gas production from
aus_production
- United States GDP from
Why is a Box-Cox transformation unhelpful for the
canadian_gas
data?
Lab Session 8
Produce the following decomposition
|> canadian_gas STL(Volume ~ season(window=7) + trend(window=11)) |> autoplot()
What happens as you change the values of the two
window
arguments?How does the seasonal shape change over time? [Hint: Try plotting the seasonal component using
gg_season
.]Can you produce a plausible seasonally adjusted series? [Hint:
season_adjust
is one of the variables returned bySTL
.]
Lab Session 9
- Use
GGally::ggpairs()
to look at the relationships between the STL-based features. You might wish to changeseasonal_peak_year
andseasonal_trough_year
to factors. - Which is the peak quarter for holidays in each state?
Lab Session 10
- Use a feature-based approach to look for outlying series in
PBS
. - What is unusual about the series you identify as “outliers”.
Lab Session 11
- Produce forecasts using an appropriate benchmark method for household wealth (
hh_budget
). Plot the results usingautoplot()
. - Produce forecasts using an appropriate benchmark method for Australian takeaway food turnover (
aus_retail
). Plot the results usingautoplot()
.
Lab Session 12
- Compute seasonal naïve forecasts for quarterly Australian beer production from 1992.
- Test if the residuals are white noise. What do you conclude?
Lab Session 13
- Create a training set for household wealth (
hh_budget
) by witholding the last four years as a test set. - Fit all the appropriate benchmark methods to the training set and forecast the periods covered by the test set.
- Compute the accuracy of your forecasts. Which method does best?
- Repeat the exercise using the Australian takeaway food turnover data (
aus_retail
) with a test set of four years.
Lab Session 14
Try forecasting the Chinese GDP from the global_economy
data set using an ETS model.
Experiment with the various options in the ETS()
function to see how much the forecasts change with damped trend, or with a Box-Cox transformation. Try to develop an intuition of what each is doing to the forecasts.
[Hint: use h=20
when forecasting, so you can clearly see the differences between the various options when plotting the forecasts.]
Lab Session 15
Find an ETS model for the Gas data from aus_production
and forecast the next few years.
- Why is multiplicative seasonality necessary here?
- Experiment with making the trend damped. Does it improve the forecasts?
Lab Session 16
For the United States GDP data (from global_economy
):
- Fit a suitable ARIMA model for the logged data.
- Produce forecasts of your fitted model. Do the forecasts look reasonable?
Lab Session 17
For the Australian tourism data (from tourism
):
- Fit a suitable ARIMA model for all data.
- Produce forecasts of your fitted models.
- Check the forecasts for the “Snowy Mountains” and “Melbourne” regions. Do they look reasonable?
Lab Session 18
Repeat the daily electricity example, but instead of using a quadratic function of temperature, use a piecewise linear function with the “knot” around 20 degrees Celsius (use predictors Temperature
& Temp2
). How can you optimize the choice of knot?
The data can be created as follows.
<- vic_elec |>
vic_elec_daily filter(year(Time) == 2014) |>
index_by(Date = date(Time)) |>
summarise(
Demand = sum(Demand)/1e3,
Temperature = max(Temperature),
Holiday = any(Holiday)
|>
) mutate(
Temp2 = I(pmax(Temperature-20,0)),
Day_Type = case_when(
~ "Holiday",
Holiday wday(Date) %in% 2:6 ~ "Weekday",
TRUE ~ "Weekend"
) )
Lab Session 19
Repeat Lab Session 16 but using all available data, and handling the annual seasonality using Fourier terms.
Lab Session 20
- Prepare aggregations of the PBS data by Concession, Type, and ATC1.
- Use forecast reconciliation with the PBS data, using ETS, ARIMA and SNAIVE models, applied to all but the last 3 years of data.
- Which type of model works best?
- Does the reconciliation improve the forecast accuracy?
- Why doesn’t the reconcililation make any difference to the SNAIVE forecasts?