It is becoming increasingly common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some new packages and methods that have been developed to handle the analysis of large collections of time series.
On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feast (along with the tidyverse of course).
Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package. We will look at creating ensemble forecasts and hybrid forecasts, as well as some new forecasting methods that have performed well in large-scale forecasting competitions. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related.
Attendees will learn:
- How to wrangle time series data with familiar tidy tools.
- How to compute time series features and visualize large collections of time series.
- How to select a good forecasting algorithm for your time series.
- How to ensure forecasts of a large collection of time series are coherent.
Is this course for me?
This course will be appropriate for you if you answer yes to these questions:
- Do you already use the tidyverse packages in R such as dplyr, tidyr, tibble and ggplot2?
- Do you need to analyse large collections of related time series?
- Would you like to learn how to use some new tidy tools for time series analysis including visualization, decomposition and forecasting?
Participants should be familiar with the use of R, at least to the point where they can fit a linear regression model, and work with data frames.
Please bring your own laptop with a recent version of R and RStudio installed. The following code will install the main packages needed for the workshop.
install.packages("tidyverse", dependencies=TRUE) install.packages("fpp3", dependencies=TRUE)
|09:00 - 10:30||Session 1|
|10:30 - 11:00||Coffee break|
|11:00 - 12:30||Session 2|
|12:30 - 13:30||Lunch break|
|13:30 - 15:00||Session 3|
|15:00 - 15:30||Coffee break|
|15:30 - 17:00||Session 4|
Lab Session 1
http://robjhyndman.com/data/tourism.xlsx, and read it into R using
- Create a tsibble which is identical to the
tourismtsibble from the
- Find what combination of
Purposehad the maximum number of overnight trips on average.
- Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
Lab Session 2
- Create time plots of the following four time series:
help()to find out about the data in each series.
- For the last plot, modify the axis labels and title.
Lab Session 3
Look at the quarterly tourism data for the Snowy Mountains
snowy <- filter(tourism, Region == "Snowy Mountains")
gg_subseries()to explore the data.
- What do you learn?
Produce a calendar plot for the
pedestriandata from one location and one year.
Lab Session 4
We have introduced the following functions:
ACF. Use these functions to explore the four time series:
vic_elec. Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
Lab Session 5
You can compute the daily changes in the Google stock price in 2018 using
dgoog <- gafa_stock %>% filter(Symbol == "GOOG", year(Date) >= 2018) %>% mutate(trading_day = row_number()) %>% update_tsibble(index=trading_day, regular=TRUE) %>% mutate(diff = difference(Close))
diff look like white noise?
Lab Session 6
Consider the GDP information in
global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
Lab Session 7
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.
- United States GDP from
- Slaughter of Victorian “Bulls, bullocks and steers” in
- Gas production from
- Tobacco from
- Economy class passengers between Melbourne and Sydney from
- Victorian Electricity Demand from
- United States GDP from
Why is a Box-Cox transformation unhelpful for the
Lab Session 8
Produce the following decomposition
expsmooth::cangas %>% as_tsibble() %>% STL(value ~ season(window=7) + trend(window=11)) %>% autoplot()
What happens as you change the values of the two
How does the seasonal shape change over time? [Hint: Try plotting the seasonal component using
Can you produce a plausible seasonally adjusted series? [Hint:
season_adjustis one of the variables returned by
Lab Session 9
GGally::ggpairs()to look at the relationships between the STL-based features. You might wish to change
- Which is the peak quarter for holidays in each state?
Lab Session 10
- Use a feature-based approach to look for outlying series in
- What is unusual about the series you identify as “outliers”.
Lab Session 11
- Produce forecasts using an appropriate benchmark method for household wealth (
hh_budget). Plot the results using
- Produce forecasts using an appropriate benchmark method for Australian takeaway food turnover (
aus_retail). Plot the results using
Lab Session 12
- Compute seasonal naïve forecasts for quarterly Australian beer production from 1992.
- Test if the residuals are white noise. What do you conclude?
Lab Session 13
- Split the household wealth (
hh_budget) into a training and test set, with the last four years in the test set.
- Fit all the appropriate benchmark methods to the training set and forecast the periods covered by the test set.
- Compute the accuracy of your forecasts. Which method does best?
- Repeat the exercise using the Australian takeaway food turnover data (
aus_retail) with a test set of four years.
Lab Session 14
Consider the data set
eggs, the price of a dozen eggs in the United States from 1900-1993. Experiment with the various options in the
ETS() function to see how much the forecasts change with damped trend, or with a Box-Cox transformation. Try to develop an intuition of what each is doing to the forecasts. How can we stop the forecasts becoming negative?
h=100 when forecasting, so you can clearly see the differences between the various options when plotting the forecasts.]
Lab Session 15
Find an ETS model for the Gas data from
aus_production and forecast the next few years.
- Why is multiplicative seasonality necessary here?
- Experiment with making the trend damped. Does it improve the forecasts?
Lab Session 16
For the United States GDP data (from
- Fit a suitable ARIMA model for the logged data.
- Produce forecasts of your fitted model. Do the forecasts look reasonable?
Lab Session 17
For the Australian tourism data (from
- Fit a suitable ARIMA model for all data.
- Produce forecasts of your fitted models.
- Check the forecasts for the “Snowy Mountains” and “Melbourne” regions. Do they look reasonable?
Lab Session 18
Repeat the daily electricity example, but instead of a using a linear function of temperature, use a piecewise linear function with the “knot” around 20 degrees Celsius. How can you optimize the choice of knot?
The data can be created as follows.
vic_elec_daily <- vic_elec %>% filter(year(Time) == 2014) %>% index_by(Date = date(Time)) %>% summarise( Demand = sum(Demand)/1e3, Temperature = max(Temperature), Holiday = any(Holiday) ) %>% mutate(Day_Type = case_when( Holiday ~ "Holiday", wday(Date) %in% 2:6 ~ "Weekday", TRUE ~ "Weekend" ))
Lab Session 19
Repeat Lab Session 16 but using all available data, and handling the annual seasonality using Fourier terms.
Lab Session 20
- Prepare aggregations of the PBS data by Concession, Type, and ATC1.
- Use forecast reconciliation with the PBS data, using ETS, ARIMA and SNAIVE models, applied to all but the last 3 years of data.
- Which type of model works best?
- Does the reconciliation improve the forecast accuracy?
- Why doesn’t the reconcililation make any difference to the SNAIVE forecasts?