library(cricketdata)
library(tidyverse)
Recently I spent a few weeks visiting Professor Tommaso Di Fonzo at the University of Padova (Italy), and one of the things we discussed was finding a notation we were both happy with so we could be more consistent in our future papers.
This is what we came up with. Hopefully others will agree and use it too!
For readers new to forecast reconciliation, Chapter 11 of FPP3 provides an introduction.
We observe time series at time , written as . The base forecasts of given data are denoted by .
This was the original formulation of the problem due to Hyndman et al. (2011), but presented here in our new notation.
Let be a vector of “bottom-level” time series at time , and let be a corresponding vector of aggregated time series, where and is the “aggregation” matrix specifying how the bottom-level series are to be aggregated to form . The full vector of time series is given by This leads to the “summing” or “structural” matrix given by such that .
All bottom-up, middle-out, top-down and linear reconciliation methods can be written as for different matrices .
Optimal reconciled forecasts are obtained with , or where the “mapping” matrix is given by are the -step forecasts of given data to time , and is an positive definite matrix. Different choices for lead to different solutions such as OLS, WLS and MinT (Wickramasuriya, Athanasopoulos, and Hyndman 2019).
There is actually no reason for to be restricted to aggregates of . They can include any linear combination of the bottom-level series , so the corresponding and matrices may contain any real values, not just 0s and 1s. Nevertheless, we will use the same notation for this more general setting.
This representation is more efficient and was used by Di Fonzo and Girolimetto (2021). It was also discussed in Wickramasuriya, Athanasopoulos, and Hyndman (2019). Here it is in the new notation.
We can express the structural representation using the constraint matrix so that . Then we can write the mapping matrix as Note that Equation 2 involves inverting an matrix, rather than the matrix in Equation 1. For most practical problems, , so Equation 2 is more efficient.
This form of the mapping matrix also allows us to interpret the reconciliation as an additive adjustment to the base forecasts. If the base forecasts are already reconciled, then and so .
The most general way to express the problem is not to denote individual series as bottom-level or aggregated, but to define the linear constraints where is an matrix, not necessarily full rank, which may contain any real values.
If is full rank, then Equation 2 holds with .
Temporal reconciliation was proposed by Athanasopoulos et al. (2017). Here it is in our new notation.
For simplicity we will assume the original (scalar) time series is observed with a single seasonality of period (e.g., for monthly data), and the total length of the series is an integer multiple of . We will denote the original series by , and the various temporally aggregated series by .
Let denote the factors of in ascending order, where and . For each factor of , we can construct a temporally aggregated series for . Of course, .
Since the observation index varies with each aggregation level, we define as the observation index of the most aggregated level (e.g., annual), so that at that level.
For each aggregation level, we stack the observations in the column vectors where , , and . Collecting these in one column vector, we obtain
The structural representation of this formulation is where and
The zero-constrained representation is .
If there are multiple seasonalities that are not integer multiples of each other, the resulting additional temporal aggregations can simply be stacked in , and can be extended accordingly.
Now consider the case where we have both cross-sectional and temporal aggregations, as discussed in Di Fonzo and Girolimetto (2021).
Suppose we have observed at the most temporally disaggregated level, including all the cross-sectionally disaggregated and aggregated (or constrained) series. Let be the th element of the vector , . For each , we can expand to include all the temporally aggregated variants, giving a vector of length : These can then be stacked into a long vector:
If denotes the structural matrix for the cross-sectional reconciliation, and denotes the structural matrix for the temporal reconciliation, then the cross-temporal structural matrix is , so that where the bottom-level series
The focus this year is on communicating with data. As with all WOMBAT events, the purpose is to bring together analysts from academia, industry and government to learn and discuss new open source tools for business analytics and data science.
The first day will be virtual with 8 tutorials to choose from. I will be giving one on “Exploratory time series analysis using R”. Each tutorial has limited places, so register early!
The second day will be in-person workshop, limited 60 people. The keynote speaker is Amanda Cox, Head of special data projects, USAFacts. She is well-known for the sixteen years she spent at The New York Times producing some amazing data visualizations. She will speak on “Charts and Words: Being more influential with your data graphics”.
Other invited speakers will talk about data communication in environment, health and sport.
The workshop on December 7 will be held at the Royal South Yarra Lawn Tennis Club, located near the Yarra River, at 310 Williams Rd N, Toorak.
The 6 Dec online tutorials are each limited to 20 participants. Register for tutorials only here. Registering for the 7 Dec workshop provides a 30% discount on tutorial registration. Your discount code will be sent in the confirmation email after you have first registered for 7 Dec.
The 7 Dec in-person event is limited to 60 attendees. Register here. Registration includes lunch, morning and afternoon tea.
For more details, see the event website.
But now there are several alternatives, and I’ve opted for giscus which is very lightweight, is built on Github Discussions, and is open source with no tracking or advertising. The other system I considered was utterances which is also hosted on Github, but uses issues rather than discussions. Consequently, comments on utterances can’t be threaded (with replies to previous comments). Also, giscus appears to have a much more active development team behind it.
The first step was to set up giscus on my blog. With quarto, this simply requires adding a few lines to the _metadata.yml
file in the relevant folder. Here is what it looks like for me:
comments:
giscus:
repo: robjhyndman/robjhyndman.com
repo-id: "R_kgDOH5G3Uw"
category: "Announcements"
category-id: "DIC_kwDOH5G3U84CRUp9"
mapping: "pathname"
reactions-enabled: true
loading: lazy
input-position: "bottom"
theme: "light"
Then I needed to set up giscus on the Github repo that hosts the website (robjhyndman/robjhyndman.com
). The instructions on the giscus website make it very simple.
The last step was the hardest – how to migrate 4000 comments from Disqus to giscus. Here I followed the nice blog post of Maëlle Salmon to download the Disqus comments as an xml file, and wrangle them into a tibble. Then I needed to use the GraphQL API for Github Discussions to generate all the comments on the Github repo. Fortunately, Mitch O’Hara-Wild came to my rescue (as usual), and helped with some of this code. The resulting code is here if anyone wants to try to do the same. You will need to change some specific details in lines 9-13. Everything else should work as it is.
]]>On day 1, we will look at the tsibble
data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to analyse time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble
, lubridate
and feasts
(along with the tidyverse of course).
Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable
package. We will look at creating ensemble forecasts and hybrid forecasts, as well as some new forecasting methods that have performed well in large-scale forecasting competitions. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related.
Places are limited, so sign up early if you’re interested.
]]>For the blogdown site, I had to (painfully) hack my own hugo theme to make it look the way I wanted. This one is pretty much straight out of the Quarto box other than some css styling, and some tweaking of Quarto templates. In case anyone wants to create something similar for themselves, I’ve set up a template version with just the bare minimum so you don’t need to wade through the extra folders I’ve kept to ensure existing links continue to work.
Actually, setting up a website in Quarto is extremely easy when following the online instructions. The hard part for me was the migration. There are about 800 pages that make up this site, and about 4000 comments on my blog. I didn’t want to break any existing links, so retaining the same structure was important.
I also decided to convert the commenting system from Disqus to giscus, which is built on Github Discussions. I’ll describe that conversion in a separate post in case anyone else wants to do something similar.
There are almost certainly things that are still broken, so please let me know in the comments below if you find anything that doesn’t work as it should.
]]>The cricketdata package has been around for a few years on github, and it has been on CRAN since February 2022. There are only four functions:
fetch_cricinfo()
: Fetch team data on international cricket matches provided by ESPNCricinfo.fetch_player_data()
: Fetch individual player data on international cricket matches provided by ESPNCricinfo.find_player_id()
: Search for the player ID on ESPNCricinfo.fetch_cricsheet()
: Fetch ball-by-ball, match and player data from Cricsheet.Jacquie Tran wrote the first version of the fetch_cricsheet()
function, and the vignette which demonstrates it.
Here are some examples demonstrating the Cricinfo functions.
library(cricketdata)
library(tidyverse)
The fetch_cricinfo()
function downloads data for international T20, ODI or Test matches, for men or women, and for batting, bowling or fielding. By default, it downloads career-level statistics for individual players. Here is an example for women T20 bowlers.
# Fetch all Women's T20 data
wt20 <- fetch_cricinfo("T20", "Women", "Bowling")
wt20 %>%
select(Player, Country, Matches, Runs, Wickets, Economy, StrikeRate)
#> # A tibble: 1,798 × 7
#> Player Country Matches Runs Wickets Economy StrikeRate
#> <chr> <chr> <int> <int> <int> <dbl> <dbl>
#> 1 A Mohammed West Indies 117 2206 125 5.58 19.0
#> 2 S Ismail South Africa 105 2153 115 5.81 19.3
#> 3 EA Perry Australia 126 2237 115 5.87 19.9
#> 4 KH Brunt England 104 2019 108 5.50 20.4
#> 5 M Schutt Australia 84 1685 108 6.05 15.5
#> 6 Nida Dar Pakistan 114 1951 106 5.35 20.6
#> 7 SFM Devine New Zealand 107 1822 104 6.36 16.5
#> 8 A Shrubsole England 79 1587 102 5.96 15.7
#> 9 Poonam Yadav India 72 1495 98 5.75 15.9
#> 10 SR Taylor West Indies 111 1639 98 5.66 17.7
#> # … with 1,788 more rows
We can plot a bowler’s strike rate (balls per wicket) vs economy rate (runs per wicket). Each observation represents one player, who has taken at least 50 international wickets.
wt20 %>%
filter(Wickets >= 50) %>%
ggplot(aes(y = StrikeRate, x = Average)) +
geom_point(alpha = 0.3, col = "blue") +
ggtitle("Women International T20 Bowlers") +
ylab("Balls per wicket") + xlab("Runs per wicket")
The extraordinary result on the bottom left is due to the Thai all-rounder, Nattaya Boochatham, who has taken 59 wickets, with a strike rate of 13.475, an average of 8.78, and an economy rate of 3.909.
The next example shows Australian men’s ODI batting results by innings.
# Fetch all Australian Men's ODI data by innings
menODI <- fetch_cricinfo("ODI", "Men", "Batting", type = "innings", country = "Australia")
menODI %>%
select(Date, Player, Runs, StrikeRate, NotOut)
#> # A tibble: 10,675 × 5
#> Date Player Runs StrikeRate NotOut
#> <date> <chr> <int> <dbl> <lgl>
#> 1 2011-04-11 SR Watson 185 193. TRUE
#> 2 2007-02-20 ML Hayden 181 109. TRUE
#> 3 2017-01-26 DA Warner 179 140. FALSE
#> 4 2015-03-04 DA Warner 178 134. FALSE
#> 5 2001-02-09 ME Waugh 173 117. FALSE
#> 6 2016-10-12 DA Warner 173 127. FALSE
#> 7 2004-01-16 AC Gilchrist 172 137. FALSE
#> 8 2019-06-20 DA Warner 166 113. FALSE
#> 9 2006-03-12 RT Ponting 164 156. FALSE
#> 10 2016-12-04 SPD Smith 164 104. FALSE
#> # … with 10,665 more rows
menODI %>%
ggplot(aes(y = Runs, x = Date)) +
geom_point(alpha = 0.2, col = "#D55E00") +
geom_smooth() +
ggtitle("Australia Men ODI: Runs per Innings")
The average number of runs per innings slowly increased until about 2000, after which it has remained largely constant at about 35.1. This is a little higher than the smooth line shown on the plot, which has not taken account of not-out results.
Next, we demonstrate some of the fielding data available, using Test match fielding from Indian men’s players.
Indfielding <- fetch_cricinfo("Test", "Men", "Fielding", country = "India")
Indfielding
#> # A tibble: 303 × 11
#> Player Start End Matches Innings Dismis…¹ Caught Caugh…² Caugh…³
#> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 MS Dhoni 2005 2014 90 166 294 256 0 256
#> 2 R Dravid 1996 2012 163 299 209 209 209 0
#> 3 SMH Kirmani 1976 1986 88 151 198 160 0 160
#> 4 VVS Laxman 1996 2012 134 248 135 135 135 0
#> 5 KS More 1986 1993 49 90 130 110 0 110
#> 6 RR Pant 2018 2022 31 61 122 111 0 111
#> 7 SR Tendulkar 1989 2013 200 366 115 115 115 0
#> 8 SM Gavaskar 1971 1987 125 216 108 108 108 0
#> 9 NR Mongia 1994 2001 44 77 107 99 0 99
#> 10 M Azharuddin 1984 2000 99 177 105 105 105 0
#> # … with 293 more rows, 2 more variables: Stumped <int>,
#> # MaxDismissalsInnings <dbl>, and abbreviated variable names
#> # ¹Dismissals, ²CaughtFielder, ³CaughtBehind
We can plot the number of dismissals by number of matches for all male test players. Because wicket keepers typically have a lot more dismissals than other players, they are shown in a different colour.
Indfielding %>%
mutate(wktkeeper = (CaughtBehind > 0) | (Stumped > 0)) %>%
ggplot(aes(x = Matches, y = Dismissals, col = wktkeeper)) +
geom_point() +
ggtitle("Indian Men Test Fielding")
The high number of dismissals, close to 300, is of course due to MS Dhoni. Another interesting one here is the non-wicketkeeper with over 200 dismissals, which is Rahul Dravid who took 209 catches during his career.
Finally, let’s look at individual player data. The fetch_player_data()
requires the Cricinfo player ID, which you can either look up on their website, or use the find_player_id()
function. We will look at the ODI results of Australia’s captain, Meg Lanning.
meg_lanning_id <- find_player_id("Lanning")$ID
MegLanning <- fetch_player_data(meg_lanning_id, "ODI") %>%
mutate(NotOut = (Dismissal == "not out"))
MegLanning
#> # A tibble: 100 × 14
#> Date Innings Opposition Ground Runs Mins BF X4s X6s SR
#> <date> <int> <chr> <chr> <dbl> <dbl> <int> <int> <int> <dbl>
#> 1 2011-01-05 1 ENG Women Perth 20 60 38 2 0 52.6
#> 2 2011-01-07 2 ENG Women Perth 104 148 118 8 1 88.1
#> 3 2011-06-14 2 NZ Women Brisb… 11 15 14 2 0 78.6
#> 4 2011-06-16 1 NZ Women Brisb… 5 8 8 1 0 62.5
#> 5 2011-06-30 1 NZ Women Chest… 17 24 20 3 0 85
#> 6 2011-07-02 2 India Wom… Chest… 23 40 32 3 0 71.9
#> 7 2011-07-05 2 ENG Women Lord's 43 40 33 9 0 130.
#> 8 2011-07-07 2 ENG Women Worms… 0 2 3 0 0 0
#> 9 2012-03-12 1 India Wom… Ahmed… 45 61 44 7 0 102.
#> 10 2012-03-14 1 India Wom… Wankh… 128 125 104 19 1 123.
#> # … with 90 more rows, and 4 more variables: Pos <int>, Dismissal <chr>,
#> # Inns <int>, NotOut <lgl>
We can plot her runs per innings on the vertical axis over time on the horizontal axis.
# Compute batting average
MLave <- MegLanning %>%
filter(!is.na(Runs)) %>%
summarise(Average = sum(Runs) / (n() - sum(NotOut))) %>%
pull(Average)
names(MLave) <- paste("Average =", round(MLave, 2))
# Plot ODI scores
ggplot(MegLanning) +
geom_hline(aes(yintercept = MLave), col="gray") +
geom_point(aes(x = Date, y = Runs, col = NotOut)) +
ggtitle("Meg Lanning ODI Scores") +
scale_y_continuous(sec.axis = sec_axis(~., breaks = MLave))
She has shown amazing consistency over her career, with centuries scored in every year of her career except for 2021, when her highest score from 6 matches was 53.
Some of these data sets have been made available in R packages previously, based on ts
objects which worked ok for annual, quarterly and monthly data, but is not a good format for daily and sub-daily data.
The tsibbledata
package provides the function monash_forecasting_respository()
to download the data and return it as a tsibble
object. These can be analysed and plotted using the feasts
package, and modelled and forecast using the fable
package. It is convenient to simply load the fpp3
package which will then load all the necessary packages.
library(fpp3)
── Attaching packages ────────────────────────────────── fpp3 0.4.0.9000 ──
✔ tibble 3.1.8 ✔ tsibble 1.1.2
✔ dplyr 1.0.9 ✔ tsibbledata 0.4.1.9000
✔ tidyr 1.2.0 ✔ feasts 0.3.0.9000
✔ lubridate 1.8.0 ✔ fable 0.3.2.9000
✔ ggplot2 3.3.6 ✔ fabletools 0.3.2
── Conflicts ──────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date() masks base::date()
✖ dplyr::filter() masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval() masks lubridate::interval()
✖ dplyr::lag() masks stats::lag()
✖ tsibble::setdiff() masks base::setdiff()
✖ tsibble::union() masks base::union()
To download the M3 data, we need to know the unique zenodo identifiers for each data set. From the forecastingdata.org page, find the M3 links (there are four, one for each observational frequency). For example, the Yearly link takes you to https://zenodo.org/record/4656222, so the Zenodo identifier for this data set is 4656222. Similarly, the Quarterly, Monthly and Other links have identifiers 4656262, 4656298 and 4656335 respectively.
m3_yearly <- monash_forecasting_repository(4656222)
m3_quarterly <- monash_forecasting_repository(4656262)
m3_monthly <- monash_forecasting_repository(4656298)
m3_other <- monash_forecasting_repository(4656335)
The first three data sets are stored with a date index, so they are read as daily data. Therefore we first need to convert them to yearly, quarterly and monthly data.
m3_yearly <- m3_yearly %>%
mutate(year = year(start_timestamp)) %>%
as_tsibble(index=year) %>%
select(-start_timestamp)
m3_quarterly <- m3_quarterly %>%
mutate(quarter = yearquarter(start_timestamp)) %>%
as_tsibble(index=quarter) %>%
select(-start_timestamp)
m3_monthly <- m3_monthly %>%
mutate(month = yearmonth(start_timestamp)) %>%
as_tsibble(index=month) %>%
select(-start_timestamp)
The resulting monthly data set is shown below.
m3_monthly
# A tsibble: 167,562 x 3 [1M]
# Key: series_name [1,428]
series_name value month
<chr> <dbl> <mth>
1 T1 2640 1990 Jan
2 T1 2640 1990 Feb
3 T1 2160 1990 Mar
4 T1 4200 1990 Apr
5 T1 3360 1990 May
6 T1 2400 1990 Jun
7 T1 3600 1990 Jul
8 T1 1920 1990 Aug
9 T1 4200 1990 Sep
10 T1 4560 1990 Oct
# … with 167,552 more rows
The series names are T1
, T2
, … The M3 data included both training and test data. These have been combined in this data set.
This data set contains total half-hourly electricity demand by state from 1 January 2002 to 1 April 2015, for five states of Australia: New South Wales, Queensland, South Australia, Tasmania, and Victoria. A subset of this data (one state and only three years) is provided as tsibbledata::vic_elec
.
aus_elec <- monash_forecasting_repository(4659727)
aus_elec
# A tsibble: 1,155,264 x 4 [30m] <UTC>
# Key: series_name, state [5]
series_name state start_timestamp value
<chr> <chr> <dttm> <dbl>
1 T1 NSW 2002-01-01 00:00:00 5714.
2 T1 NSW 2002-01-01 00:30:00 5360.
3 T1 NSW 2002-01-01 01:00:00 5015.
4 T1 NSW 2002-01-01 01:30:00 4603.
5 T1 NSW 2002-01-01 02:00:00 4285.
6 T1 NSW 2002-01-01 02:30:00 4075.
7 T1 NSW 2002-01-01 03:00:00 3943.
8 T1 NSW 2002-01-01 03:30:00 3884.
9 T1 NSW 2002-01-01 04:00:00 3878.
10 T1 NSW 2002-01-01 04:30:00 3838.
# … with 1,155,254 more rows
aus_elec %>%
filter(state=="VIC") %>%
autoplot(value) +
labs(x = "Time", y="Electricity demand (MWh)")
We also provide some accuracy measures of the performance of 13 baseline forecasting methods applied to the data sets in the repository. This makes it easy for anyone proposing a new method to compare against some standard existing methods, without having to do all the calculations themselves.
The data can be loaded as a Pandas dataframe by following this example in the github repository. Download the .tsf
files as required from Zenodo and put them into tsf_data
folder.
fable
package, and the facilities will be added there. But in the meantime, if you are using the forecast
package and want to simulate from a fitted TBATS model, here is how do it.
Doing it efficiently would require a more complicated approach, but this is super easy if you are willing to sacrifice some speed. The trick is to realise that a simulation can be handled easily for almost any time series model using residuals and one-step forecasts. Note that a residual is given by so we can write
Therefore, given data to time , we can simulate iteratively using where is randomly generated from the error distribution, or bootstrapped by randomly sampling from past residuals. The value of can be obtained by applying the model to the series (without re-estimating the parameters) and forecasting one-step ahead. This is the same trick we use to get prediction intervals for neural network models.
Because simulate()
is an S3 method in R, we have to make sure the corresponding simulate.tbats()
function has all the necessary arguments to match other simulate
functions. It’s also good to make it as close as possible to other simulate
functions in the forecast
package, to make it easier for users when switching between them. The real work is done in the last few lines.
simulate.tbats <- function(object, nsim=length(object$y),
seed = NULL, future=TRUE,
bootstrap=FALSE, innov = NULL, ...) {
if (is.null(innov)) {
if (!exists(".Random.seed", envir = .GlobalEnv)) {
runif(1)
}
if (is.null(seed)) {
RNGstate <- .Random.seed
}
else {
R.seed <- .Random.seed
set.seed(seed)
RNGstate <- structure(seed, kind = as.list(RNGkind()))
on.exit(assign(".Random.seed", R.seed, envir = .GlobalEnv))
}
}
else {
nsim <- length(innov)
}
if (bootstrap) {
res <- residuals(object)
res <- na.omit(res - mean(res, na.rm = TRUE))
e <- sample(res, nsim, replace = TRUE)
}
else if (is.null(innov)) {
e <- rnorm(nsim, 0, sqrt(object$variance))
} else {
e <- innov
}
x <- getResponse(object)
y <- numeric(nsim)
if(future) {
dataplusy <- x
} else {
# Start somewhere in the original series
dataplusy <- ts(sample(x, 1), start=-1/frequency(x),
frequency = frequency(x))
}
fitplus <- object
for(i in seq_along(y)) {
y[i] <- forecast(fitplus, h=1)$mean + e[i]
dataplusy <- ts(c(dataplusy, y[i]),
start=start(dataplusy), frequency=frequency(dataplusy))
fitplus <- tbats(dataplusy, model=fitplus)
}
return(tail(dataplusy, nsim))
}
I’ve added this to the forecast
package for the next version.
Something similar could be written for any other forecasting function that doesn’t already have a simulate
method. Just swap the tbats
call to the relevant modelling function.
library(forecast)
library(ggplot2)
fit <- tbats(USAccDeaths)
p <- USAccDeaths %>% autoplot() +
labs(x = "Year", y = "US Accidental Deaths",
title = "TBATS simulations")
for (i in seq(9)) {
p <- p + autolayer(simulate(fit, nsim = 36), series = paste("Sim", i))
}
p
General online job sites such as seek or careerjet are ok, but job-seekers can find it hard to find the relevant openings because job titles are so varied. In the general area of statistics, a job can appear under the titles “statistician”, “analyst”, “data miner”, “data manager”, “financial engineer” and a few dozen other labels. Many employers don’t place the job in the best category, often because they don’t understand what skills are required to do the job. Nevertheless, if I was looking for a job, I would certainly set up some automated searches on these sites.
In statistics, there are well-established job websites that are the best places for both employers and potential employees to meet up.
I do not know what is provided in other countries, but check with your national statistical association.
There are also e-mail lists and web forums that are widely subscribed and often contain job postings.
If I’ve missed any good places to advertise jobs, please add them in the comments.
]]>tsoutliers()
function in the forecast package for R is useful for identifying anomalies in a time series. However, it is not properly documented anywhere. This post is intended to fill that gap.
The function began as an answer on CrossValidated and was later added to the forecast package because I thought it might be useful to other people. It has since been updated and made more reliable.
The procedure decomposes the time series into trend, seasonal and remainder components: The seasonal component is optional, and it may containing several seasonal patterns corresponding to the seasonal periods in the data. The idea is to first remove any seasonality and trend in the data, and then find outliers in the remainder series, .
For data observed more frequently than annually, we use a robust approach to estimate and by first applying the MSTL method to the data. MSTL will iteratively estimate the seasonal component(s).
Then the strength of seasonality is measured using If , a seasonally adjusted series is computed: A seasonal strength threshold is used here because the estimate of is likely to be overfitted and very noisy if the underlying seasonality is too weak (or non-existent), potentially masking any outliers by having them absorbed into the seasonal component.
If , or if the data is observed annually or less frequently, we simply set .
Next, we re-estimate the trend component from the values. For non-seasonal time series such as annual data, this is necessary as we don’t have the trend estimate from the STL decomposition. But even if we have computed an STL decomposition, we may not have used it if .
The trend component is estimated by applying Friedman’s super smoother (via supsmu()
) to the data. This function has been tested on lots of data and tends to work well on a wide range of problems.
We look for outliers in the estimated remainder series If denotes the 25th percentile and denotes the 75th percentile of the remainder values, then the interquartile range is defined as . Observations are labelled as outliers if they are less than or greater than . This is the definition used by Tukey (1977, p44) in his original boxplot proposal for “far out” values.
If the remainder values are normally distributed, then the probability of an observation being identified as an outlier is approximately 1 in 427000.
Any outliers identified in this manner are replaced with linearly interpolated values using the neighbouring observations, and the process is repeated.
The gold price data contains daily morning gold prices in US dollars from 1 January 1985 to 31 March 1989. The data was given to me by a client who wanted me to forecast the gold price. (I told him it would be almost impossible to beat a naive forecast). The data are shown below.
library(fpp2)
autoplot(gold)
There are periods of missing values, and one obvious outlier which is about $100 greater than what would be expected. This was simply a typo, with someone typing 593.70 rather than 493.70. Let’s see if the tsoutliers()
function can spot it.
tsoutliers(gold)
$index
[1] 770
$replacements
[1] 495
Sure enough, it is easily found and the suggested replacement (linearly interpolated) is close to the true value.
The tsclean()
function removes outliers identified in this way, and replaces them (and any missing values) with linearly interpolated replacements.
autoplot(tsclean(gold), series="clean", color='red', lwd=0.9) +
autolayer(gold, series="original", color='gray', lwd=1) +
geom_point(data = tsoutliers(gold) %>% as.data.frame(),
aes(x=index, y=replacements), col='blue') +
labs(x = "Day", y = "Gold price ($US)")
The blue dot shows the replacement for the outlier, the red lines show the replacements for the missing values.
Hi, I’m an MSc student and am shortly starting my project/dissertation on time series data. I’ve started reading Version 3 of your book and improving my R skills but am wondering if there’s any way I can read V3 that will allow annotation? Thanks
For personal annotation of websites, the Hypothesis extension is very useful. You can highlight, annotate and discuss with other readers. You will need to set up a (free) account at https://web.hypothes.is/start/
Thanks you so much for putting out this book! … would it be possible to add OpenDyslexic (https://opendyslexic.org/) to your list of available type face on your website? I am attempting to make my way through your text book, but access to this font would make my life immeasurably easier.
The simplest approach here is to use the install the OpenDyslexic Font extension. When installed, the fpp3 book looks like this:
The only issue is that the equations are not rendered properly by default. But these can be fixed. First, right click on an equation and choose Math Settings/Math Renderer/HTML-CSS
. Then right click again and choose Math Settings/Scale all math/50%
. You only need to do these steps once.
By the way, a print version of the third edition is now available.
]]>fable
package using the stretch_tsibble()
function to generate the data folds. In this post I will give two examples of how to use it, one without covariates and one with covariates.
Here is a simple example using quarterly Australian beer production from 1956 Q1 to 2010 Q2. First we create a data object containing many training sets starting with 3 years (12 observations), and adding one quarter at a time until all data are included.
library(fpp3)
beer <- aus_production %>%
select(Beer) %>%
stretch_tsibble(.init = 12, .step=1)
beer
# A tsibble: 23,805 x 3 [1Q]
# Key: .id [207]
Beer Quarter .id
<dbl> <qtr> <int>
1 284 1956 Q1 1
2 213 1956 Q2 1
3 227 1956 Q3 1
4 308 1956 Q4 1
5 262 1957 Q1 1
6 228 1957 Q2 1
7 236 1957 Q3 1
8 320 1957 Q4 1
9 272 1958 Q1 1
10 233 1958 Q2 1
# … with 23,795 more rows
This gives 207 training sets of increasing size. We fit an ETS model to each training set and produce one year of forecasts from each model. Because I want to compute RMSE for each forecast horizon, I will add the horizon h
to the resulting object.
fc <- beer %>%
model(ETS(Beer)) %>%
forecast(h = "1 year") %>%
group_by(.id) %>%
mutate(h = row_number()) %>%
ungroup() %>%
as_fable(response="Beer", distribution=Beer)
Finally, we compare the forecasts against the actual values and average over the folds.
fc %>%
accuracy(aus_production, by=c("h",".model")) %>%
select(h, RMSE)
# A tibble: 4 × 2
h RMSE
<int> <dbl>
1 1 17.1
2 2 16.7
3 3 18.1
4 4 19.2
Forecasts of 1 and 2 quarters ahead both have about the same accuracy here, but then the error increases for horizons 3 and 4.
Things are a little more complicated when we want to use covariates in the model. Here is an example of monthly quotations issued by a US insurance company modelled as a function of the TV advertising expenditure in the same month.
The first step is the same, where we stretch the tsibble. This time we will start with one year of data.
stretch <- insurance %>%
stretch_tsibble(.step=1, .init=12)
stretch
# A tsibble: 754 x 4 [1M]
# Key: .id [29]
Month Quotes TVadverts .id
<mth> <dbl> <dbl> <int>
1 2002 Jan 13.0 7.21 1
2 2002 Feb 15.4 9.44 1
3 2002 Mar 13.2 7.53 1
4 2002 Apr 13.0 7.21 1
5 2002 May 15.4 9.44 1
6 2002 Jun 11.7 6.42 1
7 2002 Jul 10.1 5.81 1
8 2002 Aug 10.8 6.20 1
9 2002 Sep 13.3 7.59 1
10 2002 Oct 14.6 8.00 1
# … with 744 more rows
Next we fit a regression model with AR(1) errors to each fold.
fit <- stretch %>%
model(ARIMA(Quotes ~ 1 + pdq(1,0,0) + TVadverts))
Before we forecast, we need to provide the advertising expenditure for the future periods. We will forecast up to 3 steps ahead, so the test data needs to have 3 observations per fold.
test <- new_data(stretch, n=3) %>%
# Add in covariates from corresponding month
left_join(insurance, by="Month")
test
# A tsibble: 87 x 4 [1M]
# Key: .id [29]
Month .id Quotes TVadverts
<mth> <int> <dbl> <dbl>
1 2003 Jan 1 17.0 9.53
2 2003 Feb 1 16.9 9.39
3 2003 Mar 1 16.5 8.92
4 2003 Feb 2 16.9 9.39
5 2003 Mar 2 16.5 8.92
6 2003 Apr 2 15.3 8.37
7 2003 Mar 3 16.5 8.92
8 2003 Apr 3 15.3 8.37
9 2003 May 3 15.9 9.84
10 2003 Apr 4 15.3 8.37
# … with 77 more rows
The actual value in each month is also included, but that will be ignored when forecasting.
fc <- forecast(fit, new_data = test) %>%
group_by(.id) %>%
mutate(h = row_number()) %>%
ungroup() %>%
as_fable(response = "Quotes", distribution=Quotes)
Finally, we can compare the forecasts against the actual values, averaged across each forecast horizon.
fc %>% accuracy(insurance, by=c("h",".model")) %>%
select(h, RMSE)
# A tibble: 3 × 2
h RMSE
<int> <dbl>
1 1 0.761
2 2 1.20
3 3 1.49
(Updated: 17 Nov 2021)
Date | Podcast | Episode |
---|---|---|
17 November 2021 | The Random Sample | Software as a first class research output |
24 May 2021 | Data Skeptic | Forecasting principles and practice |
12 April 2021 | Seriously Social | Forecasting the future: the science of prediction |
6 February 2021 | Forecasting Impact | Rob Hyndman |
19 July 2020 | The Curious Quant | Forecasting COVID, time series, and why causality doesnt matter as much as you think |
27 May 2020 | The Random Sample | Forecasting the future & the future of forecasting |
9 October 2019 | Thought Capital | Forecasts are always wrong (but we need them anyway) |
Guest editors: George Athanasopoulos, Rob J Hyndman, Anastasios Panagiotelis, and Nikolaos Kourentzes.
Submission deadline: 31 August 2021.
Areas of interest include, but are not limited to:
For further details, see bit.ly/ijfhierarchical
]]>]]>Dear Hyndman, Rob J.
Hope you are doing well.
I write this letter on behalf of authors seeking to co-publish. We have seen your previous works (https://www.scopus.com/authid/detail.uri?authorId=7006914313&eid=2-s2.0-85063573156 ) and they were considered to be of high quality. Therefore, I offer you a co-publishing partnership.
Our clients wish to buy positions in scientific articles that are in line with their research interests. As our partner, you can offer us a position or two in your work. In this way, we develop a network of scientists with whom we would like to partner. We hope you will agree that this type of partnership can be mutually beneficial, and beneficial for authors too!
If you are interested in this, please, let me know. I will forward all required information to you and answer all your questions.
P.S. Sorry for bothering you if you find this letter useless and not interesting.
Respectfully,
Dr. Stutaluk Vladimir
Contraceptive access is vital to safe motherhood, healthy families, and prosperous communities. Greater access to contraceptives enables couples and individuals to determine whether, when, and how often to have children. In low- and middle-income countries (LMIC) around the world, health systems are often unable to accurately predict the quantity of contraceptives necessary for each health service delivery site, in part due to insufficient data, limited staff capacity, and inadequate systems.
With this competition, USAID seeks to identify and test more accurate methods of predicting future contraceptive use at health service delivery sites. Our goal is to ensure appropriate stocking of contraceptives and family planning supplies and to better understand the benefits of intelligent forecasting models for improving contraceptive availability and supply chain efficiency.
There are two overlapping phases. First, a Forecasting Prize Competition to develop an intelligent forecasting model to predict the consumption of contraceptives over three months. Second, a Field Implementation Grant to customize and test a high-performing intelligent forecasting model in Côte d’Ivoire. Competitors can apply for the prize, or for both the prize and the grant. To read the full evaluation criteria and learn more about aspects that apply to the Forecasting Prize and aspects that apply to the Field Implementation Grant, please visit https://competitions4dev.org/forecastingprize/.
The prizes are significant: US$20K for the winning model and up to US$200K for the in-country implementation. So everyone who missed out on the M5 competition prize, here’s your chance to try again and do some good in the process!
I was supposed to be attending the Forecasting for Social Good workshop last month, but it has now been postponed until July 2021. It would be great if the organizers and winners of this competition could present the results there.
]]>The problem arose due to clashes in terminology being used in different fields. Sometimes two fields use the same terminology for different things, and sometimes two fields can use different terminology for the same thing. Both situations are involved here.
In machine learning, “Time series classification” has been very widely studied. A Google scholar search brings up over 3000 articles on the topic since 2019. The problem here is to classify a collection of time series into distinct groups. In other words, the space of time series is mapped to the space of a categorical variable. Unfortunately, almost none of the researchers involved seem to be aware of the (much smaller) parallel statistical literature on functional data classification, which includes time series classification as a special case (where the data are functions of time). And the statisticians working on functional data classification also appear unaware of the extensive work done on this special case in the machine learning literature. Hence we have the situation of two fields using different terminology for essentially the same thing.
My colleagues wanted to extend this idea to using a whole time series to predict a numerical value. So it involves mapping the space of time series onto the space of real numbers. Naturally, they thought “time series regression” was a good name, as regression is like classification but with a real valued output rather than a categorical output. However, “time series regression” is widely used in statistics and econometrics to mean modelling one value of a time series given the past and present values of other time series. So here is a case of two fields using the same terminology for very different things.
In any case, the problem has received considerable coverage in the statistics literature but under another name. There it is part of “functional data analysis” and involves predicting a scalar output from a functional input. This is called “scalar-on-function” regression; see, for example, this 2016 review article by Reiss et al.
It would also be possible to think of “time series classification” to mean a model for a categorical time series, where each element of the series is a category rather than a number. This would be the natural analogue of “time series regression” as used by statisticians and econometricians. However, this problem tends to be called “categorical time series analysis” instead.
To summarise, here is a table of the terminologies being used for the different problems.
Output/Response | Input/Predictor | Terminology | Field |
---|---|---|---|
Numerical | Function/time series | Scalar-on-function regression | Statistics |
Numerical | Function/time series | Time series regression (proposed) | Machine learning |
Categorical | Function/time series | Classification of functional data | Statistics |
Categorical | Function/time series | Time series classification | Machine learning |
Numerical element of time series | Past values of same and/or other time series | Time series regression/forecasting | Statistics |
Categorical element of time series | Past values of same and/or other time series | Categorical time series analysis | Statistics |
The latter two don’t seem to have their own terminology in machine learning, both being part of supervised learning in general, although specific methods such as LSTM have been developed for time series forecasting.
Clashing terminology arises whenever researchers don’t read outside their own discipline area. There are lots of other examples of clashes between statistics and machine learning, and between statistics and econometrics.
I’m sure readers can provide some additional examples.
The weekly mortality data recently published by the Human Mortality Database can be used to explore seasonality in mortality rates. Mortality rates are known to be seasonal due to temperatures and other weather-related effects (Healy 2003).
library(dplyr)
library(tidyr)
library(ggplot2)
library(tsibble)
library(feasts)
We will first grab the latest data, using similar code to what I used in my recent post on “excess deaths”. However, this time we will keep the mortality rates rather than the numbers of deaths.
stmf <- readr::read_csv("https://www.mortality.org/Public/STMF/Outputs/stmf.csv", skip=1)
mrates <- stmf %>%
janitor::clean_names() %>%
select(country_code:sex, r0_14:r_total) %>%
pivot_longer(5:10,
names_to = "age", values_to = "mxt",
names_pattern = "[r_]*([a-z0-9_p]*)"
) %>%
filter(age == "total", sex == "b") %>%
mutate(
country = recode(country_code,
AUT = "Austria",
BEL = "Belgium",
DEUTNP = "Germany",
DNK = "Denmark",
ESP = "Spain",
FIN = "Finland",
GBRTENW = "England & Wales",
ISL = "Iceland",
NLD = "Netherlands",
NOR = "Norway",
PRT = "Portugal",
SWE = "Sweden",
USA = "United States"
)
) %>%
select(year, week, country, mxt)
First let’s plot the mortality rate against the week of the year for two countries with interesting data features.
mrates %>%
filter(country == "England & Wales") %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(x = week, y = mxt, group = year)) +
geom_line(aes(col = year))
Here we see some annual seasonal pattern, with higher rates in winter, and also a few sudden dips in mortality rates. The latter are almost certainly due to recording discrepancies, where deaths are not recorded until the following week. Note that the dips are generally followed by a higher than usual mortality rate in the following week. Those between weeks 12 and 17 are probably due to Easter; bank holiday effects are seen in weeks 18-19, 22-23 and 35-36 (depending on which year the holiday falls; the Christmas effect is seen in week 52.
The second week of the year always has increased mortality — this is a reporting issue: delayed deaths from the previous week(s) are included in statistics for the second week of the year.
Other than the obvious pandemic effect in 2020, this graph also shows increased mortality rates at the start of 2015 and 2018, and in the first half of March 2018. This is probably due to the flu epidemic.
A similar plot for Spain shows a jump in mortality from weeks 31–34 (August) in 2003. This was due to an extreme heat wave (see Robine et al, 2008).
mrates %>%
filter(country == "Spain") %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(x = week, y = mxt, group = year)) +
geom_line(aes(col = year))
We can compare the seasonal patterns from all countries by using an STL decomposition to estimate the seasonality. There will be some differences because some countries provide weekly data by date of registration (instead of the date of occurrence). This is why, for example, we see sudden dips in England & Wales but do not see similar dips in Germany. Information about the type of data available is in the metadata.
First we have to convert the data to a tsibble object.
mrates <- mrates %>%
mutate(date = yearweek(paste0(year, "W", week))) %>%
as_tsibble(index = date, key = country)
Now we estimate the seasonal components using an STL decomposition. The robust argument is used to prevent the unusual years affecting the results, and the seasonal window is set to periodic as we don’t expect the seasonal pattern to change over the last decade or two. STL decompositions are additive, but it would be more interpretable to look at the percentage increase in mortality rates across the year, so I will decompose the log rates and then compute the seasonal effect as a percentage increase (relative to the mean death rate) for each week of the year. There are a couple of missing values, so I will replace them with something small and the robust estimation should ignore them.
stl_season <- mrates %>%
fill_gaps(mxt = 0.0001) %>%
model(STL(log(mxt) ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
mutate(
pc_increase = 100*(exp(season_year)-1),
week = lubridate::week(date),
year = lubridate::year(date)
) %>%
filter(year == 2019) %>%
as_tibble() %>%
select(country, week, pc_increase)
The smoothed components can now be plotted.
stl_season %>%
ggplot(aes(x = week, y = pc_increase, group = country)) +
geom_smooth(aes(col = country), span = 0.4, se = FALSE, size = .5)
Most countries are very similar, probably due to them all being in the northern hemisphere and all having well-developed health services. The two that stand out as different from the rest are Portugal (in purple) and Iceland (in green). Let’s just plot these two with confidence intervals, along with Spain for comparison.
stl_season %>%
filter(country %in% c("Iceland", "Portugal", "Spain")) %>%
ggplot(aes(x = week, y = pc_increase, group = country)) +
geom_smooth(aes(col = country), span = 0.4, size = .5) +
ggthemes::scale_color_colorblind()
The Icelandic rates show less seasonality than other countries, but contain a dip in weeks 25–35 (mid-June to end of August). The Iceland mortality pattern is possibly due to the weather, with a very short summer and cold conditions for the rest of the year.
The Portuguese mortality rates are much higher in winter than other countries, and much lower in summer. This is strange as it has very similar weather to Spain, but very different mortality rates. Some twitter discussion suggested some of the increase in January could be delayed reporting, as few deaths are reported in the last week of the year. Healy (2003) suggested it was due to poor insulation in Portuguese houses, along with relatively high income poverty and inequality and relatively low public health expenditure compared to other European countries. However, it is not clear that this is still true 17 years after he wrote that article.