library(cricketdata)
library(tidyverse)
This is a Quarto template that assists you in creating a letter on Monash University letterhead.
This is a Quarto template that assists you in creating a memo, with optional Monash University branding.
This is a Quarto template that assists you in creating a working paper for the Department of Econometrics & Business Statistics, Monash University.
This is a Quarto template that assists you in creating a Monash University report.
This is a Quarto template that assists you in creating a Monash University thesis.
Either fork or download the repository to get started. |
These are all based on my Rmarkdown templates which are distributed via the monash
R package.
Australia has a problem with government data. Actually it has three problems with government data: 1. It is often kept secret. 2. If it is available, it is often out-of-date. 3. If it is available and timely, it is often in a form that makes any analysis difficult.
I think it would be better for the country if government data was available freely, immediately, and in a form that is useful for analysis. Of course, we should make an exception if there are privacy issues, or some other harm that would be caused by releasing it. Let me explain using some examples.
Take mortality data. During the pandemic, it has been important to know how many people died of any cause, so we could know the effect of the pandemic overall. Obviously some people were dying of COVID-19, but others might have been dying because they were unable to get treated when medical staff were overwhelmed by COVID patients. On the other hand, lockdowns may have reduced deaths due to road crashes, but perhaps they also affected deaths due to suicide. If we could compare the total deaths each week during the pandemic, with the corresponding totals in previous years, we could determine the overall effect of the pandemic on Australian mortality.
You would think, that during a global pandemic, having good mortality data would be important. But in June 2020, nearly six months after the start of COVID-19, the most recent available mortality data in Australia was from 2018. Eighteen months out of date! Think about that. For the first six months of the biggest public health event in 100 years, we had no official data on the effect of COVID-19 on Australian mortality. Eventually the Australian Bureau of Statistics got their act together and started producing provisional mortality data more frequently, but only after several of us complained loudly and publicly. Even now, the provisional mortality data available from the ABS is more than 3 months out of date. Contrast that to other countries. I could find mortality data on 38 countries, and Australia was the 5th worst for producing timely mortality data.
Another example concerns COVID-19 case numbers. There is still no reliable Australian government repository of daily COVID-19 cases by state. Some states are now producing historical data, but for most of 2020, when we really needed reliable information, the public information was incomplete. For much of the first two years of the pandemic, the state health departments were putting out their little dashboard images containing the numbers, but these were preliminary numbers, and did not include cases that were registered late, and some other data revisions. To do any serious analysis, you needed daily case numbers from the beginning of the pandemic, but these were not available on government websites until relatively recently. Some media organizations, and some individuals, were collating the case numbers from the dashboard images and putting them online in the form of spreadsheets, and people were using them to do analysis, but these data were usually inaccurate and subject to revisions. The state health departments generally didn’t update the initial numbers that were released, even though they had more reliable information. So the public data was inaccurate, and most people wanting to do any data analysis were relying on media outlets, or a few 14 year old boys running https://covidlive.com.au, to get even that.
For nearly three years, I have been part of the forecasting team appointed to provide advice to all of the Chief Health Officers of the states and territories of Australia. Every week, we produce forecasts of COVID daily case numbers for all states and territories. For that purpose, we were able to put together a relatively good data set of case numbers for all states, but we were explicitly forbidden to make the data publicly available, even though our data was more accurate than what was appearing in the media.
Similarly, our forecasts were kept secret even though they were being used to make policy decisions. Premiers would justify their policies by vaguely referring to “the modelling”, or occasionally “the Doherty modelling” (even though most of us are not at the Doherty institute), but we would have preferred to have our forecasts available. So the good data and the forecasts are kept secret, and what is available is of poorer quality, or out-of-date.
Why? There are no privacy issues here. No harm would be done by working more transparently. On the contrary, if everyone had access to the best available data, then the independent modelling that was being done would have been of a higher quality.
We use a forecasting ensemble, where we have several forecasting models, and we combine them to produce the final forecasts that are submitted to the various state governments each week. Because we can’t share the data, the only forecasts that are included are those from members of our team. Generally in forecasting, it is better to use a wide range of models, not rely on a select few. But we can’t do that in Australia because of government obsession with secrecy.
Compare that to the United States where there was an official repository of data set up early in the pandemic, and anyone could download it and produce forecasts, and submit those forecasts to the Centre for Disease Control for inclusion in their analysis. Therefore, the US forecasting ensemble that was being used for policy decisions was based on a much larger range of models, and anyone could contribute to it. The resulting forecasts are then published publicly, so anyone can see what is being forecast, and what information a government has available when making policy decisions.
I’ve focused on COVID, but similar problems arise in many other areas in Australia. We have a culture of secrecy around data that is damaging to our public discourse, it leads to worse analysis, it means less transparency in government, and it feeds distrust of government because it is not clear why decisions are being made. Making more data publicly available leads to a better society.
]]>We assume that the residuals from the method are uncorrelated and homoscedastic, with mean 0 and variance . Let denote the time series observations, and let be the estimated forecast mean (or point forecast). Then we can write where is a white noise process. Let be the estimated -step forecast variance.
For a random walk, Equation 1 suggests that the appropriate model is Therefore Consequently
Here the model is where is the seasonal period. Thus where is the integer part of (i.e., the number of complete years in the forecast period prior to time ). Therefore
The model underpinning the mean method is for some constant to be estimated. The least-squares estimate of is the mean, Thus, Therefore
For a random walk with drift Therefore, Now the least squares estimate of is . Therefore
The best Python implementations for my time series methods are available from Nixtla. Here are some of their packages related to my work, all compatible with scikit-learn
.
They have also produced a lot of other great time series tools that are fast (optimized using numba
) and perform well compared to various alternatives.
GluonTS from Amazon is excellent and provides lots of probabilistic time series forecasting models, with wrappers to some of my R code, and statsforecast from Nixtla. The other models in GluonTS are also well worth exploring.
Merlion from Salesforce is another interesting python library which includes both my automatic ARIMA and automatic ETS algorithms, along with other forecasting methods. It also has some anomaly detection methods for time series.
The first attempt to port my auto.arima()
function to Python was pmdarima
. This is also behind the AutoARIMA()
function in sktime.
sktime has the most complete set of time series methods for Python including
and more. These are also compatible with scikit-learn
.
Recently, Kate Buchhorn has ported some of my anomaly detection algorithms to Python and made them available in sktime including:
The statsmodels collection includes a few functions based on my work:
Bohan Zhang has produced pyhts, a re-implementation of the hts package in Python, based on Hyndman et al. (2011), Hyndman et al. (2016) and Wickramasuriya et al. (2019).
Darts is a Python library for wrangling and forecasting time series. It includes wrappers for ETS and ARIMA models from statsforecast
and pmdarima
, as well as an implementation of TBATS and some reconciliation functionality.
Recently I spent a few weeks visiting Professor Tommaso Di Fonzo at the University of Padova (Italy), and one of the things we discussed was finding a notation we were both happy with so we could be more consistent in our future papers.
This is what we came up with. Hopefully others will agree and use it too!
For readers new to forecast reconciliation, Chapter 11 of FPP3 provides an introduction.
We observe time series at time , written as . The base forecasts of given data are denoted by .
This was the original formulation of the problem due to Hyndman et al. (2011), but presented here in our new notation.
Let be a vector of “bottom-level” time series at time , and let be a corresponding vector of aggregated time series, where and is the “aggregation” matrix specifying how the bottom-level series are to be aggregated to form . The full vector of time series is given by This leads to the “summing” or “structural” matrix given by such that .
All bottom-up, middle-out, top-down and linear reconciliation methods can be written as for different matrices .
Optimal reconciled forecasts are obtained with , or where the “mapping” matrix is given by are the -step forecasts of given data to time , and is an positive definite matrix. Different choices for lead to different solutions such as OLS, WLS and MinT (Wickramasuriya, Athanasopoulos, and Hyndman 2019).
There is actually no reason for to be restricted to aggregates of . They can include any linear combination of the bottom-level series , so the corresponding and matrices may contain any real values, not just 0s and 1s. Nevertheless, we will use the same notation for this more general setting.
This representation is more efficient and was used by Di Fonzo and Girolimetto (2021). It was also discussed in Wickramasuriya, Athanasopoulos, and Hyndman (2019). Here it is in the new notation.
We can express the structural representation using the constraint matrix so that . Then we can write the mapping matrix as Note that Equation 2 involves inverting an matrix, rather than the matrix in Equation 1. For most practical problems, , so Equation 2 is more efficient.
This form of the mapping matrix also allows us to interpret the reconciliation as an additive adjustment to the base forecasts. If the base forecasts are already reconciled, then and so .
The most general way to express the problem is not to denote individual series as bottom-level or aggregated, but to define the linear constraints where is an matrix, not necessarily full rank, which may contain any real values.
If is full rank, then Equation 2 holds with .
Temporal reconciliation was proposed by Athanasopoulos et al. (2017). Here it is in our new notation.
For simplicity we will assume the original (scalar) time series is observed with a single seasonality of period (e.g., for monthly data), and the total length of the series is an integer multiple of . We will denote the original series by , and the various temporally aggregated series by .
Let denote the factors of in ascending order, where and . For each factor of , we can construct a temporally aggregated series for . Of course, .
Since the observation index varies with each aggregation level, we define as the observation index of the most aggregated level (e.g., annual), so that at that level.
For each aggregation level, we stack the observations in the column vectors where , , and . Collecting these in one column vector, we obtain
The structural representation of this formulation is where and
The zero-constrained representation is .
If there are multiple seasonalities that are not integer multiples of each other, the resulting additional temporal aggregations can simply be stacked in , and can be extended accordingly.
Now consider the case where we have both cross-sectional and temporal aggregations, as discussed in Di Fonzo and Girolimetto (2021).
Suppose we have observed at the most temporally disaggregated level, including all the cross-sectionally disaggregated and aggregated (or constrained) series. Let be the th element of the vector , . For each , we can expand to include all the temporally aggregated variants, giving a vector of length : These can then be stacked into a long vector:
If denotes the structural matrix for the cross-sectional reconciliation, and denotes the structural matrix for the temporal reconciliation, then the cross-temporal structural matrix is , so that where the bottom-level series
The focus this year is on communicating with data. As with all WOMBAT events, the purpose is to bring together analysts from academia, industry and government to learn and discuss new open source tools for business analytics and data science.
The first day will be virtual with 8 tutorials to choose from. I will be giving one on “Exploratory time series analysis using R”. Each tutorial has limited places, so register early!
The second day will be in-person workshop, limited 60 people. The keynote speaker is Amanda Cox, Head of special data projects, USAFacts. She is well-known for the sixteen years she spent at The New York Times producing some amazing data visualizations. She will speak on “Charts and Words: Being more influential with your data graphics”.
Other invited speakers will talk about data communication in environment, health and sport.
The workshop on December 7 will be held at the Royal South Yarra Lawn Tennis Club, located near the Yarra River, at 310 Williams Rd N, Toorak.
The 6 Dec online tutorials are each limited to 20 participants. Register for tutorials only here. Registering for the 7 Dec workshop provides a 30% discount on tutorial registration. Your discount code will be sent in the confirmation email after you have first registered for 7 Dec.
The 7 Dec in-person event is limited to 60 attendees. Register here. Registration includes lunch, morning and afternoon tea.
For more details, see the event website.
I’ve been using Disqus for more than 13 years, largely because it was the only available solution at the time I added comments. To make Disqus interface a little cleaner, I disabled all the advertising and as much of the other noise as possible, but it still looked like something from mySpace (for those of you who remember the 20th century).
But now there are several alternatives, and I’ve opted for giscus which is very lightweight, is built on Github Discussions, and is open source with no tracking or advertising. The other system I considered was utterances which is also hosted on Github, but uses issues rather than discussions. Consequently, comments on utterances can’t be threaded (with replies to previous comments). Also, giscus appears to have a much more active development team behind it.
The first step was to set up giscus on my blog. With quarto, this simply requires adding a few lines to the _metadata.yml
file in the relevant folder. Here is what it looks like for me:
comments:
giscus:
repo: robjhyndman/robjhyndman.com
repo-id: "R_kgDOH5G3Uw"
category: "Announcements"
category-id: "DIC_kwDOH5G3U84CRUp9"
mapping: "pathname"
reactions-enabled: true
loading: lazy
input-position: "bottom"
theme: "light"
Then I needed to set up giscus on the Github repo that hosts the website (robjhyndman/robjhyndman.com
). The instructions on the giscus website make it very simple.
The last step was the hardest – how to migrate 4000 comments from Disqus to giscus. Here I followed the nice blog post of Maëlle Salmon to download the Disqus comments as an xml file, and wrangle them into a tibble. Then I needed to use the GraphQL API for Github Discussions to generate all the comments on the Github repo. Fortunately, Mitch O’Hara-Wild came to my rescue (as usual), and helped with some of this code. The resulting code is here if anyone wants to try to do the same. You will need to change some specific details in lines 9-13. Everything else should work as it is.
]]>On day 1, we will look at the tsibble
data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to analyse time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble
, lubridate
and feasts
(along with the tidyverse of course).
Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable
package. We will look at creating ensemble forecasts and hybrid forecasts, as well as some new forecasting methods that have performed well in large-scale forecasting competitions. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related.
Places are limited, so sign up early if you’re interested.
]]>For the blogdown site, I had to (painfully) hack my own hugo theme to make it look the way I wanted. This one is pretty much straight out of the Quarto box other than some css styling, and some tweaking of Quarto templates. In case anyone wants to create something similar for themselves, I’ve set up a template version with just the bare minimum so you don’t need to wade through the extra folders I’ve kept to ensure existing links continue to work.
Actually, setting up a website in Quarto is extremely easy when following the online instructions. The hard part for me was the migration. There are about 800 pages that make up this site, and about 4000 comments on my blog. I didn’t want to break any existing links, so retaining the same structure was important.
I also decided to convert the commenting system from Disqus to giscus, which is built on Github Discussions. I’ll describe that conversion in a separate post in case anyone else wants to do something similar.
There are almost certainly things that are still broken, so please let me know in the comments below if you find anything that doesn’t work as it should.
]]>The cricketdata package has been around for a few years on github, and it has been on CRAN since February 2022. There are only four functions:
fetch_cricinfo()
: Fetch team data on international cricket matches provided by ESPNCricinfo.fetch_player_data()
: Fetch individual player data on international cricket matches provided by ESPNCricinfo.find_player_id()
: Search for the player ID on ESPNCricinfo.fetch_cricsheet()
: Fetch ball-by-ball, match and player data from Cricsheet.Jacquie Tran wrote the first version of the fetch_cricsheet()
function, and the vignette which demonstrates it.
Here are some examples demonstrating the Cricinfo functions.
library(cricketdata)
library(tidyverse)
The fetch_cricinfo()
function downloads data for international T20, ODI or Test matches, for men or women, and for batting, bowling or fielding. By default, it downloads career-level statistics for individual players. Here is an example for women T20 bowlers.
# Fetch all Women's T20 data
wt20 <- fetch_cricinfo("T20", "Women", "Bowling")
wt20 %>%
select(Player, Country, Matches, Runs, Wickets, Economy, StrikeRate)
#> # A tibble: 1,798 × 7
#> Player Country Matches Runs Wickets Economy StrikeRate
#> <chr> <chr> <int> <int> <int> <dbl> <dbl>
#> 1 A Mohammed West Indies 117 2206 125 5.58 19.0
#> 2 S Ismail South Africa 105 2153 115 5.81 19.3
#> 3 EA Perry Australia 126 2237 115 5.87 19.9
#> 4 KH Brunt England 104 2019 108 5.50 20.4
#> 5 M Schutt Australia 84 1685 108 6.05 15.5
#> 6 Nida Dar Pakistan 114 1951 106 5.35 20.6
#> 7 SFM Devine New Zealand 107 1822 104 6.36 16.5
#> 8 A Shrubsole England 79 1587 102 5.96 15.7
#> 9 Poonam Yadav India 72 1495 98 5.75 15.9
#> 10 SR Taylor West Indies 111 1639 98 5.66 17.7
#> # … with 1,788 more rows
We can plot a bowler’s strike rate (balls per wicket) vs economy rate (runs per wicket). Each observation represents one player, who has taken at least 50 international wickets.
wt20 %>%
filter(Wickets >= 50) %>%
ggplot(aes(y = StrikeRate, x = Average)) +
geom_point(alpha = 0.3, col = "blue") +
ggtitle("Women International T20 Bowlers") +
ylab("Balls per wicket") + xlab("Runs per wicket")
The extraordinary result on the bottom left is due to the Thai all-rounder, Nattaya Boochatham, who has taken 59 wickets, with a strike rate of 13.475, an average of 8.78, and an economy rate of 3.909.
The next example shows Australian men’s ODI batting results by innings.
# Fetch all Australian Men's ODI data by innings
menODI <- fetch_cricinfo("ODI", "Men", "Batting", type = "innings", country = "Australia")
menODI %>%
select(Date, Player, Runs, StrikeRate, NotOut)
#> # A tibble: 10,675 × 5
#> Date Player Runs StrikeRate NotOut
#> <date> <chr> <int> <dbl> <lgl>
#> 1 2011-04-11 SR Watson 185 193. TRUE
#> 2 2007-02-20 ML Hayden 181 109. TRUE
#> 3 2017-01-26 DA Warner 179 140. FALSE
#> 4 2015-03-04 DA Warner 178 134. FALSE
#> 5 2001-02-09 ME Waugh 173 117. FALSE
#> 6 2016-10-12 DA Warner 173 127. FALSE
#> 7 2004-01-16 AC Gilchrist 172 137. FALSE
#> 8 2019-06-20 DA Warner 166 113. FALSE
#> 9 2006-03-12 RT Ponting 164 156. FALSE
#> 10 2016-12-04 SPD Smith 164 104. FALSE
#> # … with 10,665 more rows
menODI %>%
ggplot(aes(y = Runs, x = Date)) +
geom_point(alpha = 0.2, col = "#D55E00") +
geom_smooth() +
ggtitle("Australia Men ODI: Runs per Innings")
The average number of runs per innings slowly increased until about 2000, after which it has remained largely constant at about 35.1. This is a little higher than the smooth line shown on the plot, which has not taken account of not-out results.
Next, we demonstrate some of the fielding data available, using Test match fielding from Indian men’s players.
Indfielding <- fetch_cricinfo("Test", "Men", "Fielding", country = "India")
Indfielding
#> # A tibble: 303 × 11
#> Player Start End Matches Innings Dismis…¹ Caught Caugh…² Caugh…³
#> <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 MS Dhoni 2005 2014 90 166 294 256 0 256
#> 2 R Dravid 1996 2012 163 299 209 209 209 0
#> 3 SMH Kirmani 1976 1986 88 151 198 160 0 160
#> 4 VVS Laxman 1996 2012 134 248 135 135 135 0
#> 5 KS More 1986 1993 49 90 130 110 0 110
#> 6 RR Pant 2018 2022 31 61 122 111 0 111
#> 7 SR Tendulkar 1989 2013 200 366 115 115 115 0
#> 8 SM Gavaskar 1971 1987 125 216 108 108 108 0
#> 9 NR Mongia 1994 2001 44 77 107 99 0 99
#> 10 M Azharuddin 1984 2000 99 177 105 105 105 0
#> # … with 293 more rows, 2 more variables: Stumped <int>,
#> # MaxDismissalsInnings <dbl>, and abbreviated variable names
#> # ¹Dismissals, ²CaughtFielder, ³CaughtBehind
We can plot the number of dismissals by number of matches for all male test players. Because wicket keepers typically have a lot more dismissals than other players, they are shown in a different colour.
Indfielding %>%
mutate(wktkeeper = (CaughtBehind > 0) | (Stumped > 0)) %>%
ggplot(aes(x = Matches, y = Dismissals, col = wktkeeper)) +
geom_point() +
ggtitle("Indian Men Test Fielding")
The high number of dismissals, close to 300, is of course due to MS Dhoni. Another interesting one here is the non-wicketkeeper with over 200 dismissals, which is Rahul Dravid who took 209 catches during his career.
Finally, let’s look at individual player data. The fetch_player_data()
requires the Cricinfo player ID, which you can either look up on their website, or use the find_player_id()
function. We will look at the ODI results of Australia’s captain, Meg Lanning.
meg_lanning_id <- find_player_id("Lanning")$ID
MegLanning <- fetch_player_data(meg_lanning_id, "ODI") %>%
mutate(NotOut = (Dismissal == "not out"))
MegLanning
#> # A tibble: 100 × 14
#> Date Innings Opposition Ground Runs Mins BF X4s X6s SR
#> <date> <int> <chr> <chr> <dbl> <dbl> <int> <int> <int> <dbl>
#> 1 2011-01-05 1 ENG Women Perth 20 60 38 2 0 52.6
#> 2 2011-01-07 2 ENG Women Perth 104 148 118 8 1 88.1
#> 3 2011-06-14 2 NZ Women Brisb… 11 15 14 2 0 78.6
#> 4 2011-06-16 1 NZ Women Brisb… 5 8 8 1 0 62.5
#> 5 2011-06-30 1 NZ Women Chest… 17 24 20 3 0 85
#> 6 2011-07-02 2 India Wom… Chest… 23 40 32 3 0 71.9
#> 7 2011-07-05 2 ENG Women Lord's 43 40 33 9 0 130.
#> 8 2011-07-07 2 ENG Women Worms… 0 2 3 0 0 0
#> 9 2012-03-12 1 India Wom… Ahmed… 45 61 44 7 0 102.
#> 10 2012-03-14 1 India Wom… Wankh… 128 125 104 19 1 123.
#> # … with 90 more rows, and 4 more variables: Pos <int>, Dismissal <chr>,
#> # Inns <int>, NotOut <lgl>
We can plot her runs per innings on the vertical axis over time on the horizontal axis.
# Compute batting average
MLave <- MegLanning %>%
filter(!is.na(Runs)) %>%
summarise(Average = sum(Runs) / (n() - sum(NotOut))) %>%
pull(Average)
names(MLave) <- paste("Average =", round(MLave, 2))
# Plot ODI scores
ggplot(MegLanning) +
geom_hline(aes(yintercept = MLave), col="gray") +
geom_point(aes(x = Date, y = Runs, col = NotOut)) +
ggtitle("Meg Lanning ODI Scores") +
scale_y_continuous(sec.axis = sec_axis(~., breaks = MLave))
She has shown amazing consistency over her career, with centuries scored in every year of her career except for 2021, when her highest score from 6 matches was 53.
Some of these data sets have been made available in R packages previously, based on ts
objects which worked ok for annual, quarterly and monthly data, but is not a good format for daily and sub-daily data.
The tsibbledata
package provides the function monash_forecasting_respository()
to download the data and return it as a tsibble
object. These can be analysed and plotted using the feasts
package, and modelled and forecast using the fable
package. It is convenient to simply load the fpp3
package which will then load all the necessary packages.
library(fpp3)
── Attaching packages ────────────────────────────────── fpp3 0.4.0.9000 ──
✔ tibble 3.1.8 ✔ tsibble 1.1.2
✔ dplyr 1.0.10 ✔ tsibbledata 0.4.1.9000
✔ tidyr 1.2.1 ✔ feasts 0.3.0.9000
✔ lubridate 1.8.0 ✔ fable 0.3.2.9000
✔ ggplot2 3.3.6 ✔ fabletools 0.3.2.9000
── Conflicts ──────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date() masks base::date()
✖ dplyr::filter() masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval() masks lubridate::interval()
✖ dplyr::lag() masks stats::lag()
✖ tsibble::setdiff() masks base::setdiff()
✖ tsibble::union() masks base::union()
To download the M3 data, we need to know the unique zenodo identifiers for each data set. From the forecastingdata.org page, find the M3 links (there are four, one for each observational frequency). For example, the Yearly link takes you to https://zenodo.org/record/4656222, so the Zenodo identifier for this data set is 4656222. Similarly, the Quarterly, Monthly and Other links have identifiers 4656262, 4656298 and 4656335 respectively.
m3_yearly <- monash_forecasting_repository(4656222)
m3_quarterly <- monash_forecasting_repository(4656262)
m3_monthly <- monash_forecasting_repository(4656298)
m3_other <- monash_forecasting_repository(4656335)
The first three data sets are stored with a date index, so they are read as daily data. Therefore we first need to convert them to yearly, quarterly and monthly data.
m3_yearly <- m3_yearly %>%
mutate(year = year(start_timestamp)) %>%
as_tsibble(index=year) %>%
select(-start_timestamp)
m3_quarterly <- m3_quarterly %>%
mutate(quarter = yearquarter(start_timestamp)) %>%
as_tsibble(index=quarter) %>%
select(-start_timestamp)
m3_monthly <- m3_monthly %>%
mutate(month = yearmonth(start_timestamp)) %>%
as_tsibble(index=month) %>%
select(-start_timestamp)
The resulting monthly data set is shown below.
m3_monthly
# A tsibble: 167,562 x 3 [1M]
# Key: series_name [1,428]
series_name value month
<chr> <dbl> <mth>
1 T1 2640 1990 Jan
2 T1 2640 1990 Feb
3 T1 2160 1990 Mar
4 T1 4200 1990 Apr
5 T1 3360 1990 May
6 T1 2400 1990 Jun
7 T1 3600 1990 Jul
8 T1 1920 1990 Aug
9 T1 4200 1990 Sep
10 T1 4560 1990 Oct
# … with 167,552 more rows
The series names are T1
, T2
, … The M3 data included both training and test data. These have been combined in this data set.
This data set contains total half-hourly electricity demand by state from 1 January 2002 to 1 April 2015, for five states of Australia: New South Wales, Queensland, South Australia, Tasmania, and Victoria. A subset of this data (one state and only three years) is provided as tsibbledata::vic_elec
.
aus_elec <- monash_forecasting_repository(4659727)
aus_elec
# A tsibble: 1,155,264 x 4 [30m] <UTC>
# Key: series_name, state [5]
series_name state start_timestamp value
<chr> <chr> <dttm> <dbl>
1 T1 NSW 2002-01-01 00:00:00 5714.
2 T1 NSW 2002-01-01 00:30:00 5360.
3 T1 NSW 2002-01-01 01:00:00 5015.
4 T1 NSW 2002-01-01 01:30:00 4603.
5 T1 NSW 2002-01-01 02:00:00 4285.
6 T1 NSW 2002-01-01 02:30:00 4075.
7 T1 NSW 2002-01-01 03:00:00 3943.
8 T1 NSW 2002-01-01 03:30:00 3884.
9 T1 NSW 2002-01-01 04:00:00 3878.
10 T1 NSW 2002-01-01 04:30:00 3838.
# … with 1,155,254 more rows
aus_elec %>%
filter(state=="VIC") %>%
autoplot(value) +
labs(x = "Time", y="Electricity demand (MWh)")
We also provide some accuracy measures of the performance of 13 baseline forecasting methods applied to the data sets in the repository. This makes it easy for anyone proposing a new method to compare against some standard existing methods, without having to do all the calculations themselves.
The data can be loaded as a Pandas dataframe by following this example in the github repository. Download the .tsf
files as required from Zenodo and put them into tsf_data
folder.
fable
package, and the facilities will be added there. But in the meantime, if you are using the forecast
package and want to simulate from a fitted TBATS model, here is how do it.
Doing it efficiently would require a more complicated approach, but this is super easy if you are willing to sacrifice some speed. The trick is to realise that a simulation can be handled easily for almost any time series model using residuals and one-step forecasts. Note that a residual is given by so we can write
Therefore, given data to time , we can simulate iteratively using where is randomly generated from the error distribution, or bootstrapped by randomly sampling from past residuals. The value of can be obtained by applying the model to the series (without re-estimating the parameters) and forecasting one-step ahead. This is the same trick we use to get prediction intervals for neural network models.
Because simulate()
is an S3 method in R, we have to make sure the corresponding simulate.tbats()
function has all the necessary arguments to match other simulate
functions. It’s also good to make it as close as possible to other simulate
functions in the forecast
package, to make it easier for users when switching between them. The real work is done in the last few lines.
simulate.tbats <- function(object, nsim=length(object$y),
seed = NULL, future=TRUE,
bootstrap=FALSE, innov = NULL, ...) {
if (is.null(innov)) {
if (!exists(".Random.seed", envir = .GlobalEnv)) {
runif(1)
}
if (is.null(seed)) {
RNGstate <- .Random.seed
}
else {
R.seed <- .Random.seed
set.seed(seed)
RNGstate <- structure(seed, kind = as.list(RNGkind()))
on.exit(assign(".Random.seed", R.seed, envir = .GlobalEnv))
}
}
else {
nsim <- length(innov)
}
if (bootstrap) {
res <- residuals(object)
res <- na.omit(res - mean(res, na.rm = TRUE))
e <- sample(res, nsim, replace = TRUE)
}
else if (is.null(innov)) {
e <- rnorm(nsim, 0, sqrt(object$variance))
} else {
e <- innov
}
x <- getResponse(object)
y <- numeric(nsim)
if(future) {
dataplusy <- x
} else {
# Start somewhere in the original series
dataplusy <- ts(sample(x, 1), start=-1/frequency(x),
frequency = frequency(x))
}
fitplus <- object
for(i in seq_along(y)) {
y[i] <- forecast(fitplus, h=1)$mean + e[i]
dataplusy <- ts(c(dataplusy, y[i]),
start=start(dataplusy), frequency=frequency(dataplusy))
fitplus <- tbats(dataplusy, model=fitplus)
}
return(tail(dataplusy, nsim))
}
I’ve added this to the forecast
package for the next version.
Something similar could be written for any other forecasting function that doesn’t already have a simulate
method. Just swap the tbats
call to the relevant modelling function.
library(forecast)
library(ggplot2)
fit <- tbats(USAccDeaths)
p <- USAccDeaths %>% autoplot() +
labs(x = "Year", y = "US Accidental Deaths",
title = "TBATS simulations")
for (i in seq(9)) {
p <- p + autolayer(simulate(fit, nsim = 36), series = paste("Sim", i))
}
p
General online job sites such as seek or careerjet are ok, but job-seekers can find it hard to find the relevant openings because job titles are so varied. In the general area of statistics, a job can appear under the titles “statistician”, “analyst”, “data miner”, “data manager”, “financial engineer” and a few dozen other labels. Many employers don’t place the job in the best category, often because they don’t understand what skills are required to do the job. Nevertheless, if I was looking for a job, I would certainly set up some automated searches on these sites.
In statistics, there are well-established job websites that are the best places for both employers and potential employees to meet up.
I do not know what is provided in other countries, but check with your national statistical association.
There are also e-mail lists and web forums that are widely subscribed and often contain job postings.
If I’ve missed any good places to advertise jobs, please add them in the comments.
]]>tsoutliers()
function in the forecast package for R is useful for identifying anomalies in a time series. However, it is not properly documented anywhere. This post is intended to fill that gap.
The function began as an answer on CrossValidated and was later added to the forecast package because I thought it might be useful to other people. It has since been updated and made more reliable.
The procedure decomposes the time series into trend, seasonal and remainder components: The seasonal component is optional, and it may containing several seasonal patterns corresponding to the seasonal periods in the data. The idea is to first remove any seasonality and trend in the data, and then find outliers in the remainder series, .
For data observed more frequently than annually, we use a robust approach to estimate and by first applying the MSTL method to the data. MSTL will iteratively estimate the seasonal component(s).
Then the strength of seasonality is measured using If , a seasonally adjusted series is computed: A seasonal strength threshold is used here because the estimate of is likely to be overfitted and very noisy if the underlying seasonality is too weak (or non-existent), potentially masking any outliers by having them absorbed into the seasonal component.
If , or if the data is observed annually or less frequently, we simply set .
Next, we re-estimate the trend component from the values. For non-seasonal time series such as annual data, this is necessary as we don’t have the trend estimate from the STL decomposition. But even if we have computed an STL decomposition, we may not have used it if .
The trend component is estimated by applying Friedman’s super smoother (via supsmu()
) to the data. This function has been tested on lots of data and tends to work well on a wide range of problems.
We look for outliers in the estimated remainder series If denotes the 25th percentile and denotes the 75th percentile of the remainder values, then the interquartile range is defined as . Observations are labelled as outliers if they are less than or greater than . This is the definition used by Tukey (1977, p44) in his original boxplot proposal for “far out” values.
If the remainder values are normally distributed, then the probability of an observation being identified as an outlier is approximately 1 in 427000.
Any outliers identified in this manner are replaced with linearly interpolated values using the neighbouring observations, and the process is repeated.
The gold price data contains daily morning gold prices in US dollars from 1 January 1985 to 31 March 1989. The data was given to me by a client who wanted me to forecast the gold price. (I told him it would be almost impossible to beat a naive forecast). The data are shown below.
library(fpp2)
autoplot(gold)
There are periods of missing values, and one obvious outlier which is about $100 greater than what would be expected. This was simply a typo, with someone typing 593.70 rather than 493.70. Let’s see if the tsoutliers()
function can spot it.
tsoutliers(gold)
$index
[1] 770
$replacements
[1] 495
Sure enough, it is easily found and the suggested replacement (linearly interpolated) is close to the true value.
The tsclean()
function removes outliers identified in this way, and replaces them (and any missing values) with linearly interpolated replacements.
autoplot(tsclean(gold), series="clean", color='red', lwd=0.9) +
autolayer(gold, series="original", color='gray', lwd=1) +
geom_point(data = tsoutliers(gold) %>% as.data.frame(),
aes(x=index, y=replacements), col='blue') +
labs(x = "Day", y = "Gold price ($US)")
The blue dot shows the replacement for the outlier, the red lines show the replacements for the missing values.
Hi, I’m an MSc student and am shortly starting my project/dissertation on time series data. I’ve started reading Version 3 of your book and improving my R skills but am wondering if there’s any way I can read V3 that will allow annotation? Thanks
For personal annotation of websites, the Hypothesis extension is very useful. You can highlight, annotate and discuss with other readers. You will need to set up a (free) account at https://web.hypothes.is/start/
Thanks you so much for putting out this book! … would it be possible to add OpenDyslexic (https://opendyslexic.org/) to your list of available type face on your website? I am attempting to make my way through your text book, but access to this font would make my life immeasurably easier.
The simplest approach here is to use the install the OpenDyslexic Font extension. When installed, the fpp3 book looks like this:
The only issue is that the equations are not rendered properly by default. But these can be fixed. First, right click on an equation and choose Math Settings/Math Renderer/HTML-CSS
. Then right click again and choose Math Settings/Scale all math/50%
. You only need to do these steps once.
By the way, a print version of the third edition is now available.
]]>fable
package using the stretch_tsibble()
function to generate the data folds. In this post I will give two examples of how to use it, one without covariates and one with covariates.
Here is a simple example using quarterly Australian beer production from 1956 Q1 to 2010 Q2. First we create a data object containing many training sets starting with 3 years (12 observations), and adding one quarter at a time until all data are included.
library(fpp3)
beer <- aus_production %>%
select(Beer) %>%
stretch_tsibble(.init = 12, .step=1)
beer
# A tsibble: 23,805 x 3 [1Q]
# Key: .id [207]
Beer Quarter .id
<dbl> <qtr> <int>
1 284 1956 Q1 1
2 213 1956 Q2 1
3 227 1956 Q3 1
4 308 1956 Q4 1
5 262 1957 Q1 1
6 228 1957 Q2 1
7 236 1957 Q3 1
8 320 1957 Q4 1
9 272 1958 Q1 1
10 233 1958 Q2 1
# … with 23,795 more rows
This gives 207 training sets of increasing size. We fit an ETS model to each training set and produce one year of forecasts from each model. Because I want to compute RMSE for each forecast horizon, I will add the horizon h
to the resulting object.
fc <- beer %>%
model(ETS(Beer)) %>%
forecast(h = "1 year") %>%
group_by(.id) %>%
mutate(h = row_number()) %>%
ungroup() %>%
as_fable(response="Beer", distribution=Beer)
Finally, we compare the forecasts against the actual values and average over the folds.
fc %>%
accuracy(aus_production, by=c("h",".model")) %>%
select(h, RMSE)
# A tibble: 4 × 2
h RMSE
<int> <dbl>
1 1 17.1
2 2 16.7
3 3 18.1
4 4 19.2
Forecasts of 1 and 2 quarters ahead both have about the same accuracy here, but then the error increases for horizons 3 and 4.
Things are a little more complicated when we want to use covariates in the model. Here is an example of monthly quotations issued by a US insurance company modelled as a function of the TV advertising expenditure in the same month.
The first step is the same, where we stretch the tsibble. This time we will start with one year of data.
stretch <- insurance %>%
stretch_tsibble(.step=1, .init=12)
stretch
# A tsibble: 754 x 4 [1M]
# Key: .id [29]
Month Quotes TVadverts .id
<mth> <dbl> <dbl> <int>
1 2002 Jan 13.0 7.21 1
2 2002 Feb 15.4 9.44 1
3 2002 Mar 13.2 7.53 1
4 2002 Apr 13.0 7.21 1
5 2002 May 15.4 9.44 1
6 2002 Jun 11.7 6.42 1
7 2002 Jul 10.1 5.81 1
8 2002 Aug 10.8 6.20 1
9 2002 Sep 13.3 7.59 1
10 2002 Oct 14.6 8.00 1
# … with 744 more rows
Next we fit a regression model with AR(1) errors to each fold.
fit <- stretch %>%
model(ARIMA(Quotes ~ 1 + pdq(1,0,0) + TVadverts))
Before we forecast, we need to provide the advertising expenditure for the future periods. We will forecast up to 3 steps ahead, so the test data needs to have 3 observations per fold.
test <- new_data(stretch, n=3) %>%
# Add in covariates from corresponding month
left_join(insurance, by="Month")
test
# A tsibble: 87 x 4 [1M]
# Key: .id [29]
Month .id Quotes TVadverts
<mth> <int> <dbl> <dbl>
1 2003 Jan 1 17.0 9.53
2 2003 Feb 1 16.9 9.39
3 2003 Mar 1 16.5 8.92
4 2003 Feb 2 16.9 9.39
5 2003 Mar 2 16.5 8.92
6 2003 Apr 2 15.3 8.37
7 2003 Mar 3 16.5 8.92
8 2003 Apr 3 15.3 8.37
9 2003 May 3 15.9 9.84
10 2003 Apr 4 15.3 8.37
# … with 77 more rows
The actual value in each month is also included, but that will be ignored when forecasting.
fc <- forecast(fit, new_data = test) %>%
group_by(.id) %>%
mutate(h = row_number()) %>%
ungroup() %>%
as_fable(response = "Quotes", distribution=Quotes)
Finally, we can compare the forecasts against the actual values, averaged across each forecast horizon.
fc %>% accuracy(insurance, by=c("h",".model")) %>%
select(h, RMSE)
# A tibble: 3 × 2
h RMSE
<int> <dbl>
1 1 0.761
2 2 1.20
3 3 1.49
(Updated: 17 Nov 2021)
Date | Podcast | Episode |
---|---|---|
17 November 2021 | The Random Sample | Software as a first class research output |
24 May 2021 | Data Skeptic | Forecasting principles and practice |
12 April 2021 | Seriously Social | Forecasting the future: the science of prediction |
6 February 2021 | Forecasting Impact | Rob Hyndman |
19 July 2020 | The Curious Quant | Forecasting COVID, time series, and why causality doesnt matter as much as you think |
27 May 2020 | The Random Sample | Forecasting the future & the future of forecasting |
9 October 2019 | Thought Capital | Forecasts are always wrong (but we need them anyway) |