All Hyndsight posts by date
Here’s an interesting new forecasting competition that came via my inbox this week.
Contraceptive access is vital to safe motherhood, healthy families, and prosperous communities. Greater access to contraceptives enables couples and individuals to determine whether, when, and how often to have children. In low- and middle-income countries (LMIC) around the world, health systems are often unable to accurately predict the quantity of contraceptives necessary for each health service delivery site, in part due to insufficient data, limited staff capacity, and inadequate systems.
I was reminded again this week that getting the right terminology is important. Some of my colleagues who work in machine learning wrote a paper entitled “Time series regression” which began with “This paper introduces Time Series Regression (TSR): a little-studied task …”. Statisticians and econometricians have done time series regression for many decades, so this beginning led to the paper being lampooned on Twitter.
The problem arose due to clashes in terminology being used in different fields.
The weekly mortality data recently published by the Human Mortality Database can be used to explore seasonality in mortality rates. Mortality rates are known to be seasonal due to temperatures and other weather-related effects (Healy 2003).
The reported COVID19 deaths in each country are often undercounts due to different reporting practices, or people dying of COVID19 related causes without ever being tested. One way to explore the true mortality effect of the pandemic is to look at “excess deaths” — the difference between death rates this year and the same time in previous years.
The Financial Times (and other media outlets) have been collecting data from many countries to try to measure this effect.
There have been some great data visualizations produced of COVID-19 case and deaths data, the best known of which is the graph from John Burn-Murdoch in the Financial Times. To my knowledge, it was first used by Matt Cowgill from the Grattan Institute, and has been widely copied. This is a great visualization and has helped introduce log-scale graphics to a wide audience.
Reproducing the Financial Times cumulative confirmed cases graph To produce something like it, we can use the tidycovid19 package from Joachim Gassen:
What makes forecasting hard? Forecasting pandemics is harder than many people think. In my book with George Athanasopoulos, we discuss the contributing factors that make forecasts relatively accurate. We identify three major factors:
how well we understand the factors that contribute to it; how much data is available; whether the forecasts can affect the thing we are trying to forecast. For example, tomorrow’s weather can be forecast relatively accurately using modern tools because we have good models of the physical atmosphere, there is tons of data, and our weather forecasts cannot possibly affect what actually happens.
The tsibbledata packages contains the vic_elec data set, containing half-hourly electricity demand for the state of Victoria, along with corresponding temperatures from the capital city, Melbourne. These data cover the period 2012-2014.
Other similar data sets are also available, and these may be of interest to researchers in the area.
For people new to tsibbles, please read my introductory post.
Australian state-level demand The rawdata for other states are also stored in the tsibbledata github repository (under the data-raw folder), but these are not included in the package to satisfy CRAN space constraints.
library(tidyverse) library(tsibble) library(readabs) library(raustats) Australian data analysts will know how frustrating it is to work with time series data from the Australian Bureau of Statistics. They are stored as multiple ugly Excel files (each containing multiple sheets) with inconsistent formatting, embedded comments, meta data stored along with the actual data, dates stored in a painful Excel format, and so on.
Fortunately there are now a couple of R packages available to make this a little easier.
library(tidyverse) library(tsibble) library(lubridate) library(feasts) library(fable) In my previous post about the new fable package, we saw how fable can produce forecast distributions, not just point forecasts. All my examples used Gaussian (normal) distributions, so in this post I want to show how non-Gaussian forecasting can be done.
As an example, we will use eating-out expenditure in my home state of Victoria.
vic_cafe <- tsibbledata::aus_retail %>% filter( State == "Victoria", Industry == "Cafes, restaurants and catering services" ) %>% select(Month, Turnover) vic_cafe %>% autoplot(Turnover) + ggtitle("Monthly turnover of Victorian cafes") Forecasting with transformations Clearly the variance is increasing with the level of the series, so we will consider modelling a Box-Cox transformation of the data.
The fable package for doing tidy forecasting in R is now on CRAN. Like tsibble and feasts, it is also part of the tidyverts family of packages for analysing, modelling and forecasting many related time series (stored as tsibbles).
For a brief introduction to tsibbles, see this post from last month.
Here we will forecast Australian tourism data by state/region and purpose. This data is stored in the tourism tsibble where Trips contains domestic visitor nights in thousands.