library(tidyverse)
library(tsibble)
library(readabs)
ABS time series as tsibbles
Australian data analysts will know how frustrating it is to work with time series data from the Australian Bureau of Statistics. They are stored as multiple ugly Excel files (each containing multiple sheets) with inconsistent formatting, embedded comments, meta data stored along with the actual data, dates stored in a painful Excel format, and so on.
Fortunately there is an R package available to make this a little easier. The readabs
package is maintained by Matt Cowgill from the Grattan Institute — which means it has probably had a very thorough workout!
To illustrate it, I will recreate the tsibbledata::aus_retail
data, containing monthly Australian retail trade turnover for different combinations of industry and state. Here is the data as provided in the tsibbledata package
.
::aus_retail tsibbledata
# A tsibble: 64,532 x 5 [1M]
# Key: State, Industry [152]
State Industry Serie…¹ Month Turno…²
<chr> <chr> <chr> <mth> <dbl>
1 Australian Capital Territory Cafes, restaurant… A33498… 1982 Apr 4.4
2 Australian Capital Territory Cafes, restaurant… A33498… 1982 May 3.4
3 Australian Capital Territory Cafes, restaurant… A33498… 1982 Jun 3.6
4 Australian Capital Territory Cafes, restaurant… A33498… 1982 Jul 4
5 Australian Capital Territory Cafes, restaurant… A33498… 1982 Aug 3.6
6 Australian Capital Territory Cafes, restaurant… A33498… 1982 Sep 4.2
7 Australian Capital Territory Cafes, restaurant… A33498… 1982 Oct 4.8
8 Australian Capital Territory Cafes, restaurant… A33498… 1982 Nov 5.4
9 Australian Capital Territory Cafes, restaurant… A33498… 1982 Dec 6.9
10 Australian Capital Territory Cafes, restaurant… A33498… 1983 Jan 3.8
# … with 64,522 more rows, and abbreviated variable names ¹`Series ID`,
# ²Turnover
There are 152 combinations of State and Industry (each corresponding to a Series ID
).
The main function is read_abs()
which will download the data, read it into R, and tidy it. While it will do this for all spreadsheets in a given catalogue number, I would not recommend that. Choose the spreadsheet you want. In this case, it is Cat 8501.0, Table 11.
<- read_abs("8501.0", tables = 11) retail1
Finding URLs for tables corresponding to ABS catalogue 8501.0
Attempting to download files from catalogue 8501.0, Retail Trade, Australia
Downloading https://www.abs.gov.au/statistics/industry/retail-and-wholesale-trade/retail-trade-australia/latest-release/8501011.xlsx
Extracting data from downloaded spreadsheets
Tidying data from imported ABS spreadsheets
retail1
# A tibble: 91,476 × 12
table_no sheet…¹ table…² date series value serie…³ data_…⁴ colle…⁵
<chr> <chr> <chr> <date> <chr> <dbl> <chr> <chr> <chr>
1 8501011 Data1 TABLE … 1982-04-01 Turno… 303. Origin… FLOW 1
2 8501011 Data1 TABLE … 1982-04-01 Turno… 41.7 Origin… FLOW 1
3 8501011 Data1 TABLE … 1982-04-01 Turno… 63.9 Origin… FLOW 1
4 8501011 Data1 TABLE … 1982-04-01 Turno… 409. Origin… FLOW 1
5 8501011 Data1 TABLE … 1982-04-01 Turno… 65.8 Origin… FLOW 1
6 8501011 Data1 TABLE … 1982-04-01 Turno… 91.8 Origin… FLOW 1
7 8501011 Data1 TABLE … 1982-04-01 Turno… 53.6 Origin… FLOW 1
8 8501011 Data1 TABLE … 1982-04-01 Turno… 211. Origin… FLOW 1
9 8501011 Data1 TABLE … 1982-04-01 Turno… 94 Origin… FLOW 1
10 8501011 Data1 TABLE … 1982-04-01 Turno… 32.7 Origin… FLOW 1
# … with 91,466 more rows, 3 more variables: frequency <chr>,
# series_id <chr>, unit <chr>, and abbreviated variable names ¹sheet_no,
# ²table_title, ³series_type, ⁴data_type, ⁵collection_month
Some of those columns are not particularly useful (containing a single unique value), so we will remove them. We also need to fix the date to be a Month (rather than Day), and we will match the names to tsibbledata::aus_retail
to make comparisons easier.
<- retail1 %>%
retail1 mutate(Month = yearmonth(date)) %>%
rename(Turnover = value, `Series ID` = series_id) %>%
select(Month, `Series ID`, series, Turnover)
retail1
# A tibble: 91,476 × 4
Month `Series ID` series Turno…¹
<mth> <chr> <chr> <dbl>
1 1982 Apr A3349335T Turnover ; New South Wales ; Supermarket… 303.
2 1982 Apr A3349627V Turnover ; New South Wales ; Liquor reta… 41.7
3 1982 Apr A3349338X Turnover ; New South Wales ; Other speci… 63.9
4 1982 Apr A3349398A Turnover ; New South Wales ; Food retail… 409.
5 1982 Apr A3349468W Turnover ; New South Wales ; Furniture, … 65.8
6 1982 Apr A3349336V Turnover ; New South Wales ; Electrical … 91.8
7 1982 Apr A3349337W Turnover ; New South Wales ; Hardware, b… 53.6
8 1982 Apr A3349397X Turnover ; New South Wales ; Household g… 211.
9 1982 Apr A3349399C Turnover ; New South Wales ; Clothing re… 94
10 1982 Apr A3349874C Turnover ; New South Wales ; Footwear an… 32.7
# … with 91,466 more rows, and abbreviated variable name ¹Turnover
The series
column contains information about the state and industry, so we will need to extract the relevant details. Also, totals are included in addition to the disaggregated data, so let’s remove them.
<- retail1 %>%
retail1 separate(series, c("Category", "State", "Industry"),
sep = ";", extra = "drop") %>%
mutate(
State = trimws(State),
Industry = trimws(Industry),
%>%
) select(-Category) %>%
filter(
!= "Total (Industry)",
Industry != "Total (State)"
State )
Next, we turn it into a tsibble by identifying the index and key variables, and removing any missing combinations.
<- retail1 %>%
retail1 as_tsibble(index = Month, key = c(State, Industry)) %>%
filter(!is.na(Turnover))
retail1
# A tsibble: 70,878 x 5 [1M]
# Key: State, Industry [156]
Month `Series ID` State Industry Turno…¹
<mth> <chr> <chr> <chr> <dbl>
1 1982 Apr A3349849A Australian Capital Territory Cafes, restau… 4.4
2 1982 May A3349849A Australian Capital Territory Cafes, restau… 3.4
3 1982 Jun A3349849A Australian Capital Territory Cafes, restau… 3.6
4 1982 Jul A3349849A Australian Capital Territory Cafes, restau… 4
5 1982 Aug A3349849A Australian Capital Territory Cafes, restau… 3.6
6 1982 Sep A3349849A Australian Capital Territory Cafes, restau… 4.2
7 1982 Oct A3349849A Australian Capital Territory Cafes, restau… 4.8
8 1982 Nov A3349849A Australian Capital Territory Cafes, restau… 5.4
9 1982 Dec A3349849A Australian Capital Territory Cafes, restau… 6.9
10 1983 Jan A3349849A Australian Capital Territory Cafes, restau… 3.8
# … with 70,868 more rows, and abbreviated variable name ¹Turnover
The additional rows here compared to tsibbledata::aus_retail
are because the data now extend to November 2019.
There’s a helpful vignette demonstrating other facilities and features of the readabs
package.