# Feature-based time series analysis

Date

16 September 2019

Topics
time series
graphics
statistics
R
tidyverts
anomalies
data science

In my last post, I showed how the feasts package can be used to produce various time series graphics.

The feasts package also includes functions for computing FEatures And Statistics from Time Series (hence the name). In this post I will give three examples of how these might be used.

library(tidyverse)
library(tsibble)
library(feasts)

## Exploring Australian tourism data

I used this example in my talk at useR!2019 in Toulouse, and it is also the basis of a vignette in the package, and a recent blog post by Mitchell O’Hara-Wild. The data set contains domestic tourist visitor nights in Australia, disaggregated by State, Region and Purpose.

tourism
# A tsibble: 24,320 x 5 [1Q]
# Key:       Region, State, Purpose [304]
Quarter Region   State           Purpose  Trips
<qtr> <chr>    <chr>           <chr>    <dbl>
# … with 24,310 more rows

An example of a feature would be the autocorrelation function at lag 1 — it is a numerical summary capturing some aspect of the time series. Autocorrelations at other lags are also features, as are the autocorrelations of the first differenced series, or the seasonally differenced series, etc. Another example of a feature is the strength of seasonality of a time series, as measured by 1-\text{Var}(R_t)/\text{Var}(S_t+R_t) where S_t is the seasonal component and R_t is the remainder component in an STL decomposition. Values close to 1 indicate a highly seasonal time series, while values close to 0 indicate a time series with little seasonality.

The feasts package has some inbuilt feature calculation functions. For example coef_hurst will calculate the Hurst coefficient of a time series and feat_spectral will compute the spectral entropy of a time series. To apply these to all series in the tourism data set, we can use the features function like this.

tourism %>% features(Trips, list(coef_hurst, feat_spectral))
# A tibble: 304 × 5
Region         State              Purpose  coef_hurst spectral_entropy
<chr>          <chr>              <chr>         <dbl>            <dbl>
2 Adelaide       South Australia    Holiday       0.558            0.753
3 Adelaide       South Australia    Other         0.862            0.793
4 Adelaide       South Australia    Visiting      0.583            0.768
6 Adelaide Hills South Australia    Holiday       0.651            0.928
7 Adelaide Hills South Australia    Other         0.672            0.913
8 Adelaide Hills South Australia    Visiting      0.653            1
9 Alice Springs  Northern Territory Business      0.713            0.944
10 Alice Springs  Northern Territory Holiday       0.500            0.544
# … with 294 more rows

There are also functions which produce collections of features based on tags. For example, feature_set(tags="acf") will produce 7 features based on various autocorrelation coefficients, while feature_set(tags="stl") produces 7 features based on an STL decomposition of a time series, including the strength of seasonality mentioned above.

tourism %>% features(Trips, feature_set(tags="stl"))
# A tibble: 304 × 12
Region  State Purpose trend…¹ seaso…² seaso…³ seaso…⁴ spiki…⁵ linea…⁶ curva…⁷
<chr>   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Adelai… Sout… Busine…   0.464   0.407       3       1 1.58e+2  -5.31   71.6
2 Adelai… Sout… Holiday   0.554   0.619       1       2 9.17e+0  49.0    78.7
3 Adelai… Sout… Other     0.746   0.202       2       1 2.10e+0  95.1    43.4
4 Adelai… Sout… Visiti…   0.435   0.452       1       3 5.61e+1  34.6    71.4
5 Adelai… Sout… Busine…   0.464   0.179       3       0 1.03e-1   0.968  -3.22
6 Adelai… Sout… Holiday   0.528   0.296       2       1 1.77e-1  10.5    24.0
7 Adelai… Sout… Other     0.593   0.404       2       2 4.44e-4   4.28    3.19
8 Adelai… Sout… Visiti…   0.488   0.254       0       3 6.50e+0  34.2    -0.529
9 Alice … Nort… Busine…   0.534   0.251       0       1 1.69e-1  23.8    19.5
10 Alice … Nort… Holiday   0.381   0.832       3       1 7.39e-1 -19.6    10.5
# … with 294 more rows, 2 more variables: stl_e_acf1 <dbl>, stl_e_acf10 <dbl>,
#   and abbreviated variable names ¹​trend_strength, ²​seasonal_strength_year,
#   ³​seasonal_peak_year, ⁴​seasonal_trough_year, ⁵​spikiness, ⁶​linearity,
#   ⁷​curvature

We can then use these features in plots to identify what type of series are heavily trended and what are most seasonal.

tourism %>% features(Trips, feature_set(tags="stl")) %>%
ggplot(aes(x=trend_strength, y=seasonal_strength_year, col=Purpose)) +
geom_point() + facet_wrap(vars(State))

Clearly, holiday series are most seasonal which is unsurprising. The strongest trends tend to be in Western Australia.

The most seasonal series can also be easily identified and plotted.

tourism %>%
features(Trips, feature_set(tags="stl")) %>%
filter(seasonal_strength_year == max(seasonal_strength_year)) %>%
left_join(tourism, by = c("State","Region","Purpose")) %>%
ggplot(aes(x = Quarter, y = Trips)) + geom_line() +
facet_grid(vars(State,Region,Purpose))

This is holiday trips to the most popular ski region of Australia.

The feasts package currently includes 42 features which can all be computed in one line like this.

# Compute features
tourism_features <- tourism %>%
features(Trips, feature_set(pkgs="feasts"))

Unusual time series can be identified by first doing a principal components decomposition

pcs <- tourism_features %>%
select(-State, -Region, -Purpose) %>%
prcomp(scale=TRUE) %>%
augment(tourism_features)
pcs %>%
ggplot(aes(x=.fittedPC1, y=.fittedPC2, col=Purpose)) +
geom_point() + theme(aspect.ratio=1)

We can then identify some unusual series.

pcs %>%
filter(.fittedPC1 > 10.5) %>%
select(Region, State, Purpose, .fittedPC1, .fittedPC2)
# A tibble: 4 × 5
Region                 State             Purpose  .fittedPC1 .fittedPC2
<chr>                  <chr>             <chr>         <dbl>      <dbl>
1 Australia's North West Western Australia Business       13.4    -11.3
2 Australia's South West Western Australia Holiday        10.9      0.880
3 Melbourne              Victoria          Holiday        12.3    -10.4
4 South Coast            New South Wales   Holiday        11.9      9.42 

## Replicating Hyndman, Wang and Laptev (2015)

I used a similar approach to identifying outliers in my ICDM 2015 paper with Earo Wang and Nikolay Laptev. In that paper, we used hourly data on thousands of mail servers from Yahoo.

The data is stored in the tsfeatures package as an mts object. Unfortunately, mts and ts objects are particularly bad at handling hourly and other sub-daily data, as time is stored numerically and the precision is not sufficient to always work properly. So we need to do a little work to create a properly formulated tsibble.

yahoo <- tsfeatures::yahoo_data() %>%
as_tsibble() %>%
mutate(
Time = hms::hms(day = trunc(index) - 1L,
hour = as.integer((round(24*(index-1))) %% 24))
) %>%
as_tsibble(index = Time, key=key) %>%
select(Time, key, value)

The key variable here identifies the particular server and measurement variable on that server. There are 1748 such time series, each with 1437 (almost 60 days) observations.

Now we create the features on all series, matching the original paper as closely as possible. There are a few small differences due to different choices in how features are computed, but they make little difference to the results.

The features are computed in two groups — the first group uses the original data, while the second group uses the scaled data.

yahoo_features <- bind_cols(
yahoo %>% features(value, features=list(
mean = ~ mean(., na.rm = TRUE),
var = ~ var(., na.rm = TRUE)
)),
yahoo %>% features(scale(value), features = list(
~ feat_acf(.),
~ feat_spectral(.),
~ longest_flat_spot(.),
~ n_crossing_points(.),
~ var_tiled_var(., .period = 24, .size = 24),
~ shift_level_max(., .period = 24, .size = 24),
~ shift_var_max(., .period = 24, .size = 24),
~ shift_kl_max(., .period = 24, .size = 48),
~ feat_stl(., .period = 24, s.window = "periodic", robust = TRUE)
))
) %>%
rename(
lumpiness = var_tiled_var,
key = key...1
) %>%
select(key, mean, var, acf1, trend_strength,
seasonal_strength_24, linearity, curvature,
seasonal_peak_24, seasonal_trough_24,
spectral_entropy, lumpiness, spikiness,
shift_level_max, shift_var_max,
longest_flat_spot, n_crossing_points,
shift_kl_max, shift_kl_index)

Now we can compute the principal components.

hwl_pca <- yahoo_features %>%
select(-key) %>%
na.omit() %>%
prcomp(scale=TRUE) %>%
augment(na.omit(yahoo_features))
hwl_pca %>%
as_tibble() %>%
ggplot(aes(x=.fittedPC1, y=.fittedPC2)) +
geom_point()

A useful plot for identifying the outlying observations is an HDR scatterplot which highlights regions of high density, and identifies outliers as those observations in low density regions. The red dots are the 1% highest density observations, the orange dots are in the 50% region, the yellow dots are in the 99% regions, while those shown in black are the 1% most “unusual” observations (having the lowest bivariate density on this plot.) The five most outlying points are highlighted by their row numbers.

hdrcde::hdrscatterplot(hwl_pca$.fittedPC1, hwl_pca$.fittedPC2, noutliers=5)

## Replicating Kang, Hyndman and Smith-Miles (2017)

Another paper that uses features is this IJF paper from 2017 in which we explored the feature space of the M3 time series data.

First we need to create several tsibbles, corresponding to the different seasonal periods in the M3 data.

m3totsibble <- function(z) {
bind_rows(
as_tsibble(z$x) %>% mutate(Type="Training") %>% as_tibble(), as_tsibble(z$xx) %>% mutate(Type="Test") %>% as_tibble()
) %>%
mutate(
st = z$st, type = z$type,
period = z$period, description = z$description,
sn = z\$sn,
)
}
m3yearly <- Mcomp::M3 %>%
subset("yearly") %>%
purrr::map_dfr(m3totsibble) %>%
as_tsibble(index=index, key=c(sn,period,st))
m3quarterly <- Mcomp::M3 %>%
subset("quarterly") %>%
purrr::map_dfr(m3totsibble) %>%
mutate(index = yearquarter(index)) %>%
as_tsibble(index=index, key=c(sn,period,st))
m3monthly <- Mcomp::M3 %>%
subset("monthly") %>%
purrr::map_dfr(m3totsibble) %>%
mutate(index = yearmonth(index)) %>%
as_tsibble(index=index, key=c(sn,period,st))
m3other <- Mcomp::M3 %>%
subset("other") %>%
purrr::map_dfr(m3totsibble) %>%
as_tsibble(index=index, key=c(sn,period,st))

There are some bespoke features used by Kang et al (IJF 2017), so rather than use the inbuilt feasts functions, we can create our own feature calculation function.

khs_stl <- function(x, period) {
output <- c(frequency=period, feat_spectral(x))
lambda <- forecast::BoxCox.lambda(ts(x, frequency=period),
lower=0, upper=1, method='loglik')
stlfeatures <- feat_stl(box_cox(x, lambda), .period=period,
s.window='periodic', robust=TRUE)
output <- c(output, stlfeatures, lambda=lambda)
if(period==1L)
output <- c(output, seasonal_strength=0)
return(output)
}
m3_features <- bind_rows(
m3yearly %>% features(value, features=list(~ khs_stl(.,1))),
m3quarterly %>% features(value, features=list(~ khs_stl(.,4))) %>%
rename(seasonal_strength = seasonal_strength_4),
m3monthly %>% features(value, features=list(~ khs_stl(.,12))) %>%
rename(seasonal_strength = seasonal_strength_12),
m3other %>% features(value, features=list(~ khs_stl(.,1)))
) %>%
select(sn, spectral_entropy, trend_strength, seasonal_strength,
frequency, stl_e_acf1, lambda)

Finally we plot the first two principal components of the feature space.

m3_features %>%
select(-sn) %>%
prcomp(scale=TRUE) %>%
augment(m3_features) %>%
ggplot(aes(x=.fittedPC1, y=.fittedPC2)) +
geom_point(aes(color = factor(frequency)))