A forecast ensemble benchmark


24 June 2018

time series

Forecasting benchmarks are very important when testing new forecasting methods, to see how well they perform against some simple alternatives. Every week I get sent papers proposing new forecasting methods that fail to do better than even the simplest benchmark. They are rejected without review.

Typical benchmarks include the naïve method (especially for finance and economic data), the seasonal naïve method (for seasonal data), an automatically selected ETS model, and an automatically selected ARIMA model. These are easily calculated using the naive(), snaive(), ets() and auto.arima() functions in the forecast package. Note that the seasonal naïve method is equivalent to the naïve method for annual data (or any other data with frequency=1).

Another benchmark worth considering is the Theta method which did very well in the M3 forecasting competition. The thetaf() function provides a convenient implementation.

At the International Symposium on Forecasting held in Boulder last week, Srihari Jaganathan proposed that a simple average of these forecasts should be used as a standard forecast combination (or ensemble) benchmark. This is a good idea as it is very easy to do, relatively fast to compute, and often provides an excellent forecast. (For other combinations, the forecastHybrid package might be useful.)

In this post, I will test various subsets of the four benchmark methods mentioned above to see which performs best. Here is some R code to compute all possible subset ensembles from these four benchmarks. That gives 15 (=2^4-1) possible combinations.

The code is general enough that it is easy to add additional methods if anyone wants to include other possible benchmark methods. Just name them with a unique letter in the following function; no other code in this post needs to be changed.

benchmarks <- function(y, h) {
  # Compute four benchmark methods
  fcasts <- rbind(
    N = snaive(y, h)$mean,
    E = forecast(ets(y), h)$mean,
    A = forecast(auto.arima(y), h)$mean,
    T = thetaf(y, h)$mean)
  colnames(fcasts) <- seq(h)
  method_names <- rownames(fcasts)
  # Compute all possible combinations
  method_choice <- rep(list(0:1), length(method_names))
  names(method_choice) <- method_names
  combinations <- expand.grid(method_choice) %>% tail(-1) %>% as.matrix()
  # Construct names for all combinations
  for (i in seq(NROW(combinations))) {
    rownames(combinations)[i] <- paste0(method_names[which(combinations[i, ] > 0)],
      collapse = "")
  # Compute combination weights
  combinations <- sweep(combinations, 1, rowSums(combinations), FUN = "/")
  # Compute combinations of forecasts
  return(combinations %*% fcasts)

Let’s try it out on the M3 competition data.

# Compute "symmetric" percentage errors and scaled errors
errors <- map(M3, function(u) {
  train <- u$x
  test <- u$xx
  f <- benchmarks(train, u$h)
  error <- -sweep(f, 2, test)
  pcerror <- (200 * abs(error) / sweep(abs(f), 2, abs(test), FUN = "+")) %>%
    as_tibble() %>%
    mutate(Method = rownames(f)) %>%
    gather(key = h, value = sAPE, -Method)
  scalederror <- (abs(error) / mean(abs(diff(train, lag = frequency(train))))) %>%
    as_tibble() %>%
    mutate(Method = rownames(f)) %>%
    gather(key = h, value = ASE, -Method)
  return(list(pcerror = pcerror, scalederror = scalederror))
# Construct a tibble with all results
M3errors <- tibble(
    Series = names(M3),
    Period = map_chr(M3, "period"),
    se = map(errors, "scalederror"),
    pce = map(errors, "pcerror")) %>%
  unnest() %>%
  select(-h1, -Method1) %>%
  mutate(h = as.integer(h),
         Period = factor(str_to_title(Period),
                         levels = c("Monthly","Quarterly","Yearly","Other")))

We need to average the accuracy measures for each period, horizon and method.

accuracy <- M3errors %>%
  group_by(Period, Method, h) %>%
  summarize(MASE=mean(ASE), sMAPE=mean(sAPE)) %>%
`summarise()` has grouped output by 'Period', 'Method'. You can override using
the `.groups` argument.

To keep the table of results of a manageable size, I have summarised the accuracy statistics over all forecast horizons, and included only the best combination approach plus the original benchmark methods. The table is ordered by MASE value with the best method at the top for each period.

# Find names of original methods
original_methods <- unique(accuracy[["Method"]])
original_methods <- original_methods[str_length(original_methods)==1L]
# Compute summary table of accuracy measures
accuracy_table <- accuracy %>%
  group_by(Method,Period) %>%
    sMAPE = mean(sMAPE, na.rm = TRUE),
    MASE = mean(MASE, na.rm = TRUE) ) %>%
  arrange(MASE) %>% ungroup()
`summarise()` has grouped output by 'Method'. You can override using the
`.groups` argument.
best <- accuracy_table %>% filter(MASE==min(MASE))
accuracy_table <- accuracy_table %>%
  filter(Method %in% original_methods | Method %in% best$Method) %>%
  arrange(Period, MASE) %>%
  select(Period, Method, sMAPE, MASE)
Period Method sMAPE MASE
Monthly EAT 13.53 0.828
Monthly T 13.86 0.864
Monthly E 14.14 0.865
Monthly A 15.02 0.868
Monthly N 17.23 1.146
Quarterly EAT 9.01 1.064
Quarterly T 9.20 1.117
Quarterly E 9.68 1.170
Quarterly A 10.01 1.189
Quarterly N 11.06 1.425
Yearly EAT 16.03 2.688
Yearly T 16.76 2.774
Yearly E 17.00 2.860
Yearly A 17.10 2.959
Yearly N 17.88 3.172
Other EAT 4.32 1.807
Other E 4.37 1.814
Other A 4.51 1.841
Other T 4.92 2.271
Other N 6.30 3.089

The best performing ensemble method is EAT – a combination of an ETS model, an ARIMA model and the Theta method.

Next, we can plot the accuracy statistics against the forecast horizon. The legend has been ordered to correspond roughly to the order of the series in the MASE plots.

order <- accuracy_table %>% group_by(Method) %>%
  summarise(MASE = mean(MASE)) %>% arrange(MASE) %>%
  pull("Method") %>% rev()
accuracy %>%
  gather(key = "Measure", value = "accuracy", sMAPE, MASE) %>%
  filter(Method %in% unique(accuracy_table$Method)) %>%
  mutate(Method = factor(Method, levels=order)) %>%
  ggplot(aes(x = h, y = accuracy), group = Method) +
    geom_line(aes(col = Method)) +
    facet_grid(rows = vars(Measure), cols=vars(Period), scale = "free") +
    xlab("Forecast horizon") + ylab("Forecast accuracy")

From the graph we can see that the EAT ensemble works well for all periods and almost all horizons. To finish, here is a function which produces the EAT ensemble as a forecast object.

eat_ensemble <- function(y, h = ifelse(frequency(y) > 1, 2*frequency(y), 10)) {
  fets <- forecast(ets(y), h)
  farima <- forecast(auto.arima(y), h)
  ftheta <- thetaf(y, h)
  comb <- fets
  comb$mean <-(fets$mean + farima$mean + ftheta$mean)/3
  comb$method <- "ETS-ARIMA-Theta Combination"
USAccDeaths %>% eat_ensemble() %>% autoplot()

The prediction intervals shown are simply those from the ETS model. Combining prediction intervals is more difficult, and I’ll leave that to another post.