```
<- function(y, h) {
benchmarks require(forecast)
# Compute four benchmark methods
<- rbind(
fcasts N = snaive(y, h)$mean,
E = forecast(ets(y), h)$mean,
A = forecast(auto.arima(y), h)$mean,
T = thetaf(y, h)$mean)
colnames(fcasts) <- seq(h)
<- rownames(fcasts)
method_names # Compute all possible combinations
<- rep(list(0:1), length(method_names))
method_choice names(method_choice) <- method_names
<- expand.grid(method_choice) %>% tail(-1) %>% as.matrix()
combinations # Construct names for all combinations
for (i in seq(NROW(combinations))) {
rownames(combinations)[i] <- paste0(method_names[which(combinations[i, ] > 0)],
collapse = "")
}# Compute combination weights
<- sweep(combinations, 1, rowSums(combinations), FUN = "/")
combinations # Compute combinations of forecasts
return(combinations %*% fcasts)
}
```

# A forecast ensemble benchmark

Forecasting benchmarks are very important when testing new forecasting methods, to see how well they perform against some simple alternatives. Every week I get sent papers proposing new forecasting methods that fail to do better than even the simplest benchmark. They are rejected without review.

Typical benchmarks include the naïve method (especially for finance and economic data), the seasonal naïve method (for seasonal data), an automatically selected ETS model, and an automatically selected ARIMA model. These are easily calculated using the `naive()`

, `snaive()`

, `ets()`

and `auto.arima()`

functions in the forecast package. Note that the seasonal naïve method is equivalent to the naïve method for annual data (or any other data with frequency=1).

Another benchmark worth considering is the Theta method which did very well in the M3 forecasting competition. The `thetaf()`

function provides a convenient implementation.

At the International Symposium on Forecasting held in Boulder last week, Srihari Jaganathan proposed that a simple average of these forecasts should be used as a standard forecast combination (or ensemble) benchmark. This is a good idea as it is very easy to do, relatively fast to compute, and often provides an excellent forecast. (For other combinations, the forecastHybrid package might be useful.)

In this post, I will test various subsets of the four benchmark methods mentioned above to see which performs best. Here is some R code to compute all possible subset ensembles from these four benchmarks. That gives 15 (=2^4-1) possible combinations.

The code is general enough that it is easy to add additional methods if anyone wants to include other possible benchmark methods. Just name them with a unique letter in the following function; no other code in this post needs to be changed.

Let’s try it out on the M3 competition data.

```
library(Mcomp)
library(tidyverse)
# Compute "symmetric" percentage errors and scaled errors
<- map(M3, function(u) {
errors <- u$x
train <- u$xx
test <- benchmarks(train, u$h)
f <- -sweep(f, 2, test)
error <- (200 * abs(error) / sweep(abs(f), 2, abs(test), FUN = "+")) %>%
pcerror as_tibble() %>%
mutate(Method = rownames(f)) %>%
gather(key = h, value = sAPE, -Method)
<- (abs(error) / mean(abs(diff(train, lag = frequency(train))))) %>%
scalederror as_tibble() %>%
mutate(Method = rownames(f)) %>%
gather(key = h, value = ASE, -Method)
return(list(pcerror = pcerror, scalederror = scalederror))
})# Construct a tibble with all results
<- tibble(
M3errors Series = names(M3),
Period = map_chr(M3, "period"),
se = map(errors, "scalederror"),
pce = map(errors, "pcerror")) %>%
unnest() %>%
select(-h1, -Method1) %>%
mutate(h = as.integer(h),
Period = factor(str_to_title(Period),
levels = c("Monthly","Quarterly","Yearly","Other")))
```

We need to average the accuracy measures for each period, horizon and method.

```
<- M3errors %>%
accuracy group_by(Period, Method, h) %>%
summarize(MASE=mean(ASE), sMAPE=mean(sAPE)) %>%
ungroup()
```

```
`summarise()` has grouped output by 'Period', 'Method'. You can override using
the `.groups` argument.
```

To keep the table of results of a manageable size, I have summarised the accuracy statistics over all forecast horizons, and included only the best combination approach plus the original benchmark methods. The table is ordered by MASE value with the best method at the top for each period.

```
# Find names of original methods
<- unique(accuracy[["Method"]])
original_methods <- original_methods[str_length(original_methods)==1L]
original_methods # Compute summary table of accuracy measures
<- accuracy %>%
accuracy_table group_by(Method,Period) %>%
summarise(
sMAPE = mean(sMAPE, na.rm = TRUE),
MASE = mean(MASE, na.rm = TRUE) ) %>%
arrange(MASE) %>% ungroup()
```

```
`summarise()` has grouped output by 'Method'. You can override using the
`.groups` argument.
```

```
<- accuracy_table %>% filter(MASE==min(MASE))
best <- accuracy_table %>%
accuracy_table filter(Method %in% original_methods | Method %in% best$Method) %>%
arrange(Period, MASE) %>%
select(Period, Method, sMAPE, MASE)
::kable(accuracy_table) knitr
```

Period | Method | sMAPE | MASE |
---|---|---|---|

Monthly | EAT | 13.53 | 0.828 |

Monthly | T | 13.86 | 0.864 |

Monthly | E | 14.14 | 0.865 |

Monthly | A | 15.02 | 0.868 |

Monthly | N | 17.23 | 1.146 |

Quarterly | EAT | 9.01 | 1.064 |

Quarterly | T | 9.20 | 1.117 |

Quarterly | E | 9.68 | 1.170 |

Quarterly | A | 10.01 | 1.189 |

Quarterly | N | 11.06 | 1.425 |

Yearly | EAT | 16.03 | 2.688 |

Yearly | T | 16.76 | 2.774 |

Yearly | E | 17.00 | 2.860 |

Yearly | A | 17.10 | 2.959 |

Yearly | N | 17.88 | 3.172 |

Other | EAT | 4.32 | 1.807 |

Other | E | 4.37 | 1.814 |

Other | A | 4.51 | 1.841 |

Other | T | 4.92 | 2.271 |

Other | N | 6.30 | 3.089 |

The best performing ensemble method is EAT – a combination of an ETS model, an ARIMA model and the Theta method.

Next, we can plot the accuracy statistics against the forecast horizon. The legend has been ordered to correspond roughly to the order of the series in the MASE plots.

```
<- accuracy_table %>% group_by(Method) %>%
order summarise(MASE = mean(MASE)) %>% arrange(MASE) %>%
pull("Method") %>% rev()
%>%
accuracy gather(key = "Measure", value = "accuracy", sMAPE, MASE) %>%
filter(Method %in% unique(accuracy_table$Method)) %>%
mutate(Method = factor(Method, levels=order)) %>%
ggplot(aes(x = h, y = accuracy), group = Method) +
geom_line(aes(col = Method)) +
facet_grid(rows = vars(Measure), cols=vars(Period), scale = "free") +
xlab("Forecast horizon") + ylab("Forecast accuracy")
```

From the graph we can see that the EAT ensemble works well for all periods and almost all horizons. To finish, here is a function which produces the EAT ensemble as a forecast object.

```
<- function(y, h = ifelse(frequency(y) > 1, 2*frequency(y), 10)) {
eat_ensemble require(forecast)
<- forecast(ets(y), h)
fets <- forecast(auto.arima(y), h)
farima <- thetaf(y, h)
ftheta <- fets
comb $mean <-(fets$mean + farima$mean + ftheta$mean)/3
comb$method <- "ETS-ARIMA-Theta Combination"
combreturn(comb)
}%>% eat_ensemble() %>% autoplot() USAccDeaths
```

The prediction intervals shown are simply those from the ETS model. Combining prediction intervals is more difficult, and I’ll leave that to another post.