New in forecast 5.0

Last week, version 5.0 of the forecast package for R was released. There are a few new functions and changes made to the package, which is why I increased the version number to 5.0. Thanks to Earo Wang for helping with this new version.

Handling missing values and outliers

Data cleaning is often the first step that data scientists and analysts take to ensure statistical modelling is supported by good data. Some new functions and extended functions have been added to the forecast package to make this job easier, and to automate some steps.

The existing na.interp function has been upgraded to handle seasonal series much better. It now fits a seasonal model to the data, and then interpolates the seasonally adjusted series, before re-seasonalizing. I’ve tested it on a lot of data and I think it works pretty well, although I’m sure users will come up with some test cases that cause problems.

tsoutliers is a new function for the purpose of identifying outliers and suggesting reasonable replacements. Residuals are identified by fitting a loess curve for non-seasonal data and via a periodic STL decomposition for seasonal data. Residuals are labelled as outliers if they lie outside the range $\pm 2(q_{0.9}-q_{0.1})$ where $q_p$ is the $p$-quantile of the residuals. This is a little experimental. For a Gaussian distribution, it will identify less than 1 point in 3 million as an outlier. In comparison, when boxplots are used, an outlier is shown if it lies outside $\pm 1.5(q_{0.75}-q_{0.25})$ and extreme outliers are outside $\pm 3.0(q_{0.75}-q_{0.25})$. By these rules, under a Gaussian distribution, 4% of points will be identified as outliers and about 1 in 20000 as extreme outliers.

Real data are often not as well-behaved as a Gaussian distribution, and outliers can be present. For example, the weekly air passenger traffic between Melbourne and Sydney (melsyd in the fpp package) contain seven consecutive weeks of zero traffic, and one week of partial traffic, due to a pilots’ strike. The tsoutliers function can replace those with estimates:

library(fpp)
tsoutliers(melsyd[,3])
$index
[1] 113 114 115 116 117 118 119 120
 
$replacements
[1] 17.57579 16.94973 16.15192 16.12787 16.20718 14.88098 14.46360 12.56176

A more general function is tsclean which is a combination of na.interp and tsoutliers, so it handles both missing values and outliers. It will return a cleaned version of a time series with outliers and missing values replaced by estimated values.

library(fpp)
plot(melsyd[,3], main="Economy class passengers: Melbourne-Sydney")
lines(tsclean(melsyd[,3]), col='red')

These three functions have one common argument lambda (a Box-Cox transformation parameter). If present, the time series is transformed before the outliers are identified and replaced, or missing values are estimated.

These functions are also now used when robust=TRUE in forecast.ts. The idea is that forecast.ts can take any time series and return something reasonable, even if the original series has missing values and outliers.

Calendar variables

We’ve added two functions, bizdays and easter, into the package; they can be used when adjusting for calendar effects. Like the function monthdays, both functions work for monthly and quarterly data.

bizdays, as its name suggests, returns the number of business days in each month or quarter of the observed time series. Along with a time series input, it has an argument FinCenter referring to the “Financial Center” (equivalent to the finCenter in the timeDate package). It is also assumed that weekdays are from Monday to Friday.

As Easter holiday isn’t fixed in relation to the civil calendar, which can make it challenging to forecast a time series with Easter effects. The function easter will return a dummy variable indicating if Easter is present in each month. Easter is defined as the days between Good Friday and Easter Sunday inclusively, plus optionally Easter Monday if easter.mon = TRUE. The function will return 0 for all months or quarters except those containing some of the days of Easter. A fractional result is returned if Easter spans March and April; otherwise 1 indicates that Easter falls entirely within the month or quarter.

These two functions are intended to give output that can be used as regression variables in auto.arima or tslm.

Changes to ARIMA modelling

The biggest change is actually not part of the forecast package. When a regression variable is present (including when a drift term is used), the estimation was very poorly initialized in the stats::arima function. I proposed a fix to the R core team, and this became part of Rv3.0.2. As stats::arima is the engine behind the Arima and auto.arima functions in the forecast package, this means that the package can now sometimes return different results to the results obtained in older versions of R.

Changes to the forecast package itself include:

  • Added arguments max.D and max.d to auto.arima(), ndiffs() and nsdiffs().
  • Removed drift term in Arima() when $d+D>1$.
  • Added bootstrap option to forecast.Arima()

The latter option now makes it possible to forecast from an ARIMA model without making the assumption of Gaussian errors.

Minor changes and bug fixes

Other changes include:

  • Added argument model to dshw() to enable an estimated model to be applied to a new time series.
  • Made several functions more robust to zoo objects.
  • Corrected an error in the calculation of AICc when using CV().
  • Made minimum default p in nnetar equal to 1 so it can no longer return a null model.
  • Improved output from snaive() and naive() to better reflect user expectations
  • Allowed Acf() to handle missing values by using na.contiguous. I might change this to na.interp in a future release.
  • Changed default information criterion in ets() to AICc. For short time series, it may choose a different model from previous versions.

Bugs?

If any user thinks they have found a bug, please report it on the github page and include a minimal reproducible example. If I can’t reproduce it, I can’t fix it.


Related Posts:


  • Pingback: R packages | Pearltrees()

  • Pingback: New in forecast 5.0 | R for Journalists | Scoop.it()

  • Jeff

    Thank you Dr. Hyndman for all you do – incredible resources (teaching and software)!

    • Nilabhra Banerjee

      Dr Hyndman’s new book (and blogs) goes million times ahead of the earlier book on Forecasting by Makridakis et al. which was once the sole source of knowledge on forecasting. Not only that, his fpp package is a brilliant addition. Dr. Hyndman has made life a lot easier for management science people like us who dont have deep insight into stats but who are often compelled to do stats for livelihood.

  • Brajesh Singh

    Dear Dr. Hyndman, Thanks a lot for these wonderful resources. I’m a novice at R and Forecasting but you’ve via your blogs, online textbook and R-package helped me learn some aspects of forecasting.
    While I was starting out I had also explored Python and wanted to know if you’ll ever consider providing software for Python, they seem to be really lacking in terms of forecasting.
    Hoping to continue learning and developing my knowledge of forecasting.
    Thanks Again.

    • No, python is not on my agenda at this stage. Maybe one day.

  • Jhonatas Kleinkauff

    Hey Rob. I recently came to R and are digging some forecast examples, most of my reading list is articles by you. So, in this example, when i call the tsoutliers function i get

    tsoutliers(melsyd[,3])

    $index

    integer(0)

    $replacements

    numeric(0)

    So, i miss something?

    • There are no outliers reported.

      • Jhonatas Kleinkauff

        But Rob, this is not the same data set that you are using in your example?

        • Oops. Sorry. Some changes I made in the way the function works must have destroyed that example. I’ll investigate.

          • I’ve tweaked the algorithm so this now works again. New version is on github, and will be pushed to CRAN in the next few weeks. Thanks for letting me know.

          • Jhonatas Kleinkauff

            Thanks you Rob! Like i said, im new to this world and your material are helping me alot.

  • Nilabhra Banerjee

    Hi Dr Hyndman, can this be extended to multiple seasonality ?
    For example, I have hourly sale data for each day. In weekdays sale less than weekends. But the problem occurs when there is a holiday on a weekday. The hourly sale increases substantially which influences the forecast for next few day’s sale. So I need to put some meaningful values as replacement for holiday sales to continue forecasting.

  • Meri Andani

    Hi Dr. Hyndman. I am new in this area and thanks a lot for your articles so i can learn a lot. Regarding to the forecast package, is the procedure of detect the outler in this forecast package was the same as the procedure in tsouliers package (Chen and Liu, 1993)?
    I mean, is that available to detect IO, AO, TC, LS and its multiple (using regARIMA) ? or just applies in to some type of outliers? Thank you.

    • No. The tsoutliers package has much more extensive facilities for outliers.

      • Meri Andani

        Thank you, Dr. Hyndman