Forecast estimation, evaluation and transformation

I’ve had a few emails lately about forecast evaluation and estimation criteria. Here is one I received today, along with some comments.

I have a rather simple question regarding the use of MSE as opposed to MAD and MAPE. If the parameters of a time series model are estimated by minimizing MSE, why do we evaluate the model using some other metric, e.g., MAD and MAPE. I could see that MAPE is not scale dependent. But MAPE is a percentage version of MAD. So why don’t we use the percentage version of MSE?

MSE (mean squared error) is not scale-free. If your data are in dollars, then the MSE is in squared dollars. Often you will want to compare forecast accuracy across a number of time series having different units. In this case, MSE makes no sense. MAE (mean absolute error) is also scale-dependent and so cannot be used for comparisons across series of different units. The MAD (mean absolute deviation) is just another name for the MAE.

The MAPE (mean absolute percentage error) is not scale-dependent and is often useful for forecast evaluation. However, it has a number of limitations. For example,

  1. If the data contain zeros, the MAPE can be infinite as it will involve division by zero. If the data contain very small numbers, the MAPE can be huge.
  2. The MAPE assumes that percentages make sense; that is, that the zero on the scale of the data is meaningful. When forecasting widgets, this is ok. But when forecasting temperatures in degrees Celsius or Fahrenheit it makes no sense. The zero on these temperature scales is relatively arbitrary, and so percentages are meaningless.

It is possible to have a percentage version of MSE, the Mean Squared Percentage Error, but this isn’t used very often.

The MASE (mean absolute scaled error) was intended to avoid these problems.

For further discussion on these and related points, see Hyndman & Koehler (IJF, 2006). A preprint version is also available.

Also, suppose we have a lognormal model, where the estimation is done on the log-transformed scale and the prediction is done on the original, untransformed scale. One could either predict with the conditional mean or the conditional median. It seems to me that you would predict with the mean if the MSE is your metric, but you would predict with the median if the MAD is your metric. My thought is that the mean would minimize MSE, while the median would minimize MAD. So whether you use the mean or the median depends on which metric you use for evaluating the model.

In most cases, the mean and median will coincide on the transformed scale because the transformation should have produced a symmetric error distribution. I would usually estimate with the MSE because it is more efficient (assuming the errors look normal). It might help to estimate with the MAD if there are outliers, but I would prefer to explicitly deal with them.

When forecasting on the original, untransformed scale, the simple thing to do is to back-transform the forecasts (and the prediction interval limits). The point forecasts will then be the conditional median (assuming symmetry on the transformed scale), and the prediction interval will still have the desired coverage.

To get the conditional mean on the original scale, it is necessary to adjust the point forecasts. If X is the variable on the log-scale and Y = e^X is the variable on the original scale, then \text{E}(Y) = e^{\mu + \sigma^2/2} where \mu is the point forecast on the log-scale and \sigma^2 is the forecast variance on the log-scale. The prediction interval remains unchanged whether you use a conditional mean or conditional median for the point forecast.

Occasionally, there may be some reason to prefer a conditional mean point forecast; for example, if you are forecasting a number of related products and you need the point forecasts to sum to give the forecast of total number of products. But in most situations, the conditional median will be suitable.

In R, the plot.forecast() function (from the forecast package) will back-transform point forecasts and prediction intervals using an inverse Box-Cox transformation. Just include the argument lambda. For example:

fit <- ar(BoxCox(lynx,0.5))
plot(forecast(fit,h=20), lambda=0.5)

Related Posts:

  • devi

    Respected Sir,
    I need to know whether the nonlinear models are suitable for forecasting or not. And how to identify the nonlinearities in the time series by mutual information.

    • Rob J Hyndman

      Some nonlinear models provide good forecasts for some data sets. Identification of nonlinearity is a complicated topic. See Fan and Yao (2005) for a good survey of the area.

  • devi

    Respected Sir,
    my question is “what is mutual information ?(i understood that it will give some idea about linear as well as nonlinear dependencies) but, how can i interpret it in the plot (ie.,) how can i differentiate the nonlinear from the linear dependencies?

  • Chewyraver

    Thank you for sharing! When I first starting exploring the forecasting world, I was very confused about which loss function to use, it almost seemed that it didn’t matter which one was used. Showing the difference between loss functions and providing a simple method of selection was part of my honours thesis.

  • zbicyclist

    Thanks. This is the clearest explanation of the log adjustment I’ve ever read.

    A bit of humor: when I first ran into this, I saw it written as
    e^(u+1/2s^2), which is ambiguous. Since at the time I was running a large analytic group (>75 professionals), I went to the specific subgroup that typically did log models and scenarios (more what-if than forecasts) and asked them what the expression meant. A sampling of opinion:

    (1) what?
    (2) (s^2) / 2
    (3) 1 / (2s^2)
    (4) I don’t use that. I compute the average bias ratio above/below the mean on the modeled observations and use those factors to correct the what-if forecasts.

    Answer #4 turned out to work pretty well in practice, and since that answer came from the statistician who’d written the production code, we stayed with that.

  • Dr. Abed

    I do not know why package forecast 2.16 in R does not produce Theil’s U? I really appreciate your efforts.

    • Rob J Hyndman

      It does include it. Use the accuracy() function.

  • Tom Shelton

    Dear Mr. Hyndman

    I am a novice in R but somewhat knowledgeable about forecasting (however, I am not a mathematician_. I am attempting to forecasts from 3 years of historic traffic data that has strong day of week as well as weekly seasonal patterns. I’ve been able to generate reasonable forecasts with the stlf function using a frequency of 52(weekly) with relatively good MAPE values of 6-16%. However, last week I re-ran some scripts(no script changes) on the same data set that I used for the previous runs several weeks ago. I am now finding that the MAPE in the accuracy output has increased by two decimal points. For instance a previous run gave me a MAPE of 9.xxxx% and now I am getting or another example is 16.xxxx% from a previous run to 1666.xxxx for the current run. All the values in both the summary and accuracy are the same except the decimal point seems to have shifted in the MAPE. What am I doing wrong or has there been a change in the forecast package? Are there potentially other programs/packages that could be interfering?

    Thank you

    Tom Shelton/Berlin

    • Rob J Hyndman

      It looks like a bug. In version 4.05 I completely rewrote the accuracy() function. Unfortunately, the mape and mpe are now 100 times too large. I’ll fix it in the next version.

      • Tom Shelton

        Thank you very much for your quick reply. When do you think the next version will be released.

        • Rob J Hyndman

          Hopefully today or tomorrow.