Errors on percentage errors

The MAPE (mean absolute percentage error) is a popular measure for forecast accuracy and is defined as
\text{MAPE} = 100\text{mean}(|y_t – \hat{y}_t|/|y_t|)
where $y_t$ denotes an observation and $\hat{y}_t$ denotes its forecast, and the mean is taken over $t$.

Armstrong (1985, p.348) was the first (to my knowledge) to point out the asymmetry of the MAPE saying that “it has a bias favoring estimates that are below the actual values”. A few years later, Armstrong and Collopy (1992) argued that the MAPE "puts a heavier penalty on forecasts that exceed the actual than those that are less than the actual". Makridakis (1993) took up the argument saying that "equal errors above the actual value result in a greater APE than those below the actual value". He provided an example where $y_t=150$ and $\hat{y}_t=100$, so that the relative error is 50/150=0.33, in contrast to the situation where $y_t=100$ and $\hat{y}_t=150$, when the relative error would be 50/100=0.50.

Thus, the MAPE puts a heavier penalty on negative errors (when $y_t < {\hat{y}_t}$) than on positive errors. This is what is stated in my textbook. Unfortunately, Anne Koehler and I got it the wrong way around in our 2006 paper on measures of forecast accuracy, where we said the heavier penalty was on positive errors. We were probably thinking that a forecast that is too large is a positive error. However, forecast errors are defined as $y_t – \hat{y}_{t}$, so positive errors arise only when the forecast is too small.

To avoid the asymmetry of the MAPE, Armstrong (1985, p.348) proposed the "adjusted MAPE", which he defined as
\overline{\text{MAPE}} = 100\text{mean}(2|y_t – \hat{y}_t|/(y_t + \hat{y}_t))
By that definition, the adjusted MAPE can be negative (if $y_t+\hat{y}_t < 0$), or infinite (if $y_t+\hat{y}_t=0$), although Armstrong claims that it has a range of (0,200). Presumably he never imagined that data and forecasts can take negative values. Strangely, there is no reference to this measure in Armstrong and Collopy (1992).

Makridakis (1993) proposed almost the same measure, calling it the “symmetric MAPE” (sMAPE), but without crediting Armstrong (1985), defining it
\text{sMAPE} = 100\text{mean}(2|y_t – \hat{y}_t|/|y_t + \hat{y}_t|)
However, in the M3 competition paper by Makridakis and Hibon (2000), sMAPE is defined equivalently to Armstrong’s adjusted MAPE (without the absolute values in the denominator), again without reference to Armstrong (1985). Makridakis and Hibon claim that this version of sMAPE has a range of (-200,200).

Flores (1986) proposed a modified version of Armstrong’s measure, defined as exactly half of the adjusted MAPE defined above. He claimed (again incorrectly) that it had an upper bound of 100.

Of course, the true range of the adjusted MAPE is $(-\infty,\infty)$ as is easily seen by considering the two cases $y_t+\hat{y}_t = \varepsilon$ and $y_t+\hat{y}_t = -\varepsilon$, where $\varepsilon>0$, and letting $\varepsilon\rightarrow0$. Similarly, the true range of the sMAPE defined by Makridakis (1993) is $(0,\infty)$. I’m not sure that these errors have previously been documented, although they have surely been noticed.

Goodwin and Lawton (1999) point out that on a percentage scale, the MAPE is symmetric and the sMAPE is asymmetric. For example, if $y_t =100$, then $\hat{y}_t=110$ gives a 10% error, as does $\hat{y}_t=90$. Either would contribute the same increment to MAPE, but a different increment to sMAPE.

Anne Koehler (2001) in a commentary on the M3 competition, made the same point, but without reference to Goodwin and Lawton.

Whether symmetry matters or not, and whether we want to work on a percentage or absolute scale, depends entirely on the problem, so these discussions over (a)symmetry don’t seem particularly useful to me.

Chen and Yang (2004), in an unpublished working paper, defined the sMAPE as
\text{sMAPE} = \text{mean}(2|y_t – \hat{y}_t|/(|y_t| + |\hat{y}_t|)).
They still called it a measure of "percentage error" even though they dropped the multiplier 100. At least they got the range correct, stating that this measure has a maximum value of two when either $y_t$ or $\hat{y}_t$ is zero, but is undefined when both are zero. The range of this version of sMAPE is (0,2). Perhaps this is the definition that Makridakis and Armstrong intended all along, although neither has ever managed to include it correctly in one of their papers or books.

As will be clear by now, the literature on this topic is littered with errors. The Wikipedia page on sMAPE contains several as well, which a reader might like to correct.

If all data and forecasts are non-negative, then the same values are obtained from all three definitions of sMAPE. But more generally, the last definition above from Chen and Yang is clearly the most sensible, if the sMAPE is to be used at all. In the M3 competition, all data were positive, but some forecasts were negative, so the differences are important. However, I can’t match the published results for any definition of sMAPE, so I’m not sure how the calculations were actually done.

Personally, I would much prefer that either the original MAPE be used (when it makes sense), or the mean absolute scaled error (MASE) be used instead. There seems little point using the sMAPE except that it makes it easy to compare the performance of a new forecasting algorithm against the published M3 results. But even there, it is not necessary, as the forecasts submitted to the M3 competition are all available in the Mcomp package for R, so a comparison can easily be made using whatever measure you prefer.

Thanks to Andrey Kostenko for alerting me to the different definitions of sMAPE in the literature.

Related Posts:

  • Matt

    I’d like a better understanding of how the heavier penalty MAPE puts on over forecasting is relevant for forecast evaluation and model selection.

    In some sense, I don’t see the asymmetry- if we hold the actual value fixed, MAPE for over forecasting and under forecasting of the same absolute magnitude will be the same. E.g. for actual value 100, forecasts of 50 and 150 give equivalent MAPE (50%). Doesn’t this imply that given an expected value for the actual observation of the forecast horizon, MAPE treats over and under forecasting equally whenever the magnitude of forecast error is the same?

    We only get the asymmetry, it seems, if we hold the magnitude of forecast error the same and vary the expected value for the actuals, which doesn’t seem practically relevant.

    It’s not true, in other words, that you can “cheat” by low-balling a forecast in order to improve forecast MAPE; as long as that’s the case, what is the problem with using it, as it’s not going to favor models that under forecast over those that over forecast? (I’m assuming here that we don’t need to worry about intermittent demand.)

    Any direction here would be most appreciated; your blog has been an invaluable resource in my business forecasting education.

    • I agree that it makes more sense to consider the case where the actual stays the same and the forecasts vary, because we can’t change actuals we can only change forecasts.

      • Matt

        Thanks, good to get some clarity here. It would be a shame to avoid a simple metric like MAPE based on a misunderstanding. MASE is helpful too, though in some cases one won’t have a naive forecast to work with (e.g. for the first period of a new product’s sales).

  • Matt

    I should add (and this is from your Armstrong reference) that it’s true that under forecasting has a maximum MAPE 100% (in the case where the forecast is always zero), whereas over forecasting has no upper bound; this is assuming that the forecast is always positive, of course. This still seems to have limited significance to the question of whether one should use MAPE in assessing forecasts, provided that zero forecasts are not common in practice.

    • It’s zero (or very small) actuals that is the issue, not zero forecasts. They come up a lot. e.g., if you are trying to predict stock returns.

      • Matt

        Absolutely right, that was a slip on my part.

  • Chad Scherrer

    For most applications of this, the values are positive, and it makes sense to either use a model with a log link (as in a GLM) or to just log-transform the response. So is there any reason to prefer MAPE over some statistic (MSE or MAE, perhaps) of the residuals on the log scale? If the big deal is having them as percentages, I guess you could do something weird like use a base 1.01 for the log. Still seems more sensible and less arbitrary than MAPE, which has no connection to the loss function of any model I’ve ever seen.

  • edyhsgr

    Why is MAPE typically used instead of Median Absolute Percent Error? Is MAPE better?

  • Luis

    Hi Rob, I would like to know if the function accuracy() works with bats(). I’m trying to use it but I got some errors.

  • Simon

    I am no Mathematician, but some time ago in wrestling with this problem I have modified this statistic to nMAPE (for Normalised MPAE) where the divisor becomes the maximum of Actual and Forecast.

    • Adam

      I recently started thinking about doing this as well. From what I can tell, this is also symmetric (using the example above abs(150-100)/150 = 0.33, abs(100-150)/150 = 0.33 and what I like about it is it is bounded between (0,1) or (0,100) if you multiple by 100 (for positive measurements such as my use case is). For me this an intuitive bound for error. what has been your experience with this? Is there any literature to support this?

  • cmos

    In the original paper by Makridakis and also in the M-3 paper the denominator of the sMAPE is multiplied by 2 whereas in your blog post the numerator is multiplied by 2. Additionally, (Makridakis 1993) nowhere mentions the term “sMAPE”. This term is only used in the M-3 paper.

    • 1. No it isn’t. In the two papers you mention, the denominator is DIVIDED by 2 which is equivalent to multiplying the numerator by 2.
      2. Yes, Makridakis didn’t use the acronym “sMAPE” in 1993. That came later.

      • cmos

        You’re right. I read that wrong.

  • Himani Wadhwa

    Hi Rob!

    What can be the expected value of MAPE for a dataset having nearly 50 observations? Like I am getting an error of around 33%. Is it fine? Also,how should i proceed further in case I want to reduce the error?

    • The number of observations is not as important as how predictable is the thing you are interested in. A good MAPE is one that is better than what everyone else gets for the same forecast objective.

  • randomdude

    Hi Rob,
    could you give me some advice on how to calculate the MASE for time-series with multiple seasonalities.
    Thanks a lot for your input!

    • The only issue is how to choose the base forecast method used in the scaling factor. I suggest you pick the shortest of the seasonal periods and use it with a seasonal naive scaling factor.

  • quantweb

    Rob.. was your position on metaselection (“selection of model selection methods”) ? Especially if one can only calculate data dependent mesuares like MAPE or MASE (not being able to calculate BIC or AIC because the models are from different classes). Thanks!

    • When AIC is unavailable, I tend to use time series cross-validation:

      • quantweb

        Thanks Rob. When i said MAPE or MASE i meant as out of sample errors. So i was thinking in using them as model selection strategy and make them “compete”. I am trying to improve model selection before using any out-of-sample forecast error bound.

  • Vikram Murthy

    Mr Hyndman .. thanks for the post but the accuracy calculation (for MAPE, MAE et al) ends with an “Inf” even if 1 of the values in the data series is a 0 .. i do understand that according to the formula one of the errors would result in a divide by 0 but by adding this Inf to all the other errors, using MAPE becomes quite problematic. This is handled well for MASE though but try explaining MASE to a management thats been using MAPE for 10 years and swears by it 🙂 ..can we have some sort of mechanism to handle 0’s in the time series ? appreciate all the work you have put into the packages !

  • Branko Radovanovic

    The trouble with MAPE is not only that it’s asymmetric, but also that it may distort the evaluation by making a worse forecast appear better. An example:

    A (Actual observations) = 100, 1000
    Fa (Forecast A) = 200, 1100
    Fb (Forecast B) = 100, 500

    MAPE(A, Fa) = 55
    MAPE(A, Fb) = 25

    By MAPE alone, it seems that forecast B is better than forecast A. However, in most cases (load forecasting, sales forecasting), the real-life cost of a forecast error is proportional to the absolute value of the residual, and forecast A is actually much better.

    If we adopt a different metric:

    APE = 100 sum(|yt – yt^|)/sum(yt)

    …it’s an entirely different story:

    APE(A, Fa) = 200 / 1100 = 18.1
    APE(A, Fb) = 500 / 1100 = 45.5

    Now, I’m calling it APE, but I don’t really know how it’s called – it’s like WAPE without the weights (i.e. with all weights set to 1), so APE seems reasonable. It’s simple and easy to intepret, scale-invariant, works fine with zeroes, values close to zero, and negative values too (if the denominator is changed to sum(|yt|)).

    There’s MASE, of course, but I’d hate to explain how MASE works to a client – I’d much rather go with APE. (For that purpose, I’d skip over RMSE too – if one manages to reduce RMSE by, say, 10%, how much will it save in the actual costs? Well, if the cost is more or less directly proportional to absolute error – and it typically is – then it’s anyone’s guess.)

    A strange thing about MASE is that, if the observations are shuffled (i.e. if they occur in a different order), and the forecasts are rearranged accordingly, then MASE changes, possibly by a large margin. That seems somewhat counter-intuitive: if we have two forecasts with the same pairs of observations and predictions {yt, yt^}, only in a different order, then one would expect them to be evaluated as same (although, to be fair, this is really a theoretical rather than practical consideration).

    I understand that when percentage (rather than absolute) errors are important, MAPE may be better – but in other cases (that’s nine times out of ten, I’d say), why e.g. APE isn’t more widely used? (In accuracy() too, perhaps?) Is there a reason to pick MASE over APE?