Filtering by Tag: accuracy

The best forecast error metric

Published on by Joannes Vermorel.

Metrics available to assess the performance of a forecast are many:

In this post, we will try to address the question of the 'best' forecasting metric. It turns out to be simpler than most practioner would expect.

Among those, MAE and MAPE are probably the most widely used metrics by practicitioners both in retail and manufacturing. Let's start by having a look at graphs for those two metrics.

 
Plot of the Mean Absolute Error. X = real (forecast is 1). Y = error.

The behavior of the MAE is resonably straightforward. The one tricky aspect - from a mathematical viewpoint - is that the function is not differentiable everywhere (not for x=1 in the example here above).

 

Plot of the Mean Absolute Percentage Error. X = real (forecast is 1). Y = error.

The MAPE, however, is a lot more convoluted. Indeed, the behavior between over and under forecasts is very different: the under forecast error is capped to 1 whereas the over forecast error tends to infinity toward zero.

This later aspect in particular tends to wreak havoc when combined with out-of-stock (OOS) events. Indeed, OOS generate very low actual sales values, hence potentially very high MAPE values.

In practice, we suggest to think twice before opting for MAPE, as interpreting results is likely to a be a small challenge in itself.

The best metric should be expressed in Dollars or Euros

From a mathematical perspective, some metrics (such as L2) are considered as more practical for statistical analysis (because of being differentiable for example), however, we believe that this viewpoint is moot when facing real business situations.

The one and only unit to be used to assess the performance of a forecast should be money. Forecasts are always wrong, and the only reasonable way to quantify the error consists of assessing how much money the delta between forecast and reality did cost to the company.

Modeling business costs

In practice, defining such an ad-hoc cost function requires a careful examination of the business, triggering questions such as:

  • How much does inventory cost?
  • How much inventory obsolescence should be expected?
  • How much does stock-out cost?
  • ...

As far company politics are concerned, modeling the forecast error as, say, a percentage, hence ignoring all those troublesome questions, has the one advantage of being neutral - leaving the rest of the company with the burden of actually translating the forecast into a course of action.

The process of establishing a sensible cost function is not rocket-science, however, it forces, within the company, the entity in charge of the forecasts, to write down explicitely all those costs. By doing so, choices are made, not beneficiting every division of the company, but clearly beneficiting the company itself.

Shameless plug: Lokad can help your company in this process.

Categories: accuracy, forecasting, insights Tags: accuracy insights statistics No Comments

Sparsity: when accuracy measure goes wrong

Published on by Joannes Vermorel.

Three years ago, we were publishing Overfitting: when accuracy measure goes wrong, however overfitting is far from being the only situation where simple accuracy measurements can be very misleading. Today, we focus on a very error-prone situation: intermittent demand which is typically encountered when looking at sales at the store level (or Ecommerce).

We believe that this single problem alone has prevented most retailers to move toward advance forecasting systems at the store level. As with most forecasting problems, it's subtle, it's counterintuitive, and some companies charge a lot to bring poor answers to the question.

Illustration of intermittent sales

The most popular error metrics in sales forecasting are the Mean Absolute Error (MAE) and the Mean Absolute Percentage Error (MAPE). As a general guideline, we suggest to stick with the MAE as the MAPE behaves very poorly whenever time-series are not smooth, that is, all the time, as far retailers are concerned. However, there are situations where MAE too behaves poorly. Low sales volumes fall in those situations.

Let's review the illustration here above. We have an item sold over 3 days. The number of unit sold over the first two days is zero. On the third day, one unit get sold. Let's assume that the demand is, in fact, of exactly 1 unit every 3 days. Technically speaking, it's a Poisson distribution with λ=1/3.

In the following, we compare two forecasting models:

  • a flat model M at 1/3 every day (the mean).
  • a flat model Z at zero every day.

As far inventory optmization is concerned, the model zero (Z) is downright harmfull. Assuming that safety stock analysis will be used to compute a reorder point, a zero forecast is very likely to produce a reorder point at zero too, causing frequent stockouts. An accuracy metric that would favor the model zero over more reasonable forecasts would be behaving rather poorly.

Let's review our two models against the MAPE (*) and the MAE.

  • M has a MAPE of 44%.
  • Z has a MAPE of 33%.
  • M has a MAE of 0.44.
  • Z has a MAE of 0.33.

(*) The classic definition of MAPE involves a division by zero when the actual value is zero. We assume here that the actual value is replaced by 1 when at zero. Alternatively, we could also have divided by the forecast (instead of the actual value), or use the sMAPE. Those changes make no difference: the conclusion of the discussion remains the same.

In conclusion, here, according to both the MAPE and the MAE, model zero prevails.

However, one might argue that this is simplistic situation, and it does not reflect the complexity of a real store. This is not entirely true. We have performed benchmarks over dozens of retail stores, and usually the winning model (according to MAE or MAPE) is the model zero - the model that returns always zero. Futhermore, this model typically wins by a comfortable margin over all the other models.

In practice, at store level, relying either on MAE or MAPE to evaluate the quality of forecasting models is asking for trouble: the metric favors models that return zeroes; the more zeroes, the better. This conclusion holds for about every single store we have analyzed so far (minus the few high volume items that do not suffer this problem).

Readers who are familar with accuracy metrics might propose to go instead for the Mean Square Error (MSE) which will not favor the model zero. This is true, however, MSE when applied to erratic data - and sales are store level are erratic - is not numerically stable. In practice, any outlier in the sales history will vastly skew the final results. This sort of problem is THE reason why statisticians have been working so hard on Robust statistics in the first place. No free lunch here.

How to assess store level forecasts then?

It took us a long, long time, to figure out a satifying solution to the problem of quantifying the accuracy of the forecasts at the store level. Back in 2011 and before, we were essentially cheating. Instead of looking at daily data points, when the sales data was too sparse, we were typically switching to weekly aggregates (or even to monthly aggregates for extremely sparse data). By switching to longer aggregation periods, we were artificially increasing sales volumes per period, hence making the MAE usable again.

The breakthrough came only a few months ago through quantiles. In essence, the enlightenment was: forget the forecasts, only reorder points matter. By trying to optimize our classic forecasts against metrics X, Y or Z, we were trying to solve the wrong problem.

Wait! Since reorder points are computed based on the forecasts, how could you say forecasts are irrelevant?

We are not saying that forecasts and forecast accuracy are irrelevant. However, we are stating that only the accuracy of the reorder points themselves matter. The forecast, or whatever other variable is used to compute reorder points, cannot be evaluated on its own. Only accuracy of the reorder points need and should be evaluated.

It turns out that a metric to assess reorder points exists: it's the pinball loss function, a function that has been known by statisticians for decades. Pinball loss is vastly superior not because of its mathematical properties, but simply because it fits the inventory tradeoff: too much stocks vs too much stockouts.

Categories: accuracy, retail, time series Tags: accuracy pinball sparse store No Comments

Measuring forecast accuracy

Published on by Joannes Vermorel.

 alt=Most engineers will tell you that:

You can't optimize what you don't measure

Turns out that forecasting is no exception. Measuring forecast accuracy is one of the few cornerstones of any forecasting technology.

A frequent misconception about accuracy measurement is that Lokad has to wait for the forecasts to become past, to finally compare the forecasts with what really happened.

Although, this approach works to some extend, it comes with severe drawbacks:

  • It's painfully slow: a 6 months ahead forecast takes 6 months to be validated.
  • It's very sensitive to overfittingOverfitting should not to be taken lightly, and it's one the few thing that is very likely to wreak havoc in your accuracy measurements.

Measuring the accuracy of delivered forecasts is a tough piece of work for us. Accuracy measurement accounts for roughly half of the complexity of our forecasting technology: the more advance the forecasting technology, the greater the need for robust accuracy measurements.

In particular, Lokad returns the forecast accuracy associated to every single forecast that we deliver (for example, our Excel-addin reports forecast accuracy). The metric used for accuracy measurement is the MAPE (Mean Absolute Percentage Error).

In order to compute an estimated accuracy, Lokad proceeds (roughly) through cross-validation tuned for time-series forecasts. Cross-validation is simpler than it sounds. If we consider a weekly forecast 10 weeks ahead with 3 years (aka 150 weeks) of history, then the cross-validation looks like:

  1. Take the 1st week, forecast 10 weeks ahead, and compare results to original.
  2. Take the 2 first weeks, forecast 10 weeks ahead, and compare.
  3. Take the 3 first weeks, forecast 10 weeks ahead, and compare.
  4. ...

The process is rather tedious, as we end-up recomputing forecasts about 150 times for only 3 years of history. Obviously, cross-validation screams for automation, and there is little hope to go through such a process without computer support. Yet, computers typically cost less than business forecast errors, and Lokad relies on cloud computing to deliver such high-intensive computations.

Attempts to "simplify" the process outlined are very likely to end-up with overfitting problems. We suggest to say very careful, as overfitting isn't a problem to be taken lightly. In doubts, stick to a complete cross-validation.

Categories: accuracy, forecasting, insights Tags: accuracy forecasting measure 1 Comment