The limited applicability of backtesting

Backtesting is a design of experiment where one truncates historical data back to a point in the past, and applies a learning algorithm, or an optimization algorithm, against this truncated dataset in order to assess how well this algorithm would have performed under those historical conditions. The approach is simple and elegant, and thus is frequently quite appealing to supply chain practitioners. However, backtesting is far from being a silver bullet, and when its limitations are misunderstood, focusing on backtesting usually does more harm than good.

In our experience of running Quantitative Supply Chain initiatives, the main threats to the success of the initiative are:

With traditional forecasting approaches, inaccurate forecasts would have been part of this list, but when probabilistic forecasts are used, it’s a much lesser concern: not because probabilistic forecasts are more accurate - they are not - but because the quality of the decisions degrades much more gracefully when the accuracy itself is degraded.

Indeed, in practice, an “inaccurate” probabilistic forecast is primarily characterized by a distribution of probabilities spreading over an exceedingly large range of values. While this behavior is undesirable, it is usually not nearly as bad as the consequences of an inaccurate traditional (i.e. non-probabilistic) forecast, where the company goes all-in on a single possible future that does not happen to be the correct one. Inaccurate probabilistic forecasts turn into exceedingly conservative and cautious decisions. Money still gets wasted, but considering that many supply chain situations have highly asymmetric costs, erring on the side of caution is far from being the worst strategy.

A proper backtesting is non-trivial to execute in real-world situations. Naive backtesting implementations are easily mislead by overfitting. As a few hidden co-variables can explain the bulk of the business growth. Proceeding with trial-and-error with a backtesting process invariably ends-up producing a model that has “memorized” past evolutions of the market, but that remains incapable of anticipating the market.

At Lokad, we have found that the only reliable way to backtest a given statistical model is to leverage datasets from dozens of companies facing very diverse situations. While this approach does not eliminate overfitting, it does mitigate it significantly.

For a Quantitative Supply Chain initiative - assuming that the forecasting tools are adequate and do not require supply chain practitioners to manually parameterize models - an early focus on backtesting typically results in premarture optimization distracting the team in charge of the initiative’s implementation from risk factors that dominate the benefits that can be expected from a backtesting process.

Some forecasting tools happen to be improperly designed, and do require their end-users to conjure statistical parameters to start working. For example, exponential smoothing, a simple forecasting model, requires a smoothing factor to be provided. As end-users cannot conjure these parameters out of thin air, they end-up having to resort to a backtesting process to get the models working in the first place. However, the desirability of backtesting should not be confused with requirements imposed by the accidental design mistakes of some forecasting tools.

As a rule of thumb, it is appropriate to start considering backtesting when:

When these conditions are met then backtesting can be rolled out as an extra angle to further improve the performance of the Quantitative Supply Chain initiative.