Ranked 6th out of 909 teams in the M5 forecasting competition

A team of Lokad employees, namely Rafael de Rezende (leader), Ignacio Marín Eiroa, Katharina Egert and Guilherme Thompson 1, have come in 6th position in the M5 Forecasting competition out of 909 competing teams. It’s an impressive feat, and I am proud of what this team has achieved. Building a culture oriented toward quantitative results has been a long standing goal for Lokad, and the result of this competition demonstrates just how far we have progressed on this journey.

Lokad ranked 6th out of 909 teams in the M5 forecasting competition

To my knowledge this is the first time that a public demand forecasting competition has involved quantile forecasts, an insight that relates directly to Lokad’s work back in 2012. Whilst it has taken 8 years for academia to catch up with quantiles, this doesn’t make this achievement any less significant. Naked “classic” forecasts are pretty much broken by design as far as supply chain is concerned. Quantile forecasts are not the endgame, but nevertheless do work where safety stocks do not. This I see as a major step in the right direction.

Results-wise, the contenders from rank 1 to 6 are incredibly close. The first placed team2 managed to leap a few percents ahead. However, my own experience indicates that, even for a super large retail network like Walmart, a 5% reduction of pinball loss - a metric that can be used to assess the accuracy of quantile forecasts - would be almost unnoticeable as far as dollars of error are concerned. Indeed, at this level of accuracy, the forecasting models are essentially equivalent, and other angles (which were not covered by the M5 competition) dominate - such as the capacity to cope with stockouts, varying assortments, cannibalizations, erratic lead times etc. These concerns make a far greater difference than a handful percents of pinball loss.

Model-wise, the Lokad team used a low dimensional parametric model which included the relevant cyclicities (day-of-week, day-of-month, month-of-year) at the store/category level, a baseline eliminating cyclicities and stock-out noise, and a 2-parameter state-space model to turn the baseline into daily trajectories (with multiplicative contributions of the cyclicities). Also, like the winning team, Lokad did not use price data, nor any external data. The biggest technicality for the Lokad team turned out to be dealing with stockouts that had to be forecast: it was a sales forecasting exercise, not a demand forecasting one. This will be discussed in greater details later on when we revisit the fine print of this model.

Overall, if a well-chosen low dimensional parametric model, like the one Lokad used in the M5 competition, can get you within a handful of percents of accuracy of the state-of-the-art method - which happens to be range-augmented gradient boosted trees - then in production, this model is guaranteed to be much more nicely behaved when compared to nonparametric or hyperparametric models, and much easier to structurally tweak 3 when the need arises.

Also, the computing performance of the model tends to be a not-so-subtle operational killer. The first placed team reported that running their prediction took “a couple hours” (sic) on a 10+10 CPU workstation setup. This may seem fast, but keep in mind that the M5 dataset was only 30k SKUs, which is very small compared to the number of SKUs in most retail networks (a few categories over a few stores). I guesstimate that Walmart has over 100M SKUs to manage globally, so we are talking of tens of thousands of compute hours per prediction 4. The retail networks that Lokad serve typically give us a ~2 hours window every day to refresh our forecasts, so whatever models we pick need to be compatible with this schedule for both training and forecasting 5. Deploying the model of the first placed team is certainly possible at the Walmart’s scale, but managing the compute cluster alone would take a team of its own.

The M5 competition was a major improvement upon its previous iterations. However, the dataset is still a far cry from being close to a real retail situation. For example, the pricing information was only available for the past. In practice, promotions don’t just happen randomly: they are planned. As such, if the price data had been provided for the time period to be forecast, the competition would have been steered toward models actually making use of this information instead of dismissing it straight away.

Besides future prices, two major pieces of data happened to be missing from the M5 competition: stock levels and disaggregated transactions, both of which are nearly always available in retail chains. Stock levels matter because obviously without stock there are no sales (censorship bias). Disaggregated transactions matter because, in my experience, it’s nearly impossible to assess any kind of cannibalization or substitution without them - whereas a casual observation of the retail shelves clearly indicates that they do play a big role. The model that the Lokad team used to rank sixth did not have anything in this regard, and the model that ranked first did not either.

In conclusion, it’s a fantastic result for Lokad. While there is definitely progress to be made to make forecasting competitions more realistic, I would urge my readers not to take these results too literally, M5 is a forecasting competition. In the real world, stockouts, product launches, product promotions, assortment changes, supplier problems, delivery schedules, all need to be factored into the picture. The biggest challenge is not to shave off a tiny few percents of error left or right, but to ensure that the end-to-end numerical recipe doesn’t have dumb blind spots that end up ruining the whole supply chain optimization initiative.

  1. Technically an ex-Lokad employee at the time of the competition. ↩︎

  2. The winning team included Northquay (pseudonym) and Russ Wolfinger. Their team was named Everyday Low SPLices for this M5 competition. For the sake of clarity, I am simply referring to them here as the first placed team. ↩︎

  3. Crisis happens routinely in supply chain. Covid-19 is just the latest world-wide crisis, but localized crises happen all the time. Historical data does not always reflect the events that unfold in supply chain. Frequently, the high-level insight of the supply chain scientist is the only way to steer models toward sensible decisions. ↩︎

  4. The first placed team used LightGBM, a C++ library capable of delivering state-of-the-art algorithmic performance for this class of models. Furthermore, the team used somewhat advanced numerical performance tricks such as using half-precision numbers. When transitioning towards a production setup, the per-SKU compute performance would most likely decrease due to the extra complexity / heterogeneity imposed by an actual production environment. ↩︎

  5. Not all models are equally suitable for isolating training from evaluation (training). Mileage may vary. Data problems happen once in a while, so in these situations, models need to be retrained, and this needs to happen fast. ↩︎