Lokad’s supply chain practice is to refresh all the data pipelines - forecasts included - at least once a day, even when dealing with monthly or quarterly calculations. To many practitioners, this practice may seem counterintuitive. Why refresh quarterly forecasts more than once per quarter? What about numerical instabilities? What about the friction involved? This goes against most established supply chain practices, especially those of companies that have an S&OP process in place. Yet, a decade of experience has taught us that not abiding to this “refresh all the things” practice is the recipe for an unending stream of production problems. At the core, this issue must be frontally addressed - in depth - through a versioned stateless design of the software in charge of the supply chain’s predictive optimization. This point is revisited in greater detail in the following, but let’s start by having a closer look at the problem itself.
In the enterprise software world, problems / glitches / bugs / issues happen all the time. This is especially true for supply chains where the applicative landscape is invariably some haphazard collection of software pieces (ERP, EDI, MRP, WMS, OMS, ecommerce, etc.) put together over decades of activity: every app is a potential source of problems as it ends up interacting with many other apps1. Every change brought to any of those apps has a chance to break something elsewhere, i.e. not just breaking the app itself. All companies aren’t equal when it comes to the management of their applicative landscape, however, beyond 1000 employees, even the best-run companies get more than one software-driven supply chain “glitch” per day.
Thus, large-ish companies face a never ending stream of software issues to be addressed. The ownership for resolving such issues vary. The responsibility may lie with the IT department, a third-party vendor, an operational team, a supplier, etc. Yet, once any of those issues get “supposedly” fixed, it takes 5 to 10 rounds2 to make sure the issue really is fixed. Indeed, most issues are edge-cases, which may or may not present themselves at each point of time. Thus, while an issue may appear as resolved, because it went away after applying a fix of some kind, this may not yet be the case. Supply chains are complex systems involving many interdependent pieces, some of them not even in the full control of the company (i.e. EDI with suppliers). People routinely fail at delivering definitive fixes not because they are lazy or incompetent, but merely because of the irreducible ambient complexity.
As a consequence, if a data pipeline is run daily after a production incident, it takes 5 to 10 days to get it back to stability. If the data pipeline is run monthly, the same resolution process takes 5 to 10 months. It’s the number of rounds that matter, not the wall-clock time. It takes multiple rounds to positively assess that the edge-case is addressed. By way of anecdotal evidence, at Lokad, we had a job scheduling bug related to time-change that took two years to fix, precisely because the conditions triggering the problem were so rare - twice a year per timezone. Yet, while certain issues - like time-changes - are inherently infrequent, most issues can be reproduced “at will” by merely cranking up the frequency of the “rounds”.
Constantly re-challenging the end-to-end data pipelines is the only way to ensure that issues get fixed within a reasonable time frame. Infrequent execution invariably leads to broken being the default state of the system. Companies who operate intermittent data pipelines - say quarterly ones - invariably end up with large bureaucracies who are only there to bring back to life the pipeline once per quarter. Worse, the whole thing is usually so dysfunctional that the bureaucracy ends up spending most of the quarter ensuring the “refresh” of the next quarter. Conversely, the real-time pipelines - like servicing web pages for the corporate website - barely need anybody to keep working.
At Lokad, we opted for daily refreshes (or more) out of sheer necessity more than a decade ago. Since that time, we still haven’t identified any other way to achieve a decent quality of service from a supply chain perspective. There are probably none. Predictive analytics are complex and, thus, prone to “bugs” of all kinds. Daily refreshes ensure that problems get addressed promptly instead of lingering forever in limbos3. In this regard, our findings are far from being original. Netflix pioneered the whole field of chaos engineering along similar lines of thought: to engineer a robust system, stress must be applied; without stress, robustness never makes its way into the engineering mindset. Most serious software companies - notably the GAFAM - have adopted even more stringent flavors of this approach.
Furthermore, infrequent data pipeline refreshes not only lead to production woes, from a specific supply chain perspective, they also emphasise a whole series of bad practices and bad technologies.
Whenever forecasts are infrequently refreshed, it becomes highly tempting to manually adjust them. Indeed, forecasts stall most of the time by design, precisely due to the infrequent refresh. Thus, by merely having a look at the data from yesterday, the demand planner can improve upon a forecast that was produced by the system three weeks ago. Yet, this work from the demand planner does not deliver any lasting added value for the company: it’s not accretive. If the numerical recipes generating the forecasts are so poor they need manual overrides, then those numerical recipes must be fixed. If the software vendor can’t deliver the fix, then the company needs a better vendor.
Frequent forecast refreshes exacerbate the numerical instabilities of the underlying statistical models, i.e. run the forecast twice and get two distinct results. This is a good thing4. Unstable numerical models are harmful for the supply chain due to ratchet effects: once a production order or a purchase order is passed, the company is stuck with the consequences of this order. If an order is passed, it better be for better reasons than a matter of numerical instability. The sooner the company eliminates unstable numerical recipes from its supply chain practice, the better. Attempting to obfuscate the underlying problem by reducing the frequency of the forecast refresh is nonsense: numerical instability doesn’t go away because the company decides to stop looking at it. If the numerical recipes are so poor that they can’t maintain a strong consistency5 from one day to the next, better numerical recipes are needed. Again, if a software vendor happens to be in the middle of the problem and can’t deliver a deep fix, then the company also needs a better vendor.
Daily refreshes of all data pipelines may seem extravagant in terms of computing resources. However, considering modern computing hardware and properly designed software, this cost is small even when considering sophisticated recipes such as probabilistic forecasting6. Furthermore, supply chains routinely face exceptional conditions that require immediate large-scale correction. If the company can’t refresh all its supply chain figures in less than 60min because it needs to, then emergencies are guaranteed to remain unaddressed every now and then, wreaking havoc on the ground. The 5 to 10 rounds rule - previously discussed - still applies: once a fix is uncovered, it takes multiple runs - possibly with varied settings - to gain confidence that this emergency correction is working. Thus, if the data pipeline is too expensive to be run “at will”, the production will be used as testing grounds and chaos will ensue.
From a productivity perspective, daily refreshes eliminate the tolerance to bad setups that keep generating garbage results. Again, it’s a good thing. There is no reason to be tolerant with a numerical recipe that keeps generating garbage. Demand planning isn’t some kind of artistic creation that defies numerical analysis. Dysfunctional numerical recipes should be treated as software bugs and fixed accordingly. However, delivering the deep fix frequently requires a lot more thinking than defaulting to an ad hoc manual override. Most companies wonder why their supply chain teams keep coming back to their spreadsheets. It turns out that frequently, the spreadsheets are the only place where the numerical fixes - which should already be part of the systems - ever get implemented, precisely because iterating quickly over a spreadsheet is a non-issue (unlike iterating over the underlying enterprise system).
However, daily (or more) refreshes is only an operating aspect of the problem. In terms of software design, this approach must be supplemented with a first key ingredient: statelessness. A data pipeline should not be using any precomputed data and start anew from the raw transactional data every single time. Indeed, every single bug or glitch is likely to corrupt the precomputed data, holding back the company for an indefinite period of time: the logic may be fixed but the faulty data remains. The solution is straightforward: the data pipeline should not have any state, i.e. no precomputed data of any kind. Starting fresh ensures that all the fixes are immediately leveraged to the greatest extent possible.
In turns, versioning, another software design principle and the second ingredient of interest, supplements the statelessness of the system: more specifically, data versioning. Indeed, if the data pipeline itself has no state, then, merely combining the logic of the pipeline - which exists as versioned source code - and the input data should be sufficient to exactly reproduce any problem encountered during the execution of the pipeline. In other words, it makes problems reproducible. However, achieving this requires to preserve an exact copy of the data as it stood during the execution of the data pipeline. In other words, the data should be versioned alongside the code. Data versioning ensures that fixes can be tested in the exact same conditions that triggered whatever problem was encountered in the first place
Lokad has been engineered around these principles. We promote an end-to-end daily refresh of everything. Our data pipelines are stateless and versioned - both the logic and the data. What about your company?
The One ERP strategy is so tempting precisely because - in theory - it would make all this many-app friction go away through one fully unified system. Unfortunately, this is not the case, and One ERP tends to backfire badly. Indeed, software complexity - and costs - grow super-linearly with the number of features involved. Thus, the “One” becomes some unmaintainable software monster collapsing under its own weight. See all our ERP knowledge base entries. There is a balance to be found between fragmenting the IT landscape (too many apps) and the curse of the monolith (unmaintainable app). ↩︎
Here, a “round” is casually referring to the end-to-end execution of the mundane processes driving the supply chain through its underlying software systems. It’s the series of steps that are needed to generate production orders, purchase orders, dispatch orders, etc. ↩︎
Many of Lokad’s competing vendors never came to terms with this “chaos engineering” perspective. Instead of frontally addressing the production “gradiness” of their system by adding stress to the system, they did the opposite: reduce the stress through less frequent refreshes. Yet, a decade down the road, those systems invariably need a team of sysadmins to even run at all. In contrast, Lokad does not even have a nightly team (yet) while we serve companies in every time zone on earth. ↩︎
The ABC analysis is a notoriously unstable numerical recipe. For this reason alone, it has no place in a modern supply chain. In practice, ABC is so bad, the instability problem barely registers when compared to the other problems that this method entails. ↩︎
There is absolutely no limit to the degree of numerical stability that can be achieved with numerical recipes. Unlike forecasting accuracy, which is limited by the irreducible uncertainty of the future, there is nothing that prevents a figure to be arbitrarily close to the same figure generated the day before. This isn’t magic: the numerical recipe can and should be precisely engineered to behave that way. ↩︎
Some software vendors vastly inflate the computing requirements due to dramatically bad software design. While this angle alone warrants a post of its own, the primary antipattern at play is usually some ultra-layered design where data is channelled through dozens - if not hundreds - of hops. For example, the data may go through: SQL database → ETL → NoSQL database → Java app → Flat Files → Another NoSQL database → Flat Files → Python → NumPy → Flat Files → PyTorch → Flat Files → Another SQL database → Another Java app → Another NoSQL database → Flat Files → ETL → SQL database. In those situations, the quasi-totality of the computing resources are wasted shuffling the data around without adding value at every step. Software vendors who suffer from this problem are easy to spot because, usually, they can’t resist putting a “technology” slide in their presentation with two dozens of logos listing the incredible collection of (open source) software pieces that accidentally ended up in their solution. ↩︎