On the accidental complexity of supply chain systems

December 21, 2020

technology

Joannes Vermorel

Modern computing hardware is extremely capable. A modest smartphone delivers billions of FLOPS (floating point operations per second) while storing hundreds of gigabytes of data. A single smartphone could technically run a predictive inventory allocation for a very large retail network. The historical data would require a suitable representation¹ and on the data crunching side, leaner techniques like differentiable programming would have to be used. Thus, high performance supply chain systems should be a given. Surely, companies can afford something a notch better than a smartphone to run and optimize their supply chains. Yet, a casual observation of our clients’ supply chain systems at Lokad indicates the exact opposite: these systems are almost always slow, and frequently torturously so.

On the accidental complexity of supply chain systems

Present-day supply chain software leaders (ERP, WMS, MRP, etc) have a hard time even sustaining 1 request per second on their API backend. At Lokad, we are painfully reminded of such horrid performances on a daily basis, as we are on the front line of the data retrieval process. For a dozen of clients or so, the initial data retrieval took almost a month². The sluggishness of the various APIs accounts for 99.9% of the problem. Systems capable of sustaining 1MB/second for their data extraction are few and far between. Systems that don’t force us to needlessly re-extract the same data over and over - to reach the most fresh parts - are even rarer. Those systems typically have 100+ more computing resources at their disposal compared to what they had 2 decades ago, and yet, they are not fundamentally faster³ nor doing things radically better either. Some of the most progressive vendors leveraging in-memory computing require several terabytes of RAM to deal with retail networks, which is an appallingly large⁴ amount of RAM considering what is being done with those resources.

This “accidental complexity” of many (most?) supply chain systems can be traced back to two root causes: first, incorrect expectations about the progress of computing hardware itself, second, a lack of care for the solution’s internal design.

On the progress of computing hardware, until one decade ago, there wasn’t a single (large) company where the first Moore’s law hadn’t been pitched dozens of times (usually incorrectly). There was a sense that computers were getting ridiculously faster all the time. Unfortunately, this mostly stopped to be trivially true since the early 2000s. This incorrect perspective of indefinite progress led many software companies, well beyond the supply chain world, to make massive mistakes. Many of the woes associated with Windows Vista (released in 2006) could be traced back to the original expectations - back in 2001 when Windows XP was released - of the Microsoft engineers that CPUs would be clocked at 6Ghz by 2006. We are nearing the end of 2020, and high-end gaming CPUs are barely scratching 5Ghz. Computing hardware never stopped progressing; however, it merely stopped progressing in a trivial manner, at least as far as the software companies were concerned.

Back in the 1980s and 1990s, as soon as a piece of software was working, even if it was somewhat slow at the date of the release, it was a given that next year its speed would be decent, and the year after it, its speed would be excellent. Aggressive software companies like Microsoft played this card very well: their engineers were given (still are) the best computing hardware that money can buy, and they were systematically pushing the software performance to the limit of what remained acceptable, knowing that the hardware would essentially solve the performance problem give or take a year or two. After the Vista debacle, the engineering teams at Microsoft realised the extent of the problem and changed their ways - Windows 7 being a major improvement. Yet, it took a decade for Windows to truly recover on the performance front. Nowadays, the outlook is nearly the exact opposite: the best Microsoft teams are not banking on future hardware any more, and nearly exclusively focus instead on delivering immediate superior performance via superior software⁵.

However, the enterprise software world turned out to be much slower at noticing the problem, and kept building software during the 2010s as if future computing hardware were on the verge of solving all their problems, as it had happened many times in the past. Unfortunately for most enterprise software vendors, while computing hardware is still progressing, a decade ago, it stopped progressing in a trivial manner⁶ where the vendor can merely wait for the performance to happen. Software tends to accumulate cruft over time (more features, more options, more screens, etc). Thus, the natural tendency of complex software is to slow down over time, not to improve - unless there is an intense dedicated effort on this front.

Sadly, from a sales perspective, performance is mostly a non-issue. Demos are done with toy accounts that only include a vanishingly small fraction of the data workload that the system would face in production. Also, screens of interest to the top management get a disproportionate amount of polish compared to those intended for corporate grunts. Yet, the latter are exactly the screens that will be used thousands of times per day, and thus, the ones that should deserve the most attention. I suspect that APIs frequently offer terrible performance because few buyers investigate whether the APIs are actually delivering a performance aligned with their intended purpose. Software vendors know this, and they align their engineering investments accordingly.

This brings me to the performance problem’s second aspect: the lack of care for the solution’s internal design. At present, it takes strategic software design decisions to leverage the bulk of the ongoing hardware improvements. Yet, a design decision is a double-edged sword: it empowers as well as it limits. It takes strong leadership to commit both on the business side and on the technical side to a design decision. Indecision is easier, but on the downside, as illustrated by the vast majority of enterprise software, performance (and UX in general) suffers greatly.

One pitfall of modern software (not just the enterprise kind) is the overabundance of layers. Data is copied, piped, pooled, synced, … through dozens of inner layers within the software. As a result, the bulk of the computing resources are wasted dealing with the “internal plumbing” which is not, in itself, delivering any added value. In terms of design, the remediation is both simple to conceive and difficult to execute: one must make a frugal use of third party components, especially those that entail a layer of some kind⁷. From a software vendor perspective, adding one more layer is the quickest way to add more “cool features” to the product, nevermind the bloat.

At Lokad, we have opted for an extensively integrated stack by designing our whole platform around a compiler core. On the downside, we lose the option of easily plugging any random open source project into our design. Integrations remain possible but usually require deeper changes in the compiler itself. On the upside, we achieve “bare metal” performance that is usually considered as unthinkable as far as enterprise software is concerned. Overall, considering that open source components are aging badly, this approach has proved particularly effective over the last decade⁸.

Mutualized multi-tenancy is another design choice that radically impacts the performance, at least from a “bang for the buck” perspective. Most enterprise software - supply chain software being one among them - has heavily intermittent requirements of computing resources. For example, at the extreme end of the spectrum, we have the forecasting numerical recipe, which is only run once per day (or so) but has to crunch the entire historical data every single time. Having a static set of computing resources⁹ dedicated to a client is highly inefficient.

Again, at Lokad, we have opted for a fully mutualized infrastructure. This approach reduces our cloud operating costs while delivering a performance that would not be economically feasible otherwise (cloud costs would outweigh supply chain benefits). In order to ensure a smooth overall orchestration of all the workloads, we have engineered a high degree of “predictability” for our own consumption of computing resources. Lokad’s DSL (domain-specific programming language), named Envision, has been engineered to support this undertaking. This is why entire classes of programming constructs - like arbitrary loops - do not exist in Envision: those constructs are not compatible with the “high predictability” requirements that supply chain data crunching entails.

In conclusion, don’t expect an obese supply chain system to get fit anytime soon if it isn’t fit already. While computing hardware is still progressing, it’s plenty fast already. If the system is sluggish, most likely it’s because it antagonizes its underlying hardware - not because the hardware is lacking. Fixing the problem is possible but it’s mostly a matter of design. Unfortunately, core software design is one of the things that tends to be near-impossible to fix in enterprise software past the design stage of the product. It’s possible to recover though, as demonstrated by Microsoft, but not every company (both vendor and client alike) can afford the decade it takes to do so.

Back in 2012, I published ReceiptStream a small open source project to demonstrate that storing about 1 year’s worth of Walmart transaction history at the basket level on a SD card was not only feasible, it could be done with a few hundreds lines of code. ↩︎
We try to perform incremental data retrieval if the systems let us do so. Yet, the initial data retrieval typically goes 3 to 5 years back, as having a bit of historical depth really helps when it comes to seasonality analysis. ↩︎
Console terminals might look dated, but if those systems managed to stick around for several decades, it means they probably had quite a few redeeming qualities, such as low latency responses. There is nothing more infuriating than having a cool looking modern web interface where every page refresh takes multiple seconds. ↩︎
I am not saying that terabytes of RAM can’t be handy when it comes to supply chain optimization - repeating the fictitious quote incorrectly pinned on Bill Gates that “640K ought to be enough for anybody”. My point is that an unreasonable use of computing resources is a wasted opportunity to put them to better use. As of December 2020, I fail to see any reason why such an amount of memory is required considering the (lack of) sophistication of the numerical recipes involved with the so-called “in-memory” computing paradigm. ↩︎
The performance improvements, quasi exclusively software-driven, brought by .NET Core 1, .NET Core 2, .NET Core 3 and .NET 5 are exemplary in this regard. Some speedups rely on SIMD instructions (single instruction, multiple data), however those instructions hardly qualify as “future” hardware, as most of the CPUs sold over the last decade already have those capabilities. ↩︎
Hardware vulnerabilities such as Meltdown turned out to negatively impact the performance of existing computing hardware. More similar problems can be expected in the future. ↩︎
Layers come in all shapes and forms. Docker, HDFS, Python, NumPy, Pandas, TensorFlow, Postgres, Jupyter … are all components of prime interest, and yet, each one of those components introduces a software layer of its own. ↩︎
When I started Lokad back in 2008, I decided to roll my very own forecasting engine. Yet, at the time, R was all the rage. In 2012, it was Hadoop. In 2014, it was Python and SciPy. In 2016, it was Scikit. In 2018, it was TensorFlow. In 2020, it’s Julia. ↩︎
The litmus test to identify whether a supposedly SaaS (Software as a Service) is leveraging a mutualized multitenant architecture consists of checking whether it’s possible to register for a free account of some kind. If the vendor can’t provide a free account, then it’s a near certainty that the vendor is merely doing ASP (Application Service Provider) instead of SaaS. ↩︎

Back to blog ›

On the accidental complexity of supply chain systems

More Posts