Integers and uncertainty in differentiable programming

Markus Leopoldseder (Director of Knowledge - Global Manufacturing and Supply Chain Practice at McKinsey) raised two relevant questions concerning the applicability of Differentiable Programming (DP) for supply chain purposes. In this fairly technical post, we’ll try to provide some insights on how integer’s constraints and how uncertainty are respectively dealt with in DP. This is only a first glance at Lokad’s approach. We are planning to post at a later date extended recipes on docs.lokad.com.

How does DP apply to integer variables? Indeed, if the objective function takes integers as inputs, it might not be differentiable.

From a supply chain perspective, most decisions are discrete: we can’t order 4.2 units of a product, it’s either 4 or 5. Thus, we are seeking - both from a learning and from an optimization perspective, methods that work well with integers. As acutely pointed out by our reader, parameters in DP are numbers, and those parameters cannot be constrained to integers, because this perspective is not compatible with the stochastic gradient descent which lies at the core of DP.

There are several ways of obtaining discrete decisions and their corresponding discrete objective functions through DP, ranging from naive but straightforward, to downright complicated but unparalleled as far as numerical optimization goes.

Integers and uncertainty in differentiable programming

The easiest approach consists of interpolating the objective function at training time, and rounding the results at evaluation time. For example, if we seek a purchase order quantity - expected to be an integer - the objective function can be extended to arbitrary numbers through interpolation. This works well when considering systems that, albeit being discrete, exhibit fairly linear behaviors. However, when facing a strong non-linearity such as an MOQ (minimal order quantity) constraint, this does not work so well.

In order to deal with such situations, the objective function can be replaced by a surrogate function, a function that approximates the original objective function, but in a smooth differentiable way. For example, the step function typically associated to the penalty cost of an MOQ constraint can be replaced by a sigmoid function. Epochs after epochs, the surrogate function is progressively deformed to become numerically closer to the original objective function.

From a supply chain perspective, in our experience, surrogate functions work surprisingly well. Indeed, situations where it’s not possible to smoothly iterate toward good solutions are rare. Supply chain problems are not cryptographic puzzles where changing a single bit from a solution derails the solution entirely. For example, if passing a purchase order at 490 units is profitable while there is an MOQ at 500, odds are high that the purchase order of 500 units is profitable too.

Then, there are more sophisticated approaches inspired by variational autoencoders: the fractional outputs of a layer (as in “deep learning layers”) are converted into a random integer deviate obtained from an arbitrary distribution, for example a Poisson distribution. Through this mechanism the program, while operating only with fractional parameters, does produce integer outputs that can then be injected into the objectives function. The stochastic gradient descent repeats the process a large number of times, ala Monte-Carlo, ensuring that the laws governing the generation of random integer deviates are properly tuned.

Finally, the most complex approaches, such as AlphaZero, rely on the introduction of a complex list of layers (e.g. a “deep” computational network) typically ending with a Softmax-like layer in order to generate discrete decisions. These approaches offer state-of-the-art results on highly non-linear optimization problems, as demonstrated by the AlphaGo (later redefined as AlphaZero) winning against the Lee Sedol. DP can also take advantage of these methods - merely tuning down the depth and the complexity of the network to keep computational overhead under control. Fortunately in practice, while solving a MOQ problem is relatively tough, it’s nowhere near as tough as beating a world champion at Go as far as optimization problems go, and “shallow” networks (by deep learning’s standards) already achieve a lot.

Offering more direct and practical ways of dealing with discrete situations, as commonly found in supply chain situations, was one of the core reasons that lead us at Lokad to engineer our own DP software stack as opposed to bending an existing framework; precisely in order to have more latitude to introduce special constructs engineered to fit those situations.

Some of those constructs are nothing more than a “standard” library of functions written through the DP language itself - as a way to mitigate repetitive tasks and avoid classes of mistakes. Some of those constructs however are more subtle and interleaved with the stochastic gradient descent in order to offer specialized capabilities that could not have been implemented through “pure” automatic differentiation.

How do you deal with uncertainty with DP? Indeed, DP can optimize any complex objective function by tuning parameters using a stochastic gradient descent optimizing a neural network which is nothing else than a special function as well. How are probability distributions considered?

The dominant approach that we adopt to deal with uncertainty in DP consists of generating trajectories - generated time-series reflecting streams of future events - from underlying probability distributions, and then letting the objective function operate following a Monte Carlo pattern. This approach offers the possibility to model complex discrete interactions within a system - such as the consequence of cascading BOMs (bills of materials) even if the probability distributions - used as inputs - were only available for the future demand of finished products.

This approach isn’t novel. Variational autoencoders - although engineered from a rather different perspective - employ a similar strategy, taking the parametrization of a distribution (a Gaussian for VAE) as input and producing deviates as outputs. As discussed above when we already mentioned variational autoencoders, we are frequently using counting distributions that produce integer deviates, rather than continuous distribution like Gaussians.

This generative approach - where distributions are turned into observations (i.e. trajectories) - can also be fully internalized within the DP process. Instead of taking probability distributions as inputs, and optimizing decisions from them, the distributions themselves can be learned while optimization is performed at the same time. For example forecasting demand can be performed jointly with pricing optimization.,as the pricing strategy has a strong impact on the demand, the two aspects can’t really be analyzed in isolation from one another.

The generative part of the recipe mentioned above is already found in generative adversarial networks that have been tremendously successful at generating photo-realistic images. When considering time-series, the LSTM and GRU (a simpler more modern equivalent of the LSTM) offer ways to generate complex time-series that could not have been explicitly modeled through probability distributions.

DP offers more flexibility to leverage those capabilities from a supply chain perspective while dealing with more heterogeneous objects (graphs, time-series, relational data) compared to the scenarios usually considered from a deep learning perspective. Again, most of Lokad’s engineering efforts have been concentrated on the supply chain perspective, in order to make sure that the tooling was aligned with the specific requirements that arise when optimizing stocks, prices, purchases, productions, assortments, etc - instead of focusing on images, voices and natural language processing.