Entropy analysis for supply chain IT system discovery
The IT landscape of supply chains is nearly always complex. Indeed, by nature, supply chains involve multiple actors, multiple sites, multiple systems, etc. As a result, building data-driven insights in supply chains is a challenge due to the sheer heterogeneity of the IT landscape. Too frequently, supply chain analytics deliver nonsensical results precisely because of underlying garbage in, garbage out problems.
At Lokad, we have not only developed a practice that thoroughly surveys the IT landscape and the datasets inhabiting it, but we have also created some bits of technology to facilitate the surveying operations themselves. In this post, we detail one step of our surveying methodology, which is based on Shannon entropy. We successfully leveraged entropy analysis for several large scale supply chain initiatives.
Our surveying process starts by reviewing all the database tables that are understood to be relevant for the supply chain initiative. Supply chains are complex and, consequently, the IT systems that operate the supply chains reflect this complexity. Moreover, the IT systems might have been evolving for several decades, and layers of complexity tend to take over those systems. As a result, it’s not uncommon to identify dozens of database tables with each table having dozens of columns, i.e. fields in the database terminology.
For large supply chains, we have observed situations where the total number of distinct fields are above 10,000. By leveraging entropy analysis, we are able to remove half of the columns from the picture immediately and consequently reduce the remaining amount of work significantly.
Dealing with that many columns is a major undertaking. The problem is not data-processing capabilities: with cloud computing and adequate data storage, it’s relatively straightforward to process thousands of columns. The real challenge is to make sense of all those fields. As a rule of thumb, we estimate that well-written documentation of a field takes up about one page, when interesting use cases and edge cases are covered. Without proper documentation of the data, data semantics are lost, and odds are very high that any complex analysis performed on the data will suffer from massive garbage in garbage out headaches. Thus, with 10,000 fields, we are faced with the production of a 10,000-page manual, which demands a really monumental effort.
Yet, we have observed that those large IT systems also carry a massive amount of dead weight. While the raw number of fields appear to very high, in practice, it does not mean that every column found in the system contains meaningful data. At the extreme, the column might be entirely empty or constant and, thus, contain no information whatsoever. A few fields can be immediately discarded because they are truly empty. However, we have observed that fully empty fields are actually pretty rare. Sometimes, the only non-constant information in the column dates from the day when the system was turned on; the field was never to be reused afterward. While truly empty fields are relatively rare, we usually observe that degenerate fields are extremely numerous. These fields contain columns with almost no data, well below any reasonable threshold to leverage this data for production purposes.
For example, a
PurchaseOrders table containing over one million rows might have an
Incoterms column that is non-empty in only 100 rows; furthermore, all those rows are more than five years old, and 90 rows contain the enry
thisisatest. In this case, the field Incoterms is clearly degenerate, and there is no point in even trying to make sense of this data. Yet, a naive SQL filter will fail to identify such a column as degenerate.
Thus, a tool to identify degenerate columns is needed. It turns out that Shannon entropy is an excellent candidate. Shannon entropy is a mathematical tool to measure the quantity of information that is contained in a message. The entropy is measured in Shannons, which is a unit of measurement somewhat akin to bits of information. By treating the values found in the column itself as the message, Shannon entropy gives us a measure of the information contained in the column expressed in Shannons.
While all of this might sound highly theoretical, putting this insight into practice is extremely straightforward. All it takes is to use the entropy() aggregator provided by Envision. The tiny script below illustrates how we can use Envision to produce the entropy analysis of a table with 3 fields.
read "data.csv" as T[*] show table "List of entropies" with entropy(T.Field1) entropy(T.Field2) entropy(T.Field3)
Any field associated with an entropy lower than 0.1 is a very good indicator of a degenerate column. If the entropy is lower than 0.01, the column is guaranteed to be degenerate.
Our experience indicates that performing an initial filtering based on entropy measurements reliably eliminates between one-third and two-thirds of the initial fields from the scope of interest. The savings in time and efforts are very substantial: for large supply chain projects, we are talking about man-years being saved through this analysis.
We unintentionally discovered a positive side effect of entropy filtering: it lowers the IT fatigue associated with the (re)discovery of the IT systems. Indeed, investigating a degenerate field typically proves to be an exhausting task. As the field is not used - sometimes not used any more - nobody is quite sure whether the field is truly degenerate or if the field is playing a critical, but obscure, role in the supply chain processes. Because of the complexities of supply chains, there is frequently nobody who can positively affirm that a given field is not used. Entropy filtering immediately eliminates the worst offenders that are guaranteed to lead us on a wild-goose chase.