Wikipedia lists seven steps for a data analysis process: data requirements, data collection, data processing, data cleaning, exploratory data analysis, data modeling, and finally the generation of production results. When Lokad forecasts inventory, optimizes prices, or anytime we tackle some kind of commerce optimization, our process is very similar to the one described above. However, there is another one vital step that typically accounts for more than half of all the effort typically applied by Lokad’s team and that is not even part of the list above. This step is the data qualification.

Now that “Big Data” has become a buzzword, myriads of companies are trying to do more with their data. Data qualification is probably the second largest cause of project failures, right after unclear or unwise business goals - which happens anytime an initiative starts from the “solution” rather than starting from the “problem”. Let’s shed some light on this mysterious “data qualification” step.

Data as a by-product of business apps

The vast majority of business software is designed to help operate companies: the Point-Of-Sale system is there to allow clients pay; the Warehouse Management System is there to pick and store products; the Web Conferencing software lets people carry out their meetings online, etc. Such software might be producing data too, but data is only a secondary by-product of the primary purpose of this software.

The systems mentioned are designed to operate the business, and as a result, whenever a practionner has to choose between better operations or better data, better operations will always always be favored. For example, if a barcode fails when being scanned at the point of sale of your local hypermarket, the cashier will invariably choose a product that happens to have the same price and scan it twice; sometimes they even have they cheat sheet of barcodes all gathered on a piece of paper. The cashier is right: the No1 priority is to let the client pay no matter what. Generating accurate stock records is not an immediate goal when compared to the urgent need of servicing a line of clients.

One might argue that the barcode scanning issue is actually a data cleaning issue. However, the situation is quite subtle: records remain accurate to some extent since the amount charged to the client remains correct and so does the count of items in the basket. Naively filtering out all the suspicious records would do more harm than good for most analysis.

Yet, we observe that too often, companies – and their software vendors too – enthusiastically ignore this fundamental pattern for nearly all business data that are generated, jumping straight from data processing to data cleaning.

Data qualification relates to the semantic of the data

The goal of the data qualification step is to clarify and thoroughly document the semantic of the data. Most of the time, when (large) companies send tabular data files to Lokad, they also send us an Excel sheet, where each column found in the files gets a short line of documentation, typically like: Price: the price of the product. However, such a brief documentation line leaves a myriad of questions open:

  • what is the currency applicable for the product?
  • is it a price with or without tax?
  • is there some other variable (like a discount) that impacts the actual price?
  • is it really the same price for the product across all channels?
  • is the price value supposed to make sense for products that are not yet sold?
  • are there edge-case situations like zeros to reflect missing values?

Dates are also excellent candidates for semantic ambiguities when an orders table contains a date column, the date-time can refer to the time of:

  • the basket validation
  • the payment entry
  • the payment clearance
  • the creation of the order in the accounting package
  • the dispatch
  • the delivery
  • the cloture of the order

However, such a shortlist hardly covers actual oddities encountered in real-life situations. Recently, for example, while working for one of the largest European online businesses, we realized that the dates associated with purchase orders did not have the same meaning dependong on the originating country of the supplier factories. European suppliers were shipping using trucks and the date reflected the arrival in the warehouse; while Asian suppliers were shipping using, well, ships, and the date reflected the arrival to the port. This little twist typically accounted for more than 10 days of difference in the lead time calculation.

For business-related datasets, the semantic of the data is nearly always dependent on the underlying company processes and practices. Documentation relating to such processes, when it exists at all, typically focuses on what is of interest to the management or the auditors, but very rarely on the myriad of tiny elements that exist within the company IT landscape. Yet, the devil is in the details.

Data qualification is not data cleaning

Data cleaning (or cleansing) makes most sense in experimental sciences where certain data points (outliers) need to be removed because they would incorrectly “bend” the experiments. For example, chart measurements in an optics experiment might simply reflect a defect in the optical sensor rather than something actually relevant to the study.

However, this process does not reflect what is typically needed while analyzing business data. Outliers might be encountered when dealing with the leftovers of a botched database recovery, but mostly, outliers are marginal. The (business-wise) integrity of the vast majority of databases currently in production is excellent. Erroneous entries exist, but most modern systems are doing a good job at preventing the most frequent ones, and are quite supportive when it comes to fixing them afterwards as well. However, data qualification is very different in the sense that the goal is neither to remove or correct data points, but rather to shed light on the data as a whole, so that subsequent analysis truly makes sense. The only thing that gets “altered” by the data qualification process is the original data documentation.

Data qualification is the bulk of the effort

While working with dozens of data-driven projects related to commerce, aerospace, hostelry, bioinformatics, energy, we have observed that data qualification has always been the most demanding step of the project. Machine learning algorithms might appear sophisticated, but as long as the initiative remains within the well-known boundaries of regression or classification problems, success in machine learning is mostly a matter of prior domain knowledge. The same goes for Big Data processing.

Data qualification problems are insidious because you don’t know what you’re missing: this is the semantic gap between the “true” semantic as it should be understood in terms of the data produced by the systems in place, and the “actual” semantic, as perceived by the people carrying out data analysis. What you don’t know can hurt you. Sometimes, the semantic gap completely invalidates the entire analysis.

We observe that most IT practitioners vastly under-estimate the depth of peculiarities that comes with most real-life business datasets. Most business don’t even have a full line of documentation per table field. Yet, we typically find that even with half a page of documentation per field, the documentation is still far from being thorough.

One of the (many) challenges faced by Lokad is that it is difficult to charge for something that is not even perceived as a need in the first place. Thus, we frequently shovel data qualification work under the guise of more noble tasks like “statistical algorithm tuning” or similar scientific-sounding tasks.

The reality of the work however is that data qualification is not only intensive from a manpower perspective, it’s also a truly challenging task in itself. It’s a mix between understanding the business, understanding how processes spread over many systems - some of them invariably of the legacy kind, and bridging the gap between the data as it exits and the expectations of the machine learning pipeline.

Most companies vastly underinvest in data qualification. In addition to being an underestimated challenge, investing talent on data qualification does not result in a flashy demo or even actual numbers. As a result, companies rush to the later stages of the data analysis process only to find themselves swimming in molasses because nothing really works as expected. There is no quick-fix for an actual understanding of the data.