Overfitting: when accuracy measure goes wrong

Published on by Joannes Vermorel.

As we already said, the whole point of forecasting is to build models that are accurate on the data you don't have. Yet, at first glance, this looks like yet another crazy mathematician idea: both weird and utterly unpractical.

But in our experience, measuring the real forecast accuracy is a real business problem. Failing at this costs money. Actually, the larger the company, the larger the cost.

Still clueless about the real forecast error?

Check out our latest 9min slidecast (scripts are pasted below).

Slidecast scripts:

Overfitting: your forecasts may not be as good as the measure tells you

Forecasting accuracy is critical for many industries such as retail, manufacturing or services. If you over-forecast your customer demand, your costs explode because you will have too much staff, too much inventory. But if you under-forecast your customer demand, your customers get angry because they can’t buy your product or because they have to wait for too long to get served.

In this slidecast, I am going to introduce a little known problem in forecasting called overfitting.

This problem is little known for two reasons. First, it’s a subtle problem - non-obvious and counter-intuitive in many aspects. Second, it’s a problem that has been puzzling mathematicians since the 19th century. It’s only at the end of the nineties, a little more than 10 years ago, that the scientific community started to really comprehend this problem both at the theoretical level but also at the practical level.

Before getting any further, let me jump to the conclusion. Overfitting has a very strong impact on your forecasts. Overfitting can make you believe that you have a 10% forecast error while your real forecast error is 20%, and that would not be a worse case situation.

Overfitting is a very real business problem. Overfitting costs money.

Moreover, there is no work-around for overfitting. Modern statistical theories are built on top of this very concept. Overfitting plays a central part in those theories, no matter which way you approach statistics, overfitting is here to stay.

The key problem is to define what forecasting accuracy actually means.

Intuitively, the easiest way to measure the forecasting accuracy consists in making a forecast and waiting for the forecasted event to happen; so that we can compare the forecast with its corresponding outcome.

Yet, this method has a big drawback: it only tells you about the accuracy of past forecasts. From a business perspective, it matters little to know that past forecasts were good or bad, since you can’t change them anyway. What really matters is to make sure that forecasts that are yet to come are truly accurate.

Then, there is another problem: unless the method used to produce the forecasts is strictly identical from one forecast to the next, there is no reason to even believe that past accuracy could be used as a reliable indicator for future accuracy.

Since the 18th century, mathematicians have introduced the notion of statistical model. The primary aspect of statistical models is not, despite popular belief, to provide good or bad forecasts, but to provide repeatable forecasts.

With a statistical model, you get a process that automates the production of forecasts. It does not guarantee that forecasts will be good, but at least if, forecasts are poor, you can analyze the model further.

Let’s consider the following sample time-series. We will illustrate the overfitting problem by considering successive statistical models.

Let’s start with a simple linear model. As you can see, the line isn’t really well fitting the points of the time-series. As a result, we have a large error, over 50%. This model does not really fit the data.

Then, we can increase the complexity of the model. We have now a model that follows roughly what looks like the local time-series average. This new model looks much better than the previous one, and indeed the error has been divided by 5, now reaching 10%.

We have a good model here, but can we still reduce the error further? Well, the answer is simple: yes, we can produce a model that achieves less than 1% error.

As you can see, it’s rather easy; we just have to design a model that goes through nearly all the points of the time-series.

But can we really trust this model to be 1% accurate on future forecasts? Obviously, we can’t! This model is just micro-optimizing tiny fluctuations of the past that are nothing but random variations. Intuitively, we can’t forecast true randomness; we can only forecast patterns such trend, seasonality, etc.

Now, if we compare the last two models, we have an obvious problem: according to our error measure, the model on the right - the one with 1% error - is ten times better than the model on the left.

Although it is obvious that the best model is one the left. This model is smoothing the random fluctuations of the time-series.

Thus, there is something wrong with the way we are measuring the error. This error, as illustrated in the previous graphics, is known as the empirical error. It’s the error that you get through measures on your historical data.

Yet, what we are really interested in is known as the real error. The real error is defined as the error of your forecasting model on the data you don’t have, that is to say: future data.

Although, this definition looks like a complete paradox: how can you measure anything if, precisely, you don’t have the data!

Since the 19th century, statisticians have been looking at this problem through an approach known as the bias-variance tradeoff.

If we look at the three models that we have, we can say that the linear model has a high bias: no matter which linear model we choose, it won’t ever succeed to really fit the data, unless, of course, if the data itself is linearly distributed; but in most situations, the linear model will just approximate the data distribution.

Then, the linear model has a low variance: intuitively, adding or removing one point in the time-series isn’t going to much affect the resulting model. This model is fairly stable.

At the other extreme, the model on the right has a very low bias: it fits, overfits actually, most of the points of the time-series. Yet, the variance is very high: adding or removing a single point is likely to cause major changes in this model. There is no stability at all.

In the center, we have a model that balances both bias and variance; and this looks exactly the way to go.

Yet, the main issue with the bias versus variance approach is that we still have no clue about what is really happening with the real error, that is to say, the error on the data we don’t have.

This tradeoff is frequently more a heuristic than a true statistical criterion.

Since the late nineties, the field of statistical learning, a broader theory that encompasses statistical forecasting, has made a significant breakthrough both at the theoretical and practical levels.

This theory is complex, but a simple equation gives us major insights in the results. This theory tells us that the real error is upper bounded by the sum of the empirical error and another value called the structural risk.

As we have seen previously, the empirical error is just the error measured on the historical data.

The structural risk is a theoretical criterion that can be explicitly computed for certain classes of models, and estimated for most of the other classes.

Back on our initial example, we can see that the structural risk increases with the model complexity.

Now if we quantify those structural risks, it gives us the following outlook.

We still do know the real error - that value can’t be measured directly anyway - but we see that the model of the center has the lowest upper bound on the real error.

The statistical learning theory does not give us the value of the real error, but it gives us instead an upper bound; and the whole point is to choose the model that achieves the lowest upper bound.

This upper bound acts as a maximal value for the real error.

Compared to the bias-variance tradeoff, we see that the statistical learning theory gives us a quantitative estimate of the real error.

The structural risk is difficult to estimate in practice. Yet, at this time, it’s still the best known solution to avoid overfitting.

We have seen previously that an obvious way of ending up with overfitting problems was to increase the model complexity.

But there is also another way, a more subtle way, of ending up with overfitting problems: this can happen by increasing the complexity of the data.

Adding extra points to the data typically reduces overfitting problems, but if you start adding extra dimensions to the data, then, you are likely to end-up with overfitting problems even if the models themselves stay unchanged.

In our experience at Lokad, this situation is frequently encountered by organizations that refine, year after year, their own forecasting models with ever increasing data inputs; without explicitly taking care of the structural risk that lurks within their models.

In high dimensions, even linear models are subject to overfitting problems.

This concludes this short presentation about overfitting. If you have to remember one thing, remember that without taking into account the structural risk, your measure of the forecast error is likely to be highly deceptive; and the bigger your company, the more money, it will cost you.

Thanks you very for interest. We will happy to address your questions in our forums.

Categories: accuracy, forecasting, insights Tags: forecasting insights measure overfitting risk slidecast vapnik 9 Comments

Better promotion forecasts in retail

Published on by Joannes Vermorel.

Since our major Tags+Events upgrade last fall, we have been very actively working on promotion forecasting for retail. We have now thousands of promotional events in our databases; and the analysis of those events has lead us to very interesting findings.

Also it’s hardly surprising, we have found that:

  • promotion forecasts when performed manually by practitioners are usually involving forecast errors above 60% in average. Your mileage may vary, but typical sales forecast errors in retail are usually closer to 20%.
  • including promotion data through tags and events reduces the average forecast error by roughly 50%. Again your mileage may vary depending the amount of data that you have on your promotional events.

As a less intuitive result, we have also found that rule-based methods and linear methods, although widely advertised by some experts and some software tools, are very weak against overfitting, and can distort the evaluation of the forecast error, leading to a false impression of performance in promotion forecasting.

Also, note that this 50% improvement has been achieved with usually quite a limited amount of information, usually no more than 2 or 3 binary descriptor per promotion.

Even crude data about your promotions are leading to significant forecast improvements, which turns into significant working capital savings.

The first step to improve your promotion forecasts consists in gathering accurate promotion data. In our experience, this step is the most difficult and the most costly one. If you do not have accurate records of your promotions, then there is little hope to get accurate forecasts. As people says, Garbage In, Garbage Out.

Yet, we did notice that even a single promotion descriptor, a binary variable that just indicates whether the article is currently promoted or not, can lead to a significant forecast improvement. Thus, although your records need to be accurate, they don’t need to be detailed to improve your forecasts.

Thus, we advise you to keep track precisely of the timing of your promotions: when did it start? when did it end? Note that for eCommerce, front page display has often an effect comparable to a product promotion, thus you need to keep track of the evolution of your front page.

Then, article description matters. Indeed, in our experience, even the most frequently promoted articles are not going to have more than a dozen promotions in their market lifetime. In average, the amount of past known promotions for a given article is ridiculously low, ranging from zero to one past promotion in average. As a result, you can’t expect any reliable results by focusing on the past promotions a single product at a time, because, most of the time there isn’t any.

So instead, you have to focus on articles that look alike the article that you are planning to promote. With Lokad, you can do that by associating tags to your sales. Typically, retailers are using a hierarchy to organize their catalog. Think of an article hierarchy with families, sub-families, articles, variants, etc.

Translating a hierarchical catalog into tags can be done quite simply following the process illustrated below for a fictitious candy reseller:

The tags associated with the sales history of medium lemon lollipops would be LOLLIPOPS, LEMON, MEDIUM

This process will typically create 2 to 6 tags per article in your catalog - depending on the complexity of your catalog.

We have said that even very limited information about your promotions could be used to improve your sales forecasts right away. Yet, more detailed promotion information clearly improves the forecast accuracy.

We have found that two items are very valuable to improve the forecast accuracy:

  • the mechanism that describes the nature of the discount offered to your customers. Typical mechanisms are flat discount (ex. -20%) but there are many other mechanisms such as free shipping or discount for larger quantities (ex: buy one and get one for free).
  • the communication that describes how your customers get notified of the promotional event. Typically, communication includes marketing operations such as radio, newspaper or local ads, but also the custom packaging (if any) and the visibility of promoted articles within the point of sales.

In case of larger distribution networks, the overall availability of the promotion should also be described if articles aren’t promoted everywhere. Such situation typically arises if point of sales managers can opt out from promotional operations.

Discussing with professionals, we have found that many retailers are expecting a set of rules to be produced by Lokad; and those rules are expected to explain promotions such as

IF TV_ADS AND PERCENT25_DISCOUNT 
THEN PROMO_SALES = 5 * REGULAR_SALES;

Basically, those expected rules always follow more or less the same patterns:

  • A set of binary conditions that defines the scope of the rule.
  • A set of linear coefficients to estimate the effect of the rule.

We have found that many tools in the software market are available to help you to discover those rules in your data; which, seemingly, has lead many people to believe that this approach was the only one available.

Yet, according to our experiments, rule-based methods are far from being optimal. Worse, those rules are really weak against overfitting. This weakness frequently lead to painful situations where there is a significant gap between estimated forecast accuracy and real forecast accuracy.

Overfitting is a very subtle, and yet, very important, phenomenon in statistical forecasting. Basically, the central issue in forecasting is that you want to build of model that is very accurate against the data you don’t have.

In particular, the statistical theory indicates that it is possible to build models that happen to be very accurate when applied to the historical data, and still very inaccurate to predict the future. The problem is that, in practice, if you do not carefully think of the overfitting problem beforehand, building such a model is not a mere possibility, but the most probable outcome of your process.

Thus, you really need to optimize your model against the data you don’t have. Yet, this problem looks like a complete paradox, because, by definition, you can’t measure anything if you don’t have the corresponding data. And we have found that many professionals gave up on this issue, because it doesn’t look like a tractable thinking anyway.

Our advice is: DON’T GIVE UP

The core issue with those rules is that they perform too well on historical data. Each rule you add is mechanically reducing the forecast error that you are measuring on your historical data. If you add enough rules, you end-up with an apparent near-zero forecasting error. Yet, the empirical error that you measure on your historical data is an artifact of the process used to build the rules in the first place. Zero forecast error on historical data does not translate itself into zero forecast error on future promotions. Quite the opposite in fact, as such models tend to perform very poorly on future promotions.

Although, optimizing for the data you don’t have is hard, the statistical learning theory offers both theoretical understanding and practical solutions to this problem. The central idea consists of introducing the notion of structural risk minimization which balances the empirical error.

This will be discussed in a later post, stay tuned.

(Shameless plug) Many of those modern solutions, i.e. mathematical models that happen to be careful about the overfitting issue, have been implemented by Lokad, so that you don’t have to hire a team of experts to benefit from them.

Categories: business, forecasting, insights, retail, sales, supply chain Tags: forecasting insights promotion retail theory 1 Comment

Gentle introduction of Lokad: the slidecast

Published on by Joannes Vermorel.

Lokad is a bit a one-of-a-kind company with a complete focus on forecasting. Want a big picture? Check out our new slidecast that aims to be a gentle introduction of what we do.


Slidecast Scripts

Hello, in this slidecast, I am going to give you a small overview of Lokad.

Lokad is an online statistical forecasting provider.

In short, companies are sending their data to us, and we give them forecasts back.

But let's start with the pig picture.

Achieving good forecasts is a cornerstone of profitability for many industries such as retail, manufacturing and services.

If you happen to be a retailer, optimizing inventory levels is critical. Too few inventory and you end up with nothing to sale.

But in the other hand, too much inventory and your costs explode.

Then again, if you happen to be a service company, such as a bank or an insurance company, optimizing your staff levels is also critical.

Too much staff and you end-up wasting money on idle employees. Too few staff and your customers get mad because the long waiting queues.

In summary, forecasting can to used to achieve substantial savings in many industries.

Yet, to achieve those savings, you need good forecasts, and in our experience truly good forecasts are truly hard to obtain.

And this is the reason why Lokad exists in the first place: we take care of the forecasts so that you don't have to do it yourself.

I believe that there are 2 key benefits of using Lokad: first, it's way easier and second it's more accurate.

Easier because you don't have to deal with statistical forecasting yourself.

We handle the process entirely for you, and once the setup is done; you can forecast in one click, or even fully automate the forecasting process if you need to.

Then, it's also more accurate because of the Lokad forecasting technology that I believe to be pretty unique.

Sales, call volumes, cash flows, market prices can be represented as time-series.

Traditionally, when people are trying to forecast a time-series, they build a statistical model for this particular time-series, one time-series at a time. But what happens if there is not enough data to reflect the future in this time-series? Well, the answer is simple: the forecasts produced by the statistical model are not accurate.

So, what Lokad is doing instead is that we are not looking only at this particular time-series, but we are also taking into account the other time-series of the company. You can think of it that way: instead of looking at the sales of a single product, Lokad is looking at the sales of all the products of the company.

If the sales of a single product are going up, it might a trend, but it might be also a random effect of the market that does not indicate anything in particular. Yet, if the sales of 100 similar products are going up, the probability of those sales to be just a random effect of the market is very low. It's clearly a trend, and that's exactly the type of correlation that Lokad is analyzing.

But Lokad goes further. We are not only using data from your company, we are using all the data from all the companies that are also using Lokad to improve every single forecast that we deliver.

Basically, if you are looking at a single company that does not happen to be a super-large retailer, the amount of data available is usually quite limited. As a result, it is usually very hard to tell whether the company history reflects true patterns such as trend, seasonality, or if it’s not patterns but noise and randomness.

In short, the more business data you have, the more accuracy you can get on your forecasts. And Lokad is taking this simple principle to the next stage by taking into account the data from many companies instead of a single one.

Then there is another subtle issue in forecasting: your historical sales data or your call volume data might not accurately represent the real historical demand of your customers.

For example, if a supplier is encountering a shortage, your sales are going down. Yet, it doesn't mean that your customers are not wanting your products any more, it just reflects that there are fewer products to buy. Following the same idea, a promotion giving a product away with a large discount is likely to increase the sales. But this increase should not be considered as a trend.

The Lokad framework is capable of handling such situations. Basically, with Lokad, you can decorate your time-series with tags and events. Tags and events are just keywords that can be used to tell Lokad that two products are similar or that a past marketing event is impacting your historical data.

Yet, note that you just have to tell Lokad that a promotion took place. It's Lokad that will figure out the actual impact of the promotion and how it will influence future customer demand.

We provide forecasts as a service so that you don’t need any actual knowledge about forecasting to use Lokad.

At this point, you might be wondering: how do I get started with Lokad?

Well, it's quite simple actually.

First you need to go on our website at lokad.com and open your Lokad account. You give us an email address, you choose a password, and that's pretty much it.

Then, you need to install a Lokad add-on. This client application will be used to send your data toward Lokad, and to retrieve the corresponding forecasts. For example, you can use "Lokad Excel Sales Forecasting" to perform forecasts directly from within Microsoft Excel.

Then once the add-on is installed, enter some data, click refresh and you're done.

Microsoft Excel is not always the most appropriate way to manage your data. Thus, Lokad also provides to specialized applications.

"Lokad Safety Stock Calculator" is designed for retailers and manufacturers. This application lets you optimize your inventory levels with sales forecasts.

"Lokad Call Center Calculator" is designed for call centers or contact centers. This application lets you optimize your staff levels with call volume forecasts.

Then, if those applications do not fit your needs, Lokad offers a web API, an Application Programming Interface that can be used to access our forecasting technology from any 3rd party application as a long as you have some internet connection available.

This concludes this short presentation of Lokad. Do not hesitate to drop questions on our forums, the Lokad team, including myself, is doing its best to address them all.

Categories: docs, insights Tags: forecasting insights slidecast No Comments