# Overfitting: when accuracy measure goes wrong

Published on by Joannes Vermorel.

As we already said, the whole point of forecasting is to build models that are accurate on the data you don't have. Yet, at first glance, this looks like yet another crazy mathematician idea: both weird and utterly unpractical.

But in our experience, measuring the real forecast accuracy is a real business problem. Failing at this costs money. Actually, the larger the company, the larger the cost.

Still clueless about the real forecast error?

Check out our latest 9min slidecast (scripts are pasted below).

Slidecast scripts:

Overfitting: your forecasts may not be as good as the measure tells you

Forecasting accuracy is critical for many industries such as retail, manufacturing or services. If you over-forecast your customer demand, your costs explode because you will have too much staff, too much inventory. But if you under-forecast your customer demand, your customers get angry because they can’t buy your product or because they have to wait for too long to get served.

In this slidecast, I am going to introduce a little known problem in forecasting called overfitting.

This problem is little known for two reasons. First, it’s a subtle problem - non-obvious and counter-intuitive in many aspects. Second, it’s a problem that has been puzzling mathematicians since the 19th century. It’s only at the end of the nineties, a little more than 10 years ago, that the scientific community started to really comprehend this problem both at the theoretical level but also at the practical level.

Before getting any further, let me jump to the conclusion. Overfitting has a very strong impact on your forecasts. Overfitting can make you believe that you have a 10% forecast error while your real forecast error is 20%, and that would not be a worse case situation.

Overfitting is a very real business problem. Overfitting costs money.

Moreover, there is no work-around for overfitting. Modern statistical theories are built on top of this very concept. Overfitting plays a central part in those theories, no matter which way you approach statistics, overfitting is here to stay.

The key problem is to define what forecasting accuracy actually means.

Intuitively, the easiest way to measure the forecasting accuracy consists in making a forecast and waiting for the forecasted event to happen; so that we can compare the forecast with its corresponding outcome.

Yet, this method has a big drawback: it only tells you about the accuracy of past forecasts. From a business perspective, it matters little to know that past forecasts were good or bad, since you can’t change them anyway. What really matters is to make sure that forecasts that are yet to come are truly accurate.

Then, there is another problem: unless the method used to produce the forecasts is strictly identical from one forecast to the next, there is no reason to even believe that past accuracy could be used as a reliable indicator for future accuracy.

Since the 18th century, mathematicians have introduced the notion of statistical model. The primary aspect of statistical models is not, despite popular belief, to provide good or bad forecasts, but to provide repeatable forecasts.

With a statistical model, you get a process that automates the production of forecasts. It does not guarantee that forecasts will be good, but at least if, forecasts are poor, you can analyze the model further.

Let’s consider the following sample time-series. We will illustrate the overfitting problem by considering successive statistical models.

Let’s start with a simple linear model. As you can see, the line isn’t really well fitting the points of the time-series. As a result, we have a large error, over 50%. This model does not really fit the data.

Then, we can increase the complexity of the model. We have now a model that follows roughly what looks like the local time-series average. This new model looks much better than the previous one, and indeed the error has been divided by 5, now reaching 10%.

We have a good model here, but can we still reduce the error further? Well, the answer is simple: yes, we can produce a model that achieves less than 1% error.

As you can see, it’s rather easy; we just have to design a model that goes through nearly all the points of the time-series.

But can we really trust this model to be 1% accurate on future forecasts? Obviously, we can’t! This model is just micro-optimizing tiny fluctuations of the past that are nothing but random variations. Intuitively, we can’t forecast true randomness; we can only forecast patterns such trend, seasonality, etc.

Now, if we compare the last two models, we have an obvious problem: according to our error measure, the model on the right - the one with 1% error - is ten times better than the model on the left.

Although it is obvious that the best model is one the left. This model is smoothing the random fluctuations of the time-series.

Thus, there is something wrong with the way we are measuring the error. This error, as illustrated in the previous graphics, is known as the empirical error. It’s the error that you get through measures on your historical data.

Yet, what we are really interested in is known as the real error. The real error is defined as the error of your forecasting model on the data you don’t have, that is to say: future data.

Although, this definition looks like a complete paradox: how can you measure anything if, precisely, you don’t have the data!

Since the 19th century, statisticians have been looking at this problem through an approach known as the bias-variance tradeoff.

If we look at the three models that we have, we can say that the linear model has a high bias: no matter which linear model we choose, it won’t ever succeed to really fit the data, unless, of course, if the data itself is linearly distributed; but in most situations, the linear model will just approximate the data distribution.

Then, the linear model has a low variance: intuitively, adding or removing one point in the time-series isn’t going to much affect the resulting model. This model is fairly stable.

At the other extreme, the model on the right has a very low bias: it fits, overfits actually, most of the points of the time-series. Yet, the variance is very high: adding or removing a single point is likely to cause major changes in this model. There is no stability at all.

In the center, we have a model that balances both bias and variance; and this looks exactly the way to go.

Yet, the main issue with the bias versus variance approach is that we still have no clue about what is really happening with the real error, that is to say, the error on the data we don’t have.

This tradeoff is frequently more a heuristic than a true statistical criterion.

Since the late nineties, the field of statistical learning, a broader theory that encompasses statistical forecasting, has made a significant breakthrough both at the theoretical and practical levels.

This theory is complex, but a simple equation gives us major insights in the results. This theory tells us that the real error is upper bounded by the sum of the empirical error and another value called the structural risk.

As we have seen previously, the empirical error is just the error measured on the historical data.

The structural risk is a theoretical criterion that can be explicitly computed for certain classes of models, and estimated for most of the other classes.

Back on our initial example, we can see that the structural risk increases with the model complexity.

Now if we quantify those structural risks, it gives us the following outlook.

We still do know the real error - that value can’t be measured directly anyway - but we see that the model of the center has the lowest upper bound on the real error.

The statistical learning theory does not give us the value of the real error, but it gives us instead an upper bound; and the whole point is to choose the model that achieves the lowest upper bound.

This upper bound acts as a maximal value for the real error.

Compared to the bias-variance tradeoff, we see that the statistical learning theory gives us a quantitative estimate of the real error.

The structural risk is difficult to estimate in practice. Yet, at this time, it’s still the best known solution to avoid overfitting.

We have seen previously that an obvious way of ending up with overfitting problems was to increase the model complexity.

But there is also another way, a more subtle way, of ending up with overfitting problems: this can happen by increasing the complexity of the data.

Adding extra points to the data typically reduces overfitting problems, but if you start adding extra dimensions to the data, then, you are likely to end-up with overfitting problems even if the models themselves stay unchanged.

In our experience at Lokad, this situation is frequently encountered by organizations that refine, year after year, their own forecasting models with ever increasing data inputs; without explicitly taking care of the structural risk that lurks within their models.

In high dimensions, even linear models are subject to overfitting problems.

This concludes this short presentation about overfitting. If you have to remember one thing, remember that without taking into account the structural risk, your measure of the forecast error is likely to be highly deceptive; and the bigger your company, the more money, it will cost you.

Thanks you very for interest. We will happy to address your questions in our forums.

Categories: accuracy, forecasting, insights Tags: forecasting insights measure overfitting risk slidecast vapnik

# Gentle introduction of Lokad: the slidecast

Published on by Joannes Vermorel.

Lokad is a bit a one-of-a-kind company with a complete focus on forecasting. Want a big picture? Check out our new slidecast that aims to be a gentle introduction of what we do.

Slidecast Scripts

Hello, in this slidecast, I am going to give you a small overview of Lokad.

Lokad is an online statistical forecasting provider.

In short, companies are sending their data to us, and we give them forecasts back.

Achieving good forecasts is a cornerstone of profitability for many industries such as retail, manufacturing and services.

If you happen to be a retailer, optimizing inventory levels is critical. Too few inventory and you end up with nothing to sale.

But in the other hand, too much inventory and your costs explode.

Then again, if you happen to be a service company, such as a bank or an insurance company, optimizing your staff levels is also critical.

Too much staff and you end-up wasting money on idle employees. Too few staff and your customers get mad because the long waiting queues.

In summary, forecasting can to used to achieve substantial savings in many industries.

Yet, to achieve those savings, you need good forecasts, and in our experience truly good forecasts are truly hard to obtain.

And this is the reason why Lokad exists in the first place: we take care of the forecasts so that you don't have to do it yourself.

I believe that there are 2 key benefits of using Lokad: first, it's way easier and second it's more accurate.

Easier because you don't have to deal with statistical forecasting yourself.

We handle the process entirely for you, and once the setup is done; you can forecast in one click, or even fully automate the forecasting process if you need to.

Then, it's also more accurate because of the Lokad forecasting technology that I believe to be pretty unique.

Sales, call volumes, cash flows, market prices can be represented as time-series.

Traditionally, when people are trying to forecast a time-series, they build a statistical model for this particular time-series, one time-series at a time. But what happens if there is not enough data to reflect the future in this time-series? Well, the answer is simple: the forecasts produced by the statistical model are not accurate.

So, what Lokad is doing instead is that we are not looking only at this particular time-series, but we are also taking into account the other time-series of the company. You can think of it that way: instead of looking at the sales of a single product, Lokad is looking at the sales of all the products of the company.

If the sales of a single product are going up, it might a trend, but it might be also a random effect of the market that does not indicate anything in particular. Yet, if the sales of 100 similar products are going up, the probability of those sales to be just a random effect of the market is very low. It's clearly a trend, and that's exactly the type of correlation that Lokad is analyzing.

But Lokad goes further. We are not only using data from your company, we are using all the data from all the companies that are also using Lokad to improve every single forecast that we deliver.

Basically, if you are looking at a single company that does not happen to be a super-large retailer, the amount of data available is usually quite limited. As a result, it is usually very hard to tell whether the company history reflects true patterns such as trend, seasonality, or if it’s not patterns but noise and randomness.

In short, the more business data you have, the more accuracy you can get on your forecasts. And Lokad is taking this simple principle to the next stage by taking into account the data from many companies instead of a single one.

Then there is another subtle issue in forecasting: your historical sales data or your call volume data might not accurately represent the real historical demand of your customers.

For example, if a supplier is encountering a shortage, your sales are going down. Yet, it doesn't mean that your customers are not wanting your products any more, it just reflects that there are fewer products to buy. Following the same idea, a promotion giving a product away with a large discount is likely to increase the sales. But this increase should not be considered as a trend.

The Lokad framework is capable of handling such situations. Basically, with Lokad, you can decorate your time-series with tags and events. Tags and events are just keywords that can be used to tell Lokad that two products are similar or that a past marketing event is impacting your historical data.

Yet, note that you just have to tell Lokad that a promotion took place. It's Lokad that will figure out the actual impact of the promotion and how it will influence future customer demand.

We provide forecasts as a service so that you don’t need any actual knowledge about forecasting to use Lokad.

At this point, you might be wondering: how do I get started with Lokad?

Well, it's quite simple actually.

First you need to go on our website at lokad.com and open your Lokad account. You give us an email address, you choose a password, and that's pretty much it.

Then, you need to install a Lokad add-on. This client application will be used to send your data toward Lokad, and to retrieve the corresponding forecasts. For example, you can use "Lokad Excel Sales Forecasting" to perform forecasts directly from within Microsoft Excel.

Then once the add-on is installed, enter some data, click refresh and you're done.

Microsoft Excel is not always the most appropriate way to manage your data. Thus, Lokad also provides to specialized applications.

"Lokad Safety Stock Calculator" is designed for retailers and manufacturers. This application lets you optimize your inventory levels with sales forecasts.

"Lokad Call Center Calculator" is designed for call centers or contact centers. This application lets you optimize your staff levels with call volume forecasts.

Then, if those applications do not fit your needs, Lokad offers a web API, an Application Programming Interface that can be used to access our forecasting technology from any 3rd party application as a long as you have some internet connection available.

This concludes this short presentation of Lokad. Do not hesitate to drop questions on our forums, the Lokad team, including myself, is doing its best to address them all.

Categories: docs, insights Tags: forecasting insights slidecast