Demand, Sales and Workload Forecasting Software

Entries in time series (9)

Choosing the right forecast period

Forecasting consists in producing figures that are supposed to reflect the future. But those figures depend heavily on the period chosen for data aggregation. Lokad supports the most frequently used periods: quarter-hour, half-hour, hour, day, week, month, quarter, semester, year ...

Intuitively, the longer the considered period, the easier it is to make an accurate forecast. For example, yearly forecasts eliminate seasonal variations. Although a short forecasting period might provide a false sense of accuracy (ex: forecasting daily candy sales over the next two years) whereas a large period might be unsuited to take operational decision (ex: trying to optimize the weekly worker schedules of the candy manufacturing unit based on yearly forecasts).

A careful choice of the forecasting period is essential to make the most of forecasting. Yet, surprisingly, this question is frequently left mostly unanswered in books treating the subject of forecasting for practitioners (usually focusing on sales or demand forecasting). Typical answers are most of the manufacturing industry is using monthly forecasts and many large retailers are using weekly forecasts.

Yet, simple assumptions can lead to practical quantitative clues to make this choice. If we just assume that forecast errors follow a normal distribution, then expected error increase when switching to a shorter period is

  • year → month: √(12/1) ≈ 3.5 (i.e. error multiplied by 3.5)
  • month → week: √(31/7) ≈ 2.1 (assuming a month with 31 days)
  • week → day: √(6/1) ≈ 2.5 (assuming 6 business days per week)
  • hour → quarter-hour: √(4/1) = 2

Although, the normal distribution assumption is usually not exactly verified, those figures are quite representative of most situations. Those figures can be used to evaluate the opportunity to change the forecasting period if the forecast error is too high or if the forecast period is too long.

Posted on Monday, January 14, 2008 at 10:16AM by Registered CommenterJoannes Vermorel in , , , | Comments Off

The No1 never asked question about forecasting

There is a question of utmost importance when it comes to statistical forecasting: what is error function used during the learning process? Indeed, it's based on the error function that you can evaluate whether a forecast is good or bad. It's also the very same error function that drives your learning process when building a statistical model.

Finding an error function isn't hard. Quite the opposite, there are plenty of error functions available: Mean Squared Error (MSE), Mean Absolute Deviation, Median Absolute Deviation Error (MAD), Mean Absolute Percentage Error (MAPE). ...

Yet, in almost 1 year of existence for Lokad, the question of the choice of error function has never been raised by any customer. Well, this situation is very natural, as Lokad is precisely taking in charge the whole forecasting process.

For those who might be interested, the answer is, unfortunately, not simple. Lokad using several error functions depending on the context. We are often using bounded version of the MAPE (identical to the classical MAPE, but the function gets upper bounded to 1) for the benchmarks. The upper bound is used to make the process more robust against pathological time-series that would have had huge errors otherwise.

Yet, if the data is not too noisy (i.e. not too much outliers), then we are often using the MSE function which tends to be much more practical from a computational viewpoint.

Posted on Monday, November 19, 2007 at 07:05AM by Registered CommenterJoannes Vermorel in , , , , | Comments Off

Perceived quality issues in forecasting

Software quality is a challenge. As a software developer, you have to make sure that your code is going to run in all sort of unexpected situations, but still it has to work. Plenty of methods, tools and processes are available to improve quality. Lokad makes an intensive usage of those.

But there is another aspect, it's the perceived quality: the user's opinion depends on many (many) purely subjective aspects of your product. For B2C, product design and aesthetic are probably among the top factors in perceived quality (think iPod).

So far, so good, as a product developer, it simply means that you need to invest a certain amount of efforts in your product design. But what happens when perceived quality conflicts actual quality? (think to the devil's method to change your iPod battery).

In the case of Lokad, where we are delivering time-series forecasts, the situation is even more complicated because statistical forecasting is just so not intuitive.

For example, we have many customers who actually try out a couple of points to see what they get. Yet, this is really not the way to go to evaluate Lokad. The right way involves a proper training dataset and a testing dataset of your own actual business data (plus many other considerations, but it's beyond the scope of this post).

Unfortunately, for us, many customers are judging Lokad on the forecast they get after entering a dozen of points generated by some function like Cos(x) or Sin(x). Actually, it would be possible to hard-code a few heuristics in Lokad just to detect those attempts (and their underlying mathematical functions). But, by doing so those heuristics would actually decrease the overall accuracy for the users having real business data in their accounts.

Then, we have another issue: our forecasts are not exactly real time. You can retrieve your forecasts any time, but if you retrieve your forecasts 0.1s after finishing the upload of your data (through our Web Services API), Lokad won't have had the time to try complex/advanced statistical models. As a result, you will get real-time but naive forecasts.

Lokad does its best to provide an end-to-end forecasting service, but to some extend it can't escape the Law of Leaky Abstraction: in order to make the most of Lokad, one needs to understand, at least little, how statistical forecasting works and how the constraints are handled by Lokad.

Posted on Monday, November 12, 2007 at 02:55PM by Registered CommenterJoannes Vermorel in , , , , | Comments2 Comments

Convert your web logs into time-series

I have already illustrated how to use Microsoft PowerShell to import data from Swivel.com into Lokad. Let's see how to convert W3C web server log files into time-series and then import this time-series into Lokad. Note that the W3C log format is the default for IIS 2003 (Apache servers also supports the W3C log format, but the default seems to be the common log format).

Disclaimer: I am exactly sure what would be the purpose of time-series forecasting when applied to hit volume for web servers. If you had a pool of web servers, web traffic forecasting could be used to increase system responsiveness in a proactive manner. Yet, this is likely to be quite an uncommon needs except for very high traffic websites. I have been toying with web traffic forecasting mostly for internal benchmarking purposes.

In order to convert the logs into a time-series, the following PowerShell line can be used

 ls *.log | get-hits | out-file 'my-logs.txt'

where ls *.log enumerates all the log files that are typically contained within a single directory. The Get-Hits is a custom function (given below) that parse the log file. The Get-Hits function aggregates the web server hits into quarter-hour periods. Obviously, most of the work here is done by the Get-Hits function:

# Get-Hits
# By Joannes Vermorel, 2007
# Convert hits of W3C server logs to time-series (aggregated by quarter-hour)
# Usage: ls *.log | get-hits | out-file 'my-logs.txt'

function Get-Hits
{

param( )
begin { }

process
{
$tab = [System.Char]::ConvertFromUtf32(9)
$file = [System.IO.File]::OpenText( $_.fullname )
$previousDate = [System.DateTime]::MinValue
$hitCount = 0

while($line = $file.ReadLine())
{
if(!$line.StartsWith('#') -and ($line.Length -gt 19))
{
$d = [System.DateTime]::Parse( $line.Substring(0, 19) )
$minutes = [int] ([System.Math]::Floor(($d.Minute / 15)) * 15)
$d = new-object System.DateTime $d.Year, $d.Month, $d.Day, $d.Hour, $minutes, 0

$hitCount += 1

if( !($d -eq $previousDate) )
{
write-output ($d.ToString("yyyy-MM-dd HH:mm:ss") + $tab + $hitCount)
$previousDate = $d
$hitCount = 0
}
}
}

if($hitCount -gt 0)
{
write-output ($d.ToString("yyyy-MM-dd HH:mm:ss") + $tab + $hitCount)
}
}
end { }
}

The output file contains a list of time-values in TSV format (i.e. "Time TAB Value"). Then you can import this TSV data through a direct cut-and-paste in your Lokad account (go Analytics => My Data => Insert a new series). Note that if you have several months of web server logs, accounting for several hundreds of MBs, then the process can takes a few minutes to complete.

Posted on Sunday, September 9, 2007 at 09:41AM by Registered CommenterJoannes Vermorel in , , , | Comments Off

Data importing made easy

Swivel is a one-of-kind Web 2.0 application: upload, share and visualize your data. Among all the publicly available datasets, time-series represent probably more than half of the published datasets (estimate based on the 2h that I have spend browsing Swivel, accuracy not guaranteed).

In particular, I have noticed that Digg Daily Statistics for 2006 had been published. Thanks to the PowerShell Forecasting SnapIn, and the following commands

PS docs:\> add-pssnapin LokadOpenShell
PS docs:\> connect-lokad 'digg-sample@lokad.com' 'myPassword'
PS docs:\> Import-CsvToLokad (convert-path 'data_set_1002641.csv') -verbose
VERBOSE: Column: front page articles
VERBOSE: Column: total comments
VERBOSE: Column: total diggs
VERBOSE: Column: comments per article
VERBOSE: Column: diggs per article
VERBOSE: Column: stdev comments per article
VERBOSE: Column: stdev diggs per article
VERBOSE: Total rows: 363
VERBOSE: Uploading serie named: front page articles
VERBOSE: Uploading serie named: total comments
VERBOSE: Uploading serie named: total diggs
VERBOSE: Uploading serie named: comments per article
VERBOSE: Uploading serie named: diggs per article
VERBOSE: Uploading serie named: stdev comments per article
VERBOSE: Uploading serie named: stdev diggs per article

I have been able to import the Digg data into a Lokad account. Then, I have created daily forecasting tasks associated to those newly imported time-series from the Lokad account. Finally, I have updated the Tour section of Lokad with the screenshot

 GetFile.aspx

For 3 lines PowerShell, the result isn't bad.

Posted on Sunday, August 19, 2007 at 08:29PM by Registered CommenterJoannes Vermorel in , , | Comments Off

Forecasting accuracy and data you don't have

Statistical forecasting is something deeply tricky and counterintuitive. I have already discussed why Lokad "must" fail against cos(x) and sin(x) and also why you should definitively not sum your forecasts. The key question of statistical forecasting is How accurate are your forecasts? Although the question might appear simple, there are many untold subtleties in that question.

Indeed, how do you define the notion of accuracy? The best definition of accuracy would be the difference between forecasted values and the "real future" values. Yet, there is a big issue: future values are unknown, otherwise what's the point of making a forecast. Then, you might say "not knowing the future value is not an issue", let's do the following

  • Every week I am making a forecast about next week.
  • Then, I wait one week. Now the "future" value is known and I can compare the two values.
  • Repeat the process for 6 months and compute the average forecasting accuracy.

This little scheme looks nice, but unfortunately it is not a operational scheme. Indeed, when you are trying to do forecasting, the problem is not to "benchmark" a single forecasting model, it's to choose the best forecasting model among a large space of possible forecasting models.Indeed, the forecasting model is not something that known a priori, it's a particular mathematical function chosen among  many other similar function by a "statistical learning" algorithm.

The 6 months scheme presented here above works to evaluate a single model. But, then what happen if you compare the accuracy of 1.million models over the same 6 month period?  If you start trying a lot of models, then one model is going to be a perfect fit for your historical data, i.e. forecasts perfectly matching data. Yet, since you've tried so many models, you can't be sure it's a good model, it might be just pure luck. 1 million might looks very large to you, but just consider that you're not going to do it by hand, a computer is going to do it; and nowadays computers are making billions of operations per seconds.

You can think of this phenomenon as lottery forecasting: each model represents a lottery ticket. Trying models is like buying lottery tickets. If you starting buying millions of tickets, then the probability of winning the lottery become very high. Yet, it has nothing to do with being able to forecast the winning ticket, it's just because you bought some many tickets.

If you ever had to choose a forecasting software, make sure you won't fall for that trap (shameless plug: I suggest to go for Lokad, since we handle completely  the design of forecasting models, we handle this burden as well).

In the end, we are pretty much stuck with our initial problem: the accuracy of a forecasting algorithm is defined against data you don't have, no matter the way you look at the problem. OK, this is not a very helpful conclusion since it looks like a dead-end. Fortunately, modern statistics do propose solutions to this problem. Stay tuned...

Posted on Tuesday, July 31, 2007 at 09:23PM by Registered CommenterJoannes Vermorel in , , , | Comments Off

Demand Forecasting vs. Sales Forecasting

If you start browsing the web about business forecasting (shameless plug: we suggest to start with the Forecasting for Business forums), you might cross both sales forecasting and demand forecasting. The two concepts are tightly related but not completely identical. I will try to outline the differences in this post, the distinction being somewhat subtle.

Sales forecasting is the most straightforward: you take your sales history as input in order to produce a sales forecast. This is the bread and butter of most Lokad sales forecasting add-ons. For most retail products, this approach is already fairly efficient. Indeed, sales are the only reliable quantitative indicator available about the customer demand for products.

Yet, it happens that sales data end-up with bias, for example

  • No products are left on the shelf, that's the inventory rupture. Sales go to zero, although there is certainly a demand for the product. In that situation, sales data are under-estimating the demand.
  • A temporary promotion is applied to the product. Sales go up, but mostly due to the promotion. Although there might be a residual effect after the end of promotion, sales are going to decrease afterward. In this situation, sales data are over-estimating the demand.

eCommerce have their own specific bias as well

  • Temporary front-page display that might significantly increase the product exposure to the customers.
  • Upgrade of the product picture and/or description that suddenly increase the demand because the product looks more attractive.

If you want to produce a demand forecast, then you need to use the demand history as input. In practice, it means that you need to correct (probably manually) your sales history to reflect the demand. For example, you might replace the zeroes caused by ruptures by the sales amounts that "would have been expected" if the product would have been available. Since demand bias are very business specific, such corrections usually require human expertise to be carried out.

We have only scratched the surface of the topic, stay tuned ...

Posted on Thursday, July 19, 2007 at 09:31PM by Registered CommenterJoannes Vermorel in , , , | Comments Off

Notes about concurrent time-series (i.e. multi-parameters)

An interesting question has been raised on the ADempiere wiki page about Lokad: does Lokad support multi-parameters time-series? Well, the answer is somehow.

In order to answer the question, we need first to clarify the situation. Intuitively a multi-parameters serie can be represented as a list of values [A(t), B(t) .... Z(t)]; the variables A, B ... Z being dependent of the time value t but also dependent of the other variables. Thus, we can have, for example, a relationship such as A(t)=B(t)+C(t). This kind of situation is usually called a concurrent time-series model with implicit or explicit relationships between the variables.


Implicit concurrent time-series

 
Intuitively, the notion of variable relationship is important because if two variables are correlated then the information of the first variable can be exploited to improve the forecast of the second variable. This insight is basically the heart of the Lokad technology. Yet, it must be noted that the Lokad user interfaces do not provide any way to "tell the system" that two particular time-series are correlated. For example, even if you know that A(t)=B(t)+C(t) in your data, you can't tell Lokad that such relationship is true.

Instead, Lokad automatically attempts to detect implicit relationships between time-series. Thus, as soon as you enter several time-series in your account, Lokad attempts to exploit the possible correlations between your time-series to increase the overall forecasting accuracy. You do not need to do anything in particular, this is the default behavior of our forecasting systems.

In the future, Lokad might provide some ways for the user to explicit the relationships between their time-series because the user might have some expert knowledge that could be used to improve the forecast accuracy as well. But at this time, there is no such feature provided by Lokad.

The usage of correlated data


Let's say that you have two time-series

  • S = the count of ice-creams sold on a daily basis.
  • T = the local daily average temperature around your ice-cream retail point.
Let's assume that the two time-series are correlated, i.e that people tend to eat more ice-cream when its hot. As a matter of fact, the only time-series that you are interested in, from a forecasting viewpoint, is the ice-cream sales (S). Indeed, an ice-cream retailer does not really care about producing weather forecasts (T). But, in the case of Lokad, if the data of T is entered into the Lokad account along with S; then Lokad will be able to implicitly exploit the temperature information (T) in order to increase the ice-cream sales forecast accuracy (S).

This kind of situation is precisely the reason why Lokad is charging by the forecasting task instead charging by the time-serie. Indeed, we do not want to overcharge our customers for simply adding the data that is required to improve the forecast accuracy of the time-series that really matter to them.

 

Posted on Monday, February 26, 2007 at 09:17PM by Registered CommenterJoannes Vermorel in , , , , | Comments Off

Missing time-series vs. Empty time-series

Lokad is about time-series forecasting, but as simple as the time-series model may seem to be (after all a time-series is nothing more than a list of time-value pairs), there are several subtleties in the way to manage time-series. In this post, we will see how the Lokad time-series model distinguishes missing time-value pairs from empty time-value pairs. Since the topic is slightly complex, I would suggest, if you're not familiar the Lokad technology, to have a look at our User Guide (in particular, the Forecasting tasks section).

A practical situation


Let's start with a practical real-life situation; let's assume that we have a time-series that include 12 time-values, one value for each month of the year 2005 (starting January 2005, ending December 2005). We can imagine that this time-series represent the monthly sales of a web shop. At the time I am writing this post, it's the beginning of January 2007. What happen if I insert now this time-series into my Lokad account and ask for a monthly forecast? Well, there is an ambiguity in the time-series model, because there would be two possibilities:

  • Returning a forecast for January 2007 (let's call it the clock-centric approach). In this case, we would be considering the 12 values for the year 2006 are simply missing. Thus, we skip them a produce a forecast nonetheless but based on the data of the year 2005.
  • Returning a forecast for January 2006 (let's call it the data-centric approach): The forecast is based on the last time-value pair available (i.e. December 2005 in the present situation), which is equivalent to the assumption that there is no missing values. In this case, the delivered forecast might refer to a period already part of the past.

Let's make the things clear: Lokad has chosen the data-centric approach, if ask a monthly forecast for your 12 time-values ranging from January 2005 to December 2005, you will get a forecast for January 2006, no matter if you request the forecast at the beginning of 2006 or in a distant future. Lokad takes the last time-value pair of your time-series as a reference to compute the forecasts. This option has been chosen because we believe it's closer to the business requirements

Some arguments supporting the data-centric approach

 

Let's review the arguments in favor of the data-centric approach:

  • The data-centric approach has a persistent semantic. If the input time-series data do not change the forecast time-range do not either (yet the actual values of the forecast may change over time ).
  • The data-centric approach offers the possibility to benchmark the Lokad forecast services. You can import your 2005 product sales data in your Lokad account, get the forecast for 2006, and see how much difference lies between our forecasts and your historical record for 2006.
  • The data-centric approach assumes that there is no missing data in your time-series data after the initial time-value pair. This assumption has the strong advantage: its simplicity. Indeed, in some data mining fields, missing data are very frequent (think medical surveys for example), but when it comes to time-series, it's quite rare.

Yet, this approach involves a minor drawback: you need to handle explicitly the lack of data. For example, in the previous web shop situation, each product of the catalog may not have be sold even once a month. In such case, you must explicitly add a zero time-value in your time-series that represent this lack of sales.

Posted on Friday, January 5, 2007 at 03:11PM by Registered CommenterJoannes Vermorel in , , , , | Comments2 Comments