Entries in tips (14)
Perceived quality issues in forecasting
Software quality is a challenge. As a software developer, you have to make sure that your code is going to run in all sort of unexpected situations, but still it has to work. Plenty of methods, tools and processes are available to improve quality. Lokad makes an intensive usage of those.
But there is another aspect, it's the perceived quality: the user's opinion depends on many (many) purely subjective aspects of your product. For B2C, product design and aesthetic are probably among the top factors in perceived quality (think iPod).
So far, so good, as a product developer, it simply means that you need to invest a certain amount of efforts in your product design. But what happens when perceived quality conflicts actual quality? (think to the devil's method to change your iPod battery).
In the case of Lokad, where we are delivering time-series forecasts, the situation is even more complicated because statistical forecasting is just so not intuitive.
For example, we have many customers who actually try out a couple of points to see what they get. Yet, this is really not the way to go to evaluate Lokad. The right way involves a proper training dataset and a testing dataset of your own actual business data (plus many other considerations, but it's beyond the scope of this post).
Unfortunately, for us, many customers are judging Lokad on the forecast they get after entering a dozen of points generated by some function like Cos(x) or Sin(x). Actually, it would be possible to hard-code a few heuristics in Lokad just to detect those attempts (and their underlying mathematical functions). But, by doing so those heuristics would actually decrease the overall accuracy for the users having real business data in their accounts.
Then, we have another issue: our forecasts are not exactly real time. You can retrieve your forecasts any time, but if you retrieve your forecasts 0.1s after finishing the upload of your data (through our Web Services API), Lokad won't have had the time to try complex/advanced statistical models. As a result, you will get real-time but naive forecasts.
Lokad does its best to provide an end-to-end forecasting service, but to some extend it can't escape the Law of Leaky Abstraction: in order to make the most of Lokad, one needs to understand, at least little, how statistical forecasting works and how the constraints are handled by Lokad.
Would you pay for moving average?
Many customers are asking THE question: which forecasting models are you using? Indeed, our technology page isn't very specific on the subject.
Disclaimer: I am not really going to answer this question in this post, so please, don't be too disappointed.
Actually, there are two main reasons why we do not disclose this information
- it's a proprietary technology (like Google search).
- it's a super counter-intuitive technology.
Yet, in order to clarify the situation, I can say that Lokad is not using any silver-bullet forecasting model (i.e. a super-model that would fit all situations), but tons of models instead.
For example, we do use simple moving average (among others naturally) which is probably the most naive forecasting method. Intuitively, simple moving average says: if you want to know the total sales next month, just take the average monthly sales over the last 6 months.
In the first sight, it might appear shocking to sell forecasts, if, in the end, it's moving average model that gets used. But, in my opinion, it is not.
Indeed, producing forecasts through a statistical model is only the last step of a complicated process. Before that, you need to choose the model to be used. And, this step is very complicated.
Thus, Lokad can indeed produce a forecast based on a moving average model, if we detect the moving average model as being the best available model for this particular situation (in practice, this situation arises for very short or very erratic time-series).
Batteries Included. Python motto.
But the key difficulty of the problem is to understand why the moving average model has been selected. With regular statistical packages, choosing the right model is the user's burden. With Lokad, it's part of the service.
Ps: there are more complex variant of the moving average where decreasing coefficients (also called weights) are applied to the time-series; but it's beyond the scope of the discussion.
Convert your web logs into time-series
I have already illustrated how to use Microsoft PowerShell to import data from Swivel.com into Lokad. Let's see how to convert W3C web server log files into time-series and then import this time-series into Lokad. Note that the W3C log format is the default for IIS 2003 (Apache servers also supports the W3C log format, but the default seems to be the common log format).
Disclaimer: I am exactly sure what would be the purpose of time-series forecasting when applied to hit volume for web servers. If you had a pool of web servers, web traffic forecasting could be used to increase system responsiveness in a proactive manner. Yet, this is likely to be quite an uncommon needs except for very high traffic websites. I have been toying with web traffic forecasting mostly for internal benchmarking purposes.
In order to convert the logs into a time-series, the following PowerShell line can be used
ls *.log | get-hits | out-file 'my-logs.txt'
where ls *.log enumerates all the log files that are typically contained within a single directory. The Get-Hits is a custom function (given below) that parse the log file. The Get-Hits function aggregates the web server hits into quarter-hour periods. Obviously, most of the work here is done by the Get-Hits function:
# Get-Hits
# By Joannes Vermorel, 2007
# Convert hits of W3C server logs to time-series (aggregated by quarter-hour)
# Usage: ls *.log | get-hits | out-file 'my-logs.txt'
function Get-Hits
{
param( )
begin { }
process
{
$tab = [System.Char]::ConvertFromUtf32(9)
$file = [System.IO.File]::OpenText( $_.fullname )
$previousDate = [System.DateTime]::MinValue
$hitCount = 0
while($line = $file.ReadLine())
{
if(!$line.StartsWith('#') -and ($line.Length -gt 19))
{
$d = [System.DateTime]::Parse( $line.Substring(0, 19) )
$minutes = [int] ([System.Math]::Floor(($d.Minute / 15)) * 15)
$d = new-object System.DateTime $d.Year, $d.Month, $d.Day, $d.Hour, $minutes, 0
$hitCount += 1
if( !($d -eq $previousDate) )
{
write-output ($d.ToString("yyyy-MM-dd HH:mm:ss") + $tab + $hitCount)
$previousDate = $d
$hitCount = 0
}
}
}
if($hitCount -gt 0)
{
write-output ($d.ToString("yyyy-MM-dd HH:mm:ss") + $tab + $hitCount)
}
}
end { }
}
The output file contains a list of time-values in TSV format (i.e. "Time TAB Value"). Then you can import this TSV data through a direct cut-and-paste in your Lokad account (go Analytics => My Data => Insert a new series). Note that if you have several months of web server logs, accounting for several hundreds of MBs, then the process can takes a few minutes to complete.
Top 8 reasons why sales forecasting don't work
Geoffrey James has a very interesting post where he points out the top 8 reasons why sales forecasting don't work. In a the nutshell, manual sales forecasts are quite subjective, and the more experts you pill up (especially if they happen to have different agenda), the more random the forecasts become.
In order to avoid the pitfall of manual forecasting, G. James suggest to hire a mathematician to built a computerized forecasting model. Yet, this approach has a major drawback: you need to recruit an excellent mathematician (or more precisely a data miner for that matter).
Indeed, in order to do that, you must overcome two large obstacles
- distinguish the good mathematician from the not-so-good mathematician.
- motivate the mathematician to join your company.
Paying an extra salary might be an issue also, but most probably, it's a small issue compared to those two problems. In my experience, non-tech-oriented large companies usually fails dramatically at those two issues. This is not very surprising considering the tremendous difficulties encountered by attractive tech-oriented companies to recruit talented people.
Thus, if you do not happen to be a large tech-oriented company, Lokad might represents a low-cost outsourced mathematician: much cheaper and hassle-free.
Data importing made easy
Swivel is a one-of-kind Web 2.0 application: upload, share and visualize your data. Among all the publicly available datasets, time-series represent probably more than half of the published datasets (estimate based on the 2h that I have spend browsing Swivel, accuracy not guaranteed).
In particular, I have noticed that Digg Daily Statistics for 2006 had been published. Thanks to the PowerShell Forecasting SnapIn, and the following commands
PS docs:\> add-pssnapin LokadOpenShell PS docs:\> connect-lokad 'digg-sample@lokad.com' 'myPassword' PS docs:\> Import-CsvToLokad (convert-path 'data_set_1002641.csv') -verbose VERBOSE: Column: front page articles VERBOSE: Column: total comments VERBOSE: Column: total diggs VERBOSE: Column: comments per article VERBOSE: Column: diggs per article VERBOSE: Column: stdev comments per article VERBOSE: Column: stdev diggs per article VERBOSE: Total rows: 363 VERBOSE: Uploading serie named: front page articles VERBOSE: Uploading serie named: total comments VERBOSE: Uploading serie named: total diggs VERBOSE: Uploading serie named: comments per article VERBOSE: Uploading serie named: diggs per article VERBOSE: Uploading serie named: stdev comments per article VERBOSE: Uploading serie named: stdev diggs per article
I have been able to import the Digg data into a Lokad account. Then, I have created daily forecasting tasks associated to those newly imported time-series from the Lokad account. Finally, I have updated the Tour section of Lokad with the screenshot

For 3 lines PowerShell, the result isn't bad.
Want more accuracy? Start uploading now!
Lokad is not only providing hosted forecasting services, we also continuously monitor the forecasts accuracy for every single account that has been populated with time-series data (such as the sales data provided by the Lokad add-ons).
The purpose of those ever-going monitoring operations is first to detect early any issue with our algorithms, but ultimately, the purpose is also to improve the overall forecasting accuracy either by tweaking / improving our algorithms to better match our customer data; or by introducing new algorithms to handle more accurately specific situations.
Care a lot about accuracy but not ready to integrate forecasting in your daily operations? Upload your data now and revert your Lokad account to Free (that way you won't get charged). By uploading your data now, you make it possible for the Lokad staff to start improving our forecasting technology taking into account the specific needs expressed by your business data.
Forecasting accuracy and data you don't have
Statistical forecasting is something deeply tricky and counterintuitive. I have already discussed why Lokad "must" fail against cos(x) and sin(x) and also why you should definitively not sum your forecasts. The key question of statistical forecasting is How accurate are your forecasts? Although the question might appear simple, there are many untold subtleties in that question.
Indeed, how do you define the notion of accuracy? The best definition of accuracy would be the difference between forecasted values and the "real future" values. Yet, there is a big issue: future values are unknown, otherwise what's the point of making a forecast. Then, you might say "not knowing the future value is not an issue", let's do the following
- Every week I am making a forecast about next week.
- Then, I wait one week. Now the "future" value is known and I can compare the two values.
- Repeat the process for 6 months and compute the average forecasting accuracy.
This little scheme looks nice, but unfortunately it is not a operational scheme. Indeed, when you are trying to do forecasting, the problem is not to "benchmark" a single forecasting model, it's to choose the best forecasting model among a large space of possible forecasting models.Indeed, the forecasting model is not something that known a priori, it's a particular mathematical function chosen among many other similar function by a "statistical learning" algorithm.
The 6 months scheme presented here above works to evaluate a single model. But, then what happen if you compare the accuracy of 1.million models over the same 6 month period? If you start trying a lot of models, then one model is going to be a perfect fit for your historical data, i.e. forecasts perfectly matching data. Yet, since you've tried so many models, you can't be sure it's a good model, it might be just pure luck. 1 million might looks very large to you, but just consider that you're not going to do it by hand, a computer is going to do it; and nowadays computers are making billions of operations per seconds.
You can think of this phenomenon as lottery forecasting: each model represents a lottery ticket. Trying models is like buying lottery tickets. If you starting buying millions of tickets, then the probability of winning the lottery become very high. Yet, it has nothing to do with being able to forecast the winning ticket, it's just because you bought some many tickets.
If you ever had to choose a forecasting software, make sure you won't fall for that trap (shameless plug: I suggest to go for Lokad, since we handle completely the design of forecasting models, we handle this burden as well).
In the end, we are pretty much stuck with our initial problem: the accuracy of a forecasting algorithm is defined against data you don't have, no matter the way you look at the problem. OK, this is not a very helpful conclusion since it looks like a dead-end. Fortunately, modern statistics do propose solutions to this problem. Stay tuned...
Demand Forecasting vs. Sales Forecasting
If you start browsing the web about business forecasting (shameless plug: we suggest to start with the Forecasting for Business forums), you might cross both sales forecasting and demand forecasting. The two concepts are tightly related but not completely identical. I will try to outline the differences in this post, the distinction being somewhat subtle.
Sales forecasting is the most straightforward: you take your sales history as input in order to produce a sales forecast. This is the bread and butter of most Lokad sales forecasting add-ons. For most retail products, this approach is already fairly efficient. Indeed, sales are the only reliable quantitative indicator available about the customer demand for products.
Yet, it happens that sales data end-up with bias, for example
- No products are left on the shelf, that's the inventory rupture. Sales go to zero, although there is certainly a demand for the product. In that situation, sales data are under-estimating the demand.
- A temporary promotion is applied to the product. Sales go up, but mostly due to the promotion. Although there might be a residual effect after the end of promotion, sales are going to decrease afterward. In this situation, sales data are over-estimating the demand.
eCommerce have their own specific bias as well
- Temporary front-page display that might significantly increase the product exposure to the customers.
- Upgrade of the product picture and/or description that suddenly increase the demand because the product looks more attractive.
If you want to produce a demand forecast, then you need to use the demand history as input. In practice, it means that you need to correct (probably manually) your sales history to reflect the demand. For example, you might replace the zeroes caused by ruptures by the sales amounts that "would have been expected" if the product would have been available. Since demand bias are very business specific, such corrections usually require human expertise to be carried out.
We have only scratched the surface of the topic, stay tuned ...
Failing at forecasting Sin(x) and Cos(x)
We have been asked if Lokad was capable to forecast a time-series defined by a simple mathematical function, defined for example by f(x) = 1/x Sin(x+1) ? The answer is NO, I will even say that Lokad will miserably fails at forecasting polynomials and trigonometric expressions.
When succeeding on toy maths means failing on real-world
Lokad fails on toy mathematical expressions because it's a totally different situation compared to real-world business time-series such as sales, call volumes or market prices. Actually, any forecasting methods that would be highly efficient on toy mathematical expressions would also miserably fail on real-word data. Unless you've got a good statistical background, this result will probably seems counter-intuitive. If it does not work on simple mathematical expressions, how could it work on highly complex real-world time-series?
Actually, the true explanation is really complicated (if you're ready for that, then go on and start reading the Vapnik books ). Intuitively, the key issue of statistical forecasting is not to build a model that accurately fit your past data, but to build a model that accurately fits the data that you don't have (i.e. future values). This definition is tricky because how could know that your model is good if the quality criterion is precisely based on the data that does not exists yet.
In the toy maths situation, since you already know the mathematical function, you expect the forecasting algorithm to be able to guess this function too. Yet, it is not mathematically possible because there are an infinite number of mathematical functions that could have produced the very same time-series values. Additionally, have you ever encountered any real-world situation with perfectly clean and de-noised time-series? We don't. Then, if you assume also that noise exists, then there is no reason to even assume that a simple function exists to explain the observed data.
As a conclusion, Lokad does not assume that any toy maths expressions even exists to explain the observed business data, because empirical evaluations indicate that such kind of assumption is totally wrong. As a consequence, Lokad fails on toy math expressions. Yet, this is the price to pay to perform accurate forecasts on real-world business time-series.
Notes about concurrent time-series (i.e. multi-parameters)
An interesting question has been raised on the ADempiere wiki page about Lokad: does Lokad support multi-parameters time-series? Well, the answer is somehow.
In order to answer the question, we need first to clarify the situation. Intuitively a multi-parameters serie can be represented as a list of values [A(t), B(t) .... Z(t)]; the variables A, B ... Z being dependent of the time value t but also dependent of the other variables. Thus, we can have, for example, a relationship such as A(t)=B(t)+C(t). This kind of situation is usually called a concurrent time-series model with implicit or explicit relationships between the variables.
Implicit concurrent time-series
Intuitively, the notion of variable relationship is important because if two variables are correlated then the information of the first variable can be exploited to improve the forecast of the second variable. This insight is basically the heart of the Lokad technology. Yet, it must be noted that the Lokad user interfaces do not provide any way to "tell the system" that two particular time-series are correlated. For example, even if you know that A(t)=B(t)+C(t) in your data, you can't tell Lokad that such relationship is true.
Instead, Lokad automatically attempts to detect implicit relationships between time-series. Thus, as soon as you enter several time-series in your account, Lokad attempts to exploit the possible correlations between your time-series to increase the overall forecasting accuracy. You do not need to do anything in particular, this is the default behavior of our forecasting systems.
In the future, Lokad might provide some ways for the user to explicit the relationships between their time-series because the user might have some expert knowledge that could be used to improve the forecast accuracy as well. But at this time, there is no such feature provided by Lokad.
The usage of correlated data
Let's say that you have two time-series
- S = the count of ice-creams sold on a daily basis.
- T = the local daily average temperature around your ice-cream retail point.
This kind of situation is precisely the reason why Lokad is charging by the forecasting task instead charging by the time-serie. Indeed, we do not want to overcharge our customers for simply adding the data that is required to improve the forecast accuracy of the time-series that really matter to them.