Forever Learning

Forever learning and helping machines do the same.

Archive for the ‘Statistics’ Category

Simulating Repeated Significance Testing

with 2 comments

My colleague Mats has an excellent piece on the topic of repeated significance testing on his blog.

To demonstrate how much [repeated significance testing] matters, I’ve ran a simulation of how much impact you should expect repeat testing errors to have on your success rate.

The simulation simply runs a series of A/A conversion experiments (e.g. there is no difference in conversion between two variants being compared) and shows how many experiments ended with a significant difference, as well as how many were ever significant somewhere along the course of the experiment. To correct for wild swings at the start of the experiment (when only a few visitors have been simulated) a cutoff point (minimum sample size) is defined before which no significance testing is performed.

Although the post includes a link to the Perl code used for the simulation, I figured that for many people downloading and tweaking a script would be too much of a hassle, so I’ve ported the simulation to a simple web-based implementation.

Repeated Significance Testing Simulation Screenshot

You can tweak the variables and run your own simulation in your browser here, or fork the code yourself on Github.

Advertisements

Written by Lukas Vermeer

August 23, 2013 at 15:47

Restaurant Reviews and the Availability Heuristic

with 3 comments

You could say fine dining is a bit of a hobby of mine; and as I’ve mentioned before, I’ve composed quite a few restaurant reviews over the years. I enjoy writing about food almost as much as I love eating it.

Whilst fantasising about fancy food with a colleague the other day, we wondered whether there is any relation between the lengthiness of my reviews and the associated score. In some strange way it made intuitive sense to me that I would devote more words to describe why a particular restaurant did not live up to my expectations.

Thinking about this, the first negative review that came to my mind was one I wrote for “The Good View” in Chiang Mai, Thailand.

If service were any slower than it already is, cobwebs would certainly overrun the place. When food and drinks eventually do arrive they’re hardly worth the wait.

Fruit juice contained more sugar than a Banglamphu brothel and cocktails had less alcohol in them than a Buddhist monk. The mixed Northern specialties appetizer revealed itself to be three kinds of sausage and some raw chillies; very special indeed.

The spicy papaya salad probably tasted alright, but I was unable to tell because my taste buds were destroyed on the first bite. (Yes, I see the irony in complaining a spicy papaya salad was too spicy, but in my mind there’s a difference between spicy food and napalm.)

Also, the view is terribly overrated.

Conversely, the first positive review that popped into my brain was this rather terse piece for “Opium” in Utrecht, the Netherlands.

Om nom nom.

Judging by this tiny sample there might indeed be something to the hypothesis that review length and review score are negatively correlated. To confirm my hunch, I decided to load my reviews into R for a proper statistical analysis.

> cor.test(nn_reviews$char_count, nn_reviews$score)

Pearson's product-moment correlation
data: nn_reviews$char_count and nn_reviews$score
t = 0.2246, df = 121, p-value = 0.8227
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1571892 0.1967366

sample estimates:
   cor
0.02041319

To my surprise, the analysis shows there is practically no relation between length and score. Contrary to what the two reviews above seem to suggest I do not require more letters to describe an unpleasant dining experience as opposed to a pleasant one.

A simple plot of the two variables gives some insight into a possible cause for my misconception.

Review scores vs review length

Review scores vs review length

The outlier in the bottom right happens to represent my review for the Good View. All my other reviews are much shorter in length and seem to be quite evenly distributed over the different scores.

My misjudgement is an excellent example of the availability heuristic. The pair of examples that presented themselves to me upon initial reflection were not representative of the complete set, but that did not stop me from drawing overarching, and incorrect, conclusions based on a sample of two.

This is why I use statistics, because I am a fallible human being; just like everyone else

Written by Lukas Vermeer

March 22, 2013 at 18:11

Snake Oil and Tiger Repellant

leave a comment »

The Wall Street Journal has an interesting article explaining how companies are starting to use (big) data to support their recruiting efforts. It provides a good example of the more general trend in businesses towards evidence-based decisioning and data science, but it also shows how some crucial aspects of these techniques are easily overlooked or oversimplified.

My big-data-science-bogus-alarm started ringing upon reading the last sentence in this short paragraph.

Applicants for the job take a 30-minute test that screens them for personality traits and puts them through scenarios they might encounter on the job. Then the program spits out a score: red for low potential, yellow for medium potential or green for high potential. Xerox accepts some yellows if it thinks it can train them, but mostly hires greens.

Sounds smart, right? Well, maybe.

If Xerox never hires any “reds” and only very few “yellows”, how will they know the program is actually working? How will they know that all that complicated math is doing something more than simply returning random colour values? An evidence-based approach should always include some form of scientific control. If it doesn’t, it might as well be snake oil.

Of course, this is probably just a simple journalistic crime of omission of a trivial implementation detail, but it reminded me of that old chestnut “the tiger repellant”. For your convenience, this blogpost has been equipped with some very strong Tiger Repellant tonic. If you do not see any tigers around you right now, you will know it is working.

See? No tigers?

Proven to work like a charm. Order yours today! Great prices! Limited availability! Now taking applications in the comments.

[ Disclaimer: Tiger Repellant is not certified for use in South-East Asia or zoological parks. Tiger Repellant inc. and its employees and subsidiaries cannot be held liable for any damage caused to your person in the event of being eaten by a tiger. ]

Written by Lukas Vermeer

September 26, 2012 at 14:34

Understanding

with one comment

Derek Jones posits that “success does not require understanding“.

In my line of work I am constantly trying to understand what is going on (the purpose of this understanding is to control and make things better) and consider anybody who uses machine learning as being clueless, dim witted or just plain lazy; the problem with machine learning is that it gives answers without explanations (ok decision trees do provide some insights).

Problem solving versus solving problems.

As one who specializes in using machine learning, I obviously resent being called “clueless, dim witted or just plain lazy”. However, I feel a larger point should be made here. Success does most definitely require understanding, but not necessarily of how one particular instance of a solution came about.

To be successful in any machine learning effort, one needs to have intricate understanding of what the problem is and how techniques can be applied to find solutions. This is a more general form of understanding which puts more emphasis on the process of finding workable models, rather than on applying these models to individual instances of a problem. Comprehension of problem solving over understanding a particular solution.

Driving a black box.

Consider the following example. To me, the engine of my car is a black box; I have very little idea how it works. My mechanic does know how engines work in general, but he is unable to know the exact internal state of the engine in my car as I am cruising down the highway at 100 miles per hour. None of this “lack of understanding” prevents me from getting from A to B. I turn the wheel, I push the peddel and off we go.

In essence, my mechanic and I have different levels of understanding of my car. But importantly, at different levels of precision, the thing becomes a black box to each of us; in the sense that there is a point where our otherwise perfectly practical models break down and no longer are able to reflect reality. In the end, it’s black boxes all the way down.

Chasing shadows.

Models are merely tools to help you navigate a vastly complex world. Very much like machine learning models, a scientific model might work in many cases, but so does Newton’s law of universal gravitation. We know for a fact that that particular model is definitely wrong; and I sincerely hope many others are just as incorrect.

There will always be limits to our understanding. The fact that we have a model that can help us predict does not necessarily mean we have correctly understood the nature of the universe. All models are wrong, but some are useful.

Reality is simply much too complicated to be captured in a manageable set of rules, but even incomplete (or incorrect) models can provide insight and help us better navigate this world. Machine learning is successful, precisely because it can help us find such models.

[ Peter Norvig has written an excellent piece on this subject in relation to language models. ]

Written by Lukas Vermeer

August 15, 2012 at 11:01

Extrapolation

leave a comment »

There are two types of people in this world:

  1. Those who can extrapolate from incomplete data.

[ Via @professorkitteh. Original source unknown. ]

Written by Lukas Vermeer

April 12, 2012 at 11:15

Why Metrics Matter

with one comment

Daring Fireball quotes some interesting research findings related to what Barry Schwartz dubbed The Paradox Of Choice.

About 60% of the people stopped when we had 24 jams on display and then at the times when we had 6 different flavors of jam out on display only 40% of the people actually stopped, so more people were clearly attracted to the larger varieties of options, but then when it came down to buying, so the second thing we looked at is in what case were people more likely to buy a jar of jam.

What we found was that of the people who stopped when there were 24 different flavors of jam out on display only 3% of them actually bought a jar of jam whereas of the people who stopped when there were 6 different flavors of jam 30% of them actually bought a jar of jam.  So, if you do the math, people were actually 6 times more likely to buy a jar of jam if they had encountered 6 than if they encountered 24, so what we learned from this study was that while people were more attracted to having more options, that’s what sort of got them in the door or got them to think about jam, when it came to choosing time they were actually less likely to make a choice if they had more to choose from than if they had fewer to choose from.

A fascinating psychological effect with clear implications for display advertising, but there is a lesson here for online marketeers and analysts as well.

In this study, fewer people stopped when there was less choice, but more people actually bought something. If we were only measuring the former (i.e. attention), and not the latter (i.e. sales), we would be led to think more choice would be about 50% more effective at bringing in customers. And boy, would we be wrong!

Metrics matter; especially when you are using a system which can automatically optimize your process in order to maximize those metrics.

Don’t get yourself in a jam; remember this next time you decide to measure click acceptance instead of actual sales to drive your online marketing effort. Clickthrough rates are useful as a measure by proxy, but they can be misleading.

Written by Lukas Vermeer

January 3, 2012 at 16:58

Predicting Pi

leave a comment »

A few weeks ago, I showed a colleague* my little visual demo of my favorite algorithm for estimating the number Pi. His immediate response was so obvious I am almost ashamed it did not occur to me before he mentioned it.

“RTD could do that!”

Turns out, he is of course right. Although Oracle Real-Time Decisions was certainly not designed for the task, it does a pretty good job of predicting Pi, given the right inputs and some mathemagic.

To reiterate, the idea behind the original algorithm is that we throw a bunch (well actually quite a lot) of darts at a square board (uniformly distributed, of course). We then count the number of darts that land in the largest circle we can fit in this square. The ratio between the darts in the circle and the total number of darts thrown, multiplied by four happens to approximate Pi for reasons explained in an earlier post.

The key thing to understand when using RTD to implement this algorithm is that the ratio described above also represents the odds that a single dart will land in the circle. More easily put, the number of darts that land in the circle divided by the total number of darts thrown equals the chance of a dart landing in the circle. If we can predict the odds of hitting the circle we can predict Pi; and RTD is pretty good at making predictions.

The Setup.

I’ll run through the RTD setup briefly. If you’re interested in a more detailed explanation how I did this, feel free to drop me a line.

RTD can only predict the likelihoods related to choices, so we’ll need a couple of those. For this experiment, we’ll pretend we have a choice of landing the dart inside or outside the circle. We’ll then diligently record whichever of those two happened to be the case each time we throw a dart so that RTD will learn to predict how likely either outcome is.

[ You can click on the screenshots below for a closer look. ]

Two choices modeled in RTD representing a dart landing inside or outside the circle.

The "In Circle" choice

As you can see, I’ve already included the likelihood as a choice attribute to be returned to the calling application. This attribute will be populated by a model aptly named Pi.

RTD model configuration screen for the "Pi" model.

The "Pi" model

The Pi model could hardly be simpler. It will predict the likelihood of mutually exclusive choices (hence the checkbox at the bottom of the configuration) from the choice group Choices.

Note that Pi is configured as a simple so-called Choice Model, not the more common Choice Event Model. The latter, more complex model is more frequently used in practice, because it allows RTD to predict multi-step likelihoods (customer clicked an add, then put the product in his or her shopping basket, then proceeded to checkout, then completed the payment) for positive as well as negative events (e.g. the customer actively declines the offer).

We’ll use a client application similar to the one used in the earlier Pi experiments to simulate the throwing of darts. This application will also show the predicted value of Pi. For this client to tell RTD about darts thrown and RTD to respond with the predicted value we will use a single advisor. The RTD server will generate a web service based on this configuration.

RTD advisor configuration screen with one boolean input parameter.

The "Get Prediction" advisor request

Each request to this advisor will tell RTD about a single dart thrown. The Dart Was In Circle boolean input parameter seems pretty self-explanatory. To process this input and feed the model we will need some simple logic.

Advisor logic to process the boolean input

Advisor logic

This is all that is needed to allow RTD to build a model that can predict the likelihoods for the darts landing inside or outside the circle.

[ Of course we also need so performance goal and decisions to allow the advisor request to return the two choices created, but we’re not really interested in that here. As long as the request returns the In Circle choice and its Likelihood attribute we’re happy. I’ll leave this part of the configuration as an exercise for the reader. ]

The Result.

It takes RTD a while (about a thousand darts or so) to come up with a prediction at all. But when it does, it is pretty much spot on.

The prediction RTD makes seems to be less sensitive to fluctuations resulting from the randomized input. At times, RTD’s guess was even closer to the actual number Pi than the value calculated mathematical function I’d used in previous experiments (and of course, I couldn’t resist taking a screenshot when it was).

A screenshot of the RTD logs and the client application

Results

RTD not only good for 986% ROI, but also for predicting Pi.

And this cake is no lie.

[* I forget who exactly suggested this. Either Alan, Simon or Tim. ]

Written by Lukas Vermeer

October 30, 2011 at 18:34

%d bloggers like this: