Forever Learning

Forever learning and helping machines do the same.

Posts Tagged ‘data science

Predictive Analytics World London

leave a comment »

In October I’ll speak at Predictive Analytics World in London. Once again, I’ll be talking about Data Science.

Special Featured Sessions at Predictive Analytics World London

You can register for the event on the site. Slides are already available online.

Advertisements

Written by Lukas Vermeer

September 9, 2013 at 17:24

Posted in Data Science, Meta

Tagged with

Simulating Repeated Significance Testing

with 2 comments

My colleague Mats has an excellent piece on the topic of repeated significance testing on his blog.

To demonstrate how much [repeated significance testing] matters, I’ve ran a simulation of how much impact you should expect repeat testing errors to have on your success rate.

The simulation simply runs a series of A/A conversion experiments (e.g. there is no difference in conversion between two variants being compared) and shows how many experiments ended with a significant difference, as well as how many were ever significant somewhere along the course of the experiment. To correct for wild swings at the start of the experiment (when only a few visitors have been simulated) a cutoff point (minimum sample size) is defined before which no significance testing is performed.

Although the post includes a link to the Perl code used for the simulation, I figured that for many people downloading and tweaking a script would be too much of a hassle, so I’ve ported the simulation to a simple web-based implementation.

Repeated Significance Testing Simulation Screenshot

You can tweak the variables and run your own simulation in your browser here, or fork the code yourself on Github.

Written by Lukas Vermeer

August 23, 2013 at 15:47

Data Science: for Fun and for Profit

with 5 comments

In the next few weeks I’ll be giving two talks on the topic of Data Science at Xebicon and another event affiliated with Xebia. There is an abstract of my spiel available on the Xebicon site.

Data Science is one of the most exciting developing fields in technology today. Ever expanding data sets and increasing computing power allow statisticians and computing scientists to explore new business opportunities that were simply not possible merely a few years ago. Although their applications are new, the ideas and techniques that form the underpinnings for this evidence-oriented discipline have a solid foundation in hundreds of years of scientific development. In order then to understand the new science of data, one must first understand the science of science.

The Scientific Method, the unintended effects of repeated significance testing and Simpson’s paradox: this talk will focus on the practical applications of the theoretical constructs that lie at the heart of Data Science; and expand on some potential pitfalls of statistical analysis that you are likely to encounter when venturing into the field.

If you’re interested, feel free to sign up for either event. I’ll also post slides and additional thoughts here afterwards.

Written by Lukas Vermeer

May 17, 2013 at 10:37

Posted in Data Science, Meta

Tagged with

A New Kind Of Science?

with one comment

I am no longer a Corporate Ninja. As of a few weeks ago I can now call myself “Data Scientist at Booking.com“.

Although I am really excited about the new challenges and opportunities that await me in the sexiest job of the 21st century, I must say this new title bothers me ever so slightly. It somehow seems so redundant.

If you’re not using data, is it really science?

Written by Lukas Vermeer

May 2, 2013 at 17:20

Posted in Meta

Tagged with

Snake Oil and Tiger Repellant

leave a comment »

The Wall Street Journal has an interesting article explaining how companies are starting to use (big) data to support their recruiting efforts. It provides a good example of the more general trend in businesses towards evidence-based decisioning and data science, but it also shows how some crucial aspects of these techniques are easily overlooked or oversimplified.

My big-data-science-bogus-alarm started ringing upon reading the last sentence in this short paragraph.

Applicants for the job take a 30-minute test that screens them for personality traits and puts them through scenarios they might encounter on the job. Then the program spits out a score: red for low potential, yellow for medium potential or green for high potential. Xerox accepts some yellows if it thinks it can train them, but mostly hires greens.

Sounds smart, right? Well, maybe.

If Xerox never hires any “reds” and only very few “yellows”, how will they know the program is actually working? How will they know that all that complicated math is doing something more than simply returning random colour values? An evidence-based approach should always include some form of scientific control. If it doesn’t, it might as well be snake oil.

Of course, this is probably just a simple journalistic crime of omission of a trivial implementation detail, but it reminded me of that old chestnut “the tiger repellant”. For your convenience, this blogpost has been equipped with some very strong Tiger Repellant tonic. If you do not see any tigers around you right now, you will know it is working.

See? No tigers?

Proven to work like a charm. Order yours today! Great prices! Limited availability! Now taking applications in the comments.

[ Disclaimer: Tiger Repellant is not certified for use in South-East Asia or zoological parks. Tiger Repellant inc. and its employees and subsidiaries cannot be held liable for any damage caused to your person in the event of being eaten by a tiger. ]

Written by Lukas Vermeer

September 26, 2012 at 14:34

%d bloggers like this: