Forever Learning

Forever learning and helping machines do the same.

Simulating Repeated Significance Testing

with 2 comments

My colleague Mats has an excellent piece on the topic of repeated significance testing on his blog.

To demonstrate how much [repeated significance testing] matters, I’ve ran a simulation of how much impact you should expect repeat testing errors to have on your success rate.

The simulation simply runs a series of A/A conversion experiments (e.g. there is no difference in conversion between two variants being compared) and shows how many experiments ended with a significant difference, as well as how many were ever significant somewhere along the course of the experiment. To correct for wild swings at the start of the experiment (when only a few visitors have been simulated) a cutoff point (minimum sample size) is defined before which no significance testing is performed.

Although the post includes a link to the Perl code used for the simulation, I figured that for many people downloading and tweaking a script would be too much of a hassle, so I’ve ported the simulation to a simple web-based implementation.

Repeated Significance Testing Simulation Screenshot

You can tweak the variables and run your own simulation in your browser here, or fork the code yourself on Github.

About these ads

Written by Lukas Vermeer

August 23, 2013 at 15:47

2 Responses

Subscribe to comments with RSS.

  1. Lukas,

    Thanks for this.

    Is the familywise error rate not applicable to this? (I don’t come from a stats background, so forgive me if this is a stupid question).

    I’ve computed P(Making at least 1 error in m tests) = 1 – (1-a)^m but roughly speaking, I’m not seeing anywhere near similar results from that calculation and your simulation.

    Where have I gone wrong? Thanks.

    Brian

    March 27, 2014 at 16:24

    • I don’t think the FWER is applicable, because we are running multiple significance tests on the same hypothesis. The probability of finding a significant result somewhere during the experiment depends more on how long you wait until you start testing and how long you run the experiment and keep testing.

      Try setting the cutoff point to 1 and sample size to 1.000.000 and see what happens. :-)

      Lukas Vermeer

      March 28, 2014 at 13:57


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: