WHY A/A TESTING IS A WASTE OF TIME

I've had a few discussions with people and read stuff posted by @distilled on this topic over the last week and felt compelled to write.  I've been doing split and multivariate tests since 2006, and have watched nearly every one like a hawk.  

I want to give some practical advice as I've personally made every test cockup in the book and wasted many days on fruitless effort - in order to become better and to continue to hopefully improve.  

I don't want to come across as saying these methods are wrong - just that my experience tells me that there are better ways to use your time (when testing).  The volume of tests you start is important but even more so, is how many you finish every month.  

And the trick to getting that up is to ensure velocity of testing throughput, by removing wastage from the process.

I've got 6 parts here to cover

* A/A and variants
* The Dirty Secret of testing
* Triangulate your data
* Watch the test like a Chef
* Segmentation
* Summary

A/A AND VARIANTS

A/A - 50%/50% split

A/A is about validating the test setup.  People use this to test the site to see if the numbers line up.  The problem is that this takes time that would normally be used to run a split test.  If you have a high traffic site, you might think this is a cool thing to do - but in my opinion, you're just using up valuable test insight time.

I've tried this and trust me - it's a lot quicker just to properly test your experiments before going live.  It also gives you confidence in your test where A/A frippery may inject doubt.

What do I recommend then?  Well - this stuff:

* Cross browser testing - Crossbrowsertesting.com, Browsercam
* Ask friends and family - Get them to QA and screenshot from outside the office
* Check the numbers - Have both split testing AND analytics package instrumentation on every test - you can check the QA figures for both to see they line up.
* Watch the test obsessively - I'll come back to this later.

That approach is a lot quicker and has always worked best for me.  Use triangulated data and solid testing to pick up on instrumentation, flow or compatibility errata that will bias your results, instead of A/A test cycles.

A/A/B/B - 25/25/25/25% split

OK - what's this one then?  Looks just like an A/B test to me, except it isn't.  We've now split the test 25% into 4 samples, which happen to contain both A and B in duplicated segments.

So what's this supposed to solve?  Ah - well it's to check the instrumentation again ( like A/A) but also confirm if there are oddities in the outcomes.  I get the A/A validation part (which I've covered already) but what about the results looking different - in your two A and B samples.  What if they don't  line up perfectly?  Who cares - you're looking at the wrong thing anyway - the average.

Let's imagine if you have 20 people come to the site, and 5 each of them end up in the sample buckets.  What if 5 of these are repeat visitors and end up in one sample bucket?  Won't that skew the results?  Hell yes.  But that's why you should never look at small sample sizes for insight.

So what have people found using this?  That the sample performance does indeed move around and especially so early in the test or if you have small numbers of conversions.  I tend to not trust anything until I've hit 350 outcomes in a sample and at least two business cycles (e.g. weekly) as well as other factors.

The problem with using this method is you've split A and B into 4 buckets, so the effect of skew is more pronounced, your effective sample size is smaller and therefore the error rate (fuzziness) of each individual sample is higher.  Simply put, the chances that you'll get skew are higher than if you're just measuring one A and B bucket.  It also means that because your sample sizes are smaller, the error rate (the +/-) stuff will be higher on each measurement.

If you tried A/A/A/B/B/B You'd just magnify the effect.  The problem is to know when the samples have stopped moving around - this is a numbers thing but also done a lot by feel for the movements in the test samples.  The big prize is not about how test results fluctuate between identical samples (A/A/B/B) - it's about how visitor segments fluctuate (covered below).

A/B/A - 25/50/25% split

This is one suggested by @danbarker and has the merit of helping identify instrumentation issues (like A/A) but without eating into as much test time.

This has the same problem as A/A/B/B in that the two A samples are smaller and therefore have higher error rates.  Your reporting interface is also going to be more complex, as you now have 3 (or in A/A/B/B, 4 lines) of numbers to crunch.

You also have the issue that as the samples are smaller, it will take longer for the two A variants to settle than a straight A/B.  Again, a tradeoff of time versus validation - but not one I'd like to take.

If you really want to do this kind of test validation, I think Dan's suggestion is the best one.  I still think there is a bigger prize though - and that's segmentation.

THE DIRTY SECRET OF TESTING

Every business I've tested with has a different pattern, randomness or cycle to it - and that's part of the fun.  Watching and learning from the site and test data as it's running is one of the best parts for me.  But there is a dirty secret in testing - that 15% lift you got in January?  You might not have it any more!

Why?  Well you might have cut your PPC budget since then, driving less warm leads into your business.  You might have run some TV ads that really put people off, that previously responded well to your creative.  
Wait a minute though.  It might be performing much better than you thought.  But YOU DO NOT KNOW.  It's the Schrodingers Cat of split testing - you don't know unless you retest it, whether it's still driving the same lift.  

To get around the fact that creative performance moves, I  typically leave a stub running (say 5-10%) to keep tracking  the old control (loser) against the new variant (winner) for a few weeks after the test finishes.  

If  the CFO shows me figures disputing the raise - I can show that  it's far higher than the old creative would have performed, if  I had left it running.  This has been very useful, at least  when bedding a new tool in with someone who distrusts the lift  until they 'see it coming through' the other end!

However, if you're just continually testing and improving -  all this worry about the creative changes becomes rather  academic - because you're continually making incremental  improvements or big leaps.  The problem is where people test  something and then STOP - this is why there are some pages I  worked on that are still under test 4 years later - there is  still continual improvement to be wrought even after all that  time.

Google Content Experiments now utilises the multi-armed bandit  algorithm to get round this obvious difference between what  the creative did back then vs. now (by adjusting the creatives  shown to visitors dynamically).  

I postulated back in 2006  that this was the kind of tool we needed - something that  dynamically mines the web data, visit history, backend data  and tracking into a personalised and dynamic split test  serving system.  Allowing this to self tune (with my  orchestration and fresh inputs) looks like the future of  testing to me :  

http://support.google.com/analytics/bin/answer.py?hl=en&answer=2844870

TRIANGULATE YOUR DATA

One thing that's really helped me to avoid instrumentation and test running issues - is to instrument at least two analytics sources.  Make sure you completely use the split testing software capabilities to integrate with a second analytics package as a minimum.  

This will allow you to have two sources of performance data to triangulate or cross check with each other.  If these don't line up proportionally or look biased to an analysts eye, this can pick up reporting issues before you've started your test.  Don't weep later about lost data - just make sure it doesn't happen.  It also helps as a belt and braces monitoring system for when you start testing - again so you can keep watching and checking the data.

WATCH IT LIKE A CHEF

You need to approach every test like a labour intensive meal, prepared by a Chef.  You need to be constantly looking, tasting, checking, stirring and rechecking things as it starts, cooks and gets ready to finish.  This is a big insight that I got from watching lots of tests intensely - you get a better feel for what's happening and what might be going wrong.

Sometimes I will look at a test hundreds of times a week - for no reason other than to get a feel for fluctuations, patterns or solidification of results.  You HAVE TO resist the temptation to be drawn in by the pretty graphs during the early cyle of a test.  

If you're less than one business cycle (e.g. a week) into your test - ignore the results.  If you have less than 350 and certainly 250 in each sample - ignore the results.  If the samples are still moving around a lot then - ignore the results.  It's not cooked yet.

SEGMENTATION

@distilled and many others spotted what I've seen - your data is moving around constantly - all the random visitors coming into the site and seeing creatives, is constantly changing the precision and nature of the data you see.

The big problem with site averages for testing is that you're not looking inside the average - to the segments.  A poorly performing experiment might have absolutely rocked - but just for returning visitors.  Not looking at segments will mean you miss that insight.

Having a way to cross instrument your analytics tool (with a custom var, say in GA) will allow you then to segment the creative level performance.  One big warning here - if you split the sample up, you'll get small segments.

Back to the electron analogy.  If you have an A and a B creative, imagine them as two large yellow space hoppers, sitting above a tennis court.  You are in the audience seating and you're trying to measure how far they are apart.    They aren't solid spaces but are fuzzy - you can't see precisely where the centre is - just a fuzzy indistinct area in space.

Now as your test runs, the position and size of these space hoppers shrinks, so you can be more confident about their location and their difference in height, for example.  As you get toward the size of a tennis ball, you're much more confident about their precise location and can measure more precisely how far apart they are.  

If you split up your A and B results into a segment, you hugely increase the size of how fuzzy your data is.  So be careful not to segment into tiny samples or just be careful about trusting what the data tells you at small numbers of conversions or outcomes.  

Other than that, segmentation will tell you much more useful stuff about A/B split testing than any sample splitting activity - because it works at the level of known attributes about visitors, not just the fuzziness of numbers.

SUMMARY

Is the future of split testing in part automation?  Yes - I think so.  I think these tools will help me run more personalised and segment driven tests - rather than trying to raise the 'average' visitor performance.  The tools will simply help the area I can cover, a bit like going from using a plough to having a tractor.  I don't think it reduces the need for human orchestration of tests - just helps us do more.

Is there any useful purpose for A/A and A/A/B/B? - I can't see one but I may have missed something.  Let me know.  I can't see anything that outweighs the loss of testing time and precision.

What's the best way to avoid problems?  Watching the test like a hawk and opening up your segmentation will work wonders.  QA time is also a wise investment as it beats the existential angst hands down, when you have to scrap that useless data.

And thanks for having the questions and insights that made me write this.  Nods to +Tim Leighton-Boyce +Will Critchlow  +dan barker 
Shared publiclyView activity