Shared publicly  - 
 
This was kind of a cool trick.

I have a program that has a clear bug.  Unfortunately the easiest test case that I had only showed the bug at the end of a 40 minute run, processing a half-million records.  But if I ran my program again, it was able to notice things it could prove it clearly did wrong the first time.  (It probably did more things wrong, but it only detected some of them.)

I decided to evolve a faster test case.

I first wrote a script to run the program twice, and report how many things it found wrong on the second run.

I wrote a second script that would take a record set, sample half of it randomly, and then run the first script.

I wrote a third script that would take all records, run 4 copies of my test program, and then save the smaller record set that best demonstrated the bug.  Wash, rinse, and repeat.  Once it was no longer able to demonstrate the bug, it would run through its list of "best small examples" from smallest to biggest, trying again, any time it made progress repeating the process.

All three scripts combined come to 150 lines.  I left them running for a few hours and just checked how they were doing.

I now have a dozen small acceptably reproduction cases for my bug, including one with 362 records that runs in a fraction of a second.  (The program is doing statistical analysis, so there is never going to be a single record that shows the bug.  362 records to reproduce it is tiny.)
37
3
William Payne's profile photoSiddharth Kulkarni's profile photoQuildreen Motta's profile photoRob Neuhaus's profile photo
13 comments
 
That's pretty slick.  How much of this can you generalize?  I am surprised you didn't get more votes on yc news.  
 
I have not thought about generalizing.  This isn't a kind of problem that I've hit often enough to know what generalization to make.

You know the rule, the first time you do something, note it.  The second time, keep track of what was in common with the first.  And the third time you can try to generalize.

This was the first time that I've used this trick.  I've noted it. :-)

As for yc, it is a slow day.  And traffic from there is always hit or miss.  (Unless you use a voting ring, at which point pg will be trying to catch you.)  If people see it from there, great.  If it sinks without a trace, that is life.
 
This sounds a little like property-based testing with randomly generated data, as done by QuickCheck or SmallCheck in Haskell. How does your approach compares to those two tools?
 
I am not a Haskell programmer, therefore I do not know.  But reading on Wikipedia it sounds like they build up test cases, while I'm trying to strip an observed failure down.  But I'm sure that when they do come up with a case, they probably use a similar idea to strip it down to something amenable to study.
 
SmallCheck takes a domain and generates all values in that domain progressively to find which ones break the test case.

QuickCheck generates a sampling of random values, and once it finds one that breaks the test case it tries to provide a minimal counter example that you can use to further investigate.

ScalaCheck goes further and generates interactions over an imperative program (sequences of steps), and tries to find the minimal sequence of steps that breaks the test case.
 
Then you have your answer.  I just happened to encounter a large failure case, and threw up some one-off code to strip it down to something manageable.  The latter two tools you name do that AND try to generate the failure case in the first case.
 
BTW for amusement I let my scripts continue to run.  I now have a 195 record example.

I would have been satisfied with a reproduction case 100x that size.
 
There's a family of testcase subsetting algorithms often called delta debugging.  Here's an implementation that is commonly used:

http://delta.tigris.org/
 
+John Regehr That does look like the right generalization of what I did.  Much nicer than my hack.
 
awesome post! you just discovered delta debugging in the wild & serendipitously! 
 
Testing statistical analysis algorithms is not like testing any old enterprise software. You need different approaches.
Add a comment...