Profile

Cover photo
Laurent Bossavit
806 followers|143,468 views
AboutPostsPhotosYouTube

Stream

Laurent Bossavit

Shared publicly  - 
 
Destroying the entire US economy

I've been trying to wrap my head around a concept in economics, and you should know that I've no background in economics. Like, at all. My econ teacher back in high school would have been unanimously voted Worst Teacher if we'd had an election; he was so bad we skipped his classes in total impunity.

Anyway, my question was: when you hear "X costs the economy N billions of dollars per year", what specifically do you take that to mean? X is variously given as "disengaged employees", "preventable heart disease", "software bugs", and so on. It's entirely possible that claims of that sort make sense for some X, and not for others.

Does it mean, for instance, "in the absence of X there would be Y $Bn more wealth to share around"? That doesn't quite compute for me, because (in some of the cases I gave, such as software bugs) those Y billions are salaries or fees paid out to people, so are in the economy.

Someone on Twitter suggested it means "people/companies/the gov't spend Y but it doesn't produce useful returns". But then what specifically is meant by not saying that, and saying instead it's "a cost to the economy"?

Alternately, can anyone provide an example of an X that was eliminated and we were able to measure the costs of X recovered to the economy?

Being who I am, my hunch was that "X costs the economy Y" is actually a snowclone, meaning "X is bad" for any value of Y, otherwise empirically meaningless. What you do when you find a snowclone is look for examples, and I was able to find plenty.

What I found was interesting. I tabulated the results in a spreadsheet. If you sum all "costs to the US economy" you get a number larger than the economy is to start with.

Of course there's no sensible reason to count down from the total, and subtract these costs. It's obvious that a bunch of these are counterfactual: "if we stopped X it would add Y dollars to the overall bottom line". But just as obviously that framing is far less potent, because it makes clear its estimate is counterfactual, speculative, uncertain; whereas a "cost" is implied to be a solid figure.

There's no sensible reason for calling these things "costs" either; I don't go around bemoaning the cost I suffered by not becoming President of France, and thus not being able to get paid $100K for speaking at a conference - a total "loss" to me, so far, of over $6M.

Anyway, if you happen to make up a number for what your favorite problem costs the economy, I've got a handy spreadsheet of items to compare it to. For instance, it's more urgent to address "routine weather variability" than software bugs. You can all relax about using PHP or whatever.

Here's the link, for your convenience or amusement.
Drive
Costs to the US economyFeuille 1 Wherein various" costs to the US economy"( as found via the helpful Google) are found to sum to more than the economy itself. Not included: temporary costs such as Iraq war, etc.; estimated future costs( e. g. climate change) Disclaimer: yes, this is rough work, and no, I' m not tak...
5
1
Laurent Bossavit's profile photoBen Hyde's profile photoSteven Flaeck's profile photoJesse Alford's profile photo
13 comments
 
+Laurent Bossavit, I don't know what your "particular form of ignorance" is.
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
 
I received notification yesterday that two of my abstracts were accepted to the Toward a Science of Consciousness 2015​ conference:

* Sentient companions predicted and modeled into existence: explaining the tulpa phenomenon. Accepted as a contributed poster. http://kajsotala.fi/Papers/Tulpa.pdf

Takes a stab at trying to explain so-called "tulpas", or intentionally created imaginary friends, based on some of the things we know about the brain's cognitive architecture.

* Coalescing Minds and Personal Identity. Accepted as contributed paper; co-authored with Harri Valpola. http://kajsotala.fi/Papers/CoalescingPersonalIdentity.pdf

Summarizes our earlier paper, Coalescing Minds (2012), which argued that it would in principle not require enormous technological breakthroughs to connect two minds together and possibly even have them merge. Then says a few words about the personal identity implications. Due to the word limit, we could only briefly summarize those implications: will have to cover the details in the actual talk.
1 comment on original post
1
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
Can't we all just get along ? A fable on ISO 29119

In an alternate world, a different group of testers petitioned ISO first, and formed WG666.

A few years later, the ISO 29666 standard was published which, among other things, stipulated  that the testing of software was to be accomplished through a recently deceased member of the species gallus gallus domesticus being swept in a roughly circular motion over any physical embodiment of the software in question.

Although this approach, derided by one faction as "waving a dead chicken over it", was publicized in conference talks and articles over the span of a few years, most of the testing community remained poorly informed as to its advantages, owing to the fees legitimately charged by ISO to cover the high costs of developing the standard. (In this alternate world, unlike in the real world, the working group had decided to use their own procedures to test the standard itself, and thus a great many fowl went into making the standard.)

One PhD level thesis that had examined how software testing was actually done at a number of corporations, and noted some passing similarities with chicken waving, was widely quoted as providing empirical support for the effectiveness of the concepts in the standard. (Clearly this alternate world was not much removed from our own.)

Some testers refused to sign the Stop29666 petition, on the basis of "keeping an open mind to ALL approaches" in software testing.

Some further reproached their testing colleagues who did sign the petition, because their opposition to waving dead chickens could after all only be explained as a result knee-jerk political affiliation (rather than because, well, dead chickens).

And further insisted that the opposition of a vocal minority to the ISO29666 standard was damaging the testing community by "polarizing" it, and called for all sincere professionals to try to "relax a bit" and "get along with each other".

All of these judgments were wrong. Even though this was an alternate world, it was still one where dead chickens didn't help much with testing.

Back in the real world...

On either side of the "contentious rift" opened up by ISO 29119 are human beings. One of the things we tend to do is justify our own beliefs using different standards than we apply to other people - particularly when they disagree with us.

It's too easy to think, when you disagree with someone, "My beliefs are grounded in facts and observations, but your beliefs are only to score points with your social circle and feel good about yourself."

Various commenters on the ISO 29119 debate, both for and against, are guilty of this, some more egregiously than others. Appeals to "take the right attitude" or "just relax a bit" are transparent attempts at painting the opposition as partial and subjective. One might argue that the accusations of "rent seeking" leveled at the authors of the standard are of a similar kind, insofar as they frame the debate as a matter of intent (the authors of the standard, the argument goes, want to secure revenue through regulation rather than through providing superior service). However, the argument based on "rent seeking" is eminently more testable than one based on "not having the right attitude": there is, factually, such a thing as regulatory capture; there is such a thing as manipulation of the ISO processes for private gain, as became painfully apparent in the case of the Microsoft-backed ISO standard for Office XML.

The point of the above fable is to encourage anyone reading up on the debate to apply a "dead chicken test". Cross out anything that you read which does not refer to a verifiable fact; anything that speculates on someone's intent or frame of mind, or expresses motherhood-and-apple-pie sentiments such as "we would like everyone to get along".

Does the approach to testing outlined in the standard yield better results than waving dead chickens around? Does any testing approach demonstrably work better than dead chickens, and what yardstick is appropriate to you when answering that question? Anything that doesn't contribute to answering these, either at the scale of an individual tester or at broader scales (company-wide, industry-wide), you can safely ignore.

For instance, the article at this URL: http://xbosoft.com/iso-29119-useful/ boils down to the following:

 The standard "gives a starting point to add context and customization".

Pretty much everything else is a red herring, or a manifestly false statement. For instance, the claim that "all ISO standards explicitly state that they need to be tailored to the situation and organization". There is an ISO standard determining the paper sizes for A and B series paper; you can bet that this doesn't "state that it needs to be tailored". Or "usually standards are born from nebulous concepts that we need to try to understand better" - this is again a completely baseless generalization. Paper size isn't a nebulous concept, it is simply a matter of reaching agreement, even a somewhat arbitrary one, on something where the details don't matter. In software development, not only do the details do matter, they sometimes seem to be all that does.

Does the standard provide a useful starting point? I've actually read the document, dissected it, and from my perspective as an expert software developer but, technically, a newbie to the world of professional testing, I find it worse than useless. The parts on "dynamic testing processes" - the parts that touch on actually going on with the testing itself, as opposed to burying it under layers of managing or planning or documenting - are a thicket of confusing terminology. Where a simple notion of "test idea" would have sufficed, they introduce "test conditions", "test coverage items" and "test sets". The only purpose these appear to serve is to generate copious amounts of documentation, essentially for the purpose of management oversight.

If you are determined, there are ways of finding the actual text of the standard. It can be a matter of finding yourself in the right place. For instance, universities or large corporations that have subscribed to IEEE's digital library on an "all you can eat" basis. If you happen to be connected to the wireless network of one such institution, as I was while attending the CAST conference, you'll be able to download the documents at no charge.

Get the facts, judge on the facts, ignore as best you can whoever does otherwise.
9
3
Laurent Bossavit's profile photoJesse Alford's profile photoHuib Schoots's profile photoZoltán Molnár's profile photo
19 comments
 
When you're lawfully inside a university library, you can read any book on the shelves. There's nothing unethical about that, it's what a library is for.

If you'd rather the market decide, then you should oppose the standard, since it will skew the market, not to the companies delivering the most valued results, but to those with a say in how the standard is defined. That is in a nutshell the "regulatory capture" argument.

Finally, you're raising the attitude objection again; "upset" is a red herring. ISO can standardize paper sizes because nobody cares about differences of a fraction of an inch in a sheet of paper, and everyone benefits from agreeing upon a size. Procedural "consensus" among a technical committee may then happen to reflect a larger consensus among users of paper.

When ISO lets a small group of people (one with an obvious vested interest, at that) standardize, on behalf of an entire specialization in software development, on something which isn't proven better than waving dead chickens, it is abusing the commonsense meaning of "consensus". There is no well-established benefit accruing to the community from agreeing upon the particular contents of ISO 29119 as the "standard" way to test. The differences being investigated by practicing testers as they go about their jobs do matter, quite a bit.

The Humpty Dumpty Principle applies: ISO can define "consensus" to means whatever the hell it wants, the rest of us are free to demand common sense and consistency in how the word is used.

ISO boasts in its marketing materials that its standards "are based on global expert opinion" and that "comments from stakeholders are taken into account" and other such phrases. These imply that whatever the strictly technical meaning of "consensus" (within working groups, et cetera), ISO does intend the word to carry the same connotations as the everyday meaning.
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
 
People try mailing a number of unpackaged items through the US Postal Service, record the results. Some of my favorites:

> Football. Days to delivery, 6. Male postal carrier was talkative and asked recipient about the scores of various current games. Carrier noted that mail must be wrapped.

> Pair of new, expensive tennis shoes. Strapped together with duct tape. Days to delivery, 7. When shoes were picked up at station, laces were tied tightly together with difficult-to-remove knot. Clerk noted that mail must be wrapped. [...]

> Helium balloon. The balloon was attached to a weight. The address was written on the balloon with magic marker; no postage was affixed. Our operative argued strongly that he should be charged a negative postage and refunded the postal fees, because the transport airplane would actually be lighter as a result of our postal item. This line of reasoning merely received a laugh from the clerk. The balloon was refused; reasons given: transportation of helium, not wrapped. [...]

> Box of sand. Packaged in transparent plastic box to be visible to postal employees. Sent to give an impression of potentially hiding something. The plastic box had obviously been opened before delivery and then securely taped shut again. Delivery without comment at doorstep, 7 days. [...]

> Deer tibia. Our mailing specialist received many strange looks from both postal clerks and members of the public in line when he picked it up at the station, 9 days. The clerk put on rubber gloves before handling the bone, inquired if our researcher were a "cultist," and commented that mail must be wrapped. 
1 comment on original post
1
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
The cost of defects curve - a matter of politics or evidence?

This started out as a reply to +George Paci's question on an earlier post. It got long, and it got me to a surprising conclusion, as alluded to in the title, so I'm making it a separate post.

The question was: "does "cost to fix" include "cost to find"?  That is, figuring out where in the code the problem is, as opposed to finding out there is a problem. If it does, then I'm willing to believe that there's a big jump in cost somewhere between an hour and a month."

I would answer "yes", because I'd want to know what caused a bug, before fixing it... lest I make things even worse.

But I'm not convinced this relates to "a big jump in cost" related only to time.

To see why, let's break down this part of the cost model even further: there is the time needed to reproduce the bug (IMNSHO a prerequisite to pinpointing the code location(s) involved), the time to analyze the faulty behaviour, the time to come up with a model of the defect injection (i.e. was it a programmer mistake at an isolated code location, was it the correct implementation of a misunderstood "specification", was it the correct implementation of a flawed "specification", was it a "perfect storm" combination of factors hard to foresee, and so on), the time to come up with a remedial plan (including considering the risk of introducing a secondary bug), perhaps the time to code a unit test exposing the bug, and finally the time to actually perform the fix.

If you look at it this way, I'd be strongly inclined that the difficulty in these various factors is seldom related to time between injection and detection. Rather, it is about cognitive complexity of the code, effectiveness of the tooling (for instance, reliable repro has been my main technical contribution at a recent client's), quality of communication among devs and between devs and product owners, and so on.

None of these are strongly affected by time per se, though they might be related to a general "entropy factor". When your code base starts getting really crufty, a line of code you changed last week can come back to bite you in nasty ways. Conversely, if you keep your code small, terse, well factored, mission-focused, well-supported by appropriate tooling, most bugs will be quick to reproduce, find, fix and verify.

Now let's go back to the phase-wise model. The thing to notice is that in my comments above I'm only talking about coding defects. Even in-phase the above factors can make for big variances in what it costs to fix a defect.

(Whereas the phase-wise cost of bugs curve strongly implies that there is such a thing as "the average cost" to fix a defect, which leads us to imagine a nicely behaved bell curve rather than a more quirky distribution.)

What the phase-wise model requires us to contemplate is how, for instance, the cost of fixing a "design defect" stays small if you find the defect during design, and blows up if you find it after the design is coded. But "finding" the bug - in the sense of "noticing there is a problem in the first place" is a very different activity depending on design vs coding.

For instance, when I think of a "design defect" I might think that a design called for algorithm A, and it later turns out that was a poor decision and algorithm B would have been more suited. If you're "detecting defects" in the design stage, that basically means having hypothetical discussions, like "Algorithm A would be much slower if conditions XYZ apply". After coding, it's more a matter of "testers (or users) have noticed operation X takes 30 minutes instead of the 3 seconds that would be acceptable".

It might seem that "fixing" the defect in the design stage takes little time (because we envision striking the words Algorithm A and replacing them with Algorithm B), but is it that simple? For instance, to have a grounded preference for Algorithm A might require having some knowledge about how many users will actually use the app, or it might require prototyping to benchmark the two algorithms, or some such.

At the coding phase, the "fix" is to replace Algo A with Algo B. This should cost about the same irrespective of how much time has passed, but is definitely going to be costlier if your coding practices tend to "spread around" the implementation of an algorithm among the entire code base: duplicated code, implementation-dependent assumption made by callers of the algorithm, and so on.

And yet the claim is not framed as "the cost of bugs rises quickly with bad programming practices". That would keep the focus on programming, and it would leave the programmers in charge of improving.

Framing the issue as "we need to thoroughly review the requirements and the design and only then will we trust the programmers with the relatively straightforward task of faithfully translating this thinking into code"... leads to a rather different managerial approach.

This is why, I suspect, people say that the "cost of defects curve" must be true because it matches their experience, in the face of repeated failure to find good evidence supporting it.

It's not about reality. It's about the politics of our profession.

-------

If you liked this post, consider supporting me by buying my book: http://leanpub.com/leprechauns
1
Laurent Bossavit's profile photoGeorge Dinwiddie's profile photo
3 comments
 
It's theoretically possible that ample documentation could alleviate the problems with memory issues. I've yet to see documentation that likely would do so. It's very hard to communicate clearly in prose. The publishing industry has groups of people and elaborate processes to increase the liklihood, but our technical documentation doesn't get that level of attention. Also, the documentation cannot be tested to see if it's factually correct, as test cases can.

Of course, the problem is even worse when the documents must carry information no only through time, but through the air-gap between people.
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
What does it mean to look at data?

Here's two months' worth of defect data from a group I've worked with. This includes both production and development defects (I don't yet know how to break out production defects reliably).

What it is: "fix dates" extracted from commit logs in version control, correlated with "detection dates" extracted from the defect tracking tool, by defect ID in ascending order.

The group is composed of several Scrum teams, has been using Scrum for years. This is one of those cases where a change of perspective turns out to be useful, specifically opening the Lean toolbox to complement Agile.

What jumps out is that calendar times to fix defects are all over the map, and in some cases surprisingly long.

What isn't as obvious is that these recorded schedules underestimate how long defects actually stay in the system.

I've looked at individual examples more closely, such as one where the first evidence that the bug exists in the system predates the "official" defect ticket by several weeks and can still be observed in production logs for some weeks after the defect is officially marked as "closed".

This right here is a big hint as to how much you can trust reports on empirical data from software projects, even by such big names as Capers Jones or Barry Boehm: not a whole lot.

To me, being "empirical" means starting with the individual data points. It means understanding just how much and just how little insight is conveyed by one instance of a defect ID in the commit logs matching up with a defect marked as close in the bug tracking tool.

These are social events: they don't say "a bug exists", they say "a person with standing to do so has decided to create a bug item" or "a person with standing has decided to close the bug".

Or alternately, you could describe them as "a player in the Testing Game has played the Report Bug card". Describing what people do in software projects as moves in a game may make a lot more sense than speaking as if bugs were "real" things with measurable properties. If nothing else, it would remind us of the possibility of "gaming" in the pejorative sense. If the rules of the game are such that the Testing Clan wins if the overall Bug Count score is low, then it's less surprising to see a lot of lag between the time a bug could be detected and the time when it's finally entered in the bug tracking system.

Overall, to be sure, people would prefer if the program "really" had fewer bugs and users "really" were happier. But these "realities" may not loom as large as the imperatives of everyday work, and since people only have about 8 hours of that each day to spend, and must make decisions on what to allocate these hours, they may trade off the external "realities" against the more pressing ones of looking good to other teams and to management.

To me, as an advisor to the group, the chart is actionable only because it allows me to take a step back from a random set of distinct stories about one defect, and weave a larger story about how the group as a whole responds to defects.

My role, if you'll excuse me for lapsing into coach-speak, is to "reveal the system to itself". It is to make apparent the gap between the espoused goals (to minimize production bugs, to maximize user satisfaction) and the day-to-day decisions of the group in practice.

The operations I choose to perform on the data aggregated into this chart only make sense or not relative to this purpose.

I could take the average of the delays between detection and correction. It's around 16. That number is just about meaningless, as are most averages.

What the Lean thinking toolkit suggests is to extract from the chart a "best case" number. That tells us how fast it's possible to process most defects. Eyeballing, I see that some defects are processed in the same day as they are reported, and that perhaps as much as half of the defects take less than about 5 days, give or take.

My next step would be to go back to individual data points. Not the ones in the past, because personal recollection is unreliable and I can't expect developers and testers to tell me an accurate story of what happened with defect 3848 that it took 30 days to fix.

Rather, I might sit down with key people in the group and suggest setting 5 days as a target turn-around time for fixing defects. The idea is that we may sometimes do worse, but when we do it's useful looking into why, what causes this large variance. What we find in such cases is likely to help us understand how our system works and how to improve it.

This is being "locally empirical". That's where the value is.

I'm unconvinced there is any value in taking this data, rolling it up with data from thousands of projects (assumed to be similar but which really aren't, assumed to have equivalent methods of collecting that data but really not because of all the uncertainties noted above), and summarizing all of that into "industry wide averages".

The practice of software development should be a lot more empirically based, but I think large parts of the so-called "empirically based software engineering" movement are just barking up the wrong tree.

--------

If you liked this post, consider supporting me by buying my book: http://leanpub.com/leprechauns
8
2
George Paci's profile photoLaurent Bossavit's profile photoRafael Ferreira's profile photoTimothy Western's profile photo
3 comments
 
You nailed it. Variances or GTFO.
Add a comment...
Have him in circles
806 people
Esther Derby's profile photo
David Gerard's profile photo
shailesh bhuva's profile photo
Xavier Roy's profile photo
Bruno Thomas's profile photo
Billy Garnet's profile photo
Gene Hughson's profile photo
Eveliina Vuolli's profile photo
Maaret Pyhäjärvi's profile photo

Laurent Bossavit

Shared publicly  - 
 
The Myth of the Myth of the Myth of 10x

Alan, over at Tooth of the Weasel, has a blog on "The Myth of the Myth of 10x", defending the old idea of "10x programmers". (It's not recent, but new to me and I was recently pointed to it after Steve McConnell commented to say, basically, "Hell yeah.")

As anyone knows who knows me a bit, I don't think the 10x concept has any credibility. But I'm open to new data and reasoning on the topic.

Also, Alan's post gave me a good opportunity to write up a bit of old history, as I like to do, that few people are aware of. So even if you're down with the whole "someone's wrong on the Internet" thing, read on for that juicy tidbit at least.

Alan's reasoning, as far as I could tell, appears to be "the 10x concept is not a myth because I define it differently from the way it was defined in the studies that are claimed to support the 10x concept".

I don't think this works. What would work for me: Alan's describing someone he's actually met (or has reliable information about), who fits his definition of "someone whose aptitudes allow them to deliver significantly higher output and quality", and some explanation of why they are that way. That would still be anecdotal evidence, but better than no evidence at all.

In a blog post a few years back (http://www.construx.com/10x_Software_Development/Chief_Programmer_Team_Update/) Steve has described what some might call "the original 10x programmer", Harlan Mills. According to Steve, "Harlan Mills personally wrote 83,000 lines of production code in one year" on a project for the New York Times in the early 70s.

I think this qualifies Mills as a "very prolific" programmer. One issue with that descriptor is that, as Alan acknowledged, "prolific" isn't the same as "productive" (and it's one of the tragedies of our profession that we consistently fail to distinguish the two). We all know people who churn out reams of code that turns out to be worthless.

It turns out Mills was one of those people.

At least he was on the particular project Steve describes as "one of the most successful projects of its time". By the way, you don't have to claim that "all programmers are about the same" to make a counter claim to the 10x concept; you can for instance merely point out that if programmers are extremely inconsistent in their performance, that would explain the data in the 10x studies just as well.

Maybe Mills was a 10x on some other project, but my research suggests he wasn't a 10x in Alan's sense of "significantly higher output and quality" on the Times project.

Stuart Shapiro, in his 1997 article "Splitting the Difference", described the same project somewhat differently:

"As evidence, the authors pointed to the development of an information bank for the New York Times, a project characterized by high productivity and very low error rates. Questions were raised, however, concerning the extent to which the circumstances surrounding the project were in fact typical. Moreover, it seems the system eventually proved unsatisfactory and was replaced some years later by a less ambitious system."

Source: http://sunnyday.mit.edu/16.355/shapiro-history.pdf

Shapiro is quoting from a much, much older article that appeared in Datamation in May 1977, "Data for Rent" by Laton McCartney:

"Unfortunately for The Times, the IBM designed system didn't prove to be the answer either. 'They touted us on top down structured programming', says Gordon H. Runner, a VP with The Information Bank, 'but what they delivered was not what they promised.' When the FSD system proved unsatisfactory, the TImes got rid of its IBM 370/148 and brought in a 360/67 and a DEC PDP-11/70. Further, Runner and his staff designed a system that was less ambitious than its predecessor but feasible and less costly. [...] 'With the new approach we're not trying to bite off the state of the art,' Runner explains. 'We're trying to deliver a product.'"

(The PDF for the Datamation article isn't available online, but I'm happy to provide it upon request.)

I find it ironic and funny that "the original 10x programmer" left behind such a bitter taste in his customer's mouth. It reminds me of the ultimate fate of the Chrysler C3 project that was the poster boy for Extreme Programming.

Our profession has long been driven by fad and fashion, with its history written not by the beneficiaries or victims of the projects on which we try new approaches, but by the people most biased to paint those projects and approaches in a good light. Our only way out of this rut is to cultivate a habit of critical thinking.

(I've written a lot more about the 10x myth, and my reasoning for branding it a myth, in my book: https://leanpub.com/leprechauns - if you found the above informative, check it out for more of that.)
I first heard of “10x” software development through the writings of Steve McConnell. Code Complete remains one of my favorite books about writing good software (The Pragmatic Programmer, Writing So...
10
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
Check out the new Rationalists in Tech podcast - I was one of the first interviewees.
I'll appreciate feedback on a new podcast, Rationalists in Tech.  I'm interviewing founders, executives, CEOs, consultants, and other people in the tech sector, mostly software. Thanks to Laurent Bossavit, Daniel Reeves, an...
2
1
Kaushik Sridharan's profile photo
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
Old but good paper on abuse of p-values.
 
One of the advantages of reading old papers is you can find some hilarious insults. Here's one from Bakan, David, “The test of significance in psychological research,” Psychological Bulletin, Vol. 66 (1966), pp. 423-437:

"I playfully once conducted the following "experiment": Suppose, I said, that every coin has associated with it a "spirit"; and suppose, furthermore, that if the spirit is implored properly, the coin will veer head or tail as one requests of the spirit. I thus invoked the spirit to make the coin fall head. I threw it once, it came up head. I did it again, it came up head again. I did this six times, and got six heads. Under the null hypothesis the probability of occurrence of six heads is (1/2)^6 =.016, significant at the 2% level of significance. I have never repeated the experiment. But, then, the logic of the inference model does not really demand that I do! It may be objected that the coin, or my tossing, or even my observation was biased. But I submit that such things were in all likelihood not as involved in the result as corresponding things in most psychological research."

This is an even better burn than it looks because Bakan is also illustrating optional stopping (he would have broken off the flipping if he hadn't kept getting heads), which is routine among psychologists and makes his p-value incorrect; naturally, no one computes their p-value correctly to account for optional stopping...
3 comments on original post
1
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
Can we bury the NIST study once and for all now?

The NIST study concluded that "the impact of inadequate software testing infrastructure on the US economy was between 22.2 and 59.5 billion dollars".

As usual, people mention this figure as if it was undisputed fact (for instance, you can find it on a couple Wikipedia pages). It's a good bet that they haven't read the original document carefully and critically. If they had, they might have noticed some red flags in the "study" and would at the very least hedge by emphasizing that it is an estimate.

There are two important aspects to any estimate: precision and accuracy.

Precision is the size of the error bars around the estimate. "Between $50Bn and $70Bn" isn't at all the same as "somewhere between a few hundred million and a few hundred billion, with sixty billion being our best guess". With a narrow spread, it's much easier to justify investing some proportionate amount of money in attempting to solve the problem. If your uncertainty is large, there's a greater risk you'll be wasting money.

Accuracy is about whether we even have reason to believe that the estimate has landed anywhere near the "true" value. Are we over-estimating? Under-estimating? Giving an answer that doesn't have anything to do with the question being asked?

The NIST procedure, as I was able to reconstruct it, went something like this (I'm actually simplifying a bit):
- ask survey respondents the question "how much did minor bugs cost you last year"
- average this across all respondents
- divide total expense by number of employees at respondent, to get a "cost of bugs per employee"
- multiply cost of bug per employee by total employment in that sector, based on BLS employment data

(Except that to extrapolate the results of their financial services survey, instead of employees they scaled by "million dollars in transaction volume".)

Then they "normalized" all that again into a per employee cost for both automotive and financial sectors... and scaled it all up again to the entire economy, again by multiplying by X million employees.  Now, whatever one thinks of this procedure (I think the heterogenous scaling factors are at best bizarre), it can't escape the laws of physics.

Specifically, that any measurement is subject to uncertainties, including the measurements from "number of employees". And these uncertainties add up as you add together estimates, or multiply one estimate by another.

To get a grip on the uncertainties involved, I tried to replicate the work of the NIST authors: that is, I tried to reproduce their derivation of the final estimate based on survey responses and estimates from BLS.

For instance, about half of NIST's total estimate can be accounted for by the costs directly incurred in paying developers and testers; the other half by the cost to end users as a consequence of software bugs. These are two distinct estimates which are added up to get the final answer. The sub-estimates are further subdivided into estimates for the automotive sector and for the financial services sector (the two sectors that were surveyed), and subdivided again into estimates for the costs from "major errors" and "minor errors" and other categories, and so on.

I eventually gave up because after a few steps I just couldn't find any way to get their numbers to add up. (A link to the spreadsheet attached; readers are more than welcome to copy, check and improve upon my work.)

Though ultimately fruitless, insofar as I wasn't able to reproduce all the steps in the derivation of the final estimate, the exercise was worthwhile. I got quite familiar with their numbers, in the process of trying to understand their derivation. I learned new things.

For instance, the study breaks down costs incurred through bad testing into various categories, including major errors and minor errors.

Apparently, for "minor errors", and in the automotive sector, the average cost of one bug in that category was four million dollars.

(Yes, they seem to be claiming an average cost per bug of $4M. This from table 6-11. I'm actually hoping someone tells me I'm interpreting that wrong, it's such an embarrassingly absurd result.)

Also, whereas "major" errors cost 16 times as much as "minor errors" in small automotive companies, this reverses in large ones, with "minor errors" having a substantially higher cost than "major errors".

So someone who believes the $60Bn number would also have to believe some very counter-intuitive things - since these numbers are inputs to the overall estimate.

The alternative is to believe there are serious problems with the study. Which opens up the question of its accuracy. On that score, two major aspects in academic research tend to be sample size and methodology. NIST's research was survey-based.

How many people did NIST ask? Paragraph 6.2.2 informs us that "four developers completed substantial portions of the entire survey". Section 7 is a bit vaguer about how many people responded for the "developer" portion of the costs, but it looks as if the total sample size was less than 15, which seems like a direly inadequate basis on which to place half of the total estimate.

The surveys of end users seem to have had a more reasonable sample size: 179 respondents in the automotive sector and 98 in financial services. (However, it must be noted that the surveys had rather dismal response rates, 20% and 7% respectively.)

What did NIST ask? They asked for a few people's opinion of how much they spent on bugs when. The inputs to the model are quite literally educated guesses. One survey is about 40 questions long, and respondents were told that they could answer the survey in 25 minutes including time to research the data.

I would argue that most people have no idea how much bugs cost other than the "exponential rise" model which largely predates 2002. If you have less than a minute to answer a question about how bugs cost, you're probably going to reach for the answer you remember from school or that you read in articles.

So, this "survey" about the cost of bugs would predictably be largely self-fulfilling. You get the numbers you expect to get. The numbers' connection with reality is tenuous at best.

If you are quoting the $60 billion estimates, you are basically endorsing:
- odd findings such as a cost of $4M per minor error
- the idea that minor errors may cost more than major ones
- the statistical validity of unreasonably small sample sizes
- most problematically, the validity of opinion over actual measurement

Think about this before spreading the NIST numbers any further.

-------

If you liked this post, consider supporting me by buying my book: http://leanpub.com/leprechauns
8
4
Joseph Sinclair's profile photoMelissa Hosten's profile photoLucas Wiman's profile photoDavid Skoog's profile photo
3 comments
 
+Laurent Bossavit OK, cool. I responded without reading the underlying study. :)
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
Production defects

I've been writing a lot about the "Great Leprechaun" lately, that is, the "cost of defects curve" which claims that the cost of fixing a defect rises exponentially with the phase-wise separation (and, some also claim, separation in time in general) between introducing and fixing a defect. I want to clarify some of my thinking and background assumptions.

The original, "phase wise" version of the claim feels suspect to me, largely because a "requirements defect", so-called, is a totally different animal than an "coding defect".

A basic taxonomy of human error distinguishes between faulty intention and faulty execution. A "mistake" is a faulty intention, whether or not correctly executed: trying to book a train from France to Corsica is a mistake, as Corsica is an island. A "slip" is a correct intention with faulty execution: booking a flight to Porto, Portugal instead of to Porto, Corsica.

Lumping all defects, including so-called "requirements defects", together with "coding defects" in the same category is therefore as unhelpful as failing to distinguish between actions and intentions when studying human error: it is basically a non-starter for thoughtful investigation.

When I think of "code defect" I mean a programmer creating or changing code, such that the program's behavior after the change unintentionally inconveniences one or more users. This is often a programmer's mistake, but can also result from the programmer correctly carrying out a change requested by someone else, where the change is a mistake. (A programmer's "mistake" is a "slip" on the part of the larger system, insofar as it makes sense to apply these labels to systems consisting of more than one individual.)

Additionally, the phrase "cost of defects" lumps together several categories of costs that have nearly nothing to do with one another: the costs (largely dependent on the hourly rates of testers and developers) of detecting, locating and correcting the defect, or whatever we want to call it (slip, anomaly, mistake, bug) on the one hand, and the total costs that accrue to the business as a consequence of the bug: lost custom, lost reputation, reimbursements paid to customers, lawsuit settlements, and so on.

The big difference is that there is an upper bound on the first category of costs, namely, the cost of doing the project over. There is no such limit on the second category - it can potentially grow large enough to put the owner of the software out of business.

It seems to me, therefore, that the most important cost driver is whether a defect reaches production.

If we look closer at the way software development is carried out, there are two main scenarios; here is the optimistic one:
- programmer pushes the defect to version control
- defect is detected, via testing or otherwise
- programmer fixes the defect
- the fix is verified and pushed to version control

In this scenario the defect merely wastes some of everybody's time.

Here is the scenario everyone dreads, "production defects":
- programmer pushes the defect to version control
- a version of the program with the defect goes live to users ("in production")
- defect is detected, often by users, sometimes otherwise
- programmer putatively fixes the defect in development
- the fix is verified and pushed to version control
- a version of the program with the fix goes live to users
- the fix turns out to be appropriate for users

(Let's not overlook the possibility that the first scenario turns into the second. It does happen, more frequently than I'm comfortable thinking about.)

This scenario is necessarily longer, more complex, has more potential for nasty surprises. For instance, development and production environments may differ enough that developers think the defect they are fixing is the same as the defect the users are experiencing, but this is frequently not the case, resulting in several loops through the above steps.

In addition to wasting time, users are inconvenienced. This is a big deal because there is no necessary proportion between the development effort and the user population.

In the first scenario, the costs accrue when the testers or developers are actually working. If they leave the defect aside temporarily, to work on something else, and resume later, the costs more or less stop rising during that pause. (There may be costs due to context-switching, and in practice these are far from negligible, but we can ignore that for the time being.) Let's call these "active costs".

In the second scenario, users may keep encountering the bug even when the developers aren't actively working on it.

In the second scenario, costs start to accrue as soon as the defect is released; there may be quite some time before releasing and detecting.

In the second scenario, costs keep mounting after the defect is "fixed", particularly in organizations that have even somewhat rigid schedules and procedures for releasing software to production.

Organizations that practice "continuous deployment" are spared these costs, which we may call "passive costs", but not the rest - the earlier "windows" for passive costs remain.

It seems to me that production defects are a bigger deal than other kinds, yet my perception (possibly biased) is that many teams out there tend to focus overmuch on things like regression testing, and very little on early detection of production defects and reduction of the "passive cost windows".

However, I don't expect that these phenomena will be studied empirically much, if at all: the doctrine of the "exponentially rising cost of defects" is so widely accepted and takes the place of actual knowledge in the minds of so many, there seems to be little interest in thinking and investigating closely and carefully.

Software engineering, as a discipline, is very much like the Zen student with a full cup of tea.

--------

If you liked this post, consider supporting me by buying my book: http://leanpub.com/leprechauns
5
1
Dave Nicolette's profile photoGeorge Paci's profile photoLaurent Bossavit's profile photoJerven Bolleman's profile photo
11 comments
 
A good question. Googling for "examples of requirements defects" turns up this very interesting paper: http://www.itu.dk/~slauesen/Papers/PrevDefectsProceedings.pdf
Add a comment...

Laurent Bossavit

Shared publicly  - 
 
The NIST report on "Economic Impacts of Inadequate Infrastructure for Software Testing"

This is a rather famous document from 2002, much cited, the source for the assertion that (at the time) software bugs were costing the US economy upwards of $60 billion. (This is in fact, surprise surprise, a misquotation, since the report itself specifies that its estimate is a range of $20 to $60 billion. But what's $40Bn among friends, eh?)

It also appears to be a nest of Leprechauns, if my initial impressions are correct.

We can see all the usual patterns of sloppy citation practice, such as citations of unverifiable documents: the NIST estimates for allocation of effort in projects are sourced to an unpublished master's thesis by students at the Lund Institute of Technology, on the topic of "Formalizing Use Cases with Message Sequence Charts". This thesis is nowhere to be found, not even on the Lund University website. Given the topic, you can bet that this isn't the primary source of the "data", such as there is, on allocation of effort in software projects.

Even more questionable is the inclusion of the chart below (linked picture), which purports to summarize the economic and human impacts of software failures in aerospace at the turn of the century.

It seems likely to me that the authors of the NIST study never saw the original document (which might have provided some context), rather "lifting" the citation from a secondary source. Infuriatingly, the only reference provided is "NASA IV&V Center, Fairmount, West Virginia. 2000". (This is a misspelling for the Fairmont WV facility.) No title, no author, no nothing.

It is in any case clear that they never bothered to fact-check it. I wasn't able to find the primary source, but did find an independently copied version of the same table (slide #23), dating the original to 1999 and not 2000: http://www.nasa.gov/centers/ivv/ppt/172489main_Nancy_Eickelmann_Motorola.ppt

For instance, the first column refers to the Lufthansa Flight 2904 crash at Warsaw. You can read a detailed account on Wikipedia: http://en.wikipedia.org/wiki/Lufthansa_Flight_2904

The loss of life figure, to start with, is simply incorrect - two people, not three, were killed in the crash.

The Wikipedia page also states that "the main cause of the accident was incorrect decisions and actions of the flight crew". The flight control software was indeed involved in the crash, but not due to a software error or "bug": the crash was a consequence of the software working as designed! In fact, one of the remedial actions afterwards consisted of changing a parameter to the software - the amount of pressure needed to consider that a wheel is "on the ground".

There is no indication that a software error of the kind that testing could have prevented was to blame for the crash. None.

Things get even worse in the second column, where the "loss of life" figure jumps out at first sight. For some reason (this was not immediately clear on reading the heading) this lumps together several unrelated incidents. One of them is the famous "Ariane 5 bug", about which much has already been written (no loss of life, monetary losses around $500M) and was definitely due to a software issue (though not necessarily one that testing could have prevented).

But what of the rest? I've only been able to find reliable information on "Flight 965" (on which more shortly). But a (moderately thorough) search for problems in 1996 with missions named Galileo or Poseidon turned up... nothing. Maybe I'm not being patient enough.

As for 965, Wikipedia has once again all the sad details:
http://en.wikipedia.org/wiki/American_Airlines_Flight_965

It is in fact the case that 160 people died. (So the entire "loss of life" cell in that column comes from the inclusion of that tragedy.) Again, the official story about the causes of the crash doesn't mention software at all. However, it's worth keeping in mind that big accidents rarely have a single cause; rather, they are the outcome of a complex web of interacting causes, often involving a mix of human error, design deficiencies, and whatnot.

Finding what role software is even alleged to have played in the crash takes some legwork. Here is what I was able to find:
- A 1996 story in which the airline was seen trying to deflect the responsibility for the incident on to the software provider, Jeppesen-Sanderson of Englewood, CO.
http://articles.sun-sentinel.com/1996-08-30/news/9608290592_1_american-airlines-flight-crew-flight-management-computer
- A transcript of the voice recording which does bear out the story as interpreted by the airline, namely that the crew was trying to set an autopilot course for "Rozo", but this led them on an incorrect flight path because that particular airstrip was not present in the database:
http://www.avweb.com/news/safety/183057-1.html

Again, the question to ask here is, "is this the kind of software error that testing would have found"? Even if you give total credence to the airline's causal interpretation - and the official record, after all the dust had settled, seems to have dismissed it entirely - we are again dealing with a database issue, an issue of the software's exploitation rather than its design or its implementation.

My conclusion here is that the NIST report could quite possibly keep me busy for months, if not years. As I have other things to do today, I leave it as an exercise for the reader to check the rest of the table, and if you're so inclined, the rest of the report.

--------

If you liked this post, consider supporting me by buying my book: http://leanpub.com/leprechauns
12
4
Philippe A.'s profile photoRafael Ferreira's profile photoDean Addison's profile photoTimothy Western's profile photo
 
SO the question is how do today's software projects compare to those cited.
Add a comment...
People
Have him in circles
806 people
Esther Derby's profile photo
David Gerard's profile photo
shailesh bhuva's profile photo
Xavier Roy's profile photo
Bruno Thomas's profile photo
Billy Garnet's profile photo
Gene Hughson's profile photo
Eveliina Vuolli's profile photo
Maaret Pyhäjärvi's profile photo
Links
Other profiles
Basic Information
Gender
Male