Shared publicly  - 
This is a great effort. They're naming names and taking no prisoners.
EDIT: See instead.
Reproducibility in Computer Science. Reproducibility is a cornerstone of the scientific process: only if my colleagues can reproduce my work should they trust its veracity. Excepting special cases, in applied Computer Science reproducing published work should be as simple as going to the ...
Annette Bieniusa's profile photoChristopher O'Toole's profile photoMartin Quinson's profile photoLucas M. Schnorr's profile photo
My guess is this is improving slowly over time, albeit from a terribly low base. But, clearly, we as a community have a huge amount of work yet to do.
11% of the builds ended with Internal Compiler Error? +John Regehr , deliver us from madness!

(Coincidentally, I remember someone claiming that if every car in the US worked mechanically perfectly, all the time, we'd still have something like 90% of the traffic fatalities we do)
The only place on the internet where "hardware" and "theoretical" would be in the same bucket :). CS is a circle !!
section 4.3 ("So, What Were Their Excuses? (Or, The Dog Ate My Program") in the tech report is really quite damning and disappointing. :-(
But argh, they could not reproduce our results because they couldn't install flex??
+Úlfar Erlingsson +Carlos Scheidegger I don't know the actual number, but would guess something much greater than 99%.

I think the paper misuses "reproducibility" - this is just measuring how willing researchers are to make useful distributions of their source code/data.  Encouraging people to distribute the software they build in an open and useful way is certainly useful and something researchers should be much better at (I think mostly by moving to a model of distributing VM images, not just source code that is unlikely to build on different platforms), but the fact that the code builds correctly on your platform doesn't make the result reproducible (or the fact that it requires a lot of effort to build/re-implement the idea doesn't mean it isn't reproducible).

Reproducibility should mean if someone is willing to put in the effort to re-implement what is described in the paper and re-execute the experiment, do they get the same results, or are key parts of the result sensitive to details of a particular environment or some implementation detail (to irrelevant to mention in the paper, but essential for achieving similar results).  

On that score, I think CS probably does quite badly compared to other areas, but there are very few data points since it is so expensive and unrewarding to actually reproduce experiments. 
"Reproducibility" is clearly the wrong word here, but I think the authors know that. Maybe they're building up to it. Or something. I suggest ignoring the title.

On Facebook, however, several people are complaining about poor communication by the authors (echoed here by +John Regehr), which is a much more disconcerting complaint. I would really like these numbers to really mean something, not be a hatchet job (which I don't expect of these authors).
Well I don't think it's a hatchet job, they have to draw the line somewhere, but maybe 30 minutes by some certain student wasn't enough in some cases.
And certainly their overall point is correct.
Arguably the study does not check for reproducibility. I think the study reports on the willingness of published authors to support a specific form of reproducibility.

The question in my mind is how far published authors need to go out of their way to support the reproduction of their results. Would someone who provides a VM image for download, containing binaries obfuscated using fully homomorphic encryption, be in the clear for reproducibility?
The problem with "naming names" is if you don't do a very good job...

So I don't recall a request from them, but they do list +Wyatt Lloyd 's eiger as having a build fail.  Again, not sure to what extent they actually tried (as +Nate Foster pointed out, all their build instructions are 404'ing).
Saying that a build failed in MS Visual Studio with compiler errors "namespace __gnu_cxx not found" and "cannot find "ext/hash_map" " -> now that's just mean! Sure, you should make your code portable on any OS - if you're shipping professional software! But why should we waste our time accounting for MS's endless quirks in order to make a cross-platform implementation for a paper? Good thing that C++11 put a stop to some of this nonsense.    
+Sarang Joshi Surely everything that could be reproduced should be reproducable? Otherwise we're expecting authors to say "we're not making our work reproducable because it's not good enough to make it worth your while". Which would be admirably honest, but seems unlikely.
+Sarang Joshi, sorry, that question doesn't make sense. First of all, note that what these authors mean by "reproduce" (a terrible choice of term on their part given its usual meaning in science) is simply "can we build the code for the system". Second, by that standard, everything. Because we never know which paper is going to be useful when. I view what they're doing as a "kick-the-wheels" testing: before you buy the car and go on a long trip, you want to make sure the wheels are at least inflated. It feels like a basic test that a system should pass.
+Magda Procopiuc, not so fast. If your goal is to run a program to get some results and put them in a paper, sure, you're right. But if your goal is to write something that a year or two from now someone else can run either to (a) incorporate into their system or (b) check their results against yours, then it is reasonable for your code to meet a slightly higher standard. Right now there is no incentive to do so (other than personal pride, etc.), and that could use some fixing.

For instance, if one is claiming a certain optimization, how do they know it wasn't because of one of "MS's endless quirks" as opposed to what they did? Presumably they actually ran the thing on a few different compilers...
+Sarang Joshi, thanks — was not aware of that journal, and that does indeed seem to be a kind of reductio! (Going too far, IMO, but it is an interesting effort. I wonder how it stacks up in terms of impact factors and prestige.)
Yep, CDE is linked as a suggested tool from the AEC instructions. He may be all growed up and an Internet celebrity now, but even +Philip Guo was a grad student once, and during that time he was a stalwart on the very first AEC.
I just read section 4.3, man that is priceless stuff.
Wait till you get to appendix A.2. Anyone who's dealt with trying to build on someone else's work has had such an experience, and I'm really glad they've had the guts to write it all down.

My only frustration is that they didn't pursue the last step. I would gladly have chipped in some cash to pay for it, as I imagine would have several others. Then we could have taken this to its logical conclusion.
Again to clarify my original statement, I don't think much of  the fundamental quality of published work in general even in the most respected conferences, so worrying about if it's reproducible (or what ever else you want to call it) is really looking where the light is shining. There is even a larger debate in the scientific community beyond CS: eg. check out
Yes, every bit of 4.3 and A.2 rings true. My group is having exactly this problem right now as a matter of fact. Time for a kickstarter for Christian's project?  I'll throw in $25 or more.
I just asked Christian about this over email. Fingers crossed.
+Sarang Joshi, yep, I've been following this debate because I've been pushing for a process called artifact evaluation for conferences. However, there may be a bit of a chicken-and-egg issue here. More vigorous evaluation will, I think, push papers into two camps: those that are actually rigorously evaluated and can be built upon, and those that are "idea" papers and can then more freely speculate and take risks. Right now we seem to be in a trough between the two.
well... one of my papers, with +Yossi Gilad, LOT: A Defense Against IP Spoofing and Flooding Attacks, is marked `bad' here. Turns out some unnecessary file broke the make, added at last minute by a student; and they didn't ask us about it (although our emails are there). oh well, nice to get it fixed, but... 
I don't understand the element of (some) academic culture where if something is not "usable" then there is a belief it should not be released. Clearly it should just be an issue of how you label the code, what expectations are attached to it, etc. (i.e. some of the concerns in 4.3.10 seem valid to me, but no bar to release as such).

Inspectability feels like the key criteria here, and the loathe-ness to embrace it feels odd. On the other hand, build breaks/bitrot are such a common feature in the software world in general, it hardly seems to me as though we need to hold academia to a higher standard.
+amir herzberg, +Gershom B, +Mike Freedman, et al: I don't really support the methodology here. Building is neither necessary nor sufficient. But I do value someone trying to keep the community honest. I personally, not surprisingly, prefer the AEC methodology, which is painstaking but I think really valuable (and does not rely on "gotcha" methods such as the ones that seem to have tripped up some people here). Inasmuch as this gets us talking, it's good.

Also, to be clear, I did not realize they were in quite such a strong gotcha mode. Either I didn't read their document carefully, or they didn't spell it out clearly. Posting this here and on Fb has revealed quite how unfair their method is to some of the people named, and that's certainly something that makes me uncomfortable.
Some of the other comments, from papers deemed to build:

"BUILD:COMMENT[string] Assumed running on merit, since compiling would take half to an hour to complete, and testing would require even more time."
 "BUILD:COMMENT[string] Decided that runs on merit of popular usage, since build is extremely time-consuming. "
"BUILD:COMMENT[string] assumed runs, since running requires writing a source file in a new langueage."
I suspect the "gotcha" methodology is really just the consequence of attempting to do a first-order approximation on a large dataset. I can't imagine the effort it would take to sort through that many even trivial build errors.

It's just that as far as "meaning" to the data it would probably be better to indicate "conditional" yeses vs. absolute "no"s in the email responses, and also to disregard the build results almost entirely.

Also, of course, given tools such as Agda, Coq, and PLT-Redex, it is not necessarily the case that purely theoretical results should not have verifiable artifacts :-)
One spot check on their methodology:  I went back in my email and checked - I sent them links to two different source code tarballs in the same email reply (since we got 2 requests).  Both are hosted on the same server, both links are correct, neither has been updated since the email was sent.  One project is listed as building, the other is listed as "link broken".   Possibly, a transient server or network failure?  But, no follow-up email?  Nor even a retry some time later?  I'm not sure what this proves...

(follow up):  After reading their report, it looks like they sent no more than 1 email to any author.  So, my guess it that my replying on behalf of the co-author who actually received the request messed up their process.  Sheesh.
from reading their paper, their methodology is : 
- if there's no URL to code, they send (2?) emails to ask for code. 
- if there is URL to code, they depend on 30 min of student time to build,  do NOT send any email to address issue, and flag the result as `non-reproducible`. 

This begs the question are there results reproducible? I think they should publish there work at a ACM conference!
About the "other fields do it better than CS" notion alluded to above:

I've had at least two separate exchanges with people from different fields about this problem. Strikingly, both conversations went like this:

they: "I'm sure CS is better than what we do our field right now".
me: "what? You have no idea how bad it is in CS"
they: "what? you have no idea how bad it is in $field_X"

Brian Nosek, a psych prof at Virginia, built a non-profit to tackle the problem in experimental psychology.

(This is another way to say "there really should be a kickstarter for this kind of project")
I did not mean to allude that at all. I simply think the whole problem goes away if stop taking conferences  (or for that matter just the act of publishing at any venue ) that seriously. But that would mean that hiring comities, tenure comities etc. etc. would actually have to read the paper and make there own judgments about how good the work is. What a concept!

Here is an interesting write up
I'm curious where this is being submitted, anyone know?
My extremely biased sample of two OOPSLA papers by people I know (Dave Herman, Sukyoung Ryu) suggests that in the first case, there was an undeclared dependency on the 'menhir' package, and that in the second case (far more amusing), the authors were trying to run a "Setup.hs" file, and write: "No readme has been provided. There is a .hs file (google says it is a haskell file but no download for haskell available). Tried to run the above using javascript." Strangely, javascript was unable to run a file written in Haskell. Darn Haskell!

This suggests to me that at this point, all this paper has discovered is that we don't yet have reliable frameworks for specifying and installing dependencies.
ok the Setup.hs one is hilarious, so thanks for that!
+Sarang Joshi, I've had that paper linked from my page for years (Don Geman's brother Stu is a colleague). But you're setting up an either-or. There's no reason for a department to not trust its own judgment, but there's also no reason for a department to not let outside experts help it shape that judgment. And I don't think we would be better off going back to a time when nobody published anything with any seriousness — that would just bring back a good ol' boys club.
+John Clements, we need a URL for the Haskell link. If that's really what someone wrote, then they're an idiot and this discredits at least that evaluator significantly.
Lots wrong with this. For starters, they are confusing repeatability and reproducibility (reproducibility requires independence). Kind of a big mistake.

If the point is to say "most research SW sucks", I think the answer is "duh?". Not sure you can conclude much else.
+Mark Reitblatt, as I said yesterday, ""Reproducibility" is clearly the wrong word here, but I think the authors know that. Maybe they're building up to it. Or something. I suggest ignoring the title."

Their point is not that it sucks. They didn't say that, and it's a cheap rhetorical device to impute it. They said that the artifacts are difficult to find and build, and though there are several questions about their methods, it does confirm what I believe is a problem, and the fact that people are not very responsive is also problematic.
I don't see how their approach could possibly answer any other kind of question. Research artifacts are rarely built for the purpose of wider use, and that's fine. There's nothing in the scientific method that requires that they be.

From an engineering perspective, it would be highly preferable if we could build directly upon each other's systems. That would be great, and I support work in that direction.

But they're not claiming this is for engineering purposes, they are dressing it up in this cargo cult science "we need reproducibility!" without understanding what that even means. And as a result, they are heading firmly in the wrong direction: trying to re-run the original code, which is neither necessary nor sufficient for reproducing research.
+Sam Tobin-Hochstadt wow, thanks for finding the attempt to run Haskell code with node. from my own spot-checks, some of the students seem unfamiliar with the basics of building software.  given their other assumptions, I think they should have assumed all software they could download builds and runs. it would probably be closer to the truth.
+Shriram Krishnamurthi I don't think this "confirm[s]" the problem you mention. Like +Andrew Ferguson, I haven't seen a reason to think that these results are better than "assume all software works" or "flip a coin".

I do agree that the results on "does the software exist" are much more interesting.
+Sam Tobin-Hochstadt, they didn't make up these artifacts out of thin air. These are things the authors put put, and presumably the authors didn't put them out for jollies: they put them with some implication that someone else could use them. I've reviewed enough NSF proposals where the authors promise to make their work available, and Appendix A lists exactly such an instance.

Is "builds" just the right criterion? Maybe not, and certainly if they'd spent more time on fewer artifacts, we'd have more credible and useful data. But they didn't do something absurd as an overall process (which isn't to excuse the individual absurdities like passing Haskell code to JavaScript).

+Andrew Ferguson, could you point out two or three more of these?
+Shriram Krishnamurthi I agree with most of that, and I'd even be happy with "builds" as a criterion. I'm just not sure that P(builds) is any different than P(builds | they say it builds). For example, +Asumu Takikawa found at least 3 instances of OOPSLA'12 papers where no attempt was made to build the software, yet it was marked as "builds". I saw another instance where the core software built, but the examples failed -- marked "builds".

Looking just at the first 5 OOPSLA papers in their corpus, I see:

1 that is described as running successfully online but is marked as not building
1 where the student failed to install the dependency (marked build failed)
1 where the student failed to install a dependency (marked build succeeded)
2 where it seemed that the students couldn't figure out how to use the software due to lack of instruction (both marked build failed)

So from those I don't think I'd alter my probability assessment based on the data provided.
+Sam Tobin-Hochstadt, good. I think it's better to fight data with data, and this is just the right kind. Would be nice if these are accompanied by links for others to quickly visit and double-check the judgment.

It does make it look like a not very skilled set of people were put on what is fundamentally a skilled task.

When I first read their document, I wondered why they had not contacted all authors of papers. Now one starts to think this effort was designed to induce failure as a reaction to the (justifiable) frustration they must have felt after the run-around in the appendix. I hope not; the authors are respectable and surely know better.
here are some more examples of not trying to overcome build problems which are totally common (and for which I, as a programmer, would expect a moderately capable user to be able to overcome):

Ubuntu 11.04 listed as a dependency, student used 12.04, which didn't have all the right packages, and went ahead anyway:

not using Bash variables properly:

not setting paths properly:

not able to find GCC (again setting paths, presumably):

more path errors:

trying to build code with a GCC dependency when using Visual Studio:

wasn't able to install boost:

didn't satisfying hardware prereqs, so marked as failed:

anyway, there are a bunch more with Java path problems as well, etc, etc.
+Andrew Ferguson I have some sympathy for the first example in your list: expecting people to install a particular point release of an OS (and not a particularly new one when the paper was published, I believe) is waaaaay more work than a normal "install a package". That paper would probably have been better served by publishing a VM image. [Unless it was benchmarks, in which case VM images are a little problematic. I don't know a good solution in that case.]
+Laurence Tratt, very true. However, that is still not grounds for listing it as "Build fails". A more reasonable thing might have been to put it in an "unreasonable/difficult" bin, along with things that require an Android phone or supercomputer or whatever else. It's fine to try to build it under 12.04, but not fair to dock it for failing to comply. [Also, "internal compiler error" in 12.04 can hardly be held against the authors, can it?]
+Shriram Krishnamurthi Yes, I think "impractical" (which is my polite way of saying "unreasonably difficult") is quite a reasonable classification. [The compiler error is hard to judge. It could be caused by user issues, but I tend to agree with you that it's "innocent until proven guilty".]
I would prefer to say that by default it's neither innocent nor guilty — just as one presumably doesn't assume that all papers submitted to a PC are by default all accepted or rejected — and it's perfectly fine to leave a few of things in an inconclusive state. Especially in this case, that state can be resolved without too much trouble (unless detailed performance was indeed at stake, which it probably wasn't, because the paper's title is API compilation for image hardware accelerators).
+Laurence Tratt perhaps on hardware, but not in a virtual machine. with VMWare, for instance, installing a fresh Ubuntu image requires ~15 minutes and ~5 basic decisions (hd size, RAM size, hostname, account info, where to save the image) -- the VM software does the rest. most of the time is spent grabbing the packages.... at any rate, using Docker images makes this process even simpler.
+Andrew Ferguson The last time I tried installing Ubuntu, it took me 10 minutes to find the right link for a specific version, 30 minutes to download it, and 30 minutes to install (reminding me again why I abandoned Linux for OpenBSD in 1999). I agree that none of these steps taxes the brain too much, but compared to a normal "install package X" it's a huge ask. I further agree that when authors provide VM images, it pretty much bypasses this problem.

[Although it was a bit more work, for our OOPSLA paper last year, we provided both a VM image and a traditional "install it on your own machine" package. It felt like duplication in one way, but it was the only way we could think of that solved the "we're presenting benchmarks, so a VM image alone isn't sufficient" problem. We're open to suggestions!]
Just to inject a note of warning, relying purely on VMs is not ideal either. We have had situations on multiple AEC reviews where the reviewers had a VM, but were very frustrated that they couldn't check something in the source. Unsurprisingly, both static and dynamic views are valuable.
There are problems in this paper, but I support the idea that authors should make their code available at minimum. Ideally they would provide something like a README and a build script that pulls in requirement dependencies. Super excellent would be using technology like vagrant or docker, and providing a Vagrantfile or Dockerfile with a bootstrapping script that installs required dependencies. Their certified images could be published on the docker index (if they use docker) or listed in or (if they use vagrant).

Not everything is going to be amenable to containers or vms, of course. But it would be great to see that type of work where it makes sense.

Researchers could learn a lot from "devops" practices in industry. People in industry want reliable, repeatable and reproducible environments and have refined toolchains to be able to orchestrate these things.



Disclosure: I work for  Victoria Stodden  and +Jennifer Seiler on an open source reboot of RunMyCode called Research Compendia, and we will have executability soon. I don't currently have a stack built out that uses docker, but I have been experimenting with it as a proof of concept. Previously I worked in industry at Orbitz working on backend web services.
And I got so much in to writing my comment that I forgot to mention PLOS ONE's new sharing guidelines. I was very happy to see the PLOS ONE sharing guidelines for softare that came in to effect recently.

They request that authors share code with information about parameters called, test data, etc. Yes!
Thank you, +Sheila Miguez! I am a big fan of Victoria Stodden's work, and the ICERM workshop at Brown last winter was very interesting. Great to have a RunMyCode person weigh in here.
Thanks +Shriram Krishnamurthi and my group is ResearchCompendia. The three of us felt passionately about creating an OSS platform, and have continued with our efforts apart from the RunMyCode group. They didn't want to open source their project.
Understood, +Sheila Miguez. I'll try to remember that there are two distinct efforts there. If there's a Web page about this OSS issue, please include a pointer here.
+Sheila Miguez I would love to have those PLoS ONE guidelines in CS:

PLOS journals will not consider manuscripts for which the following factors influence ability to share data:
* Authors will not share data because of personal interests, such as patents or potential future publications.
* The conclusions depend solely on the analysis of proprietary data (e.g., data owned by commercial interests, or copyrighted data). If proprietary data are used, the manuscript must include an analysis of public data that validates the conclusions so that others can reproduce the analysis and build on the findings.
+Eric Eide, it's clear I need to start cataloging. Need to go think about the software part of it. Next week is spring break, and I'm afraid I see where it might go...
Would it make sense for some of us to work together on a followup piece, perhaps just a blog entry? If we split up the work it shouldn't take all that long for some people who know what they're doing to look again at all 108 instances of "build fails".
Some random guidelines that come to mind:
- we don't want people working on their own software
- authors should be contacted when problems can't be resolved
- detailed notes should be kept
+Shriram Krishnamurthi How about opening this up? We could do some coordinated blog posts soliciting help. This would open up the possibility of having multiple people try to build each artifact, in which case we could do some statistics. We'd have to ask people to quantify their levels of experience and effort.
Stuff getting in place. I will post tonight. Please watch this space!
Another interesting example:

It's marked as "theoretical" in the database, with this comment. "VERIFY:COMMENT[string] Since proofs exist for the effectiveness of the system, I'm not sure either if it is necessary to ask for implementaitons."

But there's also: "PI:COMMENT_CC[string] They measure performance. So, we need access to their code. Proof of correctness means nothing, by the way!"

If you look at the paper, it has performance numbers for an implementation. For the overall numbers, though, it still is counted as "theoretical".

I couldn't find the implementation with about 10 minutes of looking.
I have personal zen engineering koan:
  Sometimes the fastest way to the solution is posting the wrong answer.
Many times in internal and customer-facing mailing lists I've seen questions go unanswered for days, and then the minute an incorrect answer shows up, 2 or 3 corrections immediately get sent.  I suspect it's a case of "I don't want to get bogged down by question X" but when answer Y comes by people feel compelled to correct it.
Add a comment...