Richard Smith-Unna
Plant scientist, geek
Plant scientist, geek

Richard's posts

All welcome at our hackday today. We'll be hacking on our tools for the system. Come along to learn about web crawling and scraping, how to extract facts from PDFs, and how to reverse-engineer data from images

#contentmine   #datamining   #hackday   #hackathon   #openscience   #citizenscience
We'll be holding a very informal hackday in Makespace Cambridge (16 Mill Lane, Cambridge, CB2 1RX).
This is open to anyone interested in developing tools and expertise in automatic mining of information from the scholarly literature. Please register so we can anticipate numbers (probably max 20) and alert security. Topics include chemistry, crystallography, metadata and licences, biological species, phylogenetics, and clinical trials - or bring your own wishlist. No formal experience, but bring your own laptop and a sense of adventure. Details will be posted and updated on

It seems to me that the conjunction of (a) perjuring oneself after (b) illegally breaking into Senate computers to (c) cover up an unimaginably brutal torture program is the sort of thing that one might get fired for.*

For those of you that haven't been following the story, a short summary of what happened: 

(1) The CIA, in offering full access to its files on torture, "inadvertently" included a memorandum submitted to Leon Panetta about the Bush-era torture program. That memorandum concluded that torture of detainees was far more widespread, and far less useful, than previous judgments on the subject had noted.

(2) The CIA attempted to cover its tracks by deleting the memorandum.

(3) Senate staffers for the Committee on Select Intelligence noticed the deletion, and somehow sufficiently averted it that they managed to produce an incomplete printout of the memorandum. That memorandum was brought out of the SCIF and transported to Diane Feinstein's office to prevent the CIA from further interfering with it.

(4) The CIA began an investigation which included spying on Senatorial staffers' emails and monitoring their usage of that computer. That investigation -- and the attempt at prosecution which followed -- was initiated by the same Assistant Counsel of the CIA that ordered that videotapes of the torture of terror suspects be destroyed in 2005. To make it very clear, this was not a disinterested party: this is a person who might face criminal prosecution if the facts which he was attempting to scrub were not scrubbed.

(5) That associate counsel submitted a false statement to the Justice Department in an attempt to start prosecution against Diane Feinstein's aides. This did not go over particularly well with the Senator.

The CIA and its apologists have attempted to justify this as a routine  classified-materials handling investigation. This is nonsense. The security clearances of Senate Select Intelligence staffers (and the need to know) are granted by the Committee itself, upon nonbinding consultation with the DCI. In other words, the committee overseeing the CIA can grant arbitrary clearance to investigators; they do not need to receive permission from the clearing agency. As a result, clearance itself is not an adequate reason for an investigation: if the Committee says its staffers are cleared, then its staffers are cleared, full stop.

But, of course, all of that is irrelevant, because the CIA may not investigate Senate staffers engaged in their official duties. After Gravel v. United States, it's clear that Senatorial aides engaged in legislative business enjoy the same absolute immunity that Senators do. Even if we presume that classified documents had been mishandled -- and I might suggest that attempting to delete files in order to obstruct a Congressional investigation is mishandling of a far fucking higher order than trying to ensure that the legacy of America's idiotic, brutal, shameful and wasteful torture program was actually brought to light. 

* In a just world, using an intelligence agency to intentionally subvert democratic processes is the sort of thing that you would get hanged for. Democratic traditions are fragile and fleeting, and spending those traditions on covering up crimes is the sort of thing that we should take very fucking seriously indeed. (But which we don't.)

Google Scholar metrics of scientific journals

Includes, interestingly enough, also "journals" like, which ranks rather high.


+Mike Bostock, one of two people I can think of who have reached total data enlightenment, takes us all on a beautiful journey through algorithms using stunning visualisations.

Spot on
Elsevier's 50-day tease.

From +Elsevier: "The new Share Link service...allows authors and their network to access their final published articles on ScienceDirect for free for a 50-day period."

Comment. OA is better. It's not limited to 50 days. You can get OA by publishing your work in an OA journal ("gold OA") or by publishing in a non-OA journal, including an Elsevier journal, and depositing a copy of your peer-reviewed manuscript in an OA repository ("green OA"). If you haven't done this before, here's how. 

Here's one of Elsevier's arguments for the new offer: "Researchers who publish in academic journals understand the necessity to expose their papers to the widest audience possible." That's true. But it's an argument for real OA, not a 50-day tease. A more precise formulation makes Elsevier's true statement false: "Researchers who publish in academic journals understand the necessity to expose their papers to the widest audience possible for 50 days, and then keep them locked behind a paywall." 

Here's another Elsevier argument for the new offer: "The new Share Link service makes it easy for authors to share their articles so they can get more exposure and more citations." That's also true. But it's also an argument for real OA, not a 50-day tease. If you really want more exposure and citations, do you want to stop with a 50-day window onto a global audience, or do you want an ongoing global audience?

Elsevier is right that 50 days of free online access is better than no free online access. But watch it try to make that case without making the case for full-bore OA. 

#oa #openaccess #elsevier

Don't buy the IEEE edition of my book.

In January 2013 the IEEE Xplore digital library began hosting digital copies of 400+ books published by MIT Press <>. The IEEE copies are not OA. They're digital copies sold by the chapter. I'm sure that's convenient for some users, and it may also be economical for some chapters of some books. So I'm not criticizing the program as a whole.

My latest book (Open Access, MIT Press, 2012) <> is one of those available in an IEEE edition <>. The problem is that IEEE charges $15 per chapter (PDF-only). By contrast, at Amazon today you can buy the whole paperback edition for $11.72 <> or the whole Kindle edition for $9.99 <>. Moreover, the whole book is open access in eight different file formats <>.

There are other problems here too. My chapters use endnotes, which are all collected in the back of the book and sold by IEEE as a separate chapter. Hence, you could pay $15 for a given chapter and not get any of its notes unless you paid another $15 for them. Moreover, the IEEE flat fee of $15 per chapter applies even to very short sections of the book, like my two-page glossary. The PDF from IEEE is packaged in 15 separate files and will cost you $225.

If you want to buy the whole book, to support the cause, buy the paperback or Kindle edition from MIT Press. If you want to read the book in PDF format, read the open-access PDF from MIT Press <>. It's not only free, it's packaged in one file for continuous reading. 

I can't think of a single reason to buy a single chapter of my book from IEEE.

In fairness, IEEE made its edition available five months before my book became OA. At that time, you couldn't save $225 by choosing an OA edition. You could only save $213 by buying the MIT paperback or save $215 by buying the MIT Kindle edition.

#oa #openaccess +The MIT Press

The good 'ol Fermi Paradox. I read about this a few years ago - still a trip to think about!

Interesting project for archiving reproducible computational experiments. Now there's no excuse not to :)
Sharing Research Artifacts

My research group is building a facility, Apt (, with the aim of making it easier to share research artifacts (code, datasets, etc.).

The basic idea is that (1) we give you virtual and/or physical machines (completely yours, with root) (2) you get your work environment set up: install your software, get all the dependencies set up, copy in your datasets, etc. (3) you take a snapshot of your setup, called a "profile" (4) we give you a URL - anyone with this URL can get an instance of this "profile" on our hardware, giving them an exact copy of your environment to work in. You can share this URL with your collaborators, publish it in your paper, etc.

I think a good way to illustrate this is to walk through a working example:


In a paper we have appearing at NSDI next month, we included this URL: . Go ahead, try it out, it works.

With a minimal amount of account setup (Apt just verifies your email address) you can create an instance of this profile; in this case, it's a Linux VM containing a MySQL database pre-loaded with the data we used in our paper, plus all of the code used to analyze that data. Once you've verified your email address, Apt starts booting the VM in its cluster.

Once the VM finishes booting, you can log into it - you can either use SSH or a (surprisingly functional) terminal built right into Apt's website. You get a message on login (we just replaced the MOTD) that explains where to find the scripts, which ones to run to produce the numbers in our paper, and a URL (simple webserver hosted in the VM) where you can go to download the graphs once they are generated.

Of course, simply producing the same graphs is not the real end goal here, so there is also a pointer to a README that describes the database schema and basics of the scripts.


Apt is built on Emulab, so (to a first approximation), anything you can run on Emulab can be shared this way. We are also hoping that the streamlined interface makes it more appealing to research communities outside Emulab's traditional userbase of networking and distributed systems.

Having successfully put together a profile ourselves, I think we're about ready to help a few other early adopters try building their own. Let me know if you're interested.

Now that's a form of taxation I can get on board with. I'm off to... race my stone

The Seriousness of the Kessler Syndrome

The Kessler Syndrome recently stared in the hit movie "Gravity," but what is it? Well, it all starts with our trash. The problem with space debris isn't that it's just a problem of having all this trash floating around, or that it will have an environmental impact (since anything that re-enters the atmosphere typically disintegrates). The real problem with space debris is the speed of that debris, and the possibility that said debris will impact other (more valuable) objects in orbit. And we're not talking about a fender-bender here; we're talking about two rather fragile thousand-pound objects colliding at speeds of tens of thousands of miles an hour.

In the event that two objects impact one-another, the collision creates a massive debris cloud which is also traveling at thousands of miles an hour. Anything from stray solar panels to a screw could obliterate another spacecraft (imagine a screw traveling 20,000 miles an hour). That debris would then hit other objects in orbit, which creates more debris and hits more objects get the picture.

To read the full article, see:
