The charges against Aaron Swartz hit home. I know little about it other than what was published yesterday. But every aspect of the case touches my professional career directly, if only conceptually.
JSTOR was developed at UMich when I was a library school student there; I was an early user, and somewhat knew people working on it and working at JSTOR later. Ten years ago I was a librarian at MIT Libraries; I somewhat know the people there who were probably involved in dealing with this when it happened. I somewhat know Aaron; I first met him at an O’Reilly conference in 2001, then again in 2007, and enjoyed his talk on Open Library at code4lib 2008 (see http://code4lib.org/conference/2008/swartz
). I currently work at a large library where among other things I’m somewhat involved in a project that gives away everything about it: all the data, all the metadata, all the software necessary to run exactly the same service anywhere else in the world. Indeed when Aaron and I crossed paths at Science Foo Camp in 2007 we talked about just that project; he noted the conversation on his blog (see http://www.aaronsw.com/weblog/scifoo07
). I’ve served on a federal grand jury; the burden of proof is low for indictment (probable cause), and most cases with indictments never make it to trial, where the burden of proof is high for a guilty charge (beyond a reasonable doubt).
When Aaron and I had that conversation at SciFoo ‘07, I was excited to tell him we had 10 TB of data in NDNP/Chronicling America to give away. As of the latest update to the site -- today -- we now manage a half-petabyte of live data (including backup copies) for NDNP alone, roughly 40 TB of which is publicly available and free, with many modes of simple, no-registration-required web access available via a lovely web API (see http://chroniclingamerica.loc.gov/about/api/
). My main contributions to NDNP were writing the first version of that API document and being project manager for the backend system that manages the data inventory and workflow, but even having had such a minor hand in the NDNP efforts, which give away so much, my involvement in the project has made me prouder to be a librarian than anything else I’ve done in the past fifteen years.
In 2006 I wrote a piece about “greedy librarians” that somewhat anticipated this trying-to-download-it-all situation (see http://onebiglibrary.net/story/greedy-librarian-moonshot
). In it I argued that when, say, a new student arrives on a university campus, they should be able to download the whole library and take it with them, get regular updates, and share their navigatings through it with their friends over their college years. I stand by that argument. Having worked on projects like NDNP since then, I’m all the more convinced that it’s the right way to go whenever possible. We should want to enable people who want to work with entire collections, to smooth their paths, to provide update mechanisms for them and their copies of data we make available.
For NDNP and similar large data projects where I work, occasionally somebody fires up bots that try to pull down all the data using brute force methods. Sometimes this exceeds our capacity to serve the data, so when possible, we attempt to contact the remote person and talk them down into other methods. Some of our data may be purchased in bulk, more efficiently; some of it may be downloaded in bulk, more patiently. In the best case, on a project like NDNP, the person on the other side of the bots reacts well to us saying “it’s okay to take this data, but you’re preventing everyone else from using the site; please slow down your bots” and they slow down their bots. Then we watch the server’s load drop back down, and then get back to our wonderful daily drudge work of moving more data through the pipeline and making it available, happy to know that somebody wants the data. Like I said, this is a best case example, and it’s a real example.
I don’t know what Aaron was trying to accomplish (other than getting a local copy of lots of JSTOR); I don’t know what the people at JSTOR or MIT Libraries think about it, and I don’t know what the US Attorney’s office is thinking. I’m not signing any petitions, though. At every step of the process, it’s easy for me to imagine what might be on the minds of the people involved: I could see a grand jury member seeing probable cause in the charges as written; I could see the staff at JSTOR or MIT Libraries with mixed emotions about seeing services to their communities limited by somebody who wants to make extensive use of the service; I can visualize Aaron holding a bike helmet over his face as he accesses a network switch. It isn’t my place to judge any of these actions or reactions. I’m not qualified.
Instead, I’m going to stay focused on making data available, at scale, for free, along with software that makes it easy to access and use. I’m a librarian. This is what I do. I hope to do it for thirty more years.