Profile cover photo
Profile photo
Hugh Cayless
Digital Classicist / Scholar-Programmer
Digital Classicist / Scholar-Programmer

Hugh's posts

Post has attachment

Post has attachment

Post has attachment

New Year's resolutions:

* start my own business
* run a marathon
* think of other ways to terrify myself :-)


If we were going to reimagine the structure of the TEI Consortium, what might it look like? I'm going to jot down a few ideas/complaints/musings. Some of this is in reaction to Martin Mueller's ( and Stuart Yeates' ( thoughts.

1. I think the current setup, where institutions with paid memberships are the only enfranchised members of the community is self-limiting. It means that the TEI leadership is closed to outsiders and that therefore new ideas and new blood don't get into the mix as much as they should. I don't see why the Council at least shouldn't have representatives elected by the community as a whole.

2. Declining institutional subscriptions are a related problem. When the leadership is this closed off, there aren't sufficient incentives to attract new subscribers (i.e. there's no good way to hook new paying members in). There are also no real incentives to attract individual members.

3. I think the separation of the Board and Council is a good idea and should stand. Maintenance of the standard is not the same sort of task as maintenance of the organization. And I think having a small number of experts be the gatekeepers for the standard works well.

4. Further, I think the fact that Council in-person meetings are subsidized is a very good thing, as it allows people whose organizations wouldn't support their travel to attend. But note that "outsiders" don't presently have a shot at getting onto the Council because of the voting structure.

5. I think some method should be found to recognize and reward major technical contributors to the TEI. In open source projects, they are made "committers"—which is rather like being on the Council. My worry is that people like this don't really stand much chance now of being elected, and might not even if elections are opened up.

6. The Council is pretty open, but I've more than once had the experience of making feature requests and then never hearing about their fate until I happen to think of asking a Council member. The system of handling and reporting back on feature requests could be improved.

7. The level of technical sophistication isn't always what I'd want it to be: there are, for example, chunks of the Guidelines that don't work (I'm looking at you, TEI XPointer Schemes) and have yet either to be revised or implemented.

Post has shared content
The charges against Aaron Swartz hit home. I know little about it other than what was published yesterday. But every aspect of the case touches my professional career directly, if only conceptually.

JSTOR was developed at UMich when I was a library school student there; I was an early user, and somewhat knew people working on it and working at JSTOR later. Ten years ago I was a librarian at MIT Libraries; I somewhat know the people there who were probably involved in dealing with this when it happened. I somewhat know Aaron; I first met him at an O’Reilly conference in 2001, then again in 2007, and enjoyed his talk on Open Library at code4lib 2008 (see I currently work at a large library where among other things I’m somewhat involved in a project that gives away everything about it: all the data, all the metadata, all the software necessary to run exactly the same service anywhere else in the world. Indeed when Aaron and I crossed paths at Science Foo Camp in 2007 we talked about just that project; he noted the conversation on his blog (see I’ve served on a federal grand jury; the burden of proof is low for indictment (probable cause), and most cases with indictments never make it to trial, where the burden of proof is high for a guilty charge (beyond a reasonable doubt).

When Aaron and I had that conversation at SciFoo ‘07, I was excited to tell him we had 10 TB of data in NDNP/Chronicling America to give away. As of the latest update to the site -- today -- we now manage a half-petabyte of live data (including backup copies) for NDNP alone, roughly 40 TB of which is publicly available and free, with many modes of simple, no-registration-required web access available via a lovely web API (see My main contributions to NDNP were writing the first version of that API document and being project manager for the backend system that manages the data inventory and workflow, but even having had such a minor hand in the NDNP efforts, which give away so much, my involvement in the project has made me prouder to be a librarian than anything else I’ve done in the past fifteen years.

In 2006 I wrote a piece about “greedy librarians” that somewhat anticipated this trying-to-download-it-all situation (see In it I argued that when, say, a new student arrives on a university campus, they should be able to download the whole library and take it with them, get regular updates, and share their navigatings through it with their friends over their college years. I stand by that argument. Having worked on projects like NDNP since then, I’m all the more convinced that it’s the right way to go whenever possible. We should want to enable people who want to work with entire collections, to smooth their paths, to provide update mechanisms for them and their copies of data we make available.

For NDNP and similar large data projects where I work, occasionally somebody fires up bots that try to pull down all the data using brute force methods. Sometimes this exceeds our capacity to serve the data, so when possible, we attempt to contact the remote person and talk them down into other methods. Some of our data may be purchased in bulk, more efficiently; some of it may be downloaded in bulk, more patiently. In the best case, on a project like NDNP, the person on the other side of the bots reacts well to us saying “it’s okay to take this data, but you’re preventing everyone else from using the site; please slow down your bots” and they slow down their bots. Then we watch the server’s load drop back down, and then get back to our wonderful daily drudge work of moving more data through the pipeline and making it available, happy to know that somebody wants the data. Like I said, this is a best case example, and it’s a real example.

I don’t know what Aaron was trying to accomplish (other than getting a local copy of lots of JSTOR); I don’t know what the people at JSTOR or MIT Libraries think about it, and I don’t know what the US Attorney’s office is thinking. I’m not signing any petitions, though. At every step of the process, it’s easy for me to imagine what might be on the minds of the people involved: I could see a grand jury member seeing probable cause in the charges as written; I could see the staff at JSTOR or MIT Libraries with mixed emotions about seeing services to their communities limited by somebody who wants to make extensive use of the service; I can visualize Aaron holding a bike helmet over his face as he accesses a network switch. It isn’t my place to judge any of these actions or reactions. I’m not qualified.

Instead, I’m going to stay focused on making data available, at scale, for free, along with software that makes it easy to access and use. I’m a librarian. This is what I do. I hope to do it for thirty more years.

Post has attachment
A glimpse at what Humanities Higher Ed should do, rather than (or at least in addition to) preparing students for a career most of them will never have.
Wait while more posts are being loaded