Shared publicly  - 
 
After questioning the sanity of grafting large amounts of metadata onto HTML in recent posts by +Bruce Lawson and +Jeni Tennison, I decided to have a look at one example: Queen's album Hot Space on MusicBrainz, which publishes RDFa.

The original size was 26481 bytes, after removing (http://pastebin.com/LPEv3KHk) all RDFa attributes together with then-redundant elements it was 19087 bytes. The RDFa is ~28% of the document size, or adds ~39% to a hypothetical original size. Given RDFa's URI compression feature/bug (CURIEs) it's unlikely that the microdata equivalent would be any smaller.

MusicBrainz has an XML API and legislation.gov.uk publishes both XML and RDF, which is much better than any (lossy) HTML encoding. Given that, is it worth the developer time to try to graft all of this data onto HTML? Is it worth the extra (say) 10-50% of markup for the benefit of the hypothetical triple/item crawler that comes along? (None have been confirmed to exist for MusicBrainz yet.)

Because this kind of complex example has come up in the discussion around microdata, I must ask: Is all of this extra markup only for good measure, or are there actual, concrete benefits? It seems to me that HTML is not the best tool for data interchange on a large scale and that no one who actually cares about data quality would use it for that purpose.

(Dear MusicBrainz developers: Please take no offense to this post; I love you all and thanks for NGS.)

(Original speculation in http://www.brucelawson.co.uk/2011/microdata-help-please/comment-page-1/#comment-779035 and http://www.jenitennison.com/blog/node/160#comment-11178)
3
1
Kingsley Idehen's profile photoBrian Schweitzer's profile photoPhilip Jägenstedt's profile photoKang-Hao “Kenny” Lu's profile photo
20 comments
 
There really is very little chance that we'd ever use microdata or RDFa or anything like that for the hovercards. Those things are interactive (you can add people to your circles from them) and personalised (they show pictures specific to the person hovering), and we want to load them asynchronously (so that we don't have to load the data up when we serve the page — which itself is sometimes done dynamically from JS and not from the server).
 
RDFa and microdata can't even begin to do what you'd need to do something like this in an actually useful and competitive way, IMHO.
 
There are people doing work with microdata and RDFa, sure. There might also be people doing what you describe above using those technologies, as research projects. But I do not believe these technologies would make sense as the basis for a competitive solution to the problem space you describe above.

Of course, I'm happy to be proved wrong.
 
+Aryeh Gregor [[We accept wikitext input and can do arbitrary processing very easily, including splitting out parts of the data into separate streams that will stay in sync. ]]

If this is true, why hasn't Wikipedia started to provide API for data encoded in infoboxes? For the two examples you mentioned, categories are not templates and hence can't really be counted as wikitext input, and I don't see how Web developers would ever be able to build new product from the API that gives you what templates are used in a page.

I am not super familiar with wikitext, and I don't know the syntax for dumping data into a separate stream nor do I know how to create a new URL endpoint for a new API corresponding to a new template. (I mean, how do you manipulate the arbitrary processing you mentioned in wikitext). Supporting such new syntax might make wikitext too complex but it might be worth trying (and then we can figure out if there's security hole or not). Hooking data (either microdata or RDFa) into the HTML part of a wiki template is way simpler and safer can be done as long as the HTML sanitizer allows.

Having said that, I prefer out-of-band data as well.
 
+Nathan Rixham: That doesn't necessarily hold. If someone with use case X uses a technology Q intended for a completely different use case Y, that doesn't mean we should adjust Q to be appropriate for X. It might instead mean we should provide a new technology for X.

In particular, it would be bad to adjust Q to be appropriate for X at the cost of it being less appropriate for Y.
 
Well, perhaps for comparison, look at all the possible markup for the handicapped - blind/etc. That can quickly add a significant bloat to HTML size, yet what tiny % will ever use it? Is there even anyone using devices that actually DO use markup <x>, or is it there just for good measure - and should someone actually use a device that can read <x>, will they realize it, given that so few sites actually do support markup-for-the-handicapped feature <x>?
 
Brian, users with disabilities do exist in large numbers and do need to use the web just like the rest of us. Fortunately, one can often serve them "for free" by just writing good, semantic HTML. For example, by using nested <section> and <h1>, screen reader users could get something resembling the overview that other users get from seeing bold headings. Similarly with <nav>, <header>, <footer>, etc... There is some markup that is only there for users with disabilities and one should use that where appropriate. However, it will never "bloat" the markup anywhere nearly as much as what's being discussed here.

I'm genuinely interested to learn about the actual, concrete benefits of exporting data en masse as RDFa or microdata for something like MusicBrainz or legislation.gov.uk. I have nothing but respect for the MusicBrainz developers and assume that there is a good reason, thus the question.
 
Nathan, microdata's design optimizes for simpler/practical use cases at the expense of some more complex/exotic use cases. That being said, I think we should be encouraging people to experiment at this early stage, that's the best way of learning what makes sense and what doesn't. They just shouldn't expect the spec to come to the rescue if they get themselves into a messy situation it was never intended to handle.
 
On a related note, I've also been somewhat enthusiastic about browsers (or extensions) integrating vEvent with the user's calendar, vCard with the address book, etc. However, it seems more likely that users will simply use web apps for mail/calendar/etc and that things like registerProtocolHandler will solve this problem.
 
Nathan, is there anything except datatypes you'd want to copy from RDFa to microdata?

Regarding datatypes... is there any point in knowing them unless you understand the vocabulary? If you do understand the vocabulary, then certainly the datatype is implicit? The only exception I could come up with is where a single property can take different types and the syntax of those types is overlapping. Do such cases exist in practice? Can't you just tweak the syntax until it becomes unambiguous?
 
What's your beef with itemref? It is the most complicated part of microdata, but have a look at the structure and source of http://www.2gc.co.uk/a2gc-people to see why you'd want to use it. Also, I had fun implementing it :)

Can I interpret "vocab" as "better and simpler mapping to RDF"? Is it not enough to publish mappings like http://schema.rdfs.org/ or is it critical that it be attached to the markup and machine readable?
 
Hehe, a lot of people seem to be losing what they've written in Google+. Try Opera, we never crash. (jk, we crash too.)
 
The interpretation of itemid is left to the vocabulary, so a vocabulary could do what you suggest and simply merge together all properties on items with the same itemid. Problems:

1. Since itemid is a global identifier you would have to do the same with all items that share the same itemid, which may not be what you want.
2. The DOM API would still see these as separate items, making it a lot less useful. To merge them would require traversing the entire document, which isn't going to happen.

Still, itemref is non-trivial, so if parser implementors and authors make a mess of it then it might be chopped. I hope not, but time will tell.
 
+Ian Hickson instead of arguing ourselves into oblivion. Why not just let the world decide what they can, cannot, or won't do with structured data delivered via HTML baed data islands. It's always dangerous to make blanket statements on behalf of broad communities. Anyway, the day G+ releases an API this will happen, outside Google. That's the way of the Web (an Open Platform).
 
Note also that the spec once did have an itemfor="" attribute, but that it was replaced with itemref="" after http://blog.whatwg.org/usability-testing-html5, precisely because "it was pointed out on the WHATWG list that itemfor makes it impossible to find the properties of an item without scanning the whole document."
 
+Ian Hickson please digest your comment: "If someone with use case X uses a technology Q intended for a completely different use case Y, that doesn't mean we should adjust Q to be appropriate for X. It might instead mean we should provide a new technology for X."

Then apply this to the WWW and G+. Hopefully, you see the fundamental contradiction!

Is G+ new technology or Web Technology?
 
Philip, I agree that there's large numbers of handicapped - that's why I mentioned a small percentage, rather than a small number. Regardless, yes, semantic structuring can benefit, be it MB or the handicapped, as I think Nathan is suggesting above. However, where I'd disagree with Nathan, and agree with you, w/r/t itemref and other such attribute-based detail beyond the mere structure of the HTML is that it lets that same meaning exist even when you take the data to a different context. It may be a screen reader, a super-MB-aware mp3 player, or a XMLT-produced something-else, but without that level of detail, the usefulness of the semantic meaning that perhaps could be assumed from the basic structure alone is degraded.

So all of that a long way, I guess, to say that I think discussion of itemref/no itemref, and the like, is missing the real question. Is it worth any of it being there? If you decide it's not, then the level of detail doesn't matter. If you decide it should be there, then imho, it should be done fully, rather than do it without the detail - after all, if it's not useful data when extracted and placed into a different semantic context, then having a basic level of data isn't really any benefit over not having it at all.
 
Brian, please let me know if you learn of any such software actually existing and if they scrape RDFa instead of using the XML API.

There are certain things we (MusicBrainz) could do that doesn't require marking up everything and that does have a visible effect. For example, adding http://schema.org/CreativeWork and aggregateRating on our releases and recordings would probably get our ratings to show up directly in Google search results. There may be more little perks like that, I'm not sure what search engines are going to do with http://schema.org/MusicRecording and related types.
 
Nathan, if itemid were the only way to solve the problem that itemref now solves, then people would be forced to use a new URI for every such item. They'd make something up to get the job done, not particularly knowing or caring that it's global and might collide with the URIs others picked at random. If I were scraping several sites in such a set-up, I certainly wouldn't trust the itemid to actually be global, defeating its current point.

As for the DOM API, the problem would be similar to what I described in http://lists.w3.org/Archives/Public/public-rdfa-wg/2011Jul/0001.html The item would no longer be represented by a single element, so you'd need to have some extra level of indirection in the API and maintain an internal list of all elements with itemid to make it perform reasonably well. It's doable, but even more complicated than itemref.
 
+Aryeh Gregor Yeah, having an API for template invocations including parameters might solve a big portion of real-world use cases that dbpedia is trying to solve without the RDF stack (I know you are speaking as a MediaWiki developer by the way), but if doing that requires a new database table I wouldn't say it's easy at all. Also, I am not sure providing this API is more effective than tweaking the sanitizer in a day and waiting for template authors to hack the HTML part of the templates.

Going back to why it might be easier to publish RDFa both in general and it the case of MediaWiki. I would say:
1) There are just not enough good page generation tools out there, while lots of Web application frameworks have a template system and changing HTML templates is pretty trivial. I tried once to make an existing project hosting site expose RDF (with the DOAP vocabulary). I decided to go with the RDFa approach after looking at the source code (notice that I prefer out-of-band RDF), which is in RoR, because modifying the HTML templates seems pretty trivial and in the end it took only 2 days and less than 50 lines of real ruby code (for HTML fragments that don't use HTML templates). I think this is the biggest reason. (So I think Manu's point wasn't about syncing being hard but changing HTML being super easy, and people are lazy.)
2) For MediaWiki, creating an API for template invocations doesn't solve use cases that involve "semantics", or vocabulary alignment or whatever you call it. In other words, there isn't data for which parameter corresponds to which microdata/RDF vocabulary. (I can't tell whether these are use cases that are concrete enough because a Web application using this API could have this information hard-coded in it.) That is, when you are talking about "new types of metadata" you are not talking the types of metadata that correspond to templates and only apply to a little portion of all Wikipedia entries. All the current API apply to all pages (structured data) and I don't know how it works for semi-structured data.
3) In general, exposing out of band RDF requires a new set of URL, and the Linked Data requirement requires this new URL set to be well designed and dereferenceable so that RDF crawlers/browsers can follow the links. Giving a person an "RDF URL" has obvious usability problem. (This is pretty much the point +Philip Jägenstedt mentioned as hypothetical)

A related question, does anyone know if Google Search scape WIkipedia data from database dump or HTML pages? And does this provide insight into this question?


For social networking standards, I would encourage people to read the "Distributed Social Network"[1] wiki entry and appreciate the fact that we have two working global standards for communication, namely email and the Web. The fact that Google+ doesn't provide RSS/Atom is just frustrating and no one should use this "nonstandard API works great" argument as an excuse for this very basic feature.

[1] http://en.wikipedia.org/wiki/Distributed_social_network
Add a comment...