Shared publicly  - 
After questioning the sanity of grafting large amounts of metadata onto HTML in recent posts by +Bruce Lawson and +Jeni Tennison, I decided to have a look at one example: Queen's album Hot Space on MusicBrainz, which publishes RDFa.

The original size was 26481 bytes, after removing ( all RDFa attributes together with then-redundant elements it was 19087 bytes. The RDFa is ~28% of the document size, or adds ~39% to a hypothetical original size. Given RDFa's URI compression feature/bug (CURIEs) it's unlikely that the microdata equivalent would be any smaller.

MusicBrainz has an XML API and publishes both XML and RDF, which is much better than any (lossy) HTML encoding. Given that, is it worth the developer time to try to graft all of this data onto HTML? Is it worth the extra (say) 10-50% of markup for the benefit of the hypothetical triple/item crawler that comes along? (None have been confirmed to exist for MusicBrainz yet.)

Because this kind of complex example has come up in the discussion around microdata, I must ask: Is all of this extra markup only for good measure, or are there actual, concrete benefits? It seems to me that HTML is not the best tool for data interchange on a large scale and that no one who actually cares about data quality would use it for that purpose.

(Dear MusicBrainz developers: Please take no offense to this post; I love you all and thanks for NGS.)

(Original speculation in and
Nathan Rixham's profile photoPhilip Jägenstedt's profile photoKang-Hao “Kenny” Lu's profile photoXavier Badosa's profile photo
I fully understand where you're coming from, as I think the same :) So why do I spend time focussing on microdata and rdfa you may wonder...

I'll explain by example, you know +Manu Sporny if you hover over his name you'll see a nice little info box appear, with an image, a list of mutual friends, and some very basic information. At the moment this information is augmented on the view by some code deep down in the depths of google's application(s). However, if just a simple bit of was metadata added to the html, a single property and Manu's personal URI, then any open script or browser extension could read that metadata, consult Manu's open data by looking up his personal URI (in whatever format) and then display the same info box. The difference being, that it would be open and decentralized, standards based and available on any website.

To me, metadata in HTML isn't about merging semantic or raw data in with HTML for (insert reason here), it's about simply exposing just enough data such that other applications and scripts can hook in and augment our browsing experience with all manner of interesting and useful things. In order to do that, all it takes is a bit of javascript, some form of decentralized identification (like WebID), and some light metadata mixed in to the HTML.

... and that's why I want to see microdata, rdfa, or a simplification and merge of the two, do well.

aside: it's also why I strongly feel sites like need machine readable profiles, such that any old developer could use a hook based framework to provide functionality, registering a function to run when a certain property or type is encountered on the page, then doing an action.
There really is very little chance that we'd ever use microdata or RDFa or anything like that for the hovercards. Those things are interactive (you can add people to your circles from them) and personalised (they show pictures specific to the person hovering), and we want to load them asynchronously (so that we don't have to load the data up when we serve the page — which itself is sometimes done dynamically from JS and not from the server).
+Ian Hickson Sorry, I must not have explained properly, I didn't mean to say that Google may or may not do this. I meant that I, and others from various corners of the web (like Harry Halpin, +Kingsley Idehen, +Michael Hausenblas, +Alexandre Passant or perhaps +Evan Prodromou), would like to do this, in an open decentralized way, backed by the web of data. That said, the use-cases really aren't limited to the social web, there's a whole world of networks and information that can benefit. Sure it'll be slower to start, but as caching is leveraged more and best practise patterns emerge, things will improve; and thanks to the great improvements across the web, from the WebApps and HTML side of things through the semantic web side, that tipping point where it's now possible to actually do and implement, has been reached.

It's all outlined quite well in Design Issues like and elsewhere in various forms :)

TBH, if I didn't have to do the paid work, I'd stop what I was doing right now and set up a demo that would work anywhere, on any site. (Well, cors-enabled of-course, thanks to that there are a few more hoops to jump through and it's not quite as out of the box for most people).
RDFa and microdata can't even begin to do what you'd need to do something like this in an actually useful and competitive way, IMHO.
RDFa and Microdata don't need to do it all, they just bootstrap a way in to the web of data from the web of documents. Ask +Thomas Steiner (also from google) who is already making extensions which read RDFa and augment with other data from the web. There are quite a few people working on this, and regardless of whether Microdata, RDFa or RDF are used, or something else, the patterns are there, and it will be done.
There are people doing work with microdata and RDFa, sure. There might also be people doing what you describe above using those technologies, as research projects. But I do not believe these technologies would make sense as the basis for a competitive solution to the problem space you describe above.

Of course, I'm happy to be proved wrong.
Great, so let's just make sure that Microdata and/or RDFa (or something else) works as well as possible for the people who use it, and go from there.

Time will tell, that's for sure.
+Aryeh Gregor [[We accept wikitext input and can do arbitrary processing very easily, including splitting out parts of the data into separate streams that will stay in sync. ]]

If this is true, why hasn't Wikipedia started to provide API for data encoded in infoboxes? For the two examples you mentioned, categories are not templates and hence can't really be counted as wikitext input, and I don't see how Web developers would ever be able to build new product from the API that gives you what templates are used in a page.

I am not super familiar with wikitext, and I don't know the syntax for dumping data into a separate stream nor do I know how to create a new URL endpoint for a new API corresponding to a new template. (I mean, how do you manipulate the arbitrary processing you mentioned in wikitext). Supporting such new syntax might make wikitext too complex but it might be worth trying (and then we can figure out if there's security hole or not). Hooking data (either microdata or RDFa) into the HTML part of a wiki template is way simpler and safer can be done as long as the HTML sanitizer allows.

Having said that, I prefer out-of-band data as well.
+Nathan Rixham: That doesn't necessarily hold. If someone with use case X uses a technology Q intended for a completely different use case Y, that doesn't mean we should adjust Q to be appropriate for X. It might instead mean we should provide a new technology for X.

In particular, it would be bad to adjust Q to be appropriate for X at the cost of it being less appropriate for Y.
Well, perhaps for comparison, look at all the possible markup for the handicapped - blind/etc. That can quickly add a significant bloat to HTML size, yet what tiny % will ever use it? Is there even anyone using devices that actually DO use markup <x>, or is it there just for good measure - and should someone actually use a device that can read <x>, will they realize it, given that so few sites actually do support markup-for-the-handicapped feature <x>?
Brian, users with disabilities do exist in large numbers and do need to use the web just like the rest of us. Fortunately, one can often serve them "for free" by just writing good, semantic HTML. For example, by using nested <section> and <h1>, screen reader users could get something resembling the overview that other users get from seeing bold headings. Similarly with <nav>, <header>, <footer>, etc... There is some markup that is only there for users with disabilities and one should use that where appropriate. However, it will never "bloat" the markup anywhere nearly as much as what's being discussed here.

I'm genuinely interested to learn about the actual, concrete benefits of exporting data en masse as RDFa or microdata for something like MusicBrainz or I have nothing but respect for the MusicBrainz developers and assume that there is a good reason, thus the question.
+Ian Hickson what is this completely different use case Y? What is use case X? What is technology Q, how does technology Q need adjusted for use case X, at what stage does catering for X make Q less appropriate for Y, and why is use case Y more important than use case X?

I've simply said let's make sure Microdata/RDFa works as well as possible for the people who use it, regardless of their use case(s), to which you've replied "That doesn't necessarily hold". If we aren't catering for the people who want/need/use the tech and their use cases, then frankly, what the hell /are/ we doing?
Nathan, microdata's design optimizes for simpler/practical use cases at the expense of some more complex/exotic use cases. That being said, I think we should be encouraging people to experiment at this early stage, that's the best way of learning what makes sense and what doesn't. They just shouldn't expect the spec to come to the rescue if they get themselves into a messy situation it was never intended to handle.
On a related note, I've also been somewhat enthusiastic about browsers (or extensions) integrating vEvent with the user's calendar, vCard with the address book, etc. However, it seems more likely that users will simply use web apps for mail/calendar/etc and that things like registerProtocolHandler will solve this problem.
+Philip Jägenstedt I completely agree :) Just because I'm part of the RDFa working group (primarily to handle the RDF Interfaces), doesn't mean I'm the biggest fan of it - personally I'm gunning for something in the middle, something simple and usable which is familiar in style to most developers (OO style classes and properties) and something that can easily be used to handle the common RDF cases (which in many respects, are the same use cases as the OO style generic ones).

All in, I feel RDFa hasn't learned enough from the general developers of the world ( a bit of an unfair statement, it has, it's just got them mixed in with complicated stuff and features that can't be removed for BC reasons). Similarly I feel microdata hasn't learned enough from the RDF(a) world, and the years of practical experience of that community.

The best bits of both specifications could easily be merged together, using only five core attributes (+ one optional datatype for RDF compatibility), it would cater for all use cases, just some of the more complex ones may entail more complex markup.

Should probably stop discussing in random snippets, and actually write the thing up.
Nathan, is there anything except datatypes you'd want to copy from RDFa to microdata?

Regarding datatypes... is there any point in knowing them unless you understand the vocabulary? If you do understand the vocabulary, then certainly the datatype is implicit? The only exception I could come up with is where a single property can take different types and the syntax of those types is overlapping. Do such cases exist in practice? Can't you just tweak the syntax until it becomes unambiguous?
Philip, personally I'd specifically like to see the "vocab" attribute in there such that each property can have a URI defined for it easily, and I'd like to see itemref dropped.

As for datatypes, personally I have the same view point as you, and think that datatyping should be punted off to the vocabularies, using simple D-entailment where necessary - it's one of my biggest bugbears with RDF tbh.
What's your beef with itemref? It is the most complicated part of microdata, but have a look at the structure and source of to see why you'd want to use it. Also, I had fun implementing it :)

Can I interpret "vocab" as "better and simpler mapping to RDF"? Is it not enough to publish mappings like or is it critical that it be attached to the markup and machine readable?
Philip, itemref is unneeded and complicates things, having the two elements have an @itemid or @about with the same identifier would give the same results.

As for vocab, partially better mapping, partially because it keeps URIs out of attributes in the common case, and partially because it would also create full URIs for properties as well as types.

I did just spend two hours writing it all up with examples, but my chrome crashed :( will have to do it again now I guess.
Hehe, a lot of people seem to be losing what they've written in Google+. Try Opera, we never crash. (jk, we crash too.)
The interpretation of itemid is left to the vocabulary, so a vocabulary could do what you suggest and simply merge together all properties on items with the same itemid. Problems:

1. Since itemid is a global identifier you would have to do the same with all items that share the same itemid, which may not be what you want.
2. The DOM API would still see these as separate items, making it a lot less useful. To merge them would require traversing the entire document, which isn't going to happen.

Still, itemref is non-trivial, so if parser implementors and authors make a mess of it then it might be chopped. I hope not, but time will tell.
+Ian Hickson instead of arguing ourselves into oblivion. Why not just let the world decide what they can, cannot, or won't do with structured data delivered via HTML baed data islands. It's always dangerous to make blanket statements on behalf of broad communities. Anyway, the day G+ releases an API this will happen, outside Google. That's the way of the Web (an Open Platform).
Note also that the spec once did have an itemfor="" attribute, but that it was replaced with itemref="" after, precisely because "it was pointed out on the WHATWG list that itemfor makes it impossible to find the properties of an item without scanning the whole document."
+Ian Hickson please digest your comment: "If someone with use case X uses a technology Q intended for a completely different use case Y, that doesn't mean we should adjust Q to be appropriate for X. It might instead mean we should provide a new technology for X."

Then apply this to the WWW and G+. Hopefully, you see the fundamental contradiction!

Is G+ new technology or Web Technology?
Philip, I agree that there's large numbers of handicapped - that's why I mentioned a small percentage, rather than a small number. Regardless, yes, semantic structuring can benefit, be it MB or the handicapped, as I think Nathan is suggesting above. However, where I'd disagree with Nathan, and agree with you, w/r/t itemref and other such attribute-based detail beyond the mere structure of the HTML is that it lets that same meaning exist even when you take the data to a different context. It may be a screen reader, a super-MB-aware mp3 player, or a XMLT-produced something-else, but without that level of detail, the usefulness of the semantic meaning that perhaps could be assumed from the basic structure alone is degraded.

So all of that a long way, I guess, to say that I think discussion of itemref/no itemref, and the like, is missing the real question. Is it worth any of it being there? If you decide it's not, then the level of detail doesn't matter. If you decide it should be there, then imho, it should be done fully, rather than do it without the detail - after all, if it's not useful data when extracted and placed into a different semantic context, then having a basic level of data isn't really any benefit over not having it at all.
+Philip Jägenstedt re (1) surely that's what they want? to associate more properties with the same item. As for (2) why would the DOM see these as separate items? that's an artificial bit of functionality surely, it could easily merge them in to one item.
Brian, please let me know if you learn of any such software actually existing and if they scrape RDFa instead of using the XML API.

There are certain things we (MusicBrainz) could do that doesn't require marking up everything and that does have a visible effect. For example, adding and aggregateRating on our releases and recordings would probably get our ratings to show up directly in Google search results. There may be more little perks like that, I'm not sure what search engines are going to do with and related types.
Nathan, if itemid were the only way to solve the problem that itemref now solves, then people would be forced to use a new URI for every such item. They'd make something up to get the job done, not particularly knowing or caring that it's global and might collide with the URIs others picked at random. If I were scraping several sites in such a set-up, I certainly wouldn't trust the itemid to actually be global, defeating its current point.

As for the DOM API, the problem would be similar to what I described in The item would no longer be represented by a single element, so you'd need to have some extra level of indirection in the API and maintain an internal list of all elements with itemid to make it perform reasonably well. It's doable, but even more complicated than itemref.
+Aryeh Gregor Yeah, having an API for template invocations including parameters might solve a big portion of real-world use cases that dbpedia is trying to solve without the RDF stack (I know you are speaking as a MediaWiki developer by the way), but if doing that requires a new database table I wouldn't say it's easy at all. Also, I am not sure providing this API is more effective than tweaking the sanitizer in a day and waiting for template authors to hack the HTML part of the templates.

Going back to why it might be easier to publish RDFa both in general and it the case of MediaWiki. I would say:
1) There are just not enough good page generation tools out there, while lots of Web application frameworks have a template system and changing HTML templates is pretty trivial. I tried once to make an existing project hosting site expose RDF (with the DOAP vocabulary). I decided to go with the RDFa approach after looking at the source code (notice that I prefer out-of-band RDF), which is in RoR, because modifying the HTML templates seems pretty trivial and in the end it took only 2 days and less than 50 lines of real ruby code (for HTML fragments that don't use HTML templates). I think this is the biggest reason. (So I think Manu's point wasn't about syncing being hard but changing HTML being super easy, and people are lazy.)
2) For MediaWiki, creating an API for template invocations doesn't solve use cases that involve "semantics", or vocabulary alignment or whatever you call it. In other words, there isn't data for which parameter corresponds to which microdata/RDF vocabulary. (I can't tell whether these are use cases that are concrete enough because a Web application using this API could have this information hard-coded in it.) That is, when you are talking about "new types of metadata" you are not talking the types of metadata that correspond to templates and only apply to a little portion of all Wikipedia entries. All the current API apply to all pages (structured data) and I don't know how it works for semi-structured data.
3) In general, exposing out of band RDF requires a new set of URL, and the Linked Data requirement requires this new URL set to be well designed and dereferenceable so that RDF crawlers/browsers can follow the links. Giving a person an "RDF URL" has obvious usability problem. (This is pretty much the point +Philip Jägenstedt mentioned as hypothetical)

A related question, does anyone know if Google Search scape WIkipedia data from database dump or HTML pages? And does this provide insight into this question?

For social networking standards, I would encourage people to read the "Distributed Social Network"[1] wiki entry and appreciate the fact that we have two working global standards for communication, namely email and the Web. The fact that Google+ doesn't provide RSS/Atom is just frustrating and no one should use this "nonstandard API works great" argument as an excuse for this very basic feature.

Add a comment...