Referencing things in the real world — is structured content missing a trick?

[Edit: I've copied this post over to my new blog, where it's more readable and comments are consolidated: ]

These thoughts have been floating around my head for some weeks now. It's time to put them out to face the cold light of day. Not sure whether they're old hat, architecture astronautics, or potentially useful. Please comment and let me know!

1. One of the most promising and least used aspects of structured content is the ability to associate inline elements unambiguously with the real-world entities they describe.
2. When marking up elements for this purpose, it's better if the markup format is not tied to specific applications or actions that should be taken by the system.
3. The best place to unambiguously reference an entity is in the source content.

On point 1, here are some of the many possible uses for this kind of semantic tagging:

For names which are trademarked in some countries but not others, you can get the appropriate trademark symbol (or none) showing up for each language or country output, on the first mention, on all mentions or according to any other requirement.

For strings such as UI text or programming language keywords, you can keep them in sync with the canonical source.

You can ensure coverage of all relevant product features, or compare feature mentions with customer searches related to those features.

You can enable more effective search for relevant content (including allowances for synonyms and misspellings). This works for content creators and consumers alike.

For terms where links to definitions or other relevant content would be useful, you can enable such links. For one vision of this, see +Mark Baker's article and the discussion in the comments here: (though note that my point 3 diverges from this on the best place to disambiguate entity mentions).

You could perhaps base conditional filtering on the features mentioned in a block element. If a feature isn't in the current product configuration, the whole block could be filtered out.

You could expose this kind of metadata in web content, allowing external applications to use it (and preparing for the increasing capabilities of search engines).

It surprises me a little that inline markup for these purposes isn't more common. Inline markup is very common for formatting purposes — bold and italics being the classic examples! This is obviously rather limited. Inline markup's also commonly used to indicate types of information or types of entity (much of the "out of the box" DITA vocabulary is like this). But not every structured content implementation attempts to refer to specific real-world entities at the inline term or "mention" level. 

Moving on to point 2, there are of course plenty of examples, where inline markup is used for these purposes, especially in medium or large organizations. Conditions represent features; custom implementations link keywords to their canonical sources; and tricks with attribute values get trademarks displayed correctly. But the markup is often different depending on the specific application for which it's used, limiting later extension. For example, there may be a conditional attribute to mark information about a feature that is present in some product configurations and absent in others. Yet if the feature name is trademarked, there may be a different attribute to ensure that only the first mention of the name gets the TM symbol. And if the feature name is used as a UI string, this may be different again. Once the content is being localized, still different markup may be used for terminology management. This seems to be an unfortunate intermingling of concerns. In an ideal world, it seems that the source content should describe what it's about, without locking this description to a particular action to be taken by the system. That leaves the system free to take any action (or multiple actions) based on the rules for the particular context.

Point 3 is about where to disambiguate terms / mentions. There's a valid argument that specific attribute values are not needed to indicate real-world entities; the content of an element is enough. For example, <organizationname>Acme Corporation</organizationname> refers to the organization named "Acme Corporation", and can be processed by a system accordingly. If an term is mis-spelled or an alternate term is used, the system can prompt authors to use the approved version — or in the case of a synonym, perhaps the system can silently do the appropriate matching, leaving the synonym visible in the output. During localization, terminology management systems can match the strings from the source with their approved, localized counterparts (or skip translation where the term is global). This certainly works for many organizations and I wouldn't say it's wrong. But here's why I think it's preferable to keep the full, unambiguous meaning of the content in its source:

It avoids messy ambiguity, for example where two companies have the same name (there have been a number of Acme Corporations), or where a name is very close and an author accidentally gets the wrong one. It also avoids the need to create specific elements solely for disambiguation, for example if there was an "Exit" UI string on a software tool's File menu, and another "Exit" string on a particular dialog, and in the canonical string resource these were maintained separately and liable to be updated independently.

It means the content is not quite so dependent on a specific system or implementation. It's easier to make use of the content in other contexts: whether in different tools by the same team/s, other group's tools, or in external teams entirely. This isn't the same as saying that we should create content in an interchange format — I think most people have experience that as soon as they start doing anything interesting with their content, they pretty much lose the ability to import that content easily and blindly into a different system without missing something. What I mean, though, is that attaching the full semantics to the source creates the possibility to interpret the source in another system without having to recreate large parts of the original system.

It gives authors confidence. Authors like unambiguity. I believe some of the affection for WYSIWYG is because of the desire to "get things right". The appearance of the content in a WYSYWIG tool gives a sense of confidence, one that isn't always justified. A better way to get it right is to be confident that your source can be used appropriately. It seems that a true WYSIWYM (what you see is what you mean) tool would give authors confidence that the semantics of their work were correct and would be preserved.

The questions then arise as to what format this markup should take, and how authors should work with it. For the format question, consistency and extensibility seem very important. RDFa comes to mind. It's machine-readable, extensible, and has the advantage of already working with the major search engines, possibly simplifying the system somewhat. But raw RDFa is not something that most authors really want to get into inserting manually, or that would be productive if they did. Here is where authoring tools can really come into their own. Where there's a canonical list of external entities or another searchable resource, designs such as the following can be used:

After working on a piece of content, you could press a button and get suggested matches for the terms you'd mentioned.

During writing, a keyboard shortcut could bring up a dynamic list of items, filtered as you typed.

For a basic implementation, something like the interface here could be enough:

It's worth mentioning that sharing a canonical list of entities doesn't have to imply anything about power and authority in a team. With the right design, new entities could be added by anyone. And the technical architecture need not even be centralized — it could work with a distributed repository as with Git or Mercurial.

At the beginning of the post, I wondered whether I was indulging in architecture astronautics.  Based on what I've written so far, there is certainly plenty of room to get carried away. But experience tells us that we shouldn't get so hung up on a particular concept or way of doing things that we lose sight of the practical costs and benefits. I really like +Joe Gollner's slides on these kinds of concerns:
As Joe puts it, we should use just enough semantic technology, and no more.

Of course, just enough for now may not always be enough for the future. And one of the great things about RDF's use of a graph model (including RDFa of course), is that it is really extensible. Extending a semantic structure or modifying existing semantics is a lot easier than if everything is buried in a relational or hierarchical form. (So it's ironic that the term RDF has connotations of idealistic, over-ambitious implementations that try to do everything and rarely realize practical value. For a good look at this misconception, see: .)

So, as I see it, inline semantics connected to specific real-world entities have huge and somewhat untapped potential; they should be separated from specific applications where possible; and they are probably best kept in the source. I have no dire predictions for people who don't do these things, but I think structured content could perhaps be enhanced with them!

Have you used inline markup in this way? Is it something you're thinking of? Or wouldn't it be worth it for your situation? It would be great to hear opinions and experiences related to these ideas.

#structuredcontent   #intelligentcontent   #authoringtool   #rdf   #rdfa       #semanticweb  
Shared publiclyView activity