Enough Breadcrumbs to Feed a Hungry Search Expert.
223,814,124 of them to be more precise.

The WebDataCommons RDFa, Microdata, and Microformat data sets extracted from the November 2013 Common Crawl have been released.

Out of the 2.24 billion HTML pages crawled 26% contained some form of structured data which represents 13% of the domains covered.  An increase of just over 20% from the 2012 stats and despite the 2013 crawl parsing 7.8M less URL's.

Since the 2012 crawl there has been an increase in Microdata adoption of slightly over 330% with the majority of webmasters favouring Schema.org. No surprise there just like there is no surprise that the most widely used RDFa is Facebook's Open Graph protocol.

The overview of the data sets can be found at http://webdatacommons.org/structureddata/2013-11/stats/stats.html
Among the aspects covered are: Top Domains by Extracted Triples and URL's with Triples, Top Classes
which include 178,334,394 schema:Product Entities and 13,813 og:"product" Domains

It's also interesting to see new domains appearing in the top 1000 listings and also to see the growth of sites like Wikipedia up from 1,668,194 urls with triples to 2,122,209.

For those wanting to do fuller comparisons between the Common Crawl data sets the August 2012 stats are here:

This new data probably comes to late for inclusion in to +David Amerland's upcoming new book but it does show that business owners really do need to start to include or at least plan to include structured data in their sites.
Shared publiclyView activity