About those 20 billion triples
Web Data Commons has released its RDFa, Microdata, and Microformat data sets extracted from the December 2014 Common Crawl, which found "found structured data within 620 million HTML pages out of the 2.01 billion pages contained in the crawl (30%)."http://bit.ly/1xXzCuM
This is up from the 26% November 2013 data set.http://bit.ly/1kXpYix
I've shown some comparative statistics in the image attached to this post. I've highlighted some microdata statistics as kind of a proxy for schema.org
use, but of course these triples could also be marked up using other syntaxes.
that there's statistical problems with using these data sets as a means of assessing adoption and/or relative use of the formats and entities enumerated, but once those shortcomings have been acknowledged I think there's trending data of interest to be uncovered.
For example, it makes sense that schema.org/Product
is ascendant, given that it's now the go-to entity employed in enterprise ecommerce structured data markup. Or to see the relative appearance of microformats waning, as the least flexible of the formats listed (though microformats are very much hanging in there).
Interesting to know if future extractions might look at JSON-LD data as well - I invite +Chris Bizer
to comment if he cares to. #commoncrawl #structureddata #webdatacommons #schemaorg #rdfa #microdata #microformats