RDFa, Microdata, JSON-LD, and Microformats data sets for Nov. 2015 Common Crawl release now available

As announced by +Robert Meusel (http://bit.ly/1XSuFvy), the Web Data Commons has just released a new RDFa, Microdata, embedded JSON-LD and Microformat data corpus, extracted from the  November 2015 version of the Common Crawl (1.8B pages, 14.4M websites):

This is the first Commons corpus to report on embedded JSON-LD discovered in the crawl, so I took a quick look what classes and properties were being most commonly declared.

As per the call-out graphic, there's a pretty direct relationship between what webmasters have been encoding and JSON-LD supported search results features in Google.

Clearly many sites - in the order of a half-million+ - have opted to provide Google with information about their internal search in order to generate an in-SERP sitelinks search box, first announced by Google in September of 2015 (http://bit.ly/1XSvXqt).

Markup in support of social media profile links, announced in January 2015 (http://bit.ly/1CrX6pr), is also much in evidence, and probably accounts for good chunk of the sameAs declarations observed.

Finally, JSON-LD support for logos and corporate contact information, both announced in 2014, is probably responsible for some other top JSON-LD-encoded classes and properties in the extraction, including Organization and sub-classes, ContactPoint and logo.

Interesting to compare this list against the equivalent top lists for microdata, which is quite different and skews much more heavily toward types with long-standing rich snippet support, such as CreativeWork classes, Product and Offer.  It will be interesting to see how this distribution over time, with JSON-LD support for these types all having been introduced since this Crawl.

Finally - because we need mystery, right? - how it it that theclothdiaperwhisperer.com has more triples (570B) than all of blogspot.com (538B)? :)

