Profile cover photo
Profile photo
Robert Meusel
data juggler, programmer, scientist, cineast, real-life addict
data juggler, programmer, scientist, cineast, real-life addict

Robert's posts

Post has attachment
We have just released a new Web Data Commons RDFa,
Microdata, Embedded JSON-LD and Microformat data corpus.

The data corpus have been extracted from the November 2015 version of the
Common Crawl covering 1.77 billion HTML pages which originate from 14.4
million websites (pay-level domains).

Altogether we discovered structured data within 541 million HTML pages out
of the 1.77 billion pages contained in the crawl (30%). These pages
originate from 2.7 million different pay-level-domains out of the 14.4
million pay-level domains covered by the crawl (19%).

Approximately 521 thousand of these websites use RDFa, while 1.1 million
websites use Microdata. Microformats are used also by over 1 million
websites within the crawl. For the first time, we have also extracted
embedded json-ld which we can report to be used by more
than 596 thousand websites.

Visit the website:

Read more:!topic/web-data-commons/VlqZef_5gDg

Post has attachment
Would you like to collect large datasets from the Web? If so, then there are good news! Recently Anthelion, a focused crawler for semantic annotations in Web pages was publicly released.

Post has attachment
Hi Everybody,

The WebDataCommons team is happy to announce that we have released several class-specific subsets of the Data contained in our Winter 2013 Microdata corpus [1]. We hope that providing those topic-specific subsets for over 50 different classes (like product, event, or address) will make it easier for the community to explore and work with the data.

The different datasets, along with some statistics about the data can be found here:

The subsets contain all instances of a specific class as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data was originally extracted using Any23 [2] from the Winter 2013 crawl provided by the Common Crawl Foundation [3]. The extracted data is represented in N-Quads [4] format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted.

We thank the Common Crawl Foundation for providing their Web corpera.

Post has attachment

Post has shared content
never ever before
Share if you've never seen a peeled lemon until now...

Post has attachment

ANN: Large hyperlink graph published, covering 3.5 billion web pages and 128 billion hyperlinks 

The Web Data Commons team has just announced the publication of a new large hyperlink graph.
The graph has been extracted from the Common Crawl 2012 web corpus [1] and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public.
The graph can be downloaded in various formats from
We provide initial statistics about the topology of the graph at
We hope that the graph will be useful for researchers who develop
• Search algorithms that rank results based on the hyperlinks between pages.
• SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
• Graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
• Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.
We want to thanks the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph. 
The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services.  We thank your sponsors a lot.
Best Regards,
Chris, Oliver & Robert
Wait while more posts are being loaded