7 plus ones
Shared publicly•View activity
View 3 previous comments
- I know I couldn't keep up with you because English isn't my native language. That was why I took photos of all his slides. :)Sep 4, 2014
- Awesome - thanks!Sep 4, 2014
- Can someone tell me what type of sample was used in this instance?Sep 5, 2014
- You mean how the 12,000,000,000 page sample was selected? I've no idea, and I'm pretty sure, alas, thatdidn't say.Sep 5, 2014
- AJ Kohn+1Sep 5, 2014
- Heh .. no need to be sorry #gooddataisgood.
I can believe though, that there's a chance you'd see these numbers from a relatively random crawl ... or extremely different number from another random crawl. I don't know what a statistically significant sample is for web pages these days, but given the scale of the web it must be huge.
And a lot of this hinges on domain selection. If you include walmart.com in your domain scope - viola, 14M pages with schema.org markup! amazon.com? viola, 50M pages without significant structured data markup of any type.
As I've seen adoption numbers for ... whatever ... many times over my years in search, and have tried to assess these myself from time to time. The problem ultimately becomes one of persistent apples-to-apples comparisons.
This is difficult because, of course, domains come and go, but I don't think some sort of persistent framework is possible that would allow us to look at adoption data for schema.org, rel="canonical", Google authorship (ha), or any other thing that can based on code extraction. Without going into detail here I can think of several selection criteria - each with their own problems, but I think at the end of the day better for monitoring code use over time than "'k, we pulled another few billion URLs outta a hat!"
That's always been the Achilles' heel of Common Crawl SD extractions, of course (http://bit.ly/Wpjkc8). Love the data but whenever I've tried to make inferences I've been told (reasonably enough) that one can't do that, because Corpus A isn't equal to Corpus B. Infuriating. :)Sep 5, 2014
Add a comment...