Today Majestic has launched a new version of Site Explorer, which has categorized the whole web into about 800 categories. That’s about 700 billion pages right now, with our full historic index to follow when we can.
Rather than totally repeat the announcement over at http://www.stateofdigital.com/majestic-trust-flow/
(thanks +Bas van den Beld
) I thought I’d talk a little bit on G+ about how we have done it… without giving away the whole secret sauce of course.
Any object on the web can be categorized… not just traditional web pages. Here’s +John Mueller
's Google+ profile shows that he is primarily all about Computers / Internet / Web Design and Development and secondarily about Computers / Internet / Searching for example. Hope that’s a fair assessment of you John? Check out a few friends while you are here! J
When we started looking at categorizing the web, I have to confess that I was mildly skeptical as to whether (a) we would pull it off and (b) whether the data would hold up to scrutiny and (c) whether people would care. Fortunately for us, +Matt Cutts
did a spiffing video on the 2nd April about “Topical Pagerank”. Thanks Matt! I am glad that we have been able to get this out before Google’s topic update though – at least we can show that we worked this all out ourselves and not using third party data.
Actually – that is not entirely true. One of the first things we looked at was what topics we should use to categorize the web. Options on the table included Dublin Core and some sort of tag based system, but in the end we decided that the Open Directory Project has an open source categorization structure and if it was good enough for AOL it was not a bad place for us to start. That said, we found the categories slightly problematic – especially when it came to (say) electronics maufacturers in the UK vs Electronics manufactures in the US… when companies get large, localization only serves to disrupt categorization. So we took the ODP categories as a framework, but then changed them for the sake of our algorithms.
The next step was the actual categorization part. This was really interesting and I can’t say too much, but we decided that one of the biggest weaknesses of human edited directories was that two pages n the same site could be about totally different topics. Some sites are very specialist… others very generalized. So we decided that the only solution was to categorize pages, and then do some roll up calculations to help give an overview at a subdomain or root domain level. This is something we are now quite practiced at, so that part was OK. The part that IS technically challenging is that many of these categories overlap dramatically semantically and differentiated between categories is really hard to do, because finding patterns for one category that differentiate from a lower or parallel subcategory is not at all easy. We have spoken with a number of researchers, practitioners, university lecturers and all manner of people about this and we did not find a useable solution on a plate. I guess that explains why it has never been openly attempted before. I know Google have Topical PageRank, but they don’t share that knowledge and neither would I if I was them. Also challenging… with a little under 1000 categories, we have to score every web page against all 800 to know which categories are most important.
So our Fresh index has about 700 billion URLs.
Multiply that by nearly 1000 categories to see the scale of the challenge.
Now do that on every Fresh Index update (daily if we can).
That’s quite a calculation. Excel won’t cut it… and it is also one of the major reasons why we take a little longer than some to show new links. We think the extra day’s wait is well worth the extra insight we can bring.
Once we have scored every URL against all the categories, we maintain the top few for the index. Not every page has a Topical Trust Flow for every category of course and this manes that the resulting table would have lots of blank cells if it was on a massive spreadsheet, so we are able to reduce this data down a lot.
The reason for doing all this every day is really so that we do not have to try and calculate all this for you on the fly every time. I would imagine that it is possible to come up with a clunky “categorization algorithm” which can start churning on demand, but it really is unlikely to be effective at the URL level and at the domain level we have already seen that this is not granular enough to be useful.
I’m pretty excited that we have categorized the whole web and would love this news to go a bit more mainstream than the SEO community… so if anyone knows any reporters on mainstream papers or TV news programs that thinks this might be a little bit more that “just another SEO Tool” I’d love to hear from them.
In the meantime – if you get a chance to publicly talk about what we’ve achieved, I would really like the exposure.