Web server log analysis for SEO purposes

If you extract from your web server logs all the rows created by Googlebot, you can actually find some useful information about how the website is analyzed by the spider and if there are crawling issue.

First, some basics.

Google uses different spiders for different purposes or, at least, the search engine seems to assign different tasks to its spiders.

So, analyzing Googlebot accesses to your website you can distinguish different behaviours:

DISCOVERY SPEED - Google is able to discover new URLs quickly, using more than one methodology. New URLs are discovered through links, pings and signals produced by users.

If the only way that your site has to be discovered is by links from other websites, then the speed at which Google is able to discovered your new URLs can be considered an indirect lousy signal about how much the search engine considers your site important.

As a general rule of thumb, if your URLs are quickly discovered by the search engine, that means nothing. But if your new URLs always have an hard time to be discovered within a few days, that can actually mean that the search engine is not particularly interested in crawling your website often enough.

FRESHNESS - Some URLs are (re)crawler more often than others, maybe because the search engine thinks that it is important to detect updates.

This often happens for any form of news and articles, especially if the website is included in Google News. These articles usually receive many visits during the hours after the publication.

For all the kind of contents not associated to feeds or news, the recrawl rate is mainly related to four things: 1) PageRank, 2) How much frequently you change the contents of the page, 3) How much of the content is changed and 4) the changefreq tag in your XML sitemap.

NATURAL DEPTH - Websites with more global PageRank usually are crawled more deeply if compared to websites with less global PageRank.

It's important to highlight that the term "PageRank" is used here in a very lousy way and it's not a reference to the original formula published by Google; it is safe to replace the word "PageRank" with "importance".

The more your website is considered important, the more the crawled is encouraged to request deep URLs. But what is a deep URL?

Let's just simplify this thing a lot and let's say that if the 100% of your backlink would point to the home page, then the depth of a given page could be considered simply the number of links that the spider has to follow in order to reach that page starting from the home page.

In the real world, no website receives backlinks only to the home page. So, if you receive backlinks directly to category pages or to product pages, the average depth of the pages of your website is lower and the spider is encouraged to crawl deeper pages.

Natural crawling depth is for sure a signal quite well related to the overall PageRank of the website, but why do I specify "natural"? That's because the crawling depth can be influenced also by other channels, for example feeds or pings or social signals. So when I say "natural depth" I mean "the depth that the spider is willing to reach after evaluating only the overall PageRank of the website".


Sometimes Googlebot goes a lot deeper than the total amount of PageRank would justify.

It happens when the spider tries to crawl GET FORMs or when the websites creates "infinite URL spaces", for example when a calendar widget creates links that point to an infinite (or extremely high) number of future years.

This kind of depth cannot be related to the importance that Google assigns to the website and such deep crawls should be considered a "special task" assigned to the spider.

(the "infinite spaces" concept comes from +Piersante Paneghel)


Sometime, especially when the content or the URLs of the website have changed a lot, Google can try to perform a full recrawl of the website.

Since this temporary behaviour is usually a consequence of some specific phenomenon, it should not be considered something related to the importance that the search engine has assigned to the website.


If you learn to analyze your logs and if you learn to distinguish these several spider behaviours, you will be able to understand at a deeper level and more quickly:

1) If there are something wrong with the internal navigation of the website

2) How much quickly the spider reacts to the publication of new resources

3) If the kind of URLs that you want to be indexed are quickly requested by the spider

4) If the kind of URLs that you don't want/care to be indexed are usually requested by the spider (don't waste your precious bandwidth, Googlebot doesn't assign to your website an infinite amount of time and resources)

5) What sections of your website are more easily crawled and what sections are somewhat ignored by the spider

6) How much importance Google actually assigns to your website and how much the search engine is eager to archive your contents. This perception is not an exact science and you can learn to acquire it only when you have a clear understanding of all the different kinds of Googlebot behaviours (explained above).

Final questions and unscientific poll:

Do you sometimes analyze your web server logs for SEO purposes? Do you just use the information that Google Webmasters Tools provides to you? Did you even found a SEO problem analyzing your web logs?

Shared publiclyView activity