Shared publicly  - 
 
Don't forget the basics > Block those search result pages

It's right there in Google's guidelines>
Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines

Take a read of this article, one of the SEO consultants eventually figured it out and the site recovered ;)

seomarketinggoddess.blogspot.co.uk/2013/01/seo-issues-it-is-penguin-is-it-panda-or.html
2
Cristina Wood's profile photoLyndon NA's profile photoPe lagic's profile photo
12 comments
 
Wasting your crawl budget for one, and in this example they also had parameter generating URLs ;(
 
I suppose the choice between blocking in robots.txt or adding a meta noindex tag depends on the website, on how the website is crawled by search engines, etc. But of course it is important to have the pages blocked by robots.txt with informative relevant content and good site navigation menu, because people can find them from search results, (when they appear in search results the snippet is replaced by the "blocked by robots.txt" message).

 
Hi +Giacomo Pelagatti I agree with you agreeing with me :) But there are good reasons sometimes to block pages in robots.txt, for example when they have forms that could make search engines like Google generate various URL parameters for from crawling, or when they include JavaScript code that might make search engines generate malformed URLs and collect them for further crawling, etc.


 
Yes, but better still fix it upstream; limit the number of links to those search pages and apply nofollow to them.

In this particular example they had a ton links to these search pages from the home page and perhaps even site wide as well :(

Those search pages then linked to 200+ urls with various parameter combinations attached to each.

Sort orders on E-commerce sites are quite distinct from internal search results and should not be blocked whatsoever, rel=canonical is your friend here.
 
Totally agree that robots.txt should not be used to block crawling for pages with little or zero unique content. And more specifically for dealing with onsite duplication issues.
 
I generally advise against using internal nofollows, however when it points to a search result page that you don't actually want crawled or indexed then its does no harm, your going to lose internal PR through those links anyway.
 
Yes. I use rel=canonical in addition to noindex. :)

Be careful not to use both at the same time though ;)
 
Nothing strange with John's reply, it's what I would expect could potentially happen when your providing mixed signals.

"Generally speaking I would avoid [...] using the rel=canonical together with the noindex because it can happen that sometimes we take the noindex and also apply it to the canonical [URL]"
 
G cannot do things like that - as they have to account for User Error (and you'd be amazed how often they have to account for that).
 
So a page has both NoIndex and Rel=Canonical.
That means;
1) Don't index "this" URL
2) Merge values of "this" URL to the designated URL

As far as I know, G can handle that - though technically the NoIndex shouldn't be required (the page shouldn't show in the index due to the CLE (and yes, I know, it's a fail safe incase G opt to ignore the CLE)).

The problem G faces is the huge number of sites out there that screw this sort of thing up.
Multiple CLEs, malformed CLE URLs, Multiple robots meta/conflicting directives etc.

So G has to draw the line somewhere.
 
It boils down to several aspects;
1) Misuse/mistakes
2) G changing their definitions
3) Control

At the end of the day, as a site owner (or SEO etc.), you want to control how the site is perceived and how it shows in the SERPs.
G fights against that though more often than not - they want it their way, which often runs against our logic.
Then throw in some confusion as definitions/behaviours change, and a dose of misunderstandings/misapplications,
and we get a mess :(
 
Not specifically the CLE, but things like "NoFollow" etc.
G introduced it as a way to tell the SEs to ignore a link - then opted to treat it as a request, not a directive.
They do that sort of thing every so often - and it causes more confusion.
Add a comment...