How to find and fix infinite crawling issues
Recently Google did a post on their Webmaster Tools blog about “infinite” crawling:
This gives some background on what “infinite” crawling is and possible causes. It also tells you how to check for messages in your Google Webmaster Tools dashboard that indicate Google has found a problem.
Wether or not Google flags an issue here, it is a good idea to examine your website for “infinite”, or “excess”, crawling issues. By “excess” I mean the same issue of crawlers having to follow lots of different URLs that lead to the same or similar pages, but just not infinite.
Common examples are search/sort/selection functions that generate URLs with parameters like “page=x”, “type=x”, “price=x-y”, etc, especially when links are given on the page leading to various combinations of these. As spiders follow these, they often return very similar results, and lead to yet more permutations of URL parameters. Spiders can spend a lot of time following these and not find any new content, and will thus stop crawling your site or ignore pages you really want them to find since they are too busy following these other pages.
To find out if you have problems with infinite or excess crawling, a good place to start is your website analytics. Do a report on one month’s worth of traffic, sort by the URL, and look through for areas where you see lots and lots of similar URLs. It is usually pretty easy to spot these — you will see page after page of similar URLs. You can also use website crawling tools such as Xenu to get a list of crawlable URLs.
Fortunately, you can address issues with infinite or excess crawling. What you want to do is identify the set of pages that have unique content or lead to unique pages (e.g., the default sort/selection for your product catalog, listings, etc), and use robots exclusion protocols to keep spiders from following and/or indexing the rest. Thus the spiders will follow a limited set of URLs, index your good content, and not waste time on URLs that serve redundant content.
The main tools to use here are as follows:
- Robots.txt: With robots.txt, you can exclude entire sets of URLs with one simple command. Used carefully, this can be safe and powerful. For example, this command would stop spiders from following any non-default sort lists in your product catalog:
Disallow: /mycatalog/search?*&sort= - Robots meta tag: With robots meta tag, you can tell spiders not to index certain pages, or to not follow links on pages. For example, you could use this to tell spiders not to index “next” pages in your catalog search, or to not follow links on pages that use non-default sorting, etc.
- rel=”nofollow” tag: Using the rel=”nofollow” tag on selected links in a page will tell the spiders to not follow those links. You could use these on all links within a page that lead to duplicate content or excess crawling. For example, if you have links in your catalog search page for different sort, price selection, etc options, you could use rel=”nofollow” on all links except those that lead to default settings.
By doing some trimming on what links spiders will follow, you can avoid infinite/excess crawling issues, which will focus the search engine spiders and your internal page rank on those page that have unique, useful content. This gives your unique content a much better chance of being crawled and ranking higher.
John Erickson
LeadQual
A new infographic by J.D. power and associates has caused quite a commotion in the social media...
