The ultimate guide to bot herding and spider wrangling — Part Two
Next up in a series on bots and why crawl budgets are important, Columnist Stephan Spencer explains how to direct the engine bots to what's important on your site and how to avoid common coding issues.
In Part One of our three-part series, we learned what bots are and why crawl budgets are important. Let’s take a look at how to let the search engines know what’s important and some common coding issues.
How to let search engines know what’s important
When a bot crawls your site, there are a number of cues that direct it through your files.
Like humans, bots follow links to get a sense of the information on your site. But they’re also looking through your code and directories for specific files, tags and elements. Let’s take a look at a number of these elements.
The first thing a bot will look for on your site is your robots.txt file.
For complex sites, a robots.txt file is essential. For smaller sites with just a handful of pages, a robots.txt file may not be necessary — without it, search engine bots will simply crawl everything on your site.
There are two main ways you can guide bots using your robots.txt file.
1. First, you can use the “disallow” directive. This will instruct bots to ignore specific uniform resource locators (URLs), files, file extensions, or even whole sections of your site:
Although the disallow directive will stop bots from crawling particular parts of your site (therefore saving on crawl budget), it will not necessarily stop pages from being indexed and showing up in search results, such as can be seen here:
The cryptic and unhelpful “no information is available for this page” message is not something that you’ll want to see in your search listings.
The above example came about because of this disallow directive in census.gov/robots.txt:
2. Another way is to use the noindex directive. Noindexing a certain page or file will not stop it from being crawled, however, it will stop it from being indexed (or remove it from the index). This robots.txt directive is unofficially supported by Google, and is not supported at all by Bing (so be sure to have a User-agent: * set of disallows for Bingbot and other bots other than Googlebot):
Obviously, since these pages are stil…