Drupal 7 - robots.txt

  • user notice: The custom_breadcrumbs_nodeapi() function called token replacement with an array rather than a string for $text in /home/redleafmedia/redleafmedia.com/sites/all/modules/token/token.module on line 263.
  • user notice: The custom_breadcrumbs_nodeapi() function called token replacement with an array rather than a string for $text in /home/redleafmedia/redleafmedia.com/sites/all/modules/token/token.module on line 263.

As the web continues to expand at a rapid pace, users are becoming increasingly reliant on search engines to find relevant sites in a timely manner.  Search engines rely on "bots" (aka "crawlers", "spiders", etc.) to automatically surf around the Internet and index the content that they find.  Over time the web development community has tried to standardize a method for telling these bots which parts of your site should be indexed or crawled.  The result is the "robots.txt" standard.  Site owners can put a file named "robots.txt" in their site's root folder.  Search engine bots are supposed to review the contents of the robots.txt file before they crawl the site.  In this way, the bots can index the site in the intended way (e.g. not indexing irrelevant content).  In addition to improving the quality of indexed content, this protocol saves on bandwidth and resources for both the site owner and search engine.

As a popular content management system, Drupal of course has a robots.txt file that it places in the site's root directory.  With a clean install of Drupal 7 beta 3, the contents are:

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/

The "User-agent: *" portion directs all bots to follow the suggestions that follow, and the "Crawl-delay" portion directs the bots to wait 10 seconds between successive requests.  After these initial settings are configured, the first order of business is to have bots ignore Drupal's core directories (e.g. includes, misc, etc.).  Next, it instructs the bots to ignore the text and php files that reside in the site root (e.g. CHANGELOG.txt, INSTALL.txt, etc.).  Finally, bots are told to ignore paths like the default search page, login/logout pages, comment form pages, etc.  These pages are blank forms for the most part and would serve very little purpose in a search engines indexes.  It should be noted that this file ensures that both "clean" and "unclean" (with '?q=' at the start) paths are ignored by bots.

That's about it for the Drupal 7 robots.txt file.  Short and sweet.

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.