Monday, July 23, 2012

Instruct Search Engine Robots to Skip Part of the Site from Crawling - robots.txt file

In Web site implementations, there can be a requirement that some files and directories of a web site should not be indexed by any of the search engines. For this purpose we can use Robot Exclusion Standard, also known as the Robots Exclusion Protocol (http://www.robotstxt.org/orig.html). Here we use robots.txt file, which is a text file placed in the root of a site to tell search engine robots which files and directories of the web site should not access (crawl).

Important:
  • When a robot wants to visits our Web site (http://myserver.com/default.aspx), it first checks for http://myserver.com/robots.txt. Robots do not search the whole site for a file named robots.txt, but look only in the main directory. They strip the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.
  • File name should be all lower case. "robots.txt", not "Robots.TXT.
  • Robots can ignore our robots.txt (especially malware robots), which means we cannot rely 100% on a robots.txt to protect some data from being indexed and displayed in search results. Because of that should not be used as a way to protect sensitive data.
  • robots.txt file is publicly available. So anyone can browse it and see what sections of our web site we do not want robots to access.

robots.txt file syntax:

robots.txt file uses two rules:
·         User-agent: The robot the following rule applies to
·         Disallow: The URL we want to block

Sample 1: Entire server content is excluded from all robots.

   User-agent: *
   Disallow: /

Sample 2: Two directories are excluded from all robots.

   User-agent: *
   Disallow: /archive/
   Disallow: /tmp/

Sample 3: A file is excluded from Googlebot search engine.

   User-agent: Googlebot
   Disallow: /myFile.aspx

Not all search engines support pattern matching or regular expression in either the User-agent or Disallow lines. The '*' in the User-agent field is a special case, which means "any robot".

There are few third-party tools available to validate robots.txt file.
·         Robots.txt Checker at http://tool.motoricerca.info/robots-checker.phtml
·         Webmaster Tools at http://www.google.com/webmasters/tools/

List of robot software implementations and operators can be found here: http://www.robotstxt.org/db.html

There are many robots.txt generation tools on the web. One of them is a Mavention Robots.txt which is for creating and managing robots.txt files on websites built on the SharePoint platform.

1 comment:

Danny said...

Thank you for shariing