Dev Help: Instruct Search Engine Robots to Skip Part of the Site from Crawling

Monday, July 23, 2012

Instruct Search Engine Robots to Skip Part of the Site from Crawling - robots.txt file

In Web site implementations, there can be a requirement that some files and directories of a web site should not be indexed by any of the search engines. For this purpose we can use Robot Exclusion Standard, also known as the Robots Exclusion Protocol (http://www.robotstxt.org/orig.html). Here we use robots.txt file, which is a text file placed in the root of a site to tell search engine robots which files and directories of the web site should not access (crawl).

Important:

When a robot wants to visits our Web site (http://myserver.com/default.aspx), it first checks for http://myserver.com/robots.txt. Robots do not search the whole site for a file named robots.txt, but look only in the main directory. They strip the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.
File name should be all lower case. "robots.txt", not "Robots.TXT.
Robots can ignore our robots.txt (especially malware robots), which means we cannot rely 100% on a robots.txt to protect some data from being indexed and displayed in search results. Because of that should not be used as a way to protect sensitive data.
robots.txt file is publicly available. So anyone can browse it and see what sections of our web site we do not want robots to access.

robots.txt file syntax:

robots.txt file uses two rules:

· User-agent: The robot the following rule applies to

· Disallow: The URL we want to block

Sample 1: Entire server content is excluded from all robots.

User-agent: *

Disallow: /

Sample 2: Two directories are excluded from all robots.

User-agent: *

Disallow: /archive/

Disallow: /tmp/

Sample 3: A file is excluded from Googlebot search engine.

User-agent: Googlebot

Disallow: /myFile.aspx

Not all search engines support pattern matching or regular expression in either the User-agent or Disallow lines. The '*' in the User-agent field is a special case, which means "any robot".

There are few third-party tools available to validate robots.txt file.

· Robots.txt Checker at http://tool.motoricerca.info/robots-checker.phtml

· Webmaster Tools at http://www.google.com/webmasters/tools/

List of robot software implementations and operators can be found here: http://www.robotstxt.org/db.html

There are many robots.txt generation tools on the web. One of them is a Mavention Robots.txt which is for creating and managing robots.txt files on websites built on the SharePoint platform.

1 comment:

Danny said...: Thank you for shariing; November 17, 2021 at 7:16 AM