In
Web site implementations, there can be a requirement that some files and directories
of a web site should not be indexed by any of the search engines. For this
purpose we can use Robot Exclusion Standard, also known as the Robots Exclusion
Protocol (http://www.robotstxt.org/orig.html).
Here we use robots.txt file, which is a text file placed in the root of a site
to tell search engine robots which files and directories of the web site should
not access (crawl).
Important:
- When a robot wants to visits our Web site (http://myserver.com/default.aspx),
it first checks for http://myserver.com/robots.txt. Robots do not search the
whole site for a file named robots.txt, but look only in the main directory. They
strip the path component from the URL (everything from the first single slash),
and puts "/robots.txt" in its place.
- File name should be all lower case.
"robots.txt", not "Robots.TXT.
- Robots can ignore our robots.txt (especially
malware robots), which means we cannot rely 100% on a robots.txt to protect some
data from being indexed and displayed in search results. Because of that should
not be used as a way to protect sensitive data.
- robots.txt
file is publicly available. So anyone can browse it and see what sections of our
web site we do not want robots to access.
robots.txt file syntax:
robots.txt
file uses two rules:
·
User-agent: The robot the following rule
applies to
·
Disallow: The URL we want to block
Sample
1: Entire server content is excluded from all robots.
User-agent: *
Disallow: /
Sample
2: Two directories are excluded from all robots.
User-agent: *
Disallow: /archive/
Disallow: /tmp/
Sample 3: A file is excluded from Googlebot search engine.
User-agent: Googlebot
Disallow: /myFile.aspx
Not
all search engines support pattern matching or regular expression in either the
User-agent or Disallow lines. The '*' in the User-agent field is a special case,
which means "any robot".
There
are few third-party tools available to validate robots.txt file.
·
Robots.txt Checker at http://tool.motoricerca.info/robots-checker.phtml
·
Webmaster Tools at http://www.google.com/webmasters/tools/
List
of robot software implementations and operators can be found here: http://www.robotstxt.org/db.html
There
are many robots.txt generation tools on the web. One of them is a Mavention Robots.txt which is for creating and managing
robots.txt files on websites built on the SharePoint platform.
1 comment:
Thank you for shariing
Post a Comment