The robots exclusion protocol is a convention to prevent unwanted web spiders and robots from accessing a website.
To use this protocol, a file must be named robots.txt and must be placed in the root directory of the website (for example, http://example.com/robots.txt).
The file is basically a text file containing a list of groups of 'User-agent's and 'Disallow' clauses. Each pair describes how specific crawlers may access the site.
A group begins with a "User-agent" clause specifying the type of robots. You may use the asterisk (*) as a wildcard.
- To specify a bot named "Wahoo", use:
- To mark all bots, use:
- To specify bots starting with "Gougle", use:
After a User-agent clause, you may use one or more "Disallow" clauses to specify the directories you want to "hide" from the bot.
- To allow the bot to access all files, specify a blank Disallow clause:
- To disallow the bot from all files, use:
- Disallow the bot from specific directories:
Disallow: /cgi-bin/ Disallow: /private/
- Files may also be excluded.
- Comments are preceded with the number (#) symbol:
# A line may start with a comment User-agent: Wahougle* # You can also place comments here Disallow: /private # This disallows all bots from entering all directories
- Groups are processed in order from top to bottom. For example, if a bot is included in the first group, following groups should not process that bot anymore. (Thus you would usually put a
User-agent: *at the bottom.)
- A "Crawl-delay" clause causes the bot to wait for the specified number of seconds before requesting again from the same website:
- An "Allow" clause counteracts a following "Disallow" clause. The following allows access to the musicfiles.html file but not to other files in the music directory:
Allow: /music/musicfiles.html Disallow: /music