File robots.txt
consists of groups of rules that determine the behavior of robots on the site.
robots.txt
should have exactly this name and its encoding should be UTF-8.robots.txt
must not be more than 32KB in size.robots.txt
should be in root directory of the site... That is, it must be accessible through the browser at an address of the form http://www.example.com/robots.txt
.robots.txt
.Disallow
.Each group can contain several of the same rules. For example, this is useful for specifying multiple robots or pages.
The rule group must be in the following order and consist of the specified directives:
User-agent
— obligatory directive, can be specified multiple times in one rule group.Disallow
and Allow
— obligatory directives. At least one of them must be listed in each rule group.Host
, Crawl-delay
, Sitemap
- optional directives.To specify regular expressions, use:
*
- means a sequence of any length from any characters.$
- means the end of the line.
Directive User-agent
defines the name of the robot that the rule will apply to. To specify all robots, you can use:
User-agent: *
If this directive is specified with a specific robot name - the rule with *
will be ignored.
The specified directives will allow access to the robot named Googlebot
and prohibit others:
User-agent: * Dissalow: / User-agent: Googlebot Dissalow:
Directive Disallow
defines pages to which robots are denied access.
You can deny access to the entire site by specifying:
Dissalow: /
A ban on individual pages can be specified as follows:
Dissalow: /admin
Directive Allow
defines pages to which robots are denied access. The directive is used to throw exceptions when specifying Disallow
.
The following rule specifies block for the robot Googlebot
the whole site except the directory pages
:
User-agent: Googlebot Disallow: / Allow: /pages/
Directive Host
defines main site domain. The directive is useful if several domain names are bound to the site and for correct search indexing, thus, you can specify which domain will be the main one so that the rest of the domains are defined as mirrors, technical addresses, etc.
An example of using the directive within a site with domains example.com
and domain.com
where for all robots example.com
will be the main domain:
User-agent: * Disallow: Host: domain.com
Directive Crawl-delay
defines the interval between the end of loading one page and the beginning of loading the next for robots. This directive is useful for reducing requests to the site, which helps to reduce the load on the server. The interval is specified in seconds.
Usage example:
User-Agent: * Disallow: Crawl-delay: 3
Directive Sitemap
defines URL-address of the sitemap file on the site. This directive can be specified multiple times. The address must be specified in the format protocol://address/path/to/sitemap
.
Usage example:
Sitemap: https://example.com/sitemap.xml Sitemap: http://www.example.com/sitemap.xml
robots.txt
must be removed, as well as site settings the parameter “Send requests to the backend if the file is not found"Or extension txt
should be removed from static files.
If the site uses several domains, for example using aliases, then the settings specified in the file robots.txt
, may differ for each site due to certain SEO-optimization or other tasks. To implement dynamic robots.txt
do the following:
domain.com-robots.txt
in root directory site, where instead of domain.com
specify the domain for which the specified rules will apply..htaccess
the following rules:RewriteEngine On RewriteCond %{REQUEST_URI} ^/robots\.txt$ RewriteRule ^robots\.txt$ %{HTTP_HOST}-robots.txt [L]