What Is robots.txt And How To Create It?
robots.txt
is one of the most important files that can be on your website. Its main task is to communicate with search engine robots such as Googlebot and instruct them which pages or parts of the page can or cannot be crawled and indexed. In this article, we will discuss how to write a robots.txt file, the benefits of a properly configured robots.txt file, its capabilities, and provide useful tools and links to documentation.
What is robots.txt?
robots.txt
is a file located in the site's root directory that tells search engine robots which parts of the site they can scan and which they should avoid. This is the first file that crawlers check when they visit your site.
How does robots.txt work?
When a search engine robot visits a site, the first thing it does is check the robots.txt
file. If the file exists, the crawler reads its instructions to find out which pages can be indexed and which should be skipped. If the file doesn't exist, the crawler will assume it can crawl the entire page.
Where should robots.txt be?
The robots.txt
file should be at the url: /robots.txt
, e.g. https://syki.dev/robots.txt
How to write a robots.txt file?
Basic properties
The robots.txt
file consists of directives that tell robots what to do. Here are the most important directives:
User-agent
: Specifies the search engine robot to which the instructions are directed. This can be the name of a specific bot (e.g.Googlebot
) or*
for all bots.Disallow
: Tells the crawler not to index the specified page or directory.Allow
: Tells the crawler that it can index the specified page or directory, even if it is in a directory that has been excluded by theDisallow
directive.Sitemap
: Specifies the location of the sitemap (sitemap) for the site. A sitemap is an XML file that lists all the pages of your site that should be indexed by search engine robots.Host
: Specifies the domain to which the instructions are directed.
Advanced properties
These properties are rarely used because they are often not considered by bots, but can be useful in some cases:
Clean-param
: Specifies URL parameters that should be removed before indexing the page. This is not a commonly used directive as most search engine robots do not follow it.Noindex
: Specifies that the page or directory should not be indexed by search engine robots. This is not a commonly used directive as most search engine robots do not follow it.Crawl-delay
: Specifies the delay in seconds that the crawler should apply before scanning the site. This is not a commonly used directive as most search engine robots do not follow it.Request-rate
: Specifies the maximum number of requests that the robot can make within a certain period of time. This is not a commonly used directive as most search engine robots do not follow it.
Formatting rules
The robots.txt
file must follow a specific format:
- Each property must be on a separate line.
- Properties are case-sensitive, so
Disallow
anddisallow
are treated as two different properties. - The
Disallow
andAllow
properties may contain paths to specific pages or directories that should or should not be indexed.
Here is an example of a robots.txt
file:
txt Loading...
Robots.txt usage examples
The robots.txt
file can be configured in many different ways, depending on the needs of your website. Here are some examples:
-
Allow all bots to scan all pages.
txtLoading... -
Block the entire site for all robots. We should use such a file if we want our website to never be displayed in Google and not to be scanned by bots. We should use it, for example, in the admin panel.
txtLoading... -
Block a specific directory for all robots.
txtLoading... -
Block a specific robot.
txtLoading...
Benefits of using robots.txt correctly
A properly configured robots.txt
file can bring many benefits:
- Control over which parts of your site are indexed by search engines.
- Prevent indexing of pages or directories that should not be public.
- Saving server resources by limiting the crawling of irrelevant pages by robots.
- Improved SEO by focusing robots' attention on the most important pages.
Robots.txt development and testing tools
Creating and testing the robots.txt
file can be facilitated by various tools:
- Google's Robots Testing Tool - A tool for testing
robots.txt
files in the Google Webmaster Console. - Google Robots.txt Documentation - Simple tool to generate
robots.txt files
.
In addition, the full specification of robots.txt
is available at The Web Robots Pages.
Summary
The robots.txt
file is an essential element of any website that allows you to control the access of search engine robots to various parts of the website. A properly configured robots.txt file helps to focus the attention of robots on the most important pages, saves server resources and improves SEO.
Note that while the robots.txt file is a powerful tool, not all robots respect it - some malicious bots may deliberately ignore it to scan and index pages that should be kept private. Therefore, robots.txt should not be the only means of protecting private pages - always use appropriate security and access permissions to protect your data.
In this article, I covered the basics of creating and using a robots.txt file, but it is a much more complex topic with many nuances and possibilities. It's always a good idea to consult the official documentation and other trusted sources to learn more.
Related Blogs