What Is robots.txt And How To Create It?

Mikołaj Sykuła

May 15, 2023

robots.txt is one of the most important files that can be on your website. Its main task is to communicate with search engine robots such as Googlebot and instruct them which pages or parts of the page can or cannot be crawled and indexed. In this article, we will discuss how to write a robots.txt file, the benefits of a properly configured robots.txt file, its capabilities, and provide useful tools and links to documentation.

What is robots.txt?

robots.txt is a file located in the site's root directory that tells search engine robots which parts of the site they can scan and which they should avoid. This is the first file that crawlers check when they visit your site.

How does robots.txt work?

When a search engine robot visits a site, the first thing it does is check the robots.txt file. If the file exists, the crawler reads its instructions to find out which pages can be indexed and which should be skipped. If the file doesn't exist, the crawler will assume it can crawl the entire page.

Where should robots.txt be?

The robots.txt file should be at the url: /robots.txt, e.g. https://syki.dev/robots.txt

How to write a robots.txt file?

Basic properties

The robots.txt file consists of directives that tell robots what to do. Here are the most important directives:

User-agent: Specifies the search engine robot to which the instructions are directed. This can be the name of a specific bot (e.g. Googlebot) or * for all bots.
Disallow: Tells the crawler not to index the specified page or directory.
Allow: Tells the crawler that it can index the specified page or directory, even if it is in a directory that has been excluded by the Disallow directive.
Sitemap: Specifies the location of the sitemap (sitemap) for the site. A sitemap is an XML file that lists all the pages of your site that should be indexed by search engine robots.
Host: Specifies the domain to which the instructions are directed.

Advanced properties

These properties are rarely used because they are often not considered by bots, but can be useful in some cases:

Clean-param: Specifies URL parameters that should be removed before indexing the page. This is not a commonly used directive as most search engine robots do not follow it.
Noindex: Specifies that the page or directory should not be indexed by search engine robots. This is not a commonly used directive as most search engine robots do not follow it.
Crawl-delay: Specifies the delay in seconds that the crawler should apply before scanning the site. This is not a commonly used directive as most search engine robots do not follow it.
Request-rate: Specifies the maximum number of requests that the robot can make within a certain period of time. This is not a commonly used directive as most search engine robots do not follow it.

Formatting rules

The robots.txt file must follow a specific format:

Each property must be on a separate line.
Properties are case-sensitive, so Disallow and disallow are treated as two different properties.
The Disallow and Allow properties may contain paths to specific pages or directories that should or should not be indexed.

Here is an example of a robots.txt file:

txt
Loading...

Robots.txt usage examples

The robots.txt file can be configured in many different ways, depending on the needs of your website. Here are some examples:

Allow all bots to scan all pages.
```
txt
Loading...
```
Block the entire site for all robots. We should use such a file if we want our website to never be displayed in Google and not to be scanned by bots. We should use it, for example, in the admin panel.
```
txt
Loading...
```
Block a specific directory for all robots.
```
txt
Loading...
```
Block a specific robot.
```
txt
Loading...
```

Benefits of using robots.txt correctly

A properly configured robots.txt file can bring many benefits:

Control over which parts of your site are indexed by search engines.
Prevent indexing of pages or directories that should not be public.
Saving server resources by limiting the crawling of irrelevant pages by robots.
Improved SEO by focusing robots' attention on the most important pages.

Robots.txt development and testing tools

Creating and testing the robots.txt file can be facilitated by various tools:

Google's Robots Testing Tool - A tool for testing robots.txt files in the Google Webmaster Console.
Google Robots.txt Documentation - Simple tool to generate robots.txt files .

In addition, the full specification of robots.txt is available at The Web Robots Pages.

Summary

The robots.txt file is an essential element of any website that allows you to control the access of search engine robots to various parts of the website. A properly configured robots.txt file helps to focus the attention of robots on the most important pages, saves server resources and improves SEO.

Note that while the robots.txt file is a powerful tool, not all robots respect it - some malicious bots may deliberately ignore it to scan and index pages that should be kept private. Therefore, robots.txt should not be the only means of protecting private pages - always use appropriate security and access permissions to protect your data.

In this article, I covered the basics of creating and using a robots.txt file, but it is a much more complex topic with many nuances and possibilities. It's always a good idea to consult the official documentation and other trusted sources to learn more.

Related Blogs

What Is sitemap.xml And How To Create It?

Discover how to craft a sitemap.xml to guide search engines through your site's structure.

May 17, 2023

What Is RSS Feed And How To Create It?

Understand RSS feeds and their creation to keep your audience updated with the latest content.

May 30, 2023

What is robots.txt?Copied!

How does robots.txt work?Copied!

Where should robots.txt be?Copied!

How to write a robots.txt file?Copied!