Robots.txt: Understanding Commands With Examples

robots.txt

Robots.txt, I’m sure you’ve heard of it or maybe not. Either way, this guide will show you how to use the robots exclusion standard or protocol for your website.

So, what is it?

It’s a file that Webmasters use to communicate with web crawlers and bots (typically search engine robots). The robots exclusion standard specifies rules or language that can be used to regulate how a bots access, crawl and index information from a website.

Using this file, you can limit access to the information on your websites such as preventing the crawling of specific directories, subdirectories and web pages. You can even prevent certain user-agents (another term for bot software) from crawling your whole site. Neat, right?

How to Use Robots.txt With Examples

Imagine that you’ve got a massive website with millions of web pages and plenty of regular visitors. Your content changes frequently and thousands of new pages are generated by your users daily. Feeling happy? Good, you should.

The only debacle here is that when web servers handle too many requests at the same time, they can become overwhelmed and temporarily take your website offline. Not good!

Luckily, you can reduce the risk of server-overwhelm by using the robots.txt file. Create a new .txt file and name it Robots.txt. Then enter the following.

User-agent: Bingbot
Crawl-delay: 60

The user-agent directive is where you specify the bot name, which you can get from this list of crawlers. Crawl-delay tells the bot to wait a minimum amount of time between crawl requests. Essentially, your website is saying something like ‘hey robot, please wait 60 seconds before crawling another web page’.

Using an asterisk as the user agent means that your commands should apply to every robot.

User-agent: *

But what if we also wanted to prevent access to specific web pages or directories?

In this case, we would use the Disallow and Allow commands. The former refuses access while the latter allows it. So if we wanted to deny access to a particular web page, we would add the following to our robots.txt file.

Disallow: /directory
Disallow: /dir/web-page.html

To allow crawling:

Allow: /directory
Allow: /dir/web-page.html

Additionally, you can list your sitemaps in the robots.txt file. A sitemap contains a list of pages on a given site. They can also be used to pass extra details about web pages to bots such as a page’s priority (importance), last modified date, and change frequency.

To specify one or more in your robots.txt file, use the Sitemap command like so:

Sitemap: https://domain.com/sitemap.xml

We’ll be diving deeper into sitemaps in another article for our on-page series later but in general, sitemaps are no longer as important for SEO, providing that you are properly interlinking your web pages.

So our complete robots.txt file would look somewhat like this:

#Specific to Binbot (use a # to write comments in your file).

User-agent: Bingbot
Crawl-delay: 60
Disallow: /dir/web-page.html
Allow: /directory

#Any bot that will read and respect the directives within this file.

User-agent: *
Crawl-delay: 120
Disallow: /directory
Allow: /dir/web-page.html
Sitemap: https://domain.com/sitemap.xml

For a live example, please see Twitter’s robots.txt file. And if you have questions, drop them in the comments.

Important Things to Know About Robots.txt

Like with many things in technology, there’re a few things you should keep in mind when using ronots.txt.

  1. For your file to be found, it needs to be in a primary directory (e.g. domain.com/robots.txt).
  2. Each domain and subdomain uses separate robots.txt.
  3. Don’t get creative with the file name because it’s case sensitive.
  4. Malicious bot software or some user agents may ignore your robots.txt.
  5. The file is publicly accessible, meaning that anyone can view it.
  6. Some search engines use multiple user-agents. So make sure you’re targeting all of them.
  7. Always triple check to ensure that you’re not blocking any content that you want to be crawled.
  8. Links from pages that you’ve blocked via robots.txt will not be followed and no link equity (or authority) will be passed.
  9. Don’t use robots.txt to prevent access to sensitive data because all a malicious bot has to do is ignore it. Hence, it’s not a secure option. You should try something else.
  10. The Allow command is only used by Google.
  11. Every directive should be written on a separate line.

Do I Need a Robots.txt File?

If you’re thinking that it’s easier to just let the bots crawl everything, then yes, you’re right. But sometimes it’s not that simple. There are several reasons why you may want to use a robots.txt file.

  • You can prevent duplicate content from appearing on the SERPs (search engine results pages).
  • It can be a quick way of submitting your sitemaps or ensuring that search engine bots always read them.
  • For the most part, it can keep certain sections of your website private.
  • It can prevent certain documents from showing up on a public SERP.
  • The crawl-delay command is useful for reducing the likelihood of server overload.

A robots.txt file is still an important tool for SEO.

Understanding Robots.txt commands with use casesClick To Tweet

>> RELATED: Duplicate Content And How It Affects SEO

Want a heads-up whenever a new article drops? Subscribe here