What is a robots.txt file and how can it help improve your SEO strategy? I’m glad I pretended you asked, because the aim of this post is to demystify robots.txt while offering insight into how this simple file can be used to help control and limit access to your website.
In the world of the World Wide Web, robots are less exciting than they sound – unless the notion of software bots visiting websites sets your soul on fire.
The most common type of web robot is the search engine crawler. We like these types of bots and do everything in our power to welcome them into our homes. Bots of this nature are responsible for crawling and indexing the billions of websites on the Internet.
But much like we don’t give every guest in our home free rein of the house, we also don’t necessarily want to give web bots free reign of our websites.
What is robots.txt?
The robots exclusion standard was developed in the 1990s in an effort to control the ways that web bots could interact with websites. Robots.txt is the part of this standard that allows you to dictate which bots can access specific pages on a website. Akin to a bouncer at a nightclub, robots.txt enables you to restrict access to certain areas of the site or block certain bots entirely.
The robots.txt file accomplishes this feat by specifying which user agents (web bots) can crawl the pages on your site, providing instructions that “allow” or “disallow” their behavior. Simply put, robots.txt tells web bots which pages to crawl and which pages not to crawl.
Search engines exist to crawl and index web content in order to offer it up to searchers based on implied intent. They accomplish this by following links that spider from site to site, spanning the breadth and relevance of a particular subject. This has proven to be a remarkably effective feat.
When a web bot is about to crawl a site, it first checks for a robots.txt file to see if there are any specific crawling instructions. Sort of a virtual knock on the door. From there, the search engine bot will proceed accordingly.
The 5 Most Common Terms Used in robots.txt Syntax
Robots.txt syntax is essentially the language of robots.txt files. Here are the five most common terms you’ll encounter in a robots.txt file. Having an awareness of these terms will give you a better idea of what you’re working with.
- User-agent: This term is in reference to the specific web crawler being giving instructions.
- Disallow: This command instructs the specified user-agent not to crawl particular URL. You can only use one disallow per URL.
- Allow: This command is specific to Googlebot. It allows it to access pages or subfolders whose parent pages or subfolders are disallowed.
- Crawl-delay: This command reference how many seconds a crawler must wait before starting its crawl. For Google, this must be set using Google Search Console.
- Sitemap: This function discloses the location of the domain’s XML sitemap.
How Does robots.txt Work?
To illustrate how a robot.txt file works in a real-world application, we’ll show rather than tell. Here’s an example of a common WordPress robots.txt file:
This is a clean robots.txt file that blocks very little.
- User-agent: * (allows access for all web bots)
- Disallow: /wp-admin/ (prevents the admin page from being crawled)
This simple robots.txt file gets the job done, and is fine in most cases. However, robots.txt files can get far more specific depending on the demands of various websites. Instructions can be included for multiple user-agents, allows, disallows, crawl-delays, sitemap location, etc. When multiple user-agents exist, each user-agent will only follow the allows and disallows that are specific to them.
The decision to disallow certain pages can be advantageous when it comes to SEO. But we’ll get to that later. Generally speaking, you absolutely want web bots to crawl (and index) your site. But use extreme caution. Botching a robots.txt file can result in your site being deindexed, which would be a disaster.
It is worth noting that not all web robots adhere to robots.txt crawl instructions – not unlike the National Do Not Call Registry. These bots tend to fall under the sinister bot category that includes malware, email harvesting, and spambots – again, not unlike the National Do Not Call Registry. Robots.txt files act more as suggestions than laws. If your site is having security issues involving bots, there companies that offer security features in this realm.
As with all things SEO, it’s best to take a best practices approach while the search engine overlords do their bidding. We must do what we can. For a masterclass in how specific and comprehensive a robots.txt file can be, check out the Wikipedia.org robots.txt file.
How To Check Any Website for robots.txt?
Pro Tip: you can see any website’s robot.txt file by simply adding the slug /robots.txt after the domain name. This will give you an idea of how different sites implement robots.txt files to suite their strategies. Looking under the hood is a valuable training exercise.
Why is a robots.txt File Important?
The most important benefits of creating a robots.txt file have to do with what Google refers to as Crawl Rate Limit and Crawl Demand.
Crawl Rate Limit
- A well-structured robots.txt file can help optimize a search engine bot’s crawl resources by telling them not to allocate crawl budget to unimportant pages that don’t need to be indexed. This increases the likelihood that search engines will allocate crawl resources on your most important pages.
- A well-structured robots.txt files can help optimize your usage resources by blocking web bots that strain your server’s resources.
Using robots.txt in the right way allows you to suggest to search engine bots how you want to allocate your site’s crawl budget. This makes the robots.txt file a useful tactic in the technical SEO playbook.
Robots.txt Best Practices
- The robots.txt file must be placed in a website’s top-level directory. This is where web bots begin their crawling.
- It’s case sensitive: when creating a robots.txt file, be sure to name it “robots.txt”.
- Do not use robot.txt files in an attempt to hide private user information. These files are publicly available. Anyone can see what pages you want to allow or disallow. Use discretion.
- When creating a robots.txt file, go to the source. While Google isn’t the only player in the search engine game, they dominate it. It’s best to follow their processes.
- Each subdomain on a root domain uses separate robots.txt files. This means that both blog.example.com and example.com should have their own robots.txt files (at blog.example.com/robots.txt and example.com/robots.txt).
- Include the location of your domain’s sitemap. This should be done at the bottom of the robots.txt file. Don’t just rely on this, submit it directly to Google Search Console. Cover all your bases.
- View the robots.txt database to see the most common user-agents.
Using robots.txt for SEO
Before tweaking an existing robots.txt file or creating a new one, this caveat is worth another mention: do not use robots.txt to block pages from search engines. It only serves to keep pages from being crawled, not from being indexed. If your goal is to prevent pages from being indexed, use another method such as password protection or a noindex directive.
Using robots.txt for SEO will largely depend on your site’s specific content and layout. There are several ways of using robots.txt to your advantage.
One of the best ways to use robots.txt to improve your SEO is in optimizing your crawl budget through disallows. This is especially relevant for pages that aren’t displayed publicly. A “thank you” page for a newsletter signup is a prime example. Another example would be an admin login page. There’s no reason for pages of this nature to be crawled, indexed, or prioritized. Instructing bots to not crawl unimportant pages on your site can free up resources that have the potential to be used crawling more important pages.
Another SEO-related use of robots.txt is in intentional duplicate content. Generally speaking, duplicate content is an SEO no-no. But there are certain circumstance in which it is acceptable. For instance, a printer-friendly version of a web page. In this case, you should instruct web bots to not crawl the printer-friendly version to avoid a duplicate content issue.
To put a bow in this robots.txt primer, there are two other directives you should be familiar with: noindex and nofollow. As mentioned above, the disallow command doesn’t prevent pages from being indexed, just crawled. So it is possible to disallow a page and still have it end up being indexed.
Stress not, that’s what the noindex directive is here for. It works in conjunction with the disallow syntax to ensure that web bots don’t crawl or index specified pages. If your site has pages that you don’t want to be indexed by search engines, use the disallow and noindex directive together. This will drastically increase the likelihood that this page won’t show up in the SERPs.
Robots.txt can be used as another weapon in your optimization arsenal. Taking the time to properly set it up can improve your SEO. Sustainable SEO is all about the small gains that accumulate over time. Making attempts to influence the way that search bots allocate your crawl budget can ultimately lead to your content ranking higher in the SERPs. Creating and implementing a robots.txt file requires a (mostly) one-time effort, but can pay dividends to your long-term rankings.