When a search engine like Google comes to a website, it tries to track as much information as possible and then index all the pages it finds into its results listings. However, sometimes we want some sections and pages to stay out of the way and not be indexed. This is where the robots.txt file comes into play.
Also called a robots exclusion file, the robots.txt file of a website allows us to give a series of instructions to search engines so that they do not crawl or index certain parts of our page.
Due to its importance, and because it is a powerful tool to improve SEO, today we will see what the robots.txt file is, what it is for exactly, what are the main instructions that we can give to search engines, and how to create a robots.txt for WordPress and for any type of website.
In addition, I will also give you an example of robots.txt for WordPress and I will explain what the difference is between using this file and the Meta Robots tag.
What is the robots.txt file and what is it for?
All search engines have a series of spiders or tracking robots that surf the Internet looking for new pages. When they come to a website, the first thing these spiders do is visit the robots.txt file on our website. Depending on the directions we give them, those spiders will crawl our site or go where they came from.
Therefore, the robots.txt file will allow you to keep some sections of your website out of search engines, or if you prefer, your entire website.
Thanks to this robot exclusion standard, you will be able to do the following:
- Tell search engines like Google not to index some parts of your website that are private and you don’t want them to appear in the results lists.
- If you are creating by creating your web page, you can prevent search engines from indexing it until you have it finished.
- You can prevent Google from indexing duplicate content and, therefore, avoid possible penalties. Because in case you don’t know, Google doesn’t like plagiarized content.
- If you have a private area or a client area, you can hide it so that it does not appear in Google and that it can only be accessed from your page or knowing the exact URL.
- Tell Google what your sitemap or Sitemap is so that it can more quickly access the important pages of your website.
You have to keep in mind that the Robots txt file is public. That is, anyone can see what content and sections you have blocked just by putting domain.com/robots.txt. This is not a bad thing or a disadvantage, since the robots.txt is intended to give instructions to search engines.
Therefore, if you have private pages that you do not want a user to know or access, protect them with a password, for example.
Main commands you can use in your robots.txt file
In addition to knowing what robots.txt is, it is essential that you know how to use its commands. The good thing about these commands is that they are standard. That is, they are understood by most search engines. However, in the robots.txt file we can give different indications to each of them.
The first thing to write when creating a robots.txt is the search engine to which we want to give the directions. For this we will use the User-agent command indicating which robot we want to affect with the following guidelines:
- User-agent: With this command we indicate the search engine we want to affect with the following indications.
- Disallow: It allows us to prohibit access to certain pages or directories of our site.
- Allow: It is the opposite command to Disallow. If we want to give search engines access to a specific page within a directory on our website that we have previously denied access using Disallow, the Allow command will be the appropriate option.
- Sitemap: It serves to tell search engines where our site map is. In this way, the search engine can easily find all the pages of our website, since there are the main ones.
Examples and cases in which to use the different commands
- Specify which robot we are giving the directions to: This is the first directions that we must give in our robots.txt file. The normal thing is to give the same indications to all:
User-agent: * → for all robots.
User-agent: Googlebot → For Google robot.
User-agent: Bing → For Bing robot.
- Website under construction: If you have your website under construction and you want search engines not to index it in their listings yet, you should use the Disallow command in the following way:
- Deny a specific page: If what you want is not to index a specific page of your website, such as https://www.abc.com/xyz/, you must use the Disallow command as follows: https: // www.abc.com / xyz / , you must use the Disallow command as follows:
Disallow: / xyz
- Deny a directory on your website: In this case, if you want to deny access to an entire part of your website, such as https://www.abc.com/xyz/, you must use the Disallow command like this https: // www.abc. com / xyz / , you should use the Disallow command like this
Disallow: / xyz /
- Deny all pages that start in a certain way: For example, if you want to deny access to Google robot, or any of the other crawlers, of all the URLs of your website that contain / category, the command would be used again Disallow like this:
Disallow: / category *
- Indicate the site map: It is used to indicate to the robots where our site map or sitemap is. This sitemap is used to facilitate the work of crawling and indexing our website:
Sitemap: https://www.abc.com/sitemap_index.xml https: // www.abc. Com / sitemap _ index. Xml
- Deny access to a file type: If, for example, we have downloadable PDF files that we do not want to appear in search engines because they are private and only accessible to certain users, we can use the Disallow command in the robots.txt indicating that all the files that ending with the extension “.pdf” should not be crawled or indexed. And it is not only used for PDF files, if not for any other that we want:
How to generate a robots.txt file
You already know what the robots.txt file is, but now let’s see how it is created. First I will explain how to create a robots.txt in WordPress and then I will explain it for web pages created with other platforms.
In either case, generating a robots.txt is a very simple task, since it is a simple text file.
How to create a robots.txt file in WordPress
Creating a robots.txt in WordPress is very simple. Surely you already have several plugins to control the SEO of your website, such as Yeast SEO. If you don’t have it installed, I highly recommend that you install it and configure it to improve the web positioning of your page.
This free plugin, in addition to allowing you to control and optimize the SEO of your entire website, has a function that automatically generates a robots.txt file for WordPress. And the best of all is that you can modify it directly from the plugin without having to create it separately and upload it to the hosting.
To create your robots.txt file in WordPress, locate the SEO section in the menu on the left and click on the subsection called Tools. Once here, click on the File Editor and some important files from your website will appear, such as the hatches.
If you have never created your robots.txt file before, you will see a specific button for this that says Create a robots txt file. With this, you would have created your robots.txt in WordPress.
Now you should only use the commands that I explained to you at the beginning or use the example of robots.txt for WordPress that I show you a little later.
How to create a robots.txt file on any other type of website
If you don’t use WordPress, you will have to create a robots.txt by hand and then upload it to the root of your hosting. To do this, open a text editor such as Notepad and write the instructions that you want to give to the search engines using the commands that I have named at the beginning of this post.
Once you have it created, save it with the name robots.txt, go to your hosting and upload it to the root where you have all the files of your website. Normally you should upload it to the public html section.
Example robots.txt for WordPress
Now that you know what robots txt is and how to create it, I’ll give you a standard example that works for most sites created with WordPress. However, each website is a different world and I recommend that you customize it according to your needs.
Disallow: / we-admin
Don’t forget to change my sitemap for yours!
Differences between Robots.txt and Meta Robots tag
Before ending this article, I want to clarify a question that often arises when we talk about page indexing.
With the robots.txt file we can prevent some parts of our website from being indexed in Google results. However, it is also possible to do this using the Meta robots tag with the value no index on each page that we do not want to index from Yeast SEO or by hand.
So what is the difference?
If we use the robots.txt file we do not allow the PageRank to be transmitted from the blocked pages to those that link. That is, if I block a page by robots to prevent it from being indexed, but on that page there are links to other parts of my website, the authority that could be transmitted with those links, we cut it.
The advantage of doing it this way is that we avoid Google wasting time crawling pages that are not important to us and, therefore, we save the Crawl Budget.
If we use the Meta robots tag, the page will be crawled, as well as the links (unless we put an additional no follow). This conveys authority, but Google loses more crawling time.
Extra resource: Learn to get links to your website that really boost your SEO
Therefore, we must analyze in which case one method or another compensates us and act accordingly based on our objectives.
To see the Crawl Budget that we have assigned to our website, and see if the robot has time to track it, we must go to Search Console> Crawl> Crawl statistics.