Using the Robots.txt File

November 17, 2020 02:00

Search engines like Google use web crawlers (also known as bots or spiders) to try and collect information from all websites across the internet. The Googlebot is from who you would assume, Google, but many other search engines such as Bing, Yahoo and even sites like YouTube, Amazon and Facebook have their own bots. Most bots do come from search engines, but there are other sites that will also send bots for various reasons like verification.

When these bots go to a website they try to "crawl" or index every page of that site so when a visitor uses their search engine, they can retrieve the needed page when its requested. Typically when a bot starts to work on a site, they search for the Robots.txt file first. This file tells the search engine what they should and should not index, which is how the public search results are saved.

#              __________
#   __,_,     |Dear Bots |
#  [_|_/      | Be Nice! |
#   //        |__________|
# _//    __  /
#(_|)   |@@|
# \ \__ \--/ __
#  \o__|----|  |   __
#      \ }{ /\ )_ / _\
#      /\__/\ \__O (__
#     (--/\--)    \__/
#     _)(  )(_
#    `---''---`

Creating a Robots.txt file

The Robots.txt file always goes in the document root folder. To access this folder log into your cPanel and go to your File Manager. Once your in the document root folder for the domain you're working on, you can create a blank file by clicking the + Folder option in the top toolbar and name it robots.txt.

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /trackback/
Disallow: /index.php
Disallow: /xmlrpc.com
Disallow: /wp-login.php
Disallow: /wp-content/plugins/
Disallow: /comments/feed/

User-agent: Yandex
Disallow: /

User-agent: Baiduspider
Disallow: /

Once added, this will allow all search engines to rank anything they want to and reduce site errors.

Blocking Bots from Crawling

If you no longer want bots to crawl or search engines to rank your site, this code will ban all attempts of indexing done from search engines:

#Code to not allow any search engines!
User-agent: *
Disallow: /

If you do need your website indexed for SEO purposes, but you only want to allow them to crawl certain parts of your site, that's possible as well. The following code will block any bots from accessing any directories listed:

# Blocks robots from specific folders / directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

**Note that this code works as house rules for your website, telling bots what you do or do not want them to access. Legitimate bots will follow and respect the house rules as most bots will, but there are rogue bots that will ignore the robots.txt file. For certain search engines such as Google or Bing, their bots do not respect the robots.txt file and you're required to use their webmaster tools.

The Google & Bing Network

As noted Google and Bing do not follow the standard rules of the robots.txt file, but you can still manage their respective bots and control how they crawl your site. Using the webmaster tools allows you to set most of the parameters for the Googlebots. For an in depth understanding of Googles official stance on using robots.txt file access this link!

Google Documentation on Robots.txt File

To help manage your account resources, we always recommend still utilizing the robots.txt file. Not only does this reduce the resources used by the system, but it also reduces the rate other web crawlers access your site. To access or learn more about the Google and Bing network visit the following links.

Was this article helpful?

0 out of 0 found this helpful

Using the Robots.txt File

Articles in this section