After 25 years, the Robots Exclusion Protocol (REP) will finally become a standard. If you are a webmaster or search specialist and haven’t heard of this protocol before, you’d almost definitely be familiar with robots.txt files – small lines of text that sit in a single file at the root of a domain to help automated crawlers and search engines better understand where and what they should and shouldn’t access.
Having been around for so long, many webmasters interpret the rules differently and as such, there has never been a ‘standard’ way of using the REP, sometimes creating confusion amongst webmasters – until now.
Google has officially teamed up with Martijn Koster (creator of the REP), various webmasters and other search engines to properly document how the REP should be used for the modern web.
Google is contributing their 20 years of experience to the project, including their data of about half a billion websites that rely on robots.txt to ensure their websites are crawled efficiently.
What do we need to know?
- Where REP was originally limited to HTTP requests, REP will now be available for FTP and other protocols.
- A maximum cache time of 24hrs will apply to the files so webmasters can update their robots.txt file whenever they want and crawlers won’t overload web servers with robots requests.
- When a previously available robots.txt file fails to become accessible due to server failures, known blocked pages will not be crawled for a reasonably long period of time. This alleviates concerns around search engines suddenly crawling and indexing pages they weren’t supposed to during the time server issues took place.
Furthermore, Google has also announced they are retiring all code that handles unsupported rules (such as noindex) as of 1st September, 2019.
Retiring Unsupported Rules
The newly proposed internet draft for REP standards have extensive architecture for rules that are not part of the standard, meaning if crawlers wanted to support their own custom lines such as “Unicorns: allowed” they could. This is quite similar to the “Sitemap:” line most search specialists would be familiar with.
While open-sourcing their library, Google analysed how robots.txt files were used and assessed unsupported usage such as crawl-delay, nofollow and noindex. As these were never official and rarely used (Google claims only 0.001% of all robots.txt files on the internet used these rules), Google will officially retire unsupported robots rules such as “noindex” on 1st September, 2019.
What Should We Do?
If you’ve been reliant on the unsupported rules, it’s important your robots.txt files are updated as soon as possible and you use the correct methods for blocking and removing content detailed at the end of this article.
Need help with your robots.txt files or just want to know what Google thinks of your site? Get in touch with our expert team and learn how we can help!