New robots.txt control in MODX Cloud

by Ryan Thrash, Mike Schell

Published on October 19, 2017

Today we’re announcing a new feature in Cloud to streamline its handling of robots.txt files, bringing a new capability for unique robots.txt files per hostname for multisite installations.

What is robots.txt?

A /robots.txt is an optional file that lets a webmaster explicity tell well-behaving web robots, like search index spiders, about how they should crawl a website. If there is no robots.txt file present, most robots should proceed with crawling and indexing a site.

This is useful when site owners use a dev or staging site for ongoing work, and an isolated production site where changes and updates are deployed. You can tell web robots to ignore the dev site, while allowing indexing on the production site.

Robots.txt in MODX Cloud

Previously, MODX Cloud gave users control over the behavior of allowing a custom robots.txt file to be served based on a toggle in the Dashboard. While this was useful, it was possible to accidentally allow indexing on staging/dev sites by toggling the option in the Dashboard. Similarly, it was possible to easily disallow indexing on a production site.

Today, we’re removing this interface completely, and relying on the presence of robots.txt files on the filesystem with the following exception: any domain that ends in modxcloud.com will be served a Disallow: / directive to all user agents, irrespective of the presence or absence of a robots.txt file.

For production sites (ones that get real visitor traffic) you’ll need to use a custom domain if you desire your site to be indexed.

Serve unique robots.txt files per hostname in MODX Cloud

Some organizations use MODX Revolution to run multiple websites from a single installation using Contexts. Cases where this might apply would be a public facing marketing site combined with landing page microsites and possibly a non-public intranet.

Most site owners want their sites indexed. In MODX Cloud all sites with custom hostnames will fall back to serving any uploaded robots.txt file in the web root, usually with the following content:

User-agent: *
Disallow: 

 
However, for a hypothetical intranet using intranet.example.com as its hostname, you wouldn’t want it indexed. Traditionally, this was tricky to accomplish on multisite installs because they shared the same web root. However in MODX Cloud, it’s easy. Simply upload an additional file to your webroot named robots-intranet.example.com.txt with the following content and it will block indexing by well behaving robots, and all other hostnames will fall back to the standard robots.txt file if no other hostname-specific ones exist:

User-agent: *
Disallow: /

Do I need to do anything?

All new Clouds will work as described above starting now. Please refer to our note on robots.txt behavior for Clouds created prior to October 19, 2017.

Learn more

Understanding how robots.txt affects your sites in the search engines is an important aspect of website management. Learn more about robots.txt at robotstxt.org. Also bookmark our documentation on robots.txt handling in MODX Cloud. And if you want to start using this new capability in MODX Cloud login to your Dashboard or create an account today.

Sign Me Up for MODX Cloud!

Millions Rely on MODX

In 2005, MODX could power a fully mobile-responsive website using HTML5 and CSS3, even though those technologies weren’t invented yet. And with MODX today, you’re ready not only for what you need now but also what comes next.

Try MODX Right Now