Today we’re announcing a new feature in Cloud to streamline its handling of robots.txt files, bringing a new capability for unique robots.txt files per hostname for multisite installations.
What is robots.txt?
A /robots.txt is an optional file that lets a webmaster explicity tell well-behaving web robots, like search index spiders, about how they should crawl a website. If there is no robots.txt file present, most robots should proceed with crawling and indexing a site.
This is useful when site owners use a dev or staging site for ongoing work, and an isolated production site where changes and updates are deployed. You can tell web robots to ignore the dev site, while allowing indexing on the production site.
Robots.txt in MODX Cloud
Previously, MODX Cloud gave users control over the behavior of allowing a custom robots.txt file to be served based on a toggle in the Dashboard. While this was useful, it was possible to accidentally allow indexing on staging/dev sites by toggling the option in the Dashboard. Similarly, it was possible to easily disallow indexing on a production site.
Today, we’re removing this interface completely, and relying on the presence of robots.txt files on the filesystem with the following exception: any domain that ends in
modxcloud.com will be served a
Disallow: / directive to all user agents, irrespective of the presence or absence of a robots.txt file.
For production sites (ones that get real visitor traffic) you’ll need to use a custom domain if you desire your site to be indexed.
Serve unique robots.txt files per hostname in MODX Cloud
Some organizations use MODX Revolution to run multiple websites from a single installation using Contexts. Cases where this might apply would be a public facing marketing site combined with landing page microsites and possibly a non-public intranet.
Most site owners want their sites indexed. In MODX Cloud all sites with custom hostnames will fall back to serving any uploaded
robots.txt file in the web root, usually with the following content:
User-agent: * Disallow:
However, for a hypothetical intranet using intranet.example.com as its hostname, you wouldn’t want it indexed. Traditionally, this was tricky to accomplish on multisite installs because they shared the same web root. However in MODX Cloud, it’s easy. Simply upload an additional file to your webroot named
robots-intranet.example.com.txt with the following content and it will block indexing by well behaving robots, and all other hostnames will fall back to the standard robots.txt file if no other hostname-specific ones exist:
User-agent: * Disallow: /
Do I need to do anything?
All new Clouds will work as described above starting now. Please refer to our note on robots.txt behavior for Clouds created prior to October 19, 2017.
Understanding how robots.txt affects your sites in the search engines is an important aspect of website management. Learn more about robots.txt at robotstxt.org. Also bookmark our documentation on robots.txt handling in MODX Cloud. And if you want to start using this new capability in MODX Cloud login to your Dashboard or create an account today.