ReadTheDocs robots.txt and sitemap.xml - robots.txt

ReadTheDocs auto-generates a robots.txt and sitemap.xml for projects. Each time I deploy a new minor version of my project (ex. 4.1.10), I hide previous minor versions (ex. 4.1.9). ReadTheDocs adds entries for all versions to sitemap.xml, but hidden versions are also added to robots.txt. The result is that submitted sitemaps to Google Search Console, at this point, result in "Submitted URL blocked by robots.txt" errors, since the previous sitemap entry is now blocked by the newly generated robots.txt.
ReadTheDocs generates a sitemap URL for each version, so we have an entry like this for 4.1.9, for example:
<url>
<loc>https://pyngrok.readthedocs.io/en/4.1.9/</loc>
<lastmod>2020-08-12T18:57:47.140663+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>
And when 4.1.10 is release and the previous minor version is hidden, the newly generated robots.txt gets:
Disallow: /en/4.1.9/ # Hidden version
I believe this Disallow is what then causes the Google crawler to throw the error.
Realistically, all I want in the sitemap.xml are latest, develop, and stable, I don't much care to have every version crawled. But all I'm able to configure, as I understand it from ReadTheDocs docs, is a static robots.txt.
What I want is to publish a static sitemap.xml of my own instead of using the auto-generated one. Any way to accomplish this?

After playing around with a few ideas, here is the solution I came other with. Since this question is asked frequently and often opened as a bug against ReadTheDocs on GitHub (which it's not, it just appears to be poorly supported and/or documented), I'll share my workaround here for others to find.
As mentioned above and in the docs, while ReadTheDocs allows you to override the auto-generated robots.txt and publish your own, you can't with sitemap.xml. Unclear why. Regardless, you can simply publish a different sitemap.xml, I named mine sitemap-index.xml, then, tell your robots.txt to point to your custom sitemap.
For my custom sitemap-index.xml, I only put the pages I care about rather then ever generated version (since stable and latest are really what I want search engines to be crawling, not versioned pages):
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://pyngrok.readthedocs.io/en/stable/</loc>
<changefreq>weekly</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://pyngrok.readthedocs.io/en/latest/</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://pyngrok.readthedocs.io/en/develop/</loc>
<changefreq>monthly</changefreq>
<priority>0.1</priority>
</url>
</urlset>
I created my own robots.txt that tells Google not to crawl anything except my main branches and points to my custom sitemap-index.xml.
User-agent: *
Disallow: /
Allow: /en/stable
Allow: /en/latest
Allow: /en/develop
Sitemap: https://pyngrok.readthedocs.io/en/latest/sitemap-index.xml
I put these two files under /docs/_html, and to my Sphinx conf.py file (which is in /docs) I added:
html_extra_path = ["_html"]
Here is this shown in the repo too, for reference.
After ReadTheDocs rebuilds the necessary branches, give /en/latest/sitemap-index.xml to Google Search Console instead of the default one, ask Google to reprocess your robots.txt, and not only will the crawl errors be resolved, Google will properly index a site that hides previous minor versions.

Related

Does modifying robots.txt take effect immediately?

I am trying to solve an issue where Googlebot seems to be eating up my CPU usage. To confirm my guess, I modify robots.txt on my website's root folder, adding
Disallow: /
to it. I have two websites on different servers both of them are having this issue. So for one of them, after I edited robots.txt the CPU usage drops to a normal level, for the other I see from apache access log that the Googlebot is still coming in.
So I go to Google search console to test robots.txt. For the first one I see that google already discovered the latest robots.txt and stop crawling my website; For the second one google is still using an old version of robots.txt. So modifying robots.txt doesn't always take effect immediately, am I right? And if so, how do I notify google that I have a new robots.txt?
You need to use this to disallow all user agents though:
User-agent: *
Disallow: /
For the search engine re-indexing, it might take between a few days to four weeks before Googlebots index a new site (reference).

Joomla 3.6.4 home page does not use correct robots meta data

Whilst in development on a temporary subdomain I had robots set to No Index, No Follow in the Joomla 3.6.4 Global Configuration.
After moving the site to its permanent home I changed the settings to Index, Follow but, all the pages are fine, except the home page, which is still showing (when source is viewed) meta name="robots" content="noindex, nofollow"
I have changed the menu item to 'Index, Follow' rather than 'use global' and, when that made no difference, also did the same for the single article which is assigned to that page. Still no change. It doesn't change on the original development site version, either.
This is a template that I have used many times before: I keep a basic copy with some global tweaks that I have made and then adapt it for each new site. I can change the robots meta tags on this original template with no problem.
I have no SEF plugins. The only SEF tool is the System - SEF.
I have checked all the template files over and over and I am now about to tear my hair out.
Any suggestions gratefully received.
(p.s. can someone with a better reputation than me please add a Joomla3.6.4 forum tag?)

Adobe AEM relative links

How do you do relative links within text editor component? Adobe AEM doesn't like when I use relative links to external pages by default. It strips them out and shows the broken link symbol.
I strongly recommend you to uncheck Disable Checking in Day CQ Link Checker Transformer.
Be aware that It's your responsibility to ensure all links are valid:
Completly disable all link checking. All links are handled as valid.
This is something you want to check with your team (Devs, TAs...). It may work on your local environment and it will fail in QA, UAT and PROD as this option is not checked.
Disabling the link checker might not be a good idea as Content Authors may add broken links, which It'll break user navigation throughout the site if this is not picked during testing and regression testing.
Regarding paths, relative paths are those within the environment you're in. For instance,
/content/dam/geometrixx/banners/banner-mono.png
is a relative path, however path to Stack Overflow is outside you environment therefore is external. In order to be valid, you need to provide the full URL including the scheme, either http, https, ftp, ftps and so on. A valid external URL would be:
http://www.stackoverflow.com
More info about URLs can be found here.
While disabling the link checker will work, I'm not sure how you are referencing external websites by relative link. Relative links are on the same domain by their nature. Can you give us an example of what you mean?
The other problem with disabling the link checker is that the production deployment will likely have the link checker turned on. In this case your code will break again. You probably don't want your client/boss/whatever upset about that.
Relative links can be made to work just fine with the linkchecker. Can you post some example links? I can help you make things work properly.
This issue is quite common, if you have URL's (paths) in your domain that are NOT served by AEM. This can be files directly served by the Apache (e.g. robots.txt), Servlets creating dynamic redirects (e.g. language switcher) or another application (e.g. web shop under /shop).
First solution is to mark an individual link for the link checker as valid. Therefore you can add the following attribute to the link tag.
x-cq-linkchecker="valid" - link is marked as valid, without any
check
x-cq-linkchecker="skip"- link is ignored by the link
checker, and remains as is
e.g. Shopping Basket
Second solution is to configure special-threated link patterns in the OSGi config of the “Day CQ Link Checker Service”. In case you have a second application in the same domain, then you can specify regex-patterns matching the links to this second application. User either “Link Check Override Patterns” (not checked, but rewritten) or “Special Link Patterns” (not checked and not rewritten).
Example configuration that only links to /content/* are verified. Links that not matching ^/content/.*$ are threated as valid:
<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0"
jcr:primaryType="sling:OsgiConfig"
service.special_link_prefix="[javascript:,data:,mailto:,#,<!--,${,tel:]"
service.check_override_patterns="[^system/,^(?!/content/).*$]"
/>
Fixed my issue in /system/console/configMgr# > Day CQ Link Checker Transformer > Check "Disable Checking" box.
Relative paths now work.

Robots.txt Allow sub folder but not the parent

Can anybody please explain the correct robots.txt command for the following scenario.
I would like to allow access to:
/directory/subdirectory/..
But I would also like to restrict access to /directory/ not withstanding the above exception.
Be aware that there is no real official standard and that any web crawler may happily ignore your robots.txt
According to a Google groups post, the following works at least with GoogleBot;
User-agent: Googlebot
Disallow: /directory/
Allow: /directory/subdirectory/
I would recommend using Google's robot tester. Utilize Google Webmaster tools - https://support.google.com/webmasters/answer/6062598?hl=en
You can edit and test URLs right in the tool, plus you get a wealth of other tools as well.
If these are truly directories then the accepted answer is probably your best choice. But, if you're writing an application and the directories are dynamically generated paths (a.k.a. contexts, routes, etc), then you might want to use meta tags instead of defining it in the robots.txt. This gives you the advantage of not having to worry about how different browsers may interpret/prioritize the access to the subdirectory path.
You might try something like this in the code:
if is_parent_directory_path
<meta name="robots" content="noindex, nofollow">
end

prevent google from indexing

hi sirs what's the best way to prevent google from showing of a folder in the search engine ?, like e.g www.example.com/support , what should i do if I want the support folder to disappear in google ?
the first thing I did was place a 'robots.txt' file and include this code
User-agent: *
Disallow: /support/etc
but the results is a total disaster, am not able to use the support page anymore unless i remove the robots.txt
what's the best thing to do ?
robots.txt shouldnt affect the way your page function. If in doubt, you can use tools to generate like http://www.searchenginepromotionhelp.com/m/robots-text-creator/simple-robots-creator.php or http://www.seochat.com/seo-tools/robots-generator/
When dissallowing in robots file, you can explicitly specify a file or subfolder rather than just a folder.
You can also use meta tag in your document to tell the crawler not to use it
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
what's the best way to prevent google from showing of a folder in the search engine ?
A robots.txt file is the right way to do this. Your example is correct for blocking the /support/etc directory and its descendants.
am not able to use the support page anymore unless i remove the robots.txt
It doesn't make sense that a robots.txt file would affect the way your site functions, and certainly it should never affect which pages can be accessed by a human. I suspect something else is awry -- check your server logs to see what kinds of errors are being recorded.
While not the preferred method of limiting robot access, Google talks about using a noindex meta tag here. This will also prevent the various pages from showing up if they are linked to by a site other than your own.
A good discussion of limiting bots that visit your site can be found here.