I came across a site that uses the following in its robots.txt file:
User-agent: *
Disallow: /*.php$
So what does it do?
Will it prevent web crawlers from crawling the following URLs?
https://example.com/index.php
https://example.com/index.php?page=Events&action=Upcoming
Will it block subdomains too?
https://subdomain.example.com/index.php
So what does it do?
By spec it means "URLs starting with /*.php$", which isn't very useful. There might be engines out that which support some custom syntax for it. I know some support wild cards, but that looks like regular expression syntax and I've not heard of anything that supports that in robots.txt.
Will it prevent web crawlers from crawling the following URLs?
By spec: No.
If anything supports regexs, then it will block the first one but not the second one.
Will it block subdomains too?
No. Each origin is independent when it comes to robots.txt. The subdomain site would need its own copy of the resource.
It looks like regular expressions but regular expressions are not in the spec. But Google and Bing both honours wildcards (*) and end-of-url markers ($). You can try your robots.txt rules here.
Related
I am trying to solve an issue where Googlebot seems to be eating up my CPU usage. To confirm my guess, I modify robots.txt on my website's root folder, adding
Disallow: /
to it. I have two websites on different servers both of them are having this issue. So for one of them, after I edited robots.txt the CPU usage drops to a normal level, for the other I see from apache access log that the Googlebot is still coming in.
So I go to Google search console to test robots.txt. For the first one I see that google already discovered the latest robots.txt and stop crawling my website; For the second one google is still using an old version of robots.txt. So modifying robots.txt doesn't always take effect immediately, am I right? And if so, how do I notify google that I have a new robots.txt?
You need to use this to disallow all user agents though:
User-agent: *
Disallow: /
For the search engine re-indexing, it might take between a few days to four weeks before Googlebots index a new site (reference).
According to this page
globbing and regular expression are not supported in either the User-agent or Disallow lines
However, I noticed that the stackoverflow robots.txt includes characters like * and ? in the URLs. Are these supported or not?
Also, does it make any difference whether a URL includes a trailing slash, or are these two equivalent?
Disallow: /privacy
Disallow: /privacy/
Your second question, the two are not equivalent. /privacy will block anything that starts with /privacy, including something like /privacy_xyzzy. /privacy/, on the other hand, would not block that.
The original robots.txt did not support globbing or wildcards. However, many robots do. Google, Microsoft, and Yahoo agreed on a standard a few years back. See http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html for details.
Most major robots that I know of support that "standard."
Can anybody please explain the correct robots.txt command for the following scenario.
I would like to allow access to:
/directory/subdirectory/..
But I would also like to restrict access to /directory/ not withstanding the above exception.
Be aware that there is no real official standard and that any web crawler may happily ignore your robots.txt
According to a Google groups post, the following works at least with GoogleBot;
User-agent: Googlebot
Disallow: /directory/
Allow: /directory/subdirectory/
I would recommend using Google's robot tester. Utilize Google Webmaster tools - https://support.google.com/webmasters/answer/6062598?hl=en
You can edit and test URLs right in the tool, plus you get a wealth of other tools as well.
If these are truly directories then the accepted answer is probably your best choice. But, if you're writing an application and the directories are dynamically generated paths (a.k.a. contexts, routes, etc), then you might want to use meta tags instead of defining it in the robots.txt. This gives you the advantage of not having to worry about how different browsers may interpret/prioritize the access to the subdirectory path.
You might try something like this in the code:
if is_parent_directory_path
<meta name="robots" content="noindex, nofollow">
end
http://plus.google.com/robots.txt has the following contents:
User-agent: *
Disallow: /_/
I'm assuming this means search engines are allowed to index anything in the first level off the root and nothing further?
I believe those lines are used to deny robots access to URLs like
https://plus.google.com/_/apps-static/_/ss/landing/...
and
https://plus.google.com/_/apps-static/_/js/landing/....
These URLs mainly appear to be CSS, Javascript, and JSON, but there might be other things (that are more valuable to search engines) which aren't immediately obvious.
hi sirs what's the best way to prevent google from showing of a folder in the search engine ?, like e.g www.example.com/support , what should i do if I want the support folder to disappear in google ?
the first thing I did was place a 'robots.txt' file and include this code
User-agent: *
Disallow: /support/etc
but the results is a total disaster, am not able to use the support page anymore unless i remove the robots.txt
what's the best thing to do ?
robots.txt shouldnt affect the way your page function. If in doubt, you can use tools to generate like http://www.searchenginepromotionhelp.com/m/robots-text-creator/simple-robots-creator.php or http://www.seochat.com/seo-tools/robots-generator/
When dissallowing in robots file, you can explicitly specify a file or subfolder rather than just a folder.
You can also use meta tag in your document to tell the crawler not to use it
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
what's the best way to prevent google from showing of a folder in the search engine ?
A robots.txt file is the right way to do this. Your example is correct for blocking the /support/etc directory and its descendants.
am not able to use the support page anymore unless i remove the robots.txt
It doesn't make sense that a robots.txt file would affect the way your site functions, and certainly it should never affect which pages can be accessed by a human. I suspect something else is awry -- check your server logs to see what kinds of errors are being recorded.
While not the preferred method of limiting robot access, Google talks about using a noindex meta tag here. This will also prevent the various pages from showing up if they are linked to by a site other than your own.
A good discussion of limiting bots that visit your site can be found here.