robots.txt URL format

robots.txt URL format - robots.txt

According to this page
globbing and regular expression are not supported in either the User-agent or Disallow lines
However, I noticed that the stackoverflow robots.txt includes characters like * and ? in the URLs. Are these supported or not?
Also, does it make any difference whether a URL includes a trailing slash, or are these two equivalent?
Disallow: /privacy
Disallow: /privacy/

Your second question, the two are not equivalent. /privacy will block anything that starts with /privacy, including something like /privacy_xyzzy. /privacy/, on the other hand, would not block that.
The original robots.txt did not support globbing or wildcards. However, many robots do. Google, Microsoft, and Yahoo agreed on a standard a few years back. See http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html for details.
Most major robots that I know of support that "standard."

Related

Disallow url with ends with "?m=0" in robots

Hello I want to disallow urls like this one: "/2018/11/razones-para-ver-fallet.html?m=0", in robots.txt. I mean the ones who ends with "?m=0".
This one belongs to blogger mobile view (now I have migrated to wordpress) and google bot are still indexing (sitemaps in console is the new one but...) them causing some cpu problems.
I have proved with: Disallow: /*?m=0 but I´m still seeing them in visit log.
Many thanks

What about this: Disallow: /*?*m=0

What does /*.php$ mean in robots.txt?

I came across a site that uses the following in its robots.txt file:
User-agent: *
Disallow: /*.php$
So what does it do?
Will it prevent web crawlers from crawling the following URLs?
https://example.com/index.php
https://example.com/index.php?page=Events&action=Upcoming
Will it block subdomains too?
https://subdomain.example.com/index.php

So what does it do?
By spec it means "URLs starting with /*.php$", which isn't very useful. There might be engines out that which support some custom syntax for it. I know some support wild cards, but that looks like regular expression syntax and I've not heard of anything that supports that in robots.txt.
Will it prevent web crawlers from crawling the following URLs?
By spec: No.
If anything supports regexs, then it will block the first one but not the second one.
Will it block subdomains too?
No. Each origin is independent when it comes to robots.txt. The subdomain site would need its own copy of the resource.

It looks like regular expressions but regular expressions are not in the spec. But Google and Bing both honours wildcards (*) and end-of-url markers ($). You can try your robots.txt rules here.

Robots.txt Disallow

I'm working with an e-commerce system at the moment that is throwing up hundreds of potential duplicate page urls and trying to work out how to hide them via robots.txt untill the developers are able to sort there ...... out.
I have managed to block most of them but got stuck on the last type so the question is:
I have 4 urls to the same product page with the below structure, how do I block the first one but not the others.
www.example.com/ProductPage
www.example.com/category/ProductPage
www.example.com/category/subcategory/ProductPage
www.example.com/category/subcategory/ProductPage/assessorypage
So far the only idea I can come up with is using:
Disallow: /*?id=*/
this however blocks everything…
EDIT: I believe I may have found a way to do it by setting up a robots.txt file to disallow all then just allow the specific paths I want again below that and then…once again disallow any specific paths after that.
Anyone know if this has a negative effect on SEO using disallow > allow > disallow.

You could set the meta tag for the rel="canonical" property. This will help search engines know which url is the 'right' one and not have more than one URL per product in search results.
Read here for more information

Robots.txt Allow sub folder but not the parent

Can anybody please explain the correct robots.txt command for the following scenario.
I would like to allow access to:
/directory/subdirectory/..
But I would also like to restrict access to /directory/ not withstanding the above exception.

Be aware that there is no real official standard and that any web crawler may happily ignore your robots.txt
According to a Google groups post, the following works at least with GoogleBot;
User-agent: Googlebot
Disallow: /directory/
Allow: /directory/subdirectory/

I would recommend using Google's robot tester. Utilize Google Webmaster tools - https://support.google.com/webmasters/answer/6062598?hl=en
You can edit and test URLs right in the tool, plus you get a wealth of other tools as well.

If these are truly directories then the accepted answer is probably your best choice. But, if you're writing an application and the directories are dynamically generated paths (a.k.a. contexts, routes, etc), then you might want to use meta tags instead of defining it in the robots.txt. This gives you the advantage of not having to worry about how different browsers may interpret/prioritize the access to the subdirectory path.
You might try something like this in the code:
if is_parent_directory_path
<meta name="robots" content="noindex, nofollow">
end

What do these lines in the Google+ robots.txt mean?

http://plus.google.com/robots.txt has the following contents:
User-agent: *
Disallow: /_/
I'm assuming this means search engines are allowed to index anything in the first level off the root and nothing further?

I believe those lines are used to deny robots access to URLs like
https://plus.google.com/_/apps-static/_/ss/landing/...
and
https://plus.google.com/_/apps-static/_/js/landing/....
These URLs mainly appear to be CSS, Javascript, and JSON, but there might be other things (that are more valuable to search engines) which aren't immediately obvious.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

robots.txt URL format - robots.txt

Related

Disallow url with ends with "?m=0" in robots

What does /*.php$ mean in robots.txt?

Robots.txt Disallow

Robots.txt Allow sub folder but not the parent

What do these lines in the Google+ robots.txt mean?

Categories

Resources