How to disallow a certain action? - robots.txt

I'd like to disallow /questions/{ID}/foo but not /questions/{ID}.
Is the syntax Disallow: /questions/*/foo?

A good place to start looking for the proper syntax should be here:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

Related

robots.txt: Does Wildcard mean no characters too?

I have the following example robots.txt and questions about the wildcard:
User-agent: *
Disallow: /*/admin/*
Does this rule now apply on both pages:
http://www.example.org/admin
and http://www.example.org/es/admin
So can the Wildcard stand for no characters?
In the original robots.txt specification, * in Disallow values has no special meaning, it’s just a character like any other. So, bots following the original spec would crawl http://www.example.org/admin as well as http://www.example.org/es/admin.
Some bots support "extensions" of the original robots.txt spec, and a popular extension is interpreting * in Disallow values to be a wildcard. However, these extensions aren’t standardized somewhere, each bot could interpret it differently.
The most popular definition is arguably the one from Google Search (Google says that Bing, Yahoo, and Ask use the same definition):
* designates 0 or more instances of any valid character
Your example
When interpreting the * according to the above definition, both of your URLs would still allowed to be crawled, though.
Your /*/admin/* requires three slashes in the path, but http://www.example.org/admin has only one, and http://www.example.org/es/admin has only two.
(Also note that the empty line between the User-agent and the Disallow lines is not allowed.)
You might want to use this:
User-agent: *
Disallow: /admin
Disallow: /*/admin
This would block at least the same, but possibly more than you want to block (depends on your URLs):
User-agent: *
Disallow: /*admin
Keep in mind that bots who follow the original robots.txt spec would ignore it, as they interpret * literally. If you want to cover both kinds of bots, you would have to add multiple records: a record with User-agent: * for the bots that follow the original spec, and a record listing all user agents (in User-agent) that support the wildcard.

Allow only one file of directory in robots.txt?

I want to allow only one file of directory /minsc, but I would like to disallow the rest of the directory.
Now in the robots.txt is this:
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
The file that I want to allow is /minsc/menu-leaf.png
I'm afraid to do damage, so I dont'know if I must to use:
A)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
Allow: /minsc/menu-leaf.png
or
B)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/* //added "*" -------------------------------
Allow: /minsc/menu-leaf.png
?
Thanks and sorry for my English.
According to the robots.txt website:
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The
easy way is to put all files to be disallowed into a separate
directory, say "stuff", and leave the one file in the level above this
directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
According to Wikipedia, if you are going to use the Allow directive, it should go before the Disallow for maximum compatability:
Allow: /directory1/myfile.html
Disallow: /directory1/
Furthermore, you should put Crawl-delay last, according to Yandex:
To maintain compatibility with robots that may deviate from the
standard when processing robots.txt, the Crawl-delay directive needs
to be added to the group that starts with the User-Agent record right
after the Disallow and Allow directives).
So, in the end, your robots.txt file should look like this:
User-agent: *
Allow: /minsc/menu-leaf.png
Disallow: /minsc/
Crawl-delay: 10
Robots.txt is sort of an 'informal' standard that can be interpreted differently. The only interesting 'standard' is really how the major players are interpreting it.
I found this source saying that globbing ('*'-style wildcards) are not supported:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
http://www.robotstxt.org/robotstxt.html
So according to this source you should stick with your alternative (A).

Is wildcard in Robots.txt in middle of string recognized?

I need some string for robots.txt like:
disallow:
/article/*/
but I don't know if this is a proper way to do this or not?!
I need that for example:
/article/hello
/article/123
may be followed; BUT:
/article/hello/edit
/article/123/768&goshopping
the last lines would not be followed....
Wildcards are not part of the original robots.txt specification, but they are supported by all of the major search engines. If you just want to keep Google/Bing/Yahoo from crawling these pages, then the following should do it:
User-agent: *
Disallow: /article/*/
Older crawlers that do not support wildcards will simply ignore this line.

How To Use a Wildcard in robots.txt

Is it possible to:
User-agent: *
Disallow: /apps/abc*/
In a robots.txt file to disallow abc123, abc-xyz, etc.?
Quoting Wikipedia:
The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. Some crawlers like Googlebot and Slurp recognize strings containing "*", while MSNbot and Teoma interpret it in different ways.
Additional research may be found in the cited source.

Can I use robots.txt to block certain URL parameters?

Before you tell me 'what have you tried', and 'test this yourself', I would like to note that robots.txt updates awfully slow for my siteany site on search engines, so if you could provide theoretical experience, that would be appreciated.
For example, is it possible to allow:
http://www.example.com
And block:
http://www.example.com/?foo=foo
I'm not very sure.
Help?
According to Wikipedia, "The robots.txt patterns are matched by simple substring comparisons" and as the GET string is a URL you should be able to just add:
Disallow: /?foo=foo
or something more fancy like
Disallow: /*?*
to disable all get strings. The asterisk is a wildcard symbol so it matches one or many characters of anything.
Example of a robots.txt with dynamic urls.