robots.txt: Does Wildcard mean no characters too? - robots.txt

I have the following example robots.txt and questions about the wildcard:
User-agent: *
Disallow: /*/admin/*
Does this rule now apply on both pages:
http://www.example.org/admin
and http://www.example.org/es/admin
So can the Wildcard stand for no characters?

In the original robots.txt specification, * in Disallow values has no special meaning, it’s just a character like any other. So, bots following the original spec would crawl http://www.example.org/admin as well as http://www.example.org/es/admin.
Some bots support "extensions" of the original robots.txt spec, and a popular extension is interpreting * in Disallow values to be a wildcard. However, these extensions aren’t standardized somewhere, each bot could interpret it differently.
The most popular definition is arguably the one from Google Search (Google says that Bing, Yahoo, and Ask use the same definition):
* designates 0 or more instances of any valid character
Your example
When interpreting the * according to the above definition, both of your URLs would still allowed to be crawled, though.
Your /*/admin/* requires three slashes in the path, but http://www.example.org/admin has only one, and http://www.example.org/es/admin has only two.
(Also note that the empty line between the User-agent and the Disallow lines is not allowed.)
You might want to use this:
User-agent: *
Disallow: /admin
Disallow: /*/admin
This would block at least the same, but possibly more than you want to block (depends on your URLs):
User-agent: *
Disallow: /*admin
Keep in mind that bots who follow the original robots.txt spec would ignore it, as they interpret * literally. If you want to cover both kinds of bots, you would have to add multiple records: a record with User-agent: * for the bots that follow the original spec, and a record listing all user agents (in User-agent) that support the wildcard.

Related

Why is googlebot blocking all my urls if the only disallow I selected in robots.txt was for iisbot?

I have had the following robots.txt for over a year, seemingly without issues:
User-Agent: *
User-Agent: iisbot
Disallow: /
Sitemap: http://iprobesolutions.com/sitemap.xml
Now I'm getting the following error from the robots.txt Tester
Why is googlebot blocking all my urls if the only disallow I selected was for iisbot ?
Consecutive User-Agent lines are added together. So the Disallow will apply to User-Agent: * as well as User-Agent: iisbot.
Sitemap: http://iprobesolutions.com/sitemap.xml
User-Agent: iisbot
Disallow: /
You actually don't need the User-Agent: *.
Your robots.txt is not valid (according to the original robots.txt specification).
You can have multiple records.
Records are separated by empty lines.
Each record must have at least one User-agent line and at least one Disallow line.
The spec doesn’t define how invalid records should be treated. So user-agents might either interpret your robots.txt as having one record (ignoring the empty line), or they might interpret the first record as allowing everything (at least that would be the likely assumption).
If you want to allow all bots (except "iisbot") to crawl everything, you should use:
User-Agent: *
Disallow:
User-Agent: iisbot
Disallow: /
Alternatively, you could omit the first record, as allowing everything is the default anyway. But I’d prefer to be explicit here.

Using "Disallow: /*?" in robots.txt file

I used
Disallow: /*?
in the robots.txt file to disallow all pages that might contain a "?" in the URL.
Is that syntax correct, or am I blocking other pages as well?
It depends on the bot.
Bots that follow the original robots.txt specification don’t give the * any special meaning. These bots would block any URL whose path starts with /*, directly followed by ?, e.g., http://example.com/*?foo.
Some bots, including the Googlebot, give the * character a special meaning. It typically stands for any sequence of characters. These bots would block what you seem to intend: any URL with a ?.
Google’s robots.txt documentation includes this very case:
To block access to all URLs that include question marks (?). For example, the sample code blocks URLs that begin with your domain name, followed by any string, followed by a question mark, and ending with any string:
User-agent: Googlebot
Disallow: /*?

Is wildcard in Robots.txt in middle of string recognized?

I need some string for robots.txt like:
disallow:
/article/*/
but I don't know if this is a proper way to do this or not?!
I need that for example:
/article/hello
/article/123
may be followed; BUT:
/article/hello/edit
/article/123/768&goshopping
the last lines would not be followed....
Wildcards are not part of the original robots.txt specification, but they are supported by all of the major search engines. If you just want to keep Google/Bing/Yahoo from crawling these pages, then the following should do it:
User-agent: *
Disallow: /article/*/
Older crawlers that do not support wildcards will simply ignore this line.

Multiple User-agents: * in robots.txt

Related question: Multiple User Agents in Robots.txt
I'm reading a robots.txt file on a certain website and it seems to be contradictory to me (but I'm not sure).
User-agent: *
Disallow: /blah
Disallow: /bleh
...
...
...several more Disallows
User-agent: *
Allow: /
I know that you can exclude certain robots by specifying multiple User-agents, but this file seems to be saying that all robots are disallowed of a bunch of files but also allowed to access all the files? Or am I reading this wrong.
This robots.txt is invalid, as there must only be one record with User-agent: *. If we fix it, we have:
User-agent: *
Disallow: /blah
Disallow: /bleh
Allow: /
Allow is not part of the original robots.txt specification, so not all parsers will understand it (those have to ignore the line).
For parsers that understand Allow, this line simply means: allow everything (else). But that is the default anyway, so this robots.txt has the same meaning:
User-agent: *
Disallow: /blah
Disallow: /bleh
Meaning: Everything is allowed except those URLs whose paths start with blah or bleh.
If the Allow line would come before the Disallow lines, some parsers might ignore the Disallow lines. But, as Allow is not specified, this might be different from parser to parser.

How To Use a Wildcard in robots.txt

Is it possible to:
User-agent: *
Disallow: /apps/abc*/
In a robots.txt file to disallow abc123, abc-xyz, etc.?
Quoting Wikipedia:
The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. Some crawlers like Googlebot and Slurp recognize strings containing "*", while MSNbot and Teoma interpret it in different ways.
Additional research may be found in the cited source.