Using "Disallow: /*?" in robots.txt file - robots.txt

I used
Disallow: /*?
in the robots.txt file to disallow all pages that might contain a "?" in the URL.
Is that syntax correct, or am I blocking other pages as well?

It depends on the bot.
Bots that follow the original robots.txt specification don’t give the * any special meaning. These bots would block any URL whose path starts with /*, directly followed by ?, e.g., http://example.com/*?foo.
Some bots, including the Googlebot, give the * character a special meaning. It typically stands for any sequence of characters. These bots would block what you seem to intend: any URL with a ?.
Google’s robots.txt documentation includes this very case:
To block access to all URLs that include question marks (?). For example, the sample code blocks URLs that begin with your domain name, followed by any string, followed by a question mark, and ending with any string:
User-agent: Googlebot
Disallow: /*?

Related

robots.txt: Does Wildcard mean no characters too?

I have the following example robots.txt and questions about the wildcard:
User-agent: *
Disallow: /*/admin/*
Does this rule now apply on both pages:
http://www.example.org/admin
and http://www.example.org/es/admin
So can the Wildcard stand for no characters?
In the original robots.txt specification, * in Disallow values has no special meaning, it’s just a character like any other. So, bots following the original spec would crawl http://www.example.org/admin as well as http://www.example.org/es/admin.
Some bots support "extensions" of the original robots.txt spec, and a popular extension is interpreting * in Disallow values to be a wildcard. However, these extensions aren’t standardized somewhere, each bot could interpret it differently.
The most popular definition is arguably the one from Google Search (Google says that Bing, Yahoo, and Ask use the same definition):
* designates 0 or more instances of any valid character
Your example
When interpreting the * according to the above definition, both of your URLs would still allowed to be crawled, though.
Your /*/admin/* requires three slashes in the path, but http://www.example.org/admin has only one, and http://www.example.org/es/admin has only two.
(Also note that the empty line between the User-agent and the Disallow lines is not allowed.)
You might want to use this:
User-agent: *
Disallow: /admin
Disallow: /*/admin
This would block at least the same, but possibly more than you want to block (depends on your URLs):
User-agent: *
Disallow: /*admin
Keep in mind that bots who follow the original robots.txt spec would ignore it, as they interpret * literally. If you want to cover both kinds of bots, you would have to add multiple records: a record with User-agent: * for the bots that follow the original spec, and a record listing all user agents (in User-agent) that support the wildcard.

robots.txt: how are ill-formed disallow lines treated

What happens when a Disallow line includes more than one URI? Example:
Disallow: / tmp/
I white space was introduced by mistake.
Is there a standard way in how web browsers deal with this? Do they ignore the whole line or just ignore the second URI and treat it like:
Disallow: /
Google, at least, seems to treat the first non-space character as the beginning of the path, and the last non-space character as the end. Anything in-between is counted as part of the path, even if it's a space. Google also silently percent-encodes certain characters in the path, including spaces.
So the following:
Disallow: / tmp/
will block:
http://example.com/%20tmp/
but it will not block:
http://example.com/tmp/
I have verified this on Google's robots.txt tester. YMMV for crawlers other than Google.

Is wildcard in Robots.txt in middle of string recognized?

I need some string for robots.txt like:
disallow:
/article/*/
but I don't know if this is a proper way to do this or not?!
I need that for example:
/article/hello
/article/123
may be followed; BUT:
/article/hello/edit
/article/123/768&goshopping
the last lines would not be followed....
Wildcards are not part of the original robots.txt specification, but they are supported by all of the major search engines. If you just want to keep Google/Bing/Yahoo from crawling these pages, then the following should do it:
User-agent: *
Disallow: /article/*/
Older crawlers that do not support wildcards will simply ignore this line.

Can I use robots.txt to block certain URL parameters?

Before you tell me 'what have you tried', and 'test this yourself', I would like to note that robots.txt updates awfully slow for my siteany site on search engines, so if you could provide theoretical experience, that would be appreciated.
For example, is it possible to allow:
http://www.example.com
And block:
http://www.example.com/?foo=foo
I'm not very sure.
Help?
According to Wikipedia, "The robots.txt patterns are matched by simple substring comparisons" and as the GET string is a URL you should be able to just add:
Disallow: /?foo=foo
or something more fancy like
Disallow: /*?*
to disable all get strings. The asterisk is a wildcard symbol so it matches one or many characters of anything.
Example of a robots.txt with dynamic urls.

Spaces in folder names

I am a bit new at search engine optimisation, and I am just trying to write a robot.txt file. I want to disallow pages in the terms and conditions folder, which has a space in it.
Should I write:
Disallow: /Page TermsAndConditions/
Or:
Disallow: /Page%20TermsAndConditions/
Or both, or does it not matter?
(I want to disallow this folder because this is full of technical jargon that I think is messing up the content keywords found by the googlebot - or do I have this wrong as well?).
Edit:
I found this page from googe Robots.txt Specifications. Which says: "Non-7-bit ASCII characters in a path may be included as UTF-8 characters or as percent-escaped UTF-8 encoded characters per RFC 3986". So I guess the answer to this question is that it doesn't matter.
Rename the folder without a space and avoid the issue.
Disallow: /PageTermsAndConditions/