Why is googlebot blocking all my urls if the only disallow I selected in robots.txt was for iisbot? - robots.txt

I have had the following robots.txt for over a year, seemingly without issues:
User-Agent: *
User-Agent: iisbot
Disallow: /
Sitemap: http://iprobesolutions.com/sitemap.xml
Now I'm getting the following error from the robots.txt Tester
Why is googlebot blocking all my urls if the only disallow I selected was for iisbot ?

Consecutive User-Agent lines are added together. So the Disallow will apply to User-Agent: * as well as User-Agent: iisbot.
Sitemap: http://iprobesolutions.com/sitemap.xml
User-Agent: iisbot
Disallow: /
You actually don't need the User-Agent: *.

Your robots.txt is not valid (according to the original robots.txt specification).
You can have multiple records.
Records are separated by empty lines.
Each record must have at least one User-agent line and at least one Disallow line.
The spec doesn’t define how invalid records should be treated. So user-agents might either interpret your robots.txt as having one record (ignoring the empty line), or they might interpret the first record as allowing everything (at least that would be the likely assumption).
If you want to allow all bots (except "iisbot") to crawl everything, you should use:
User-Agent: *
Disallow:
User-Agent: iisbot
Disallow: /
Alternatively, you could omit the first record, as allowing everything is the default anyway. But I’d prefer to be explicit here.

Related

Prevent indexing of images containing a given string

I am a photographer and I need to prevent the indexing ( thus the finding ) of the images of my clients that are displayed on a password protected shop.
I could include in the file names a specific string like ... WWWWW ... that would mark the files I want to hide.
Does this robots.txt do the work ?
User-agent: *
Disallow: /*WWWWW*
How can I test if it does ?
Thanks
User-agent: Googlebot-Image
Disallow: /*.gif$
or you can totally disable with the help of htaccess file.
Deny from all
You can test your existing robots.txt file by using for example https://en.ryte.com/free-tools/robots-txt/ or even Googles own tester https://support.google.com/webmasters/answer/6062598?hl=en
The following will disallow a specific directory:
User-agent: *
Disallow: /path/to/images/
You can also use an wildcard *:
User-agent: *
Disallow: /*.jpg # Disallows any JPEG images
Disallow: /*/images/ # Disallows parsing of all */images/* directories
There's no need for trailing wildcards, they are ignored. /*/path/* equals /*/path/.
You don't want to make a extensive list of every single file to disallow, because the contents of the robots.txt file is publicly available. Therefore it is good practice to prioritize directories over file paths.
See https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values for examples of paths/wildcards, and what they actually match.

robots.txt: Does Wildcard mean no characters too?

I have the following example robots.txt and questions about the wildcard:
User-agent: *
Disallow: /*/admin/*
Does this rule now apply on both pages:
http://www.example.org/admin
and http://www.example.org/es/admin
So can the Wildcard stand for no characters?
In the original robots.txt specification, * in Disallow values has no special meaning, it’s just a character like any other. So, bots following the original spec would crawl http://www.example.org/admin as well as http://www.example.org/es/admin.
Some bots support "extensions" of the original robots.txt spec, and a popular extension is interpreting * in Disallow values to be a wildcard. However, these extensions aren’t standardized somewhere, each bot could interpret it differently.
The most popular definition is arguably the one from Google Search (Google says that Bing, Yahoo, and Ask use the same definition):
* designates 0 or more instances of any valid character
Your example
When interpreting the * according to the above definition, both of your URLs would still allowed to be crawled, though.
Your /*/admin/* requires three slashes in the path, but http://www.example.org/admin has only one, and http://www.example.org/es/admin has only two.
(Also note that the empty line between the User-agent and the Disallow lines is not allowed.)
You might want to use this:
User-agent: *
Disallow: /admin
Disallow: /*/admin
This would block at least the same, but possibly more than you want to block (depends on your URLs):
User-agent: *
Disallow: /*admin
Keep in mind that bots who follow the original robots.txt spec would ignore it, as they interpret * literally. If you want to cover both kinds of bots, you would have to add multiple records: a record with User-agent: * for the bots that follow the original spec, and a record listing all user agents (in User-agent) that support the wildcard.

Allow only one file of directory in robots.txt?

I want to allow only one file of directory /minsc, but I would like to disallow the rest of the directory.
Now in the robots.txt is this:
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
The file that I want to allow is /minsc/menu-leaf.png
I'm afraid to do damage, so I dont'know if I must to use:
A)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
Allow: /minsc/menu-leaf.png
or
B)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/* //added "*" -------------------------------
Allow: /minsc/menu-leaf.png
?
Thanks and sorry for my English.
According to the robots.txt website:
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The
easy way is to put all files to be disallowed into a separate
directory, say "stuff", and leave the one file in the level above this
directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
According to Wikipedia, if you are going to use the Allow directive, it should go before the Disallow for maximum compatability:
Allow: /directory1/myfile.html
Disallow: /directory1/
Furthermore, you should put Crawl-delay last, according to Yandex:
To maintain compatibility with robots that may deviate from the
standard when processing robots.txt, the Crawl-delay directive needs
to be added to the group that starts with the User-Agent record right
after the Disallow and Allow directives).
So, in the end, your robots.txt file should look like this:
User-agent: *
Allow: /minsc/menu-leaf.png
Disallow: /minsc/
Crawl-delay: 10
Robots.txt is sort of an 'informal' standard that can be interpreted differently. The only interesting 'standard' is really how the major players are interpreting it.
I found this source saying that globbing ('*'-style wildcards) are not supported:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
http://www.robotstxt.org/robotstxt.html
So according to this source you should stick with your alternative (A).

Disalow robots for different purpose with 1 Directive

Can I combine the 2 directive below into one as show under it and google or bing bot will still follow my robots? I have recently seen bingbot not following the second directive and thinking if I combine the directive they might follow it.
Original
User-agent:*
Disallow: /folder1/
Disallow: /folder2/
User-agent: *
Disallow: /*.png
Disallow: /*.jpg
Wanted to change to this
User-agent:*
Disallow: /folder1/
Disallow: /folder2/
Disallow: /*.png
Disallow: /*.jpg
You may only have one record with User-agent: *:
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
When having more than one of these records, bots (that are not matched by a more specific record) might only follow the first one in the file.
So you have to use this record:
User-agent: *
Disallow: /folder1/
Disallow: /folder2/
Disallow: /*.png
Disallow: /*.jpg
Note that the * in a Disallow value has no special meaning in the original robots.txt specification, but some consumers use it as a wildcard.

Multiple User-agents: * in robots.txt

Related question: Multiple User Agents in Robots.txt
I'm reading a robots.txt file on a certain website and it seems to be contradictory to me (but I'm not sure).
User-agent: *
Disallow: /blah
Disallow: /bleh
...
...
...several more Disallows
User-agent: *
Allow: /
I know that you can exclude certain robots by specifying multiple User-agents, but this file seems to be saying that all robots are disallowed of a bunch of files but also allowed to access all the files? Or am I reading this wrong.
This robots.txt is invalid, as there must only be one record with User-agent: *. If we fix it, we have:
User-agent: *
Disallow: /blah
Disallow: /bleh
Allow: /
Allow is not part of the original robots.txt specification, so not all parsers will understand it (those have to ignore the line).
For parsers that understand Allow, this line simply means: allow everything (else). But that is the default anyway, so this robots.txt has the same meaning:
User-agent: *
Disallow: /blah
Disallow: /bleh
Meaning: Everything is allowed except those URLs whose paths start with blah or bleh.
If the Allow line would come before the Disallow lines, some parsers might ignore the Disallow lines. But, as Allow is not specified, this might be different from parser to parser.