How to disallow bots from a single page or file - robots.txt

How to disallow bots from a single page and allow allow all other content to be crawled.
Its so important not to get wrong so I am asking here, cant find a definitive answer elsewhere.
Is this correct?
User-Agent:*
Disallow: /dir/mypage.html
Allow: /

The Disallow line is all that's needed. It will block access to anything that starts with "/dir/mypage.html".
The Allow line is superfluous. The default for robots.txt is Allow: /. In general, Allow is not required. It's there so that you can override access to something that would be disallowed. For example, say you want to disallow access to the "/images" directory, except for images in the "public" subdirectory. You would write:
Allow: /images/public
Disallow: /images
Note that order is important here. Crawlers are supposed to use a "first match" algorithm. If you wrote the 'Disallow` first, then a crawler would assume that access to "/images/public" was blocked.

Related

How to allow URLs in robots.txt but disallow other ones similar to allowed

This is what I'm using now:
User-agent: *
Allow: /
Allow: /video/funny-dogs/index.html
Allow: /video/funny-cats/index.html
Allow: /video/funny-dolphins/index.html
Disallow: /video/
But seems like all others "/video/" URLs are also being crawled.
What's wrong with that?
You robots.txt file should definitely work for Google, and I believe it will work for Bing. However, for many other robots it probably won't work, because not all robots prioritize competing Allows & Disallows the same way. Also, some robots don't support Allow at all.
For robots other than Google/Bing, you can increase the chances of success by removing the "Allow: /" line. Many older robots look for the first directive that can be applied to the current URL and then stop looking. For these robots, the allow will always be applied, and the other directives will always be ignored. Removing the "Allow: /" should fix this.
If Google or Bing are not obeying your robots.txt file, then something may be broken. You might check for the following things:
Was the robots.txt file added/changed very recently? Google can often take as much as a week to notice a new robots.txt file.
Is the robots.txt in the site's root directory? (e.g. in http://somesite.com/robots.txt, NOT http://somesite.com/subdir/robots.txt)
Do requests for the robots.txt file return anything funny in the response headers, like X-Robots-Tag:noindex, or a status code other than 200?
The original robots.txt specification said that the bot should read robots.txt and take the first rule that applies. When Allow was added, that wasn't changed, and many bots still use that rule. Other bots use the most permissive rule.
In the first case, Allow: / on the first line of the file will cause the bot to think that it can crawl. In the second case, the existence of Allow: / anywhere in the file will cause the bot to assume that it can crawl anything.
There is never a good reason to include Allow: /. The assumption in robots.txt is that if a file isn't specifically disallowed, then crawling is allowed. Allow is intended to be an override or exception to a Disallow.
Remove the Allow: /. Things should work then.

robots.txt allow root only, disallow everything else?

I can't seem to get this to work but it seems really basic.
I want the domain root to be crawled
http://www.example.com
But nothing else to be crawled and all subdirectories are dynamic
http://www.example.com/*
I tried
User-agent: *
Allow: /
Disallow: /*/
but the Google webmaster test tool says all subdirectories are allowed.
Anyone have a solution for this? Thanks :)
According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow and Disallow directives doesn't matter. So changing the order really won't help you.
Instead, use the $ operator to indicate the closing of your path. $ means 'the end of the line' (i.e. don't match anything from this point on)
Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console):
user-agent: *
Allow: /$
Disallow: /
This will allow http://www.example.com and http://www.example.com/ to be crawled but everything else blocked.
note: that the Allow directive satisfies your particular use case, but if you have index.html or default.php, these URLs will not be crawled.
side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the Allow and Disallow directive blocks, I just set them that way to debunk some of the comments.
When you look at the google robots.txt specifications, you can see that:
Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:
* designates 0 or more instances of any valid character
$ designates the end of the URL
see https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#example-path-matches
Then as eywu said, the solution is
user-agent: *
Allow: /$
Disallow: /

block google robots for URLS containing a certain word

my client has a load of pages which they dont want indexed by google - they are all called
http://example.com/page-xxx
so they are /page-123 or /page-2 or /page-25 etc
Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt
would something ike this work?
Disallow: /page-*
Thanks
In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?
Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.
Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page- will work. If they're scattered across directories in many different places, then things are a bit more difficult.
As #user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.
For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:
Disallow: /*page-
That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:
Disallow: */page-
It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.
Then I pulled this from Google's documentation:
Pattern matching
Googlebot (but not all search engines) respects some pattern matching.
To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /?$
Disallow: /?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.
Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.
You could put all the pages that you don't want to get visited in a folder and then use disallow to tell bots not to visit pages in that folder.
Disallow: /private/
I don't know very much about robots.txt so I'm not sure how to use wildcards like that
Here, it says "you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines."
http://www.robotstxt.org/faq/robotstxt.html

How to configure robots.txt to allow everything?

My robots.txt in Google Webmaster Tools shows the following values:
User-agent: *
Allow: /
What does it mean? I don't have enough knowledge about it, so looking for your help. I want to allow all robots to crawl my website, is this the right configuration?
That file will allow all crawlers access
User-agent: *
Allow: /
This basically allows all user agents (the *) to all parts of the site (the /).
If you want to allow every bot to crawl everything, this is the best way to specify it in your robots.txt:
User-agent: *
Disallow:
Note that the Disallow field has an empty value, which means according to the specification:
Any empty value, indicates that all URLs can be retrieved.
Your way (with Allow: / instead of Disallow:) works, too, but Allow is not part of the original robots.txt specification, so it’s not supported by all bots (many popular ones support it, though, like the Googlebot). That said, unrecognized fields have to be ignored, and for bots that don’t recognize Allow, the result would be the same in this case anyway: if nothing is forbidden to be crawled (with Disallow), everything is allowed to be crawled.
However, formally (per the original spec) it’s an invalid record, because at least one Disallow field is required:
At least one Disallow field needs to be present in a record.
I understand that this is fairly old question and has some pretty good answers. But, here is my two cents for the sake of completeness.
As per the official documentation, there are four ways, you can allow complete access for robots to access your site.
Clean:
Specify a global matcher with a disallow segment as mentioned by #unor. So your /robots.txt looks like this.
User-agent: *
Disallow:
The hack:
Create a /robots.txt file with no content in it. Which will default to allow all for all type of Bots.
I don't care way:
Do not create a /robots.txt altogether. Which should yield the exact same results as the above two.
The ugly:
From the robots documentation for meta tags, You can use the following meta tag on all your pages on your site to let the Bots know that these pages are not supposed to be indexed.
<META NAME="ROBOTS" CONTENT="NOINDEX">
In order for this to be applied to your entire site, You will have to add this meta tag for all of your pages. And this tag should strictly be placed under your HEAD tag of the page. More about this meta tag here.
It means you allow every (*) user-agent/crawler to access the root (/) of your site. You're okay.
I think you are good, you're allowing all pages to crawling
User-agent: *
allow:/

Robots.txt to disallow everything and allow only specific parts of the site/pages. Is "allow" supported by crawlers like Ultraseek and FAST?

Just wanted to know if it is possible to disallow the whole site for crawlers and allow only specific webpages or sections?
Is "allow" supported by crawlers like FAST and Ultraseek?
There is an Allow Directive however there's no guarantee that a particular bot will support it (much like there's no guarantee a bot will even check your robots.txt to begin with). You could probably tell by examining your weblogs whether or not specific bots were indexing only the parts of your website that you allow.
The format for allowing just a particular page or section of your website might look like:
Allow: /public/section1/
Disallow: /
This (should) prevent bots from crawling or indexing anything except for content under /public/section1