block google robots for URLS containing a certain word - robots.txt

my client has a load of pages which they dont want indexed by google - they are all called
http://example.com/page-xxx
so they are /page-123 or /page-2 or /page-25 etc
Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt
would something ike this work?
Disallow: /page-*
Thanks

In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?
Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.
Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page- will work. If they're scattered across directories in many different places, then things are a bit more difficult.
As #user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.
For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:
Disallow: /*page-
That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:
Disallow: */page-

It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.
Then I pulled this from Google's documentation:
Pattern matching
Googlebot (but not all search engines) respects some pattern matching.
To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /?$
Disallow: /?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.
Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.

You could put all the pages that you don't want to get visited in a folder and then use disallow to tell bots not to visit pages in that folder.
Disallow: /private/
I don't know very much about robots.txt so I'm not sure how to use wildcards like that
Here, it says "you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines."
http://www.robotstxt.org/faq/robotstxt.html

Related

robots.txt - Disallow folder but allow files within folder

I seem to have a conflict between my sitemap.xml and my robots.txt
All the images on my site are stored in the folder /pubstore
When google crawls that folder it finds nothing because I an not including a listing of files in that folder.
This in turn generates hundreds of 404 errors in google search console.
What I decided to do, is block google from crawling the folder by adding:
Disallow: '/pubstore/'
What now happens is that files within that folder or in a sub-directory in that folder are block for google and thus Google is not indexing my images.
So an example scenario,
I have a page that uses the image /pubstore/12345/image.jpg
Google doesn't fetch it because /pubstore is blocked.
My end result is that I want the actual files to be crawlable but not the folder or its subdirectories.
Allow:
/pubstore/file.jpg
/pubstore/1234/file.jpg
/pubstore/1234/543/file.jpg
/pubstore/1234/543/132/file.jpg
Disallow:
/pubstore/
/pubstore/1234/
/pubstore/1234/543/
/pubstore/1234/543/132/
How can this be achieved?
If you don’t link to /pubstore/ and /pubstore/folder/ on your site, there is typically no reason to care about 404s for them. It’s the correct response for such URLs (as there is no content).
If you still want to use robots.txt to prevent any crawling for these, you have to use Allow, which is not part of the original robots.txt specification, but supported by Google.
For example:
User-agent: Googlebot
Disallow: /pubstore/
Allow: /pubstore/*.jpg$
Allow: /pubstore/*.JPG$
Or in case you want to allow many different file types, maybe just:
User-agent: Googlebot
Disallow: /pubstore/
Allow: /pubstore/*.
This would allow all URLs whose path starts with /pubstore/, followed by any string, followed by a ., followed by any string.

*/link in robots.txt - Does this block all or just url ending with /link?

I have a Rails-application with products where the products can be found at:
mydomain.com/thisproduct
if the user clicks on the link that leads to the manufacturers website, this is done by using a function "link" with the following url:
mydomain.com/thisproduct/link
Google seems to index this quite peculiarly by indexing that page as my page but with the content of the manufacturers website. So, I want to block this from being indexed in robots.txt.
This is my robots.txt:
# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
# User-Agent: *
# Disallow: /
Disallow: /sokresultat/*
Disallow: */link
Where the last line is what my question relates to:
Do this block all urls that ends with link? And, more importantly, does it block anything else? I am afraid this will de-index my entire site, through that wildcard.
It seems like, after some additional, research that wildcards are supported differently between search engines. This works for Google and could be verified in Google Webmaster Tools.

Disallow dynamic pages in robots.txt

How would I disallow all dynamic pages within my robots.txt?
E.g.
page.php?hello=there
page.php?hello=everyone
page.php?thank=you
I would like page.php AND all possible dynamic versions to be disallowed.
At the moment I have
User-Agent: *
Disallow: /page.php
But this still allows e.g. page.php?hello=there
Thanks
What you've already got should block all access to /page.php for all search engines which respect robots.txt (no matter whether there are any query string parameters provided)
Don't forget robots.txt is only for robots :-) If you're trying to block users from accessing the page you'll need to use .htaccess or similar

How to disallow bots from a single page or file

How to disallow bots from a single page and allow allow all other content to be crawled.
Its so important not to get wrong so I am asking here, cant find a definitive answer elsewhere.
Is this correct?
User-Agent:*
Disallow: /dir/mypage.html
Allow: /
The Disallow line is all that's needed. It will block access to anything that starts with "/dir/mypage.html".
The Allow line is superfluous. The default for robots.txt is Allow: /. In general, Allow is not required. It's there so that you can override access to something that would be disallowed. For example, say you want to disallow access to the "/images" directory, except for images in the "public" subdirectory. You would write:
Allow: /images/public
Disallow: /images
Note that order is important here. Crawlers are supposed to use a "first match" algorithm. If you wrote the 'Disallow` first, then a crawler would assume that access to "/images/public" was blocked.

How to configure robots.txt to allow everything?

My robots.txt in Google Webmaster Tools shows the following values:
User-agent: *
Allow: /
What does it mean? I don't have enough knowledge about it, so looking for your help. I want to allow all robots to crawl my website, is this the right configuration?
That file will allow all crawlers access
User-agent: *
Allow: /
This basically allows all user agents (the *) to all parts of the site (the /).
If you want to allow every bot to crawl everything, this is the best way to specify it in your robots.txt:
User-agent: *
Disallow:
Note that the Disallow field has an empty value, which means according to the specification:
Any empty value, indicates that all URLs can be retrieved.
Your way (with Allow: / instead of Disallow:) works, too, but Allow is not part of the original robots.txt specification, so it’s not supported by all bots (many popular ones support it, though, like the Googlebot). That said, unrecognized fields have to be ignored, and for bots that don’t recognize Allow, the result would be the same in this case anyway: if nothing is forbidden to be crawled (with Disallow), everything is allowed to be crawled.
However, formally (per the original spec) it’s an invalid record, because at least one Disallow field is required:
At least one Disallow field needs to be present in a record.
I understand that this is fairly old question and has some pretty good answers. But, here is my two cents for the sake of completeness.
As per the official documentation, there are four ways, you can allow complete access for robots to access your site.
Clean:
Specify a global matcher with a disallow segment as mentioned by #unor. So your /robots.txt looks like this.
User-agent: *
Disallow:
The hack:
Create a /robots.txt file with no content in it. Which will default to allow all for all type of Bots.
I don't care way:
Do not create a /robots.txt altogether. Which should yield the exact same results as the above two.
The ugly:
From the robots documentation for meta tags, You can use the following meta tag on all your pages on your site to let the Bots know that these pages are not supposed to be indexed.
<META NAME="ROBOTS" CONTENT="NOINDEX">
In order for this to be applied to your entire site, You will have to add this meta tag for all of your pages. And this tag should strictly be placed under your HEAD tag of the page. More about this meta tag here.
It means you allow every (*) user-agent/crawler to access the root (/) of your site. You're okay.
I think you are good, you're allowing all pages to crawling
User-agent: *
allow:/