robots.txt - Disallow folder but allow files within folder - robots.txt

I seem to have a conflict between my sitemap.xml and my robots.txt
All the images on my site are stored in the folder /pubstore
When google crawls that folder it finds nothing because I an not including a listing of files in that folder.
This in turn generates hundreds of 404 errors in google search console.
What I decided to do, is block google from crawling the folder by adding:
Disallow: '/pubstore/'
What now happens is that files within that folder or in a sub-directory in that folder are block for google and thus Google is not indexing my images.
So an example scenario,
I have a page that uses the image /pubstore/12345/image.jpg
Google doesn't fetch it because /pubstore is blocked.
My end result is that I want the actual files to be crawlable but not the folder or its subdirectories.
Allow:
/pubstore/file.jpg
/pubstore/1234/file.jpg
/pubstore/1234/543/file.jpg
/pubstore/1234/543/132/file.jpg
Disallow:
/pubstore/
/pubstore/1234/
/pubstore/1234/543/
/pubstore/1234/543/132/
How can this be achieved?

If you don’t link to /pubstore/ and /pubstore/folder/ on your site, there is typically no reason to care about 404s for them. It’s the correct response for such URLs (as there is no content).
If you still want to use robots.txt to prevent any crawling for these, you have to use Allow, which is not part of the original robots.txt specification, but supported by Google.
For example:
User-agent: Googlebot
Disallow: /pubstore/
Allow: /pubstore/*.jpg$
Allow: /pubstore/*.JPG$
Or in case you want to allow many different file types, maybe just:
User-agent: Googlebot
Disallow: /pubstore/
Allow: /pubstore/*.
This would allow all URLs whose path starts with /pubstore/, followed by any string, followed by a ., followed by any string.

Related

Robots.txt: Allow everything but the root directory

I have a site that is meant to have http://domain.com/blog as the root directory, and any traffic to http://domain.com is redirected to http://domain.com/blog.
This causes a problem cause when I go to Google and do site:domain.com, I see the root directory with the title of one of the first articles on the page. How can I block the root from being crawled, thus not showing up in search?
In webmaster tools I added the site as http://domain.com but I only fetch as google on the /blog directory and other static pages. Is that correct?
I usually know how to do this but this time the site has a sub-directory as the intended root so it's a bit different.
Can someone verify if this will do what I am trying to achieve?
User-agent: *
Allow: /$
Disallow: /
Robots.txt does NOT block a crawler from crawling certain webpages. Robots.txt is simply a text file with a set of guidelines that you ask the crawler to follow it does not at any time block a crawler. If you want to block a certain webpage from being crawl/visited - you will then have to block all access to that page, this includes other users that are not crawlers. But since you have already have it to redirect I see no issue.
Also the $ is not a unified standard, neither is Allow(technically). Try to make it focused on specific bots. Google and Bing recognise the Allow keyword, but many other bots does not.
Also your current robots.txt says this: Do not crawl any pages, but the root
I recommend this as your robots.txt
User-agent: *
Disallow: /
User-agent: googlebot
Disallow: /$
This tells all other bots, but google to not crawl your webpage. And it tells the google crawler not to crawl in root, but everything else is allowed.

How do I disallow search robots from www.example.com and exsample.com

I would like to know if it is possible to block all robots from my site. I get some trouble because I redirect exsample.com to www.exsample.com. The robots.txt checker tools says I don't have a robots.txt file on exsample.com but have it on www.exsample.com.
Hope someone can help me out :)
just make a text file named robots.txt and in this file you write the following
User-agent: *
Disallow: /
and put it in your www folder or public_html folder
this would ask all the search engines to disallow all content of the website but not all the search engines would obbay to this protocol, but the most important search engines would read it and do as you asked
Robots.txt works per host.
So if you want to block URLs on http://www.example.com, the robots.txt must be accessible at http://www.example.com/robots.txt.
Note that the subdomain matters, so you can’t block URLs on http://example.com with a robots.txt only available on http://www.example.com/robots.txt.

Remove multiples urls of same type from google webmaster

I accidentally kept some urls of type www.example.com/abc/?id=1 in which value of id can vary from 1 to 200. I don't want these to appear in search so i am using remove url feature of google webmasters tools. How can i remove all these types of urls in one shot? i tried www.example.com/abc/?id=* but this doesn't worked!
just block them using robots.txt ie.
User-agent: *
Disallow: /junk.html
Disallow: /foo.html
Disallow: /bar.html

Stopping Google's crawl of my site

Google has started crawling my site, but from a temporary domain (beta.mydomain instead of just mydomain) and also I only want him to crawl just some of my pages. Therefore, I want to stop their crawl and only let them crawl pages I specify in a sitemap. How can I do that? (I know how to add a sitemap, but how can I stop their current crawling and request that they'll crawl just the sitemap)
Update: If I kill the subdomain beta.mydomain - will that be "fine" by them or will they continue go over all killed pages and "not like" them? Can I specify that in each page's header?
Create a single text file called 'robots.txt' in the root folder for your site. Inside...
User-agent: *
Disallow: /thisfolder/
Disallow: /foo.html
Disallow: /andthisfoldertoo/
Disallow: /andthisfile.html
I use this for project files. In fact, as I write this I think I'll change the way I work on projects and always put them in a sub-directory called /projects/project1/ so one line will do...
Disallow: /projects/
AND I also add a line for my image files. I don't like my images all over the web...
Disallow: /imgs/
You could start with a robots.txt file.
See google's info here
I presume you have already looked at webmaster tools and sitemaps from what you say? Do be aware that while a sitemap will help tell google WHAT to crawl, it won't work very well for telling them what NOT to crawl.
For that you will want to use the robots.txt file to block certain pages / folders.
Use a robots.txt, see this site.

block google robots for URLS containing a certain word

my client has a load of pages which they dont want indexed by google - they are all called
http://example.com/page-xxx
so they are /page-123 or /page-2 or /page-25 etc
Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt
would something ike this work?
Disallow: /page-*
Thanks
In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?
Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.
Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page- will work. If they're scattered across directories in many different places, then things are a bit more difficult.
As #user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.
For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:
Disallow: /*page-
That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:
Disallow: */page-
It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.
Then I pulled this from Google's documentation:
Pattern matching
Googlebot (but not all search engines) respects some pattern matching.
To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /?$
Disallow: /?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.
Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.
You could put all the pages that you don't want to get visited in a folder and then use disallow to tell bots not to visit pages in that folder.
Disallow: /private/
I don't know very much about robots.txt so I'm not sure how to use wildcards like that
Here, it says "you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines."
http://www.robotstxt.org/faq/robotstxt.html