Disallow pages that ends with number only in robots.txt - robots.txt

Is it possible to tell Google not to crawl these pages
/blog/page/10
/blog/page/20
…
/blog/page/100
These are basically Ajax calls that bring blog posts data.
I created this in robots.txt:
User-agent: *
Disallow: /blog/page/*
But now I have to another page that I want allow which is
/blog/page/start
Is there a way that I tell robots that only pages that end with a number
e.g
User-agent: *
Disallow: /blog/page/(:num)
I also got an error bellow when I tried to validate the robots.txt file:

Following the original robots.txt specification, this would work (for all conforming bots, including Google’s):
User-agent: *
Disallow: /blog/pages/0
Disallow: /blog/pages/1
Disallow: /blog/pages/2
Disallow: /blog/pages/3
Disallow: /blog/pages/4
Disallow: /blog/pages/5
Disallow: /blog/pages/6
Disallow: /blog/pages/7
Disallow: /blog/pages/8
Disallow: /blog/pages/9
This blocks all URLs whose path begins with /blog/pages/ followed by any number (/blog/pages/9129831823, /blog/pages/9.html, /blog/pages/5/10/foo etc.).
So you should not append the * character (it’s not a wildcard in the original robots.txt specification, and not even needed in your case for bots that interpret it as wildcard).
Google supports some features for robots.txt which are not part of the original robots.txt specification, and therefore are not supported by (all) other bots, e.g., the Allow field. But as the above robots.txt would work, there is no need for using it.

Related

How to disallow multiple folder in robots.txt

I want to disallow robots from crawling any folder/subfolder.
I want to disallow the ff:
http://example.com/staging/
http://example.com/test/
And this is the code inside my robots.txt
User-agent: *
Disallow: /staging/
Disallow: /test/
Is this right? and will it work?
Yes, it is right !
You have to add the command Disallow line by line to each path.
Like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /img/
Disallow: /docs/
A good trick is to use some Robot.txt Generator.
Another tip is test your Robot.txt using this Google Tool

Check for specific text in Robots.txt

My URL ends with &content=Search. I want to block all URLs that end with this. I have added following in robots.txt.
User-agent: *
Disallow:
Sitemap: http://local.com/sitemap.xml
Sitemap: http://local.com/en/sitemap.xml
Disallow: /*&content=Search$
But it's not working when testing /en/search?q=terms#currentYear=2015&content=search in https://webmaster.yandex.com/robots.xml. It is not working for me because content=search is after # character.
The Yandex Robots.txt analysis will block your example if you test for Search instead of search, as Robots.txt Disallow values are case-sensitive.
If your site uses case-insensitive URLs, you might want to use:
User-agent: *
Disallow: /*&content=Search$
Disallow: /*&content=search$
# and possibly also =SEARCH, =SEarch, etc.
Having said that, I don’t know if Yandex really supports this for URL fragments (it would be unusual, I guess), although their tool gives this impression.

robots.txt to disallow all pages except one? Do they override and cascade?

I want one page of my site to be crawled and no others.
Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is.
# robots.txt for http://example.com/
User-agent: *
Disallow: /style-guide
Disallow: /splash
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Or can I do like this?
# robots.txt for http://example.com/
User-agent: *
Disallow: /
Allow: /under-construction
Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index.
I think what I need is to have http://example.com craweld, but no other pages.
# robots.txt for http://example.com/
User-agent: *
Disallow: /*
Would this mean disallow anything after the root?
The easiest way to allow access to just one page would be:
User-agent: *
Allow: /under-construction
Disallow: /
The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.
The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.
If you just want to crawl http://example.com, but nothing else, you might try:
Allow: /$
Disallow: /
The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.
If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page.
In the case of blocking everything but the homepage:
User-agent: *
Allow: /$
Disallow: /
will work.
you can use this below both will work
User-agent: *
Allow: /$
Disallow: /
or
User-agent: *
Allow: /index.php
Disallow: /
the Allow must be before the Disallow because the file is read from top to bottom
Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
The $ means "end of string," like in regular expressions. so the result of Allow : /$ is your homepage /index
http://en.wikipedia.org/wiki/Robots.txt#Allow_directive
The order is only important to robots that follow the standard; in the case of the Google or Bing bots, the order is not important.

Url blocked by robots.txt message in Google Webmaster

I've a wordpress site in root domain. Now, i've added a forum in subfolder as mydomain/forum
which makes a sitemap as follows: mydomain/forum/sitemap_index.xml.
Submitting that sitemap to google, It sounds google can't access sub-sitemaps with the message of "Url blocked by robots.txt" - Value: mydomain/forum/sitemap-forums.xml?page=1 --- Value: mydoamin/forum/sitemap-index.xml?page=1.
This is my robots.txt:
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
Sitemap: mydomain/sitemap_index.xml
Sitemap: mydomain/forum/sitemap_index.xml
What should i add to robots.txt? Any help would be greatly appreciated.
Thanks in advance
Just to clarify, I'm assuming 'mydomain' in your example is a stand-in for the scheme plus fully qualified domain name, correct? (e.g. "http://www.whatever.com", not "whatever.com" or "www.whatever.com") I figure this must be the case because you have it in the Google error message in the same format.
The error message suggests that Google is getting the URL from somewhere other than your robots.txt file. The robots.txt file lists the sitemap URL as:
mydomain/forum/sitemap_index.xml
but the error message shows that Google is trying to load the URL:
mydomain/forum/sitemap-index.xml?page=1
This second URL is getting blocked, because your robots.txt file blocks any URL that contains a question mark:
Disallow: /*?*
Disallow: /*?
(Incidentally, these two lines do exactly the same thing. You can safely delete the first one) Google should still be able to read the sitemap file using the simpler URL however, so the pages will probably still be crawled. If you really want to get rid of the error message, you could always add:
Allow: /forum/sitemap-index.xml?page=1
This will override the disallows for just the sitemap URL. (This will work on Google at least - YMMV for any other search engines)

My robots.txt is User-agent: * Disallow: /feeds/ Disallow: /*/_/ what its means?

I have a Google site, and today I found google generated a new robots.txt file:
User-agent: *
Disallow: /feeds/
Disallow: /*/_/
What does it mean? Your kind reply is appreciated.
Here is the breakdown:
User-agent: * -- Apply to all robots
Disallow: /feeds/ -- Do not crawl the /feeds/ directory
Disallow: /*/_/ -- Do not crawl any subdirectory that is named _
For more information, see www.robotstxt.org.
The User-Agent header is how browsers and robots identify themselves.
The Disallow lines define the rules the robots are supposed to follow - in this case what they shouldn't crawl.