Robots.txt blocking all pages except selected ones

Robots.txt blocking all pages except selected ones - robots.txt

User-agent: *
Sitemap: https://somedomain.com/sitemap.xml
Disallow: /
Allow: /sitemap.xml
Allow: /some-page
Allow: /some-other-page
After submitting sitemap manually via google webmaster tools, it says that it can't read the Allowed pages, because they are blocked by robots.txt.
How to modify robots.txt, to allow them to be indexed, but leaving the rest of portal pages non-indexed?

It’s probably just a matter of time until Google recognizes the new/updated robots.txt.
You can "ask Google to more quickly crawl and index a new robots.txt file for your site" in the Google Webmaster Tools: Submit your updated robots.txt to Google.
Side note: As the Sitemap field does not belong to a single record (as the protocol defines: "independent of the user-agent line"), you might want to structure your robots.txt like this:
User-agent: *
Disallow: /
Allow: /sitemap.xml
Allow: /some-page
Allow: /some-other-page
Sitemap: https://somedomain.com/sitemap.xml

Related

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.

I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml

From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

GWT says robots.txt is blocking resources - mine or theirs?

In Google Webmaster Tools when I 'Fetch as Google' it tells me there are 2 blocked resources which are blocked by robots.txt:
https://dash.reviews.co.uk/[cut]
https://googleads.g.doubleclick.net/[cut]
But I cannot see how these are blocked in my robots.txt, which contains the following:
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /category/
Disallow: /tag/
Disallow: /tools/
Any clues?

You don't have to worry about those resources being blocked because those are on domains that you don't control and is being blocked by their robots.txt files.
Google Webmaster Tools is showing you that for the page you had it fetch, it can't see all the resources which is fairly common. Google and many large sites robots.txt many of their resources. (DoubleClick is a Google owned property)
As long as you can see the entirety of your content when you "fetch and render" you're in good shape.

Google robots.txt file

I want to allow google robot:
1) to only see the main page
2) to see description in a search results for main page
I have the following code but it seems that it doesn't work
User-agent: *
Disallow: /feed
Disallow: /site/terms-of-service
Disallow: /site/rules
Disallow: /site/privacy-policy
Allow: /$
Am I missing something or I just need to wait the google robot to visit my site?
Or maybe it is some action required from google webmaster panel?
Thanks in advance!

Your robots.txt should work (and yes, it takes time), but you might want to make the following changes:
It seems you want to target only Google’s bot, so you should use User-agent: Googlebot instead of User-agent: * (which targets all bots that don’t have a specific record in your robots.txt).
It seems that you want to disallow crawling of all pages except the home page, so there is no need to specify a few specific path beginnings in Disallow.
So it could look like this:
User-agent: Googlebot
Disallow: /
Allow: /$
Google’s bot may only crawl your home page, nothing else. All other bots may crawl everything.

robots.txt disallow google bot on root domain but allow google image bot?

Would having the following robot.txt work?
User-agent: *
Disallow: /
User-agent: Googlebot-Image
Allow: /
My idea is to avoid google crawling my cdn domain but allowing google image still crawl and index my images.

The file has to be called robots.txt, not robot.txt.
Note that User-agent: * targets all bots (that are not matched by another User-agent record), not only the Googlebot. So if you want allow other bots to crawl your site, you would want to use User-agent: Googlebot instead.
So this robots.txt would allow "Googlebot-Image" everything, and disallow everything for all other bots:
User-agent: Googlebot-Image
Disallow:
User-agent: *
Disallow: /
(Note that Disallow: with an empty string value is equivalent to Allow: /, but the Allow field is not part of the original robots.txt specification, although some parsers support it, among them Google’s).

Url blocked by robots.txt message in Google Webmaster

I've a wordpress site in root domain. Now, i've added a forum in subfolder as mydomain/forum
which makes a sitemap as follows: mydomain/forum/sitemap_index.xml.
Submitting that sitemap to google, It sounds google can't access sub-sitemaps with the message of "Url blocked by robots.txt" - Value: mydomain/forum/sitemap-forums.xml?page=1 --- Value: mydoamin/forum/sitemap-index.xml?page=1.
This is my robots.txt:
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
Sitemap: mydomain/sitemap_index.xml
Sitemap: mydomain/forum/sitemap_index.xml
What should i add to robots.txt? Any help would be greatly appreciated.
Thanks in advance

Just to clarify, I'm assuming 'mydomain' in your example is a stand-in for the scheme plus fully qualified domain name, correct? (e.g. "http://www.whatever.com", not "whatever.com" or "www.whatever.com") I figure this must be the case because you have it in the Google error message in the same format.
The error message suggests that Google is getting the URL from somewhere other than your robots.txt file. The robots.txt file lists the sitemap URL as:
mydomain/forum/sitemap_index.xml
but the error message shows that Google is trying to load the URL:
mydomain/forum/sitemap-index.xml?page=1
This second URL is getting blocked, because your robots.txt file blocks any URL that contains a question mark:
Disallow: /*?*
Disallow: /*?
(Incidentally, these two lines do exactly the same thing. You can safely delete the first one) Google should still be able to read the sitemap file using the simpler URL however, so the pages will probably still be crawled. If you really want to get rid of the error message, you could always add:
Allow: /forum/sitemap-index.xml?page=1
This will override the disallows for just the sitemap URL. (This will work on Google at least - YMMV for any other search engines)