GWT says robots.txt is blocking resources - mine or theirs? - robots.txt

In Google Webmaster Tools when I 'Fetch as Google' it tells me there are 2 blocked resources which are blocked by robots.txt:
https://dash.reviews.co.uk/[cut]
https://googleads.g.doubleclick.net/[cut]
But I cannot see how these are blocked in my robots.txt, which contains the following:
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /category/
Disallow: /tag/
Disallow: /tools/
Any clues?

You don't have to worry about those resources being blocked because those are on domains that you don't control and is being blocked by their robots.txt files.
Google Webmaster Tools is showing you that for the page you had it fetch, it can't see all the resources which is fairly common. Google and many large sites robots.txt many of their resources. (DoubleClick is a Google owned property)
As long as you can see the entirety of your content when you "fetch and render" you're in good shape.

Related

robots.txt content / selenium web scraping

I am trying to run web scraping using selenium
What does this robot.txt content mean?
User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/
Can i run web scraping in all folders except go and launch-announcement?
What is a robots.txt file?
Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents. view more...
The Disallow: tells the robot that it should not visit the mentioned page on the site.
Can i run web scraping in all folders except go and launch-announcement?
Yes you can scrape the other page except these 2.
According to the basic robots.txt guide, the rule-
User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/
means crawling /go/ and /launch-announcement/ (and their subdirectories) is disallowed for all user agents.

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.
I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml
From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

Robots.txt: Allow everything but the root directory

I have a site that is meant to have http://domain.com/blog as the root directory, and any traffic to http://domain.com is redirected to http://domain.com/blog.
This causes a problem cause when I go to Google and do site:domain.com, I see the root directory with the title of one of the first articles on the page. How can I block the root from being crawled, thus not showing up in search?
In webmaster tools I added the site as http://domain.com but I only fetch as google on the /blog directory and other static pages. Is that correct?
I usually know how to do this but this time the site has a sub-directory as the intended root so it's a bit different.
Can someone verify if this will do what I am trying to achieve?
User-agent: *
Allow: /$
Disallow: /
Robots.txt does NOT block a crawler from crawling certain webpages. Robots.txt is simply a text file with a set of guidelines that you ask the crawler to follow it does not at any time block a crawler. If you want to block a certain webpage from being crawl/visited - you will then have to block all access to that page, this includes other users that are not crawlers. But since you have already have it to redirect I see no issue.
Also the $ is not a unified standard, neither is Allow(technically). Try to make it focused on specific bots. Google and Bing recognise the Allow keyword, but many other bots does not.
Also your current robots.txt says this: Do not crawl any pages, but the root
I recommend this as your robots.txt
User-agent: *
Disallow: /
User-agent: googlebot
Disallow: /$
This tells all other bots, but google to not crawl your webpage. And it tells the google crawler not to crawl in root, but everything else is allowed.

Google robots.txt file

I want to allow google robot:
1) to only see the main page
2) to see description in a search results for main page
I have the following code but it seems that it doesn't work
User-agent: *
Disallow: /feed
Disallow: /site/terms-of-service
Disallow: /site/rules
Disallow: /site/privacy-policy
Allow: /$
Am I missing something or I just need to wait the google robot to visit my site?
Or maybe it is some action required from google webmaster panel?
Thanks in advance!
Your robots.txt should work (and yes, it takes time), but you might want to make the following changes:
It seems you want to target only Google’s bot, so you should use User-agent: Googlebot instead of User-agent: * (which targets all bots that don’t have a specific record in your robots.txt).
It seems that you want to disallow crawling of all pages except the home page, so there is no need to specify a few specific path beginnings in Disallow.
So it could look like this:
User-agent: Googlebot
Disallow: /
Allow: /$
Google’s bot may only crawl your home page, nothing else. All other bots may crawl everything.

will googlebot index my site?

in my robots.txt file, I have the following line
User-agent: Googlebot-Mobile
Disallow: /
User-agent:GoogleBot
Disallow: /
Sitemap: http://mydomain.com/sitemapindex.xml
I know that if I put the first 4 lines , googlebot won't index the sites, but what if I put the last line Sitemap: http://mydomain.com/sitemapindex.xml, will googlebot be able to index the site?
Thanks,
I tested your robots.txt against my own domain (which has a sitemap entry for every page) and Googlebot and Googlebot-Mobile returned that they were Disallowed access.
Based on this - I would say the robots.txt file takes precedence over any sitemaps.
Plus, logically speaking - if you block the entire domain, the bot is disallowed access to the sitemap. The sitemap entry just tells crawlers where to find your sitemap - not their authorization to access it.
Even if you allowed the sitemap, I don't think bots would crawl your site - sitemaps are designed more for telling the bot how often to crawl your site, not what they are allowed to crawl.
No I dont think Google will do that. Its actually a question of Good bot and Bad bot. Even if you add a robots.txt file to restrict some area Bots can still crawl. Its actually a question of Yes or No. robots.txt is just like a warning board and not a security wall.
googlebot will not even be able to touch the sitemapindex.xml
the robots.txt is a crawler directive.
the sitemap.xml is fetched via the googlebot crawler.
googlebot will not access the sitemapindex.xml
no crawl coverage, no indexing, no SERP listing
you can test this with google webmaster tools robots.txt verification tool and fetch as googlebot (in the labs section) feature.