"/recaptcha/api2/logo_48.png" blocked by Google - robots.txt

I have a contact form on my site, it all works fine. I also using Google Captcha for that form.
When I go to my G search console to make sure all is fine, I see I get one error stating:
Googlebot couldn't get all resources for this page. Here's a list:
https://www.gstatic.com/recaptcha/api2/logo_48.png << Blocked
I have gone to my robots.txt file and added the following but that didnt help
Allow: https://www.gstatic.com/recaptcha/api2/logo_48.png
Allow: /recaptcha/api2/logo_48.png

Your own robots.txt is only for URLs from your host.
The message is about a URL from a different host (www.gstatic.com). This host would have to edit its robots.txt file to allow crawling of /recaptcha/api2/logo_48.png. It’s currently disallowed.
In other words: You can’t control the crawling of files that are hosted by someone else.

Related

Stop web.archive.org to save the site pages

I tried accessing facebook.com webpages from previous time.
And the site showed me an error that it can not save pages because of the site robots.txt/
Can anyone tell which statements in the robots.txt are making the site inaccessible to web.archive.org
I guess it is because of the #permission statement as mentioned here (http://facebook.com/robots.txt)
Is there any other way i can do this for my site as well.
I also dont want woorank.com or builtwith.com to analyze my site.
Note : search engine bots should face no problems while crawling my site and indexing it if i add some statements to robots.txt in order to achieve results which are mentioned above.
The Internet Archive (archive.org) crawler uses the User-Agent value ia_archiver (see their documentation).
So if you want to target this bot in your robots.txt, use
User-agent: ia_archiver
And this is exactly what Facebook does in its robots.txt:
User-agent: ia_archiver
Allow: /about/privacy
Allow: /full_data_use_policy
Allow: /legal/terms
Allow: /policy.php
Disallow: /
If you would like to submit a request for archives of your site or
account to be excluded from web.archive.org, send us a request to
info#archive.org and indicate:
https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

Why robots.txt doesn't work for when I do redirection from http to https

Today I experience the problem with search in the google.
When I type "trakopolis" in the google in shows me my page (so it is indexed by google robots), but the description of the page is not available. It is very important to have a description on my website.
the website is:
https://trakopolis.com
the robots txt file is, so I allow everything:
User-agent: *
Allow: /
https://www.google.com.ua/?gws_rd=cr#gs_rn=23&gs_ri=psy-ab&tok=O7cIXclKCSxtMd3uDVRVhg&cp=2&gs_id=h&xhr=t&q=trakopolis&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq=tr&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.50165853,d.bGE&fp=d3f611552977418f&biw=1680&bih=949
but as you see the description is not available. I confused :( Sorry if the questio is stupid.
As I see from the google webmaster tools. Google use this robots.txt file, so maybe the issue with redirection from http to https? The website doesn't allow http and we use https. And on main page I use redirection to Login.aspx page in case if user didn't authenticate.
Google shows a description when searching for "trakopolis":
It seems that your robots.txt disallowed crawling of your site some time ago, as some other search engines still display that they are not allowed to show your description, e.g. DuckDuckGo.
Note that your robots.txt uses Allow, which is not part of the original robots.txt specification (but many parsers understand it anyway). It’s equivalent to:
User-agent: *
Disallow:
(But because parsers have to ignore unknown fields, you should have no problem using Allow. An empty or no existent robots.txt always allows crawling of everything.)

How come when I block a directory in robots.txt, its contents are still coming up?

This is what I've got in my robots.txt, placed in the base directory, of course:
User-Agent: *
Disallow: /foo/
But then, in Google, I have no index of /foo/, but for some reason, I still have /foo/foo.php showing up as a link in Google.
How come? Did I write something incorrectly? Do I need to write something else?
When you put robots.txt after your site went live, Google could already index files under /foo/.
You can remove already indexed files via Google Webmaster Tools - removal request.
robots.txt does not prevent Google to link to your blocked pages. Google won't index your blocked pages (so it won't show the page title/description/snippet), but if it finds a link to any blocked page, it might still link it from their search results.
If you want to also forbid this linking, you could use the meta element with robots and noindex.

robots.txt: user-agent: Googlebot disallow: / Google still indexing

Look at the robots.txt of this site:
fr2.dk/robots.txt
The content is:
User-Agent: Googlebot
Disallow: /
That ought to tell google not to index the site, no?
If true, why does the site appear in google searches?
Besides having to wait, because Google's index updates take some time, also note that if you have other sites linking to your site, robots.txt alone won't be sufficient to remove your site.
Quoting Google's support page "Remove a page or site from Google's search results":
If the page still exists but you don't want it to appear in search results, use robots.txt to prevent Google from crawling it. Note that in general, even if a URL is disallowed by robots.txt we may still index the page if we find its URL on another site. However, Google won't index the page if it's blocked in robots.txt and there's an active removal request for the page.
One possible alternative solution is also mentioned in above document:
Alternatively, you can use a noindex meta tag. When we see this tag on a page, Google will completely drop the page from our search results, even if other pages link to it. This is a good solution if you don't have direct access to the site server. (You will need to be able to edit the HTML source of the page).
If you just added this, then you'll have to wait - it's not instantaenous - until Googlebot comes back to respider the site and sees the robots.txt, the site'll still be in their database.
I doubt it's relevant, but you might want to change your "Agent" to "agent" - Google's most likely not case sensitive for this, but can't hurt to follow the standard exactly.
I can confirm Google doesn't respect the Robots Exclusion File. Here's my file, which I created before putting this origin online:
https://git.habd.as/robots.txt
And the full contents of the file:
User-agent: *
Disallow:
User-agent: Google
Disallow: /
And Google still indexed it.
I don't use Google after cancelling my account last March and never had this site added to a webmaster console outside Yandex which leaves me with two assumptions:
Google is scraping Yandex
Google doesn't respect the Robots Exclusion Standard
I haven't grepped my logs yet but I will and my assumption is I'll find Google spiders in there misbehaving.