Block search engines from indexing local search results but not search page - robots.txt

Bingbot keeps indexing search engine result pages so I want to:
Allow search engines to access everything in general.
Allow search engines to index the search/ url.
Disallow only search queries (search/?q=example) without blocking the search/ URL itself.
Are there any conflicts with the following code in relation to my three stated goals?
User-Agent: *
Allow: /
Disallow: /search/?

Set a canonical link on your search page with the URL of the search page. Then give Google some time to clean-up the mess (up to 6 months if there are many pages or low visited pages to cover it all). All indexed pages with query parameters will be regrouped under the search page and removed from rankings little by little.

Related

Disallow search engines indexing of entire web site, while allowing to save meta title and description

We are using the following robots.txt on our site:
User-agent: *
Disallow: /
We'd like to keep the functionality (not allow crawlers indexing of any part of the site), but we would like search engines to save meta title and description, so that these texts show up beautifully, when someone enters the domain name into the search engine.
As far as I can see the only workaround is to create a separate indexable page with only meta tags. This is the only way to achieve our goal? Will it have any side-effects?
With this robots.txt, you disallow bots to crawl documents on your host. Bots are still allowed to index the URLs to your documents (e.g., if they find links on external sites), but they aren’t allowed to access elements from your head element, so they can’t use this content to provide a title or description in their SERP.
There’s no standard way to allow bots to access the head but not the body.
Some search engines might display metadata from other sources, e.g., from the Open Directory Project (you could disallow this with the noodp value for the meta-robots element) or the Yahoo Directory (you could disallow this with the noydir value).
If you’d create a document that only contains metadata in the head, and allow bots to crawl it in your robots.txt, bots might crawl and index it, but the metadata will of course be shown for this page, not for other pages on your host.

How to deny Googlebot only for a specific set of page variables?

I have a page that is https://www.somedomain.com and then under that page I have the option for users to change the language, like
https://www.somedomain.com/?change_language=en&random_id=123
https://www.somedomain.com/?change_language=de&random_id=123
https://www.somedomain.com/?change_language=fr&random_id=123
etc.
Is it possible to deny Googlebot from crawling these links, but still crawl the https://www.somedomain.com/ main page?
You can use robots.txt to target just the query parameter:
User-agent: *
Disallow: /?change_language
This will prevent Google or other good bots from crawling the language options on the homepage. If you want to make it more universal to all pages:
User-agent: *
Disallow: ?change_language
However, you might want to consider letting those language changes to be crawled and instead utilize the rel="alternate" hreflang specification that Google and Bing support.
This way you can indiciate to the engines that the content is in multiple languages allowing your site to get indexed in all the different country specific versions of Google, Bing, and Yahoo.

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

robots.txt: user-agent: Googlebot disallow: / Google still indexing

Look at the robots.txt of this site:
fr2.dk/robots.txt
The content is:
User-Agent: Googlebot
Disallow: /
That ought to tell google not to index the site, no?
If true, why does the site appear in google searches?
Besides having to wait, because Google's index updates take some time, also note that if you have other sites linking to your site, robots.txt alone won't be sufficient to remove your site.
Quoting Google's support page "Remove a page or site from Google's search results":
If the page still exists but you don't want it to appear in search results, use robots.txt to prevent Google from crawling it. Note that in general, even if a URL is disallowed by robots.txt we may still index the page if we find its URL on another site. However, Google won't index the page if it's blocked in robots.txt and there's an active removal request for the page.
One possible alternative solution is also mentioned in above document:
Alternatively, you can use a noindex meta tag. When we see this tag on a page, Google will completely drop the page from our search results, even if other pages link to it. This is a good solution if you don't have direct access to the site server. (You will need to be able to edit the HTML source of the page).
If you just added this, then you'll have to wait - it's not instantaenous - until Googlebot comes back to respider the site and sees the robots.txt, the site'll still be in their database.
I doubt it's relevant, but you might want to change your "Agent" to "agent" - Google's most likely not case sensitive for this, but can't hurt to follow the standard exactly.
I can confirm Google doesn't respect the Robots Exclusion File. Here's my file, which I created before putting this origin online:
https://git.habd.as/robots.txt
And the full contents of the file:
User-agent: *
Disallow:
User-agent: Google
Disallow: /
And Google still indexed it.
I don't use Google after cancelling my account last March and never had this site added to a webmaster console outside Yandex which leaves me with two assumptions:
Google is scraping Yandex
Google doesn't respect the Robots Exclusion Standard
I haven't grepped my logs yet but I will and my assumption is I'll find Google spiders in there misbehaving.

How to configure robots.txt to allow everything?

My robots.txt in Google Webmaster Tools shows the following values:
User-agent: *
Allow: /
What does it mean? I don't have enough knowledge about it, so looking for your help. I want to allow all robots to crawl my website, is this the right configuration?
That file will allow all crawlers access
User-agent: *
Allow: /
This basically allows all user agents (the *) to all parts of the site (the /).
If you want to allow every bot to crawl everything, this is the best way to specify it in your robots.txt:
User-agent: *
Disallow:
Note that the Disallow field has an empty value, which means according to the specification:
Any empty value, indicates that all URLs can be retrieved.
Your way (with Allow: / instead of Disallow:) works, too, but Allow is not part of the original robots.txt specification, so it’s not supported by all bots (many popular ones support it, though, like the Googlebot). That said, unrecognized fields have to be ignored, and for bots that don’t recognize Allow, the result would be the same in this case anyway: if nothing is forbidden to be crawled (with Disallow), everything is allowed to be crawled.
However, formally (per the original spec) it’s an invalid record, because at least one Disallow field is required:
At least one Disallow field needs to be present in a record.
I understand that this is fairly old question and has some pretty good answers. But, here is my two cents for the sake of completeness.
As per the official documentation, there are four ways, you can allow complete access for robots to access your site.
Clean:
Specify a global matcher with a disallow segment as mentioned by #unor. So your /robots.txt looks like this.
User-agent: *
Disallow:
The hack:
Create a /robots.txt file with no content in it. Which will default to allow all for all type of Bots.
I don't care way:
Do not create a /robots.txt altogether. Which should yield the exact same results as the above two.
The ugly:
From the robots documentation for meta tags, You can use the following meta tag on all your pages on your site to let the Bots know that these pages are not supposed to be indexed.
<META NAME="ROBOTS" CONTENT="NOINDEX">
In order for this to be applied to your entire site, You will have to add this meta tag for all of your pages. And this tag should strictly be placed under your HEAD tag of the page. More about this meta tag here.
It means you allow every (*) user-agent/crawler to access the root (/) of your site. You're okay.
I think you are good, you're allowing all pages to crawling
User-agent: *
allow:/