robots.txt disallow google bot on root domain but allow google image bot? - robots.txt

Would having the following robot.txt work?
User-agent: *
Disallow: /
User-agent: Googlebot-Image
Allow: /
My idea is to avoid google crawling my cdn domain but allowing google image still crawl and index my images.

The file has to be called robots.txt, not robot.txt.
Note that User-agent: * targets all bots (that are not matched by another User-agent record), not only the Googlebot. So if you want allow other bots to crawl your site, you would want to use User-agent: Googlebot instead.
So this robots.txt would allow "Googlebot-Image" everything, and disallow everything for all other bots:
User-agent: Googlebot-Image
Disallow:
User-agent: *
Disallow: /
(Note that Disallow: with an empty string value is equivalent to Allow: /, but the Allow field is not part of the original robots.txt specification, although some parsers support it, among them Google’s).

Related

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.
I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml
From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

Robots.txt different from the one in directory

I can't figure out why Google reads my robots.txt file as Disallow: /.
This is what I have in my robots.txt file that is in the main root directory:
User-agent: *
Allow: /
But if I digit in browser it will show Disallow: /: http://revita.hr/robots.txt
I tried everything, submitted the sitemap, added meta robots index, follow into <head>, but it's always the same.
Any ideas?
You seem to have a different robots.txt file if accessing it via HTTPS (→ Allow) instead of HTTP (→ Disallow).
By the way, you don’t need to state
User-agent: *
Allow: /
because allowing everything is the default. As Allow is not part of the original robots.txt specification, you might want to use this instead:
User-agent: *
Disallow:
Also note that you should not have a blank line inside a record.

Website Description not showing in Google search engine

My website description not showing in Google search engine.
I wrote description meta tag inside the header tag.
and my robots.txt file as below
User-agent: *
Disallow: /
When i am searching in google i am getting below message
A description for this result is not available because of this site's robots.txt – learn more.
User-agent: *
Disallow: /
You're blocking all robots to crawl your website.
Simply remove that line to enable robots to crawl your site, or use:
User-agent: *
Disallow:
robots.txt examples:
To exclude all robots from the entire server (what you have!)
User-agent: *
Disallow: /
To allow all robots complete access (or just create an empty "/robots.txt" file, or don't use one at all)
User-agent: *
Disallow:
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
Learn more about robots.txt at http://www.robotstxt.org

Disallow pages that ends with number only in robots.txt

Is it possible to tell Google not to crawl these pages
/blog/page/10
/blog/page/20
…
/blog/page/100
These are basically Ajax calls that bring blog posts data.
I created this in robots.txt:
User-agent: *
Disallow: /blog/page/*
But now I have to another page that I want allow which is
/blog/page/start
Is there a way that I tell robots that only pages that end with a number
e.g
User-agent: *
Disallow: /blog/page/(:num)
I also got an error bellow when I tried to validate the robots.txt file:
Following the original robots.txt specification, this would work (for all conforming bots, including Google’s):
User-agent: *
Disallow: /blog/pages/0
Disallow: /blog/pages/1
Disallow: /blog/pages/2
Disallow: /blog/pages/3
Disallow: /blog/pages/4
Disallow: /blog/pages/5
Disallow: /blog/pages/6
Disallow: /blog/pages/7
Disallow: /blog/pages/8
Disallow: /blog/pages/9
This blocks all URLs whose path begins with /blog/pages/ followed by any number (/blog/pages/9129831823, /blog/pages/9.html, /blog/pages/5/10/foo etc.).
So you should not append the * character (it’s not a wildcard in the original robots.txt specification, and not even needed in your case for bots that interpret it as wildcard).
Google supports some features for robots.txt which are not part of the original robots.txt specification, and therefore are not supported by (all) other bots, e.g., the Allow field. But as the above robots.txt would work, there is no need for using it.

robots.txt to disallow all pages except one? Do they override and cascade?

I want one page of my site to be crawled and no others.
Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is.
# robots.txt for http://example.com/
User-agent: *
Disallow: /style-guide
Disallow: /splash
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Or can I do like this?
# robots.txt for http://example.com/
User-agent: *
Disallow: /
Allow: /under-construction
Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index.
I think what I need is to have http://example.com craweld, but no other pages.
# robots.txt for http://example.com/
User-agent: *
Disallow: /*
Would this mean disallow anything after the root?
The easiest way to allow access to just one page would be:
User-agent: *
Allow: /under-construction
Disallow: /
The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.
The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.
If you just want to crawl http://example.com, but nothing else, you might try:
Allow: /$
Disallow: /
The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.
If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page.
In the case of blocking everything but the homepage:
User-agent: *
Allow: /$
Disallow: /
will work.
you can use this below both will work
User-agent: *
Allow: /$
Disallow: /
or
User-agent: *
Allow: /index.php
Disallow: /
the Allow must be before the Disallow because the file is read from top to bottom
Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
The $ means "end of string," like in regular expressions. so the result of Allow : /$ is your homepage /index
http://en.wikipedia.org/wiki/Robots.txt#Allow_directive
The order is only important to robots that follow the standard; in the case of the Google or Bing bots, the order is not important.