Robots.txt different from the one in directory - robots.txt

I can't figure out why Google reads my robots.txt file as Disallow: /.
This is what I have in my robots.txt file that is in the main root directory:
User-agent: *
Allow: /
But if I digit in browser it will show Disallow: /: http://revita.hr/robots.txt
I tried everything, submitted the sitemap, added meta robots index, follow into <head>, but it's always the same.
Any ideas?

You seem to have a different robots.txt file if accessing it via HTTPS (→ Allow) instead of HTTP (→ Disallow).
By the way, you don’t need to state
User-agent: *
Allow: /
because allowing everything is the default. As Allow is not part of the original robots.txt specification, you might want to use this instead:
User-agent: *
Disallow:
Also note that you should not have a blank line inside a record.

Related

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.
I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml
From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

Check for specific text in Robots.txt

My URL ends with &content=Search. I want to block all URLs that end with this. I have added following in robots.txt.
User-agent: *
Disallow:
Sitemap: http://local.com/sitemap.xml
Sitemap: http://local.com/en/sitemap.xml
Disallow: /*&content=Search$
But it's not working when testing /en/search?q=terms#currentYear=2015&content=search in https://webmaster.yandex.com/robots.xml. It is not working for me because content=search is after # character.
The Yandex Robots.txt analysis will block your example if you test for Search instead of search, as Robots.txt Disallow values are case-sensitive.
If your site uses case-insensitive URLs, you might want to use:
User-agent: *
Disallow: /*&content=Search$
Disallow: /*&content=search$
# and possibly also =SEARCH, =SEarch, etc.
Having said that, I don’t know if Yandex really supports this for URL fragments (it would be unusual, I guess), although their tool gives this impression.

robots.txt disallow google bot on root domain but allow google image bot?

Would having the following robot.txt work?
User-agent: *
Disallow: /
User-agent: Googlebot-Image
Allow: /
My idea is to avoid google crawling my cdn domain but allowing google image still crawl and index my images.
The file has to be called robots.txt, not robot.txt.
Note that User-agent: * targets all bots (that are not matched by another User-agent record), not only the Googlebot. So if you want allow other bots to crawl your site, you would want to use User-agent: Googlebot instead.
So this robots.txt would allow "Googlebot-Image" everything, and disallow everything for all other bots:
User-agent: Googlebot-Image
Disallow:
User-agent: *
Disallow: /
(Note that Disallow: with an empty string value is equivalent to Allow: /, but the Allow field is not part of the original robots.txt specification, although some parsers support it, among them Google’s).

robots.txt to disallow all pages except one? Do they override and cascade?

I want one page of my site to be crawled and no others.
Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is.
# robots.txt for http://example.com/
User-agent: *
Disallow: /style-guide
Disallow: /splash
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Or can I do like this?
# robots.txt for http://example.com/
User-agent: *
Disallow: /
Allow: /under-construction
Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index.
I think what I need is to have http://example.com craweld, but no other pages.
# robots.txt for http://example.com/
User-agent: *
Disallow: /*
Would this mean disallow anything after the root?
The easiest way to allow access to just one page would be:
User-agent: *
Allow: /under-construction
Disallow: /
The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.
The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.
If you just want to crawl http://example.com, but nothing else, you might try:
Allow: /$
Disallow: /
The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.
If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page.
In the case of blocking everything but the homepage:
User-agent: *
Allow: /$
Disallow: /
will work.
you can use this below both will work
User-agent: *
Allow: /$
Disallow: /
or
User-agent: *
Allow: /index.php
Disallow: /
the Allow must be before the Disallow because the file is read from top to bottom
Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
The $ means "end of string," like in regular expressions. so the result of Allow : /$ is your homepage /index
http://en.wikipedia.org/wiki/Robots.txt#Allow_directive
The order is only important to robots that follow the standard; in the case of the Google or Bing bots, the order is not important.

robots.txt disallow: spider

I'm looking at a robots.txt file of a site I would like to do a one off scrape and there is this line:
User-agent: spider
Disallow: /
Does this mean they don't want any spiders? I was under the impression that * was used for all spiders. If true this would of-course stop spiders such as google.
This just tells to agents that call themselves spider to be gently enough to not browse the site.
This has no special meaning.
robots.txt files are used only by robots, so a way to exclude all robots is to use a *:
User-Agent: *
Disallow: /