robots.txt: my site has over 700 pages indexed, but it will not rank at all - robots.txt

My robots.txt:
User-agent: *
Disallow: /Templates/
Allow: /
I don't want the 'Templates' folder to be crawled.
Is the above correct?
What else may be causing my site to not rank, even though Google Webmaster Tools is giving green lights for everything.
Thank you in advance.

Your robots.txt is fine. You could also use the following one, which has the same meaning (because Allow is only supported by some parsers, and by default everything is allowed anyway):
User-agent: *
Disallow: /Templates/
Note that Disallow: /Templates/ blocks, for example, the following URLs:
http://example.com/Templates/
http://example.com/Templates/foobar
http://example.com/Templates/foobar/index.html
But the following URLs are still allowed:
http://example.com/Templates
http://example.com/Templates.html
http://example.com/templates/
http://example.com/templates/foobar.html
What else may be causing my site to not rank, even though Google Webmaster Tools is giving green lights for everything.
This question is offtopic here. You should try it at Webmasters SE. But you’d need to give more details there.

Related

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.
I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml
From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

Google robots.txt file

I want to allow google robot:
1) to only see the main page
2) to see description in a search results for main page
I have the following code but it seems that it doesn't work
User-agent: *
Disallow: /feed
Disallow: /site/terms-of-service
Disallow: /site/rules
Disallow: /site/privacy-policy
Allow: /$
Am I missing something or I just need to wait the google robot to visit my site?
Or maybe it is some action required from google webmaster panel?
Thanks in advance!
Your robots.txt should work (and yes, it takes time), but you might want to make the following changes:
It seems you want to target only Google’s bot, so you should use User-agent: Googlebot instead of User-agent: * (which targets all bots that don’t have a specific record in your robots.txt).
It seems that you want to disallow crawling of all pages except the home page, so there is no need to specify a few specific path beginnings in Disallow.
So it could look like this:
User-agent: Googlebot
Disallow: /
Allow: /$
Google’s bot may only crawl your home page, nothing else. All other bots may crawl everything.

robots.tx disallow all with crawl-delay

I would like to get information from a certain site, and checked to see if I were allowed to crawl it. The robots.txt file had considerations for 15 different user agents and then for everyone else. My confusion comes from the everyone else statement (which would include me). It was
User-agent: *
Crawl-delay: 5
Disallow: /
Disallow: /sbe_2020/pdfs/
Disallow: /sbe/sbe_2020/2020_pdfs
Disallow: /newawardsearch/
Disallow: /ExportResultServlet*
If I read this correctly, the site is asking that no unauthorized user-agents crawl it. However, the fact that they included a Crawl-delay seems odd. If I'm not allowed to crawl it, why would there even be a crawl delay consideration? And why would they need to include any specific directories at all? Or, perhaps I've read the " Disallow: /" incorrectly?
Yes, this record would mean the same if it were reduced to this:
User-agent: *
Disallow: /
A bot matched by this record is not allowed to crawl anything on this host (having an unneeded Crawl-delay doesn’t change this).

Remove multiples urls of same type from google webmaster

I accidentally kept some urls of type www.example.com/abc/?id=1 in which value of id can vary from 1 to 200. I don't want these to appear in search so i am using remove url feature of google webmasters tools. How can i remove all these types of urls in one shot? i tried www.example.com/abc/?id=* but this doesn't worked!
just block them using robots.txt ie.
User-agent: *
Disallow: /junk.html
Disallow: /foo.html
Disallow: /bar.html

robots.txt allow root only, disallow everything else?

I can't seem to get this to work but it seems really basic.
I want the domain root to be crawled
http://www.example.com
But nothing else to be crawled and all subdirectories are dynamic
http://www.example.com/*
I tried
User-agent: *
Allow: /
Disallow: /*/
but the Google webmaster test tool says all subdirectories are allowed.
Anyone have a solution for this? Thanks :)
According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow and Disallow directives doesn't matter. So changing the order really won't help you.
Instead, use the $ operator to indicate the closing of your path. $ means 'the end of the line' (i.e. don't match anything from this point on)
Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console):
user-agent: *
Allow: /$
Disallow: /
This will allow http://www.example.com and http://www.example.com/ to be crawled but everything else blocked.
note: that the Allow directive satisfies your particular use case, but if you have index.html or default.php, these URLs will not be crawled.
side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the Allow and Disallow directive blocks, I just set them that way to debunk some of the comments.
When you look at the google robots.txt specifications, you can see that:
Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:
* designates 0 or more instances of any valid character
$ designates the end of the URL
see https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#example-path-matches
Then as eywu said, the solution is
user-agent: *
Allow: /$
Disallow: /