robots.tx disallow all with crawl-delay - robots.txt

I would like to get information from a certain site, and checked to see if I were allowed to crawl it. The robots.txt file had considerations for 15 different user agents and then for everyone else. My confusion comes from the everyone else statement (which would include me). It was
User-agent: *
Crawl-delay: 5
Disallow: /
Disallow: /sbe_2020/pdfs/
Disallow: /sbe/sbe_2020/2020_pdfs
Disallow: /newawardsearch/
Disallow: /ExportResultServlet*
If I read this correctly, the site is asking that no unauthorized user-agents crawl it. However, the fact that they included a Crawl-delay seems odd. If I'm not allowed to crawl it, why would there even be a crawl delay consideration? And why would they need to include any specific directories at all? Or, perhaps I've read the " Disallow: /" incorrectly?

Yes, this record would mean the same if it were reduced to this:
User-agent: *
Disallow: /
A bot matched by this record is not allowed to crawl anything on this host (having an unneeded Crawl-delay doesn’t change this).

Related

How To Disallow URLs in Robots.txt ending with numeric values

I need help removing or DisAllowing some malicious URLs that Google has configured with my main domain. I didn't really pay attention until my site broke down and I found hundreds of pages like website dot com / 10588msae28bdem12b84
Now I want to Disallow: all of them in Robots.txt
I need your help thanks And I also want to remove them from the Google index. Any excellent advice would be appreciated. Thanks
You could use ten rules with wildcards like this:
User-agent: *
Disallow: /*0
Disallow: /*1
Disallow: /*2
Disallow: /*3
Disallow: /*4
Disallow: /*5
Disallow: /*6
Disallow: /*7
Disallow: /*8
Disallow: /*9
However, this probably not the best solution to your problem. It is likely that your site was hacked. You should clean that up by following this guide: Help, I think I've been hacked!  |  Web Fundamentals  |  Google Developers
You need to find a way of removing the hack so that these URLs return a 404 or 410 status code which Google won't index. Then you should actualy let Google crawl those URLs to know they are gone.

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.
I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml
From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

Check for specific text in Robots.txt

My URL ends with &content=Search. I want to block all URLs that end with this. I have added following in robots.txt.
User-agent: *
Disallow:
Sitemap: http://local.com/sitemap.xml
Sitemap: http://local.com/en/sitemap.xml
Disallow: /*&content=Search$
But it's not working when testing /en/search?q=terms#currentYear=2015&content=search in https://webmaster.yandex.com/robots.xml. It is not working for me because content=search is after # character.
The Yandex Robots.txt analysis will block your example if you test for Search instead of search, as Robots.txt Disallow values are case-sensitive.
If your site uses case-insensitive URLs, you might want to use:
User-agent: *
Disallow: /*&content=Search$
Disallow: /*&content=search$
# and possibly also =SEARCH, =SEarch, etc.
Having said that, I don’t know if Yandex really supports this for URL fragments (it would be unusual, I guess), although their tool gives this impression.

Google robots.txt file

I want to allow google robot:
1) to only see the main page
2) to see description in a search results for main page
I have the following code but it seems that it doesn't work
User-agent: *
Disallow: /feed
Disallow: /site/terms-of-service
Disallow: /site/rules
Disallow: /site/privacy-policy
Allow: /$
Am I missing something or I just need to wait the google robot to visit my site?
Or maybe it is some action required from google webmaster panel?
Thanks in advance!
Your robots.txt should work (and yes, it takes time), but you might want to make the following changes:
It seems you want to target only Google’s bot, so you should use User-agent: Googlebot instead of User-agent: * (which targets all bots that don’t have a specific record in your robots.txt).
It seems that you want to disallow crawling of all pages except the home page, so there is no need to specify a few specific path beginnings in Disallow.
So it could look like this:
User-agent: Googlebot
Disallow: /
Allow: /$
Google’s bot may only crawl your home page, nothing else. All other bots may crawl everything.

robots.txt to disallow all pages except one? Do they override and cascade?

I want one page of my site to be crawled and no others.
Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is.
# robots.txt for http://example.com/
User-agent: *
Disallow: /style-guide
Disallow: /splash
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Or can I do like this?
# robots.txt for http://example.com/
User-agent: *
Disallow: /
Allow: /under-construction
Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index.
I think what I need is to have http://example.com craweld, but no other pages.
# robots.txt for http://example.com/
User-agent: *
Disallow: /*
Would this mean disallow anything after the root?
The easiest way to allow access to just one page would be:
User-agent: *
Allow: /under-construction
Disallow: /
The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.
The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.
If you just want to crawl http://example.com, but nothing else, you might try:
Allow: /$
Disallow: /
The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.
If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page.
In the case of blocking everything but the homepage:
User-agent: *
Allow: /$
Disallow: /
will work.
you can use this below both will work
User-agent: *
Allow: /$
Disallow: /
or
User-agent: *
Allow: /index.php
Disallow: /
the Allow must be before the Disallow because the file is read from top to bottom
Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.
The $ means "end of string," like in regular expressions. so the result of Allow : /$ is your homepage /index
http://en.wikipedia.org/wiki/Robots.txt#Allow_directive
The order is only important to robots that follow the standard; in the case of the Google or Bing bots, the order is not important.