How can I exclude crawlers to index certain page of my website using robots.txt? [duplicate] - robots.txt

This question already has an answer here:
Robots.txt: Is this wildcard rule valid?
(1 answer)
Closed 5 years ago.
I tried this on my root robots.txt:
User-agent: *
Allow: /
Disallow: /*&action=surprise
Sitemap: https://example.com/sitemap.php
I would like to exclude from crawling urls like:
https://example.com/track&id=13&action=surprise&autoplay
From access.log file I see again some bots hitting those urls.
Am I doing anything wrong or it's just that some bots are not following my robots.txt settings?

I have to say, not all bots will obey the rules and follow your robtos.txt.
you need add some anti-crawler tech to forbid the access...
such as:
check the user-agent
count the ip of bolts

Related

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.
I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml
From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

robots.tx disallow all with crawl-delay

I would like to get information from a certain site, and checked to see if I were allowed to crawl it. The robots.txt file had considerations for 15 different user agents and then for everyone else. My confusion comes from the everyone else statement (which would include me). It was
User-agent: *
Crawl-delay: 5
Disallow: /
Disallow: /sbe_2020/pdfs/
Disallow: /sbe/sbe_2020/2020_pdfs
Disallow: /newawardsearch/
Disallow: /ExportResultServlet*
If I read this correctly, the site is asking that no unauthorized user-agents crawl it. However, the fact that they included a Crawl-delay seems odd. If I'm not allowed to crawl it, why would there even be a crawl delay consideration? And why would they need to include any specific directories at all? Or, perhaps I've read the " Disallow: /" incorrectly?
Yes, this record would mean the same if it were reduced to this:
User-agent: *
Disallow: /
A bot matched by this record is not allowed to crawl anything on this host (having an unneeded Crawl-delay doesn’t change this).

robots.txt deny access to specific URL parameters

I have been trying to get an answer on this question on various Google forums but no-one answers so I'll try here at SO.
I had an old site that used different URL parameters like
domain.com/index.php?showimage=166
domain.com/index.php?x=googlemap&showimage=139
How can I block access to these pages for these parameters? Of course without my domain.com/index.php page being blocked?
Can this be done in robots.txt
EDIT I found a post here: Ignore urls in robot.txt with specific parameters?
Allow: *
Disallow: /index.php?showImage=*
Disallow: /index.php?x=*

robots.txt allow root only, disallow everything else?

I can't seem to get this to work but it seems really basic.
I want the domain root to be crawled
http://www.example.com
But nothing else to be crawled and all subdirectories are dynamic
http://www.example.com/*
I tried
User-agent: *
Allow: /
Disallow: /*/
but the Google webmaster test tool says all subdirectories are allowed.
Anyone have a solution for this? Thanks :)
According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow and Disallow directives doesn't matter. So changing the order really won't help you.
Instead, use the $ operator to indicate the closing of your path. $ means 'the end of the line' (i.e. don't match anything from this point on)
Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console):
user-agent: *
Allow: /$
Disallow: /
This will allow http://www.example.com and http://www.example.com/ to be crawled but everything else blocked.
note: that the Allow directive satisfies your particular use case, but if you have index.html or default.php, these URLs will not be crawled.
side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the Allow and Disallow directive blocks, I just set them that way to debunk some of the comments.
When you look at the google robots.txt specifications, you can see that:
Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:
* designates 0 or more instances of any valid character
$ designates the end of the URL
see https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#example-path-matches
Then as eywu said, the solution is
user-agent: *
Allow: /$
Disallow: /

Is noindex valid in robots.txt? [duplicate]

This question already has answers here:
Noindex in a robots.txt
(2 answers)
Closed 1 year ago.
Is noindex an optional directive in a robots.txt file, or are user-agent, disallow, allow and crawl-delay the only options?
For example, is this valid for the contents of a robots.txt file?
user-agent: *
disallow: /
noindex: /
noindex is not a valid directive for a robots.txt file. It is a valid directive for a META robots tag, though.
The only standard directives for robots.txt are "User-agent" and "Disallow". Some browsers support an extended set of directives including "Crawl-delay", "Allow" and "Sitemap". http://rield.com/cheat-sheets/robots-exclusion-standard-protocol seems to have a thorough explanation of the standard and extended directives.