allowing certain urls and deny the rest with robots.txt - robots.txt

I need to allow only some particular directories and deny the rest. It is my understanding that you should allow first then disallow the rest. Is this right what I have setup?
Allow: /word-lists/words-that-start-with/letter/z/
Allow: /word-lists/words-that-end-with/letter/z/
Disallow: /word-lists/words-that-start-with/letter/
Disallow: /word-lists/words-that-end-with/letter/

Your snippet looks OK, just don't forget to add a User-Agent at the top.
The order of the allow/disallow keywords doesn't matter currently, but it's up to the client to make the correct choice. See Order of precedence for group-member records section in our Robots.txt documentation.
[...] for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule.
The original RFC does state that clients should evaluate rules in the order they're found, however I don't recall any crawler that would actually do that, instead they're playing on the safe side and follow the most restrictive rule.
To evaluate if access to a URL is allowed, a robot must attempt to
match the paths in Allow and Disallow lines against the URL, in the
order they occur in the record. The first match found is used. If no
match is found, the default assumption is that the URL is allowed.

Related

Disallow /*foo but allow /*bar?foo=foo (i.e. how to disallow an API if query string might contain the same name?)

I want to disallow /*foo endpoint regardless of its query string, but allow /*bar regardless of its query string.
A robots.txt like below would also disallow /*bar?foo=foo with query string or with higher path which contains foo such as /foo/bar.
User-agent: *
Disallow: /*foo
How should I set robots.txt in this case? Does putting $ at the end work in this scenario?
The "standard" robots.txt doesn't accept wildcards, so I'm talking about the ones like used by Google.

Disallow dynamic URL in robots.txt

Our URL is:
http://example.com/kitchen-knife/collection/maitre-universal-cutting-boards-rana-parsley-chopper-cheese-slicer-vegetables-knife-sharpening-stone-ham-stand-ham-stand-riviera-niza-knives-block-benin.html
I want to disallow URLs to be crawled after collection, but before collection there are categories that are dynamically coming.
How would I disallow URLs in robots.txt after /collection?
This is not possible in the original robots.txt specification.
But some (!) parsers extend the specification and define a wildcard character (typically *).
For those parsers, you could use:
Disallow: /*/collection
Parsers that understand * as wildcard will stop crawling any URL whose path starts with anything (which may be nothing), followed by /collection/, followed by anything, e.g.,
http://example.com/foo/collection/
http://example.com/foo/collection/bar
http://example.com/collection/
Parsers that don’t understand * as wildcard (i.e., they follow the original specification) will stop crawling any URL whose paths starts with /*/collection/, e.g.
http://example.com/*/collection/
http://example.com/*/collection/bar

Mod security whitelist, multiple conditions

I've set up mod_security on my server with the Owasp predefined modsec rules.
However, I'm getting a lot of false positive so I've started to set up whitelist rules.
I have a false positive on this url:
http://example.com/fr/share/?u=http%3A%2F%2Fwww.example.com%2Fen%2Ffiles%2Fimgs%2F%3Fpage%3D100%2
with "Multiple URL Encoding Detected","OWASP_CRS/PROTOCOL_VIOLATION/EVASION"
due to the rule:
SecRule ARGS "\%((?!$|\W)|[0-9a-fA-F]{2}|u[0-9a-fA-F]{4})" "phase:2,rev:'2',ver:'OWASP_CRS/2.2.9',maturity:'6',accuracy:'8',t:none,block,msg:'Multiple URL Encoding Detected',id:'1',tag:'OWASP_CRS/PROTOCOL_VIOLATION/EVASION',severity:'4',setvar:'tx.msg=%{rule.msg}',setvar:tx.anomaly_score=+%{tx.warning_anomaly_score},setvar:tx.%{rule.id}-OWASP_CRS/PROTOCOL_VIOLATION/EVASION-%{matched_var_name}=%{matched_var}"
So the main idea for me is to create a rule that still does the check except for the parameters "u" on url starting by /fr/share/?.
I have some hints with :
SecRule ARGS|!ARGS:u ... but how can I combine the mention where !REQUEST_URI equal to "/fr/share?.*"
So there are several options here.
You could rewrite the rule, and use chaining, to test for multiple conditions (note I've stripped off some of the rule actions for formatting reasons):
SecRule ARGS "\%((?!$|\W)|[0-9a-fA-F]{2}|u[0-9a-fA-F]{4})" \
"phase:2,rev:'2',ver:'OWASP_CRS/2.2.9',maturity:'6',accuracy:'8', \
t:none,block,msg:'Multiple URL Encoding Detected',id:'1',chain"
SecRule REQUEST_URI "!#beginsWith /fr/share/" "t:none"
The "chain" action means the rule on the next line must also pass before the actions are taken, so in this case it's checking the REQUEST_URI does not begin with /fr/share.
However this means you've got your own copy of this rule and makes upgrading to future versions of the Core Rule Set more difficult. It's much preferred to leave the original rule in place (which I've looked up and is actually rule id 950109 rather than rule id 1 that you've given so I presume that rule 1 is your copy).
So, to leave the original rule in place, but not have it false alerting you've a few options, detailed below in increasing complexity:
You could disable the whole rule:
SecRuleRemoveById 950109
This should be specified AFTER the rule is defined.
Obviously that's a bit extreme if it's only giving a false positive for one particular URL, parameter combination and means you lose the protection that rule is giving you for the any other url or parameter.
You could disable that rule for just that 'u' parameter:
SecRuleUpdateTargetById 950109 !ARGS:'u'
I think this can be specified before or after that rule is defined but not 100% sure on that.
But this will disable the for ALL 'u' parameters and you only want to disable it for this particular call, so slightly better but still not what you are looking for.
Therefore the best way is to use the ctl action, on a rule which matches the URL, to alter the original rule for that parameter:
SecRule REQUEST_URI "#beginsWith /fr/share/" \
"t:none,id:1,nolog,pass,ctl:ruleRemoveTargetById=950109;ARGS:u"
An almost identical request to what you are asking for, for rule 981260, is documented here:
https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual#ctl

Disallow URLs with empty parameters in robots.txt

Normally I have this URL structure:
http://example.com/team/name/16356**
But sometimes my CMS generates URLs without name:
http://example./com/team//16356**
and then it’s 404.
How to disavow such URLs when they are empty?
Probably it would be possible with regex for empty symbol here, but I dont want to mess up with Googlebot, better do good from the beginning.
If you want to block URLs like http://example./com/team//16356**, where the number part can be different, you could use the following robots.txt:
User-agent: *
Disallow: /team//
This will block crawling of any URL whose path starts with /team//.

Help to rightly create robots.txt

I have dynamic urls like this.
mydomain.com/?pg=login
mydomain.com/?pg=reguser
mydomain.com/?pg=aboutus
mydomain.com/?pg=termsofuse
When the page is requested for ex. mydomainname.com/?pg=login index.php include login.php file.
some of the urls are converted to static url like
mydomain.com/aboutus.html
mydomain.com/termsofuse.html
I need to allow index mydomainname.com/aboutus.html, mydomainname.com/termsofuse.html
and disallow mydomainname.com/?pg=login, mydomainname.com/?pg=reguser, please help to manage my robots.txt file.
I have also mydomainname.com/posted.php?details=50 (details can have any number) which I converted to mydomainname.com/details/50.html
I need also to allow all this type of urls.
If you wish to only index your static pages, you can use this:
Disallow: /*?
This will disallow all URLs which contain a question mark.
If you wish to keep indexing posted.php?details=50 URLs, and you have a finite set of params you wish to disallow, you can create a disallow entry for each, like this:
Disallow: /?pg=login
Or just prevent everything starting with /?
Disallow: /?*
You can use a tool like this to test a sampling of URLs to see if it will match them or not.
http://tools.seobook.com/robots-txt/analyzer/