Disallow dynamic urls using robots.txt - robots.txt

I have URLs like example.com/post/alai-fm-sri-lanka-listen-online-1467/
I want to remove all URLs which have post word in them using robots.txt
So which is corrent format?
Disallow: /post-*
Disallow: /?page=post
Disallow: /*page=post

(Note that the file has to be called robots.txt; I corrected it in your question.)
You only included one example URL, where "post" is the first path segment. If all your URLs look like that, the following robots.txt should work:
User-agent: *
Disallow: /post/
It would block the following URLs:
http://example.com/post/
http://example.com/post/foobar
http://example.com/post/foo/bar
…
The following URLs would still be allowed:
http://example.com/post
http://example.com/foo/post/
http://example.com/foo/bar/post
http://example.com/foo?page=post
http://example.com/foo?post=1
…

Googlebot and Bingbot both handle limited wildcarding, so this will work:
Disallow: /*post
Of course, that will also disallow any url that contains the words "compost", "outpost", "poster", or anything that contains the substring "post".
You could try to make it a little better. For example:
Disallow: /*/post // any segment that starts with "post"
Disallow: /*?post= // the post query parameter
Disallow: /*=post // any value that starts with "post"
Understand, though, that not all bots support wildcards, and of those that do some are buggy. Bing and Google handle them correctly. There's no guarantee if other bots do.

Related

Disallow /*foo but allow /*bar?foo=foo (i.e. how to disallow an API if query string might contain the same name?)

I want to disallow /*foo endpoint regardless of its query string, but allow /*bar regardless of its query string.
A robots.txt like below would also disallow /*bar?foo=foo with query string or with higher path which contains foo such as /foo/bar.
User-agent: *
Disallow: /*foo
How should I set robots.txt in this case? Does putting $ at the end work in this scenario?
The "standard" robots.txt doesn't accept wildcards, so I'm talking about the ones like used by Google.

Robot.txt disallow page but not starting with

I want to disallow specific page
example.com/10
but not other pages starting with /10
example.com/101
example.com/102
example.com/103
how to do this?
You can use the Allow keyword to achieve it:
User-agent: *
Allow: /10*
Disallow: /10$
Results from http://tools.seobook.com/robots-txt/analyzer/:
Url: /10
Multiple robot rules found
Robots disallowed: All robots
Url: /101
Robots allowed: All robots
Url: /102
Robots allowed: All robots
Url: /103
Robots allowed: All robots
However, older robots may interpret it correctly. For example reading just the first line.

Robots.txt disallowing particular type of URL

I want to exclude this URL from bots:-
test.com/p/12345/qwerty
But allow this URL:-
test.com/p/12345
Will this line work?
User-Agent: *
Disallow: */p/*/*
Thanks
According to this tutorial there is no need for the * at the beginning. It should be:
User-Agent: *
Disallow: /p/*/*
Note that this will have the side effect of blocking bots on an address like test.com/p/abc/def
In order to do the exact functionality you are asking for and nothing more (i.e. no side effects), use this:
User-Agent: *
Disallow: /p/12345/qwerty
test.com/p/12345 will be allowed by default.

Disallow URLs with empty parameters in robots.txt

Normally I have this URL structure:
http://example.com/team/name/16356**
But sometimes my CMS generates URLs without name:
http://example./com/team//16356**
and then it’s 404.
How to disavow such URLs when they are empty?
Probably it would be possible with regex for empty symbol here, but I dont want to mess up with Googlebot, better do good from the beginning.
If you want to block URLs like http://example./com/team//16356**, where the number part can be different, you could use the following robots.txt:
User-agent: *
Disallow: /team//
This will block crawling of any URL whose path starts with /team//.

Help to rightly create robots.txt

I have dynamic urls like this.
mydomain.com/?pg=login
mydomain.com/?pg=reguser
mydomain.com/?pg=aboutus
mydomain.com/?pg=termsofuse
When the page is requested for ex. mydomainname.com/?pg=login index.php include login.php file.
some of the urls are converted to static url like
mydomain.com/aboutus.html
mydomain.com/termsofuse.html
I need to allow index mydomainname.com/aboutus.html, mydomainname.com/termsofuse.html
and disallow mydomainname.com/?pg=login, mydomainname.com/?pg=reguser, please help to manage my robots.txt file.
I have also mydomainname.com/posted.php?details=50 (details can have any number) which I converted to mydomainname.com/details/50.html
I need also to allow all this type of urls.
If you wish to only index your static pages, you can use this:
Disallow: /*?
This will disallow all URLs which contain a question mark.
If you wish to keep indexing posted.php?details=50 URLs, and you have a finite set of params you wish to disallow, you can create a disallow entry for each, like this:
Disallow: /?pg=login
Or just prevent everything starting with /?
Disallow: /?*
You can use a tool like this to test a sampling of URLs to see if it will match them or not.
http://tools.seobook.com/robots-txt/analyzer/