Common rule in robots.txt - robots.txt

How can I disallow URLs like 1.html, 2.html, ..., [0-9]+.html (in terms of regexp) with robots.txt?

The original robots.txt specification doesn't support regex/wildcards. However, you could block URLs like these:
example.com/1.html
example.com/2367123.html
example.com/3
example.com/4/foo
example.com/5/1
example.com/6/
example.com/7.txt
example.com/883
example.com/9to5
…
with:
User-agent: *
Disallow: /0
Disallow: /1
Disallow: /2
Disallow: /3
Disallow: /4
Disallow: /5
Disallow: /6
Disallow: /7
Disallow: /8
Disallow: /9
If you want to block only URLs starting with a single numeral followed by .html, just append .html, like:
User-agent: *
Disallow: /0.html
Disallow: /1.html
…
However, this wouldn't block, for example, example.com/12.html

Related

Robots.txt file to allow all root php files except one and disallow all subfolders content

I seem to be struggling with a robots.txt file in the following scenario. I would like all root folder *.php files to be indexed except for one (exception.php) and would like all content from all subdirectories of the root folder not to be indexed.
I have tried the following, but it allows accessing php files in subdirectories even though subdirectories in general are not indexed?
....
# robots.txt
User-agent: *
Allow: /*.php
disallow: /*
disallow: /exceptions.php
....
Can anyone help with this?
For crawlers that interpret * in Disallow values as wildcard (it’s not part of the robots.txt spec, but many crawlers support it anyway), this should work:
User-agent: *
Disallow: /exceptions.php
Disallow: /*/
This disallows URLs like:
https://example.com/exceptions.php
https://example.com//
https://example.com/foo/
https://example.com/foo/bar.php
And it allows URLs like:
https://example.com/
https://example.com/foo.php
https://example.com/bar.html
For crawlers that don’t interpret * in Disallow values as wildcard, you would have to list all subfolders (on the first level):
User-agent: *
Disallow: /exceptions.php
Disallow: /foo/
Disallow: /bar/

Need to stop indexing the URL parameters for custom build CMS

I would like for Google to ignore URLs like this:
https://www.example.com/blog/category/web-development?page=2
As my links are getting indexed in Google I need to stop indexing them. What code should I use to not index them?
This is my curet robots.txt file:
Disallow: /cgi-bin/
Disallow: /scripts/
Disallow: /privacy
Disallow: /404.html
Disallow: /500.html
Disallow: /tweets
Disallow: /tweet/
Can I use this to disallow them?
Disallow: /blog/category/*?*
With robots.txt, you can prevent crawling, not necessarily indexing.
If you want to disallow Google to crawl URLs
whose paths start with /blog/category/, and
that contain a query component (e.g., ?, ?page, ?page=2, ?foo=bar&page=2 etc.)
then you can use this:
Disallow: /blog/category/*?
You don’t need another * at the end because Disallow values represent the start of the URL (beginning from the path).
But note that this is not supported by all bots. According to the original robots.txt spec, the * has no special meaning. Conforming bots would interpret the above line literally (* as part of the path). If you were to follow only the rules from the original specification, you would have to list every occurrence:
Disallow: /blog/category/c1?
Disallow: /blog/category/c2?
Disallow: /blog/category/c3?

Disable some URLs on robots.txt

I have a site with URLs of the style:
https://www.example.com/16546/slug-title
What is the rule to add in robots.txt to disable these URLs?
I want to keep the public URLs https://www.example.com/terms.
You can use wildcards in your robots.txt, but that will not work for your url format /<id>/<slug>.
If you use the format /article/<id>/<slug> it could work with (not tested):
Disallow: /article
If you are fine with blocking all URLs whose path starts with 0-9, you can use:
Disallow: /0
Disallow: /1
Disallow: /2
Disallow: /3
Disallow: /4
Disallow: /5
Disallow: /6
Disallow: /7
Disallow: /8
Disallow: /9
This will block URLs like
https://www.example.com/1
https://www.example.com/16
https://www.example.com/165/foo

Disalow robots for different purpose with 1 Directive

Can I combine the 2 directive below into one as show under it and google or bing bot will still follow my robots? I have recently seen bingbot not following the second directive and thinking if I combine the directive they might follow it.
Original
User-agent:*
Disallow: /folder1/
Disallow: /folder2/
User-agent: *
Disallow: /*.png
Disallow: /*.jpg
Wanted to change to this
User-agent:*
Disallow: /folder1/
Disallow: /folder2/
Disallow: /*.png
Disallow: /*.jpg
You may only have one record with User-agent: *:
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
When having more than one of these records, bots (that are not matched by a more specific record) might only follow the first one in the file.
So you have to use this record:
User-agent: *
Disallow: /folder1/
Disallow: /folder2/
Disallow: /*.png
Disallow: /*.jpg
Note that the * in a Disallow value has no special meaning in the original robots.txt specification, but some consumers use it as a wildcard.

robot.txt to block directory showing

Few questions
How can you effectively block directories and their contents using robots.txt?
Is it ok to do:
User-agent: *
Disallow: /group
Disallow: /home
Do you have to put a trailing slash, for example:
User-agent: *
Disallow: /group/
Disallow: /home/
Also what is the difference between Disallow in robots.txt and adding ?
If I want google not to show specific pages and folders in a directory, what should I do?
Is it ok to do:
User-agent: * Disallow: /group Disallow: /home
You must place these on separate lines
It is highly recommended that you put a trailing slash if you are trying to exlude the directories home and group
I would do something like this:
User-agent: *
Disallow: /group/
Disallow: /home/
About the trailing slash, yes, you should add it according to http://www.thesitewizard.com/archive/robotstxt.shtml:
Remember to add the trailing slash ("/") if you are indicating a directory. If you simply add
User-agent: *
Disallow: /privatedata
the robots will be disallowed from accessing privatedata.html as well as ?privatedataandstuff.html as well as the directory tree beginning from /privatedata/ (and so on). In other words, there is an implied wildcard character following whatever you list in the Disallow line.
If you do not want google to show specific pages or directories, add a Disallow line for each of these pages or directories.