Does related subfolders need to be disallowed separately in robots.txt? - robots.txt

Will disallowing certain folder in robots.txt disallow its related subfolders?
Example:
Disallow:/folder/
Will match:
/folder/page
/folder/subfolder/page
Or it will just match:
/folder/page
So if the second case is true, do I need to disallow second and subsequent subfolder separately?
Disallow: /folder/
Disallow /folder/subfolder/
Disallow /folder/subfolder/onemorefolder

Robots.txt has no concept of "folders", it’s just strings. Whatever you specify in Disallow is the beginning of the URL path.
Disallow: / blocks any URL whose path starts with / (= all pages).
Disallow: /foo blocks any URL whose path starts with /foo:
/foo
/foobar
/foo.html
/foo/bar
/foo/bar/doe
Disallow: /foo/ blocks any URL whose path starts with /foo/:
/foo/
/foo/bar.html
/foo/bar
/foo/bar/doe

Related

Robots.txt file to allow all root php files except one and disallow all subfolders content

I seem to be struggling with a robots.txt file in the following scenario. I would like all root folder *.php files to be indexed except for one (exception.php) and would like all content from all subdirectories of the root folder not to be indexed.
I have tried the following, but it allows accessing php files in subdirectories even though subdirectories in general are not indexed?
....
# robots.txt
User-agent: *
Allow: /*.php
disallow: /*
disallow: /exceptions.php
....
Can anyone help with this?
For crawlers that interpret * in Disallow values as wildcard (it’s not part of the robots.txt spec, but many crawlers support it anyway), this should work:
User-agent: *
Disallow: /exceptions.php
Disallow: /*/
This disallows URLs like:
https://example.com/exceptions.php
https://example.com//
https://example.com/foo/
https://example.com/foo/bar.php
And it allows URLs like:
https://example.com/
https://example.com/foo.php
https://example.com/bar.html
For crawlers that don’t interpret * in Disallow values as wildcard, you would have to list all subfolders (on the first level):
User-agent: *
Disallow: /exceptions.php
Disallow: /foo/
Disallow: /bar/

Need to stop indexing the URL parameters for custom build CMS

I would like for Google to ignore URLs like this:
https://www.example.com/blog/category/web-development?page=2
As my links are getting indexed in Google I need to stop indexing them. What code should I use to not index them?
This is my curet robots.txt file:
Disallow: /cgi-bin/
Disallow: /scripts/
Disallow: /privacy
Disallow: /404.html
Disallow: /500.html
Disallow: /tweets
Disallow: /tweet/
Can I use this to disallow them?
Disallow: /blog/category/*?*
With robots.txt, you can prevent crawling, not necessarily indexing.
If you want to disallow Google to crawl URLs
whose paths start with /blog/category/, and
that contain a query component (e.g., ?, ?page, ?page=2, ?foo=bar&page=2 etc.)
then you can use this:
Disallow: /blog/category/*?
You don’t need another * at the end because Disallow values represent the start of the URL (beginning from the path).
But note that this is not supported by all bots. According to the original robots.txt spec, the * has no special meaning. Conforming bots would interpret the above line literally (* as part of the path). If you were to follow only the rules from the original specification, you would have to list every occurrence:
Disallow: /blog/category/c1?
Disallow: /blog/category/c2?
Disallow: /blog/category/c3?

Allow only one file of directory in robots.txt?

I want to allow only one file of directory /minsc, but I would like to disallow the rest of the directory.
Now in the robots.txt is this:
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
The file that I want to allow is /minsc/menu-leaf.png
I'm afraid to do damage, so I dont'know if I must to use:
A)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
Allow: /minsc/menu-leaf.png
or
B)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/* //added "*" -------------------------------
Allow: /minsc/menu-leaf.png
?
Thanks and sorry for my English.
According to the robots.txt website:
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The
easy way is to put all files to be disallowed into a separate
directory, say "stuff", and leave the one file in the level above this
directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
According to Wikipedia, if you are going to use the Allow directive, it should go before the Disallow for maximum compatability:
Allow: /directory1/myfile.html
Disallow: /directory1/
Furthermore, you should put Crawl-delay last, according to Yandex:
To maintain compatibility with robots that may deviate from the
standard when processing robots.txt, the Crawl-delay directive needs
to be added to the group that starts with the User-Agent record right
after the Disallow and Allow directives).
So, in the end, your robots.txt file should look like this:
User-agent: *
Allow: /minsc/menu-leaf.png
Disallow: /minsc/
Crawl-delay: 10
Robots.txt is sort of an 'informal' standard that can be interpreted differently. The only interesting 'standard' is really how the major players are interpreting it.
I found this source saying that globbing ('*'-style wildcards) are not supported:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
http://www.robotstxt.org/robotstxt.html
So according to this source you should stick with your alternative (A).

robot.txt to block directory showing

Few questions
How can you effectively block directories and their contents using robots.txt?
Is it ok to do:
User-agent: *
Disallow: /group
Disallow: /home
Do you have to put a trailing slash, for example:
User-agent: *
Disallow: /group/
Disallow: /home/
Also what is the difference between Disallow in robots.txt and adding ?
If I want google not to show specific pages and folders in a directory, what should I do?
Is it ok to do:
User-agent: * Disallow: /group Disallow: /home
You must place these on separate lines
It is highly recommended that you put a trailing slash if you are trying to exlude the directories home and group
I would do something like this:
User-agent: *
Disallow: /group/
Disallow: /home/
About the trailing slash, yes, you should add it according to http://www.thesitewizard.com/archive/robotstxt.shtml:
Remember to add the trailing slash ("/") if you are indicating a directory. If you simply add
User-agent: *
Disallow: /privatedata
the robots will be disallowed from accessing privatedata.html as well as ?privatedataandstuff.html as well as the directory tree beginning from /privatedata/ (and so on). In other words, there is an implied wildcard character following whatever you list in the Disallow line.
If you do not want google to show specific pages or directories, add a Disallow line for each of these pages or directories.

How to disallow search pages from robots.txt

I need to disallow http://example.com/startup?page=2 search pages from being indexed.
I want http://example.com/startup to be indexed but not http://example.com/startup?page=2 and page3 and so on.
Also, startup can be random, e.g., http://example.com/XXXXX?page
Something like this works, as confirmed by Google Webmaster Tools "test robots.txt" function:
User-Agent: *
Disallow: /startup?page=
Disallow The value of this field
specifies a partial URL that is not to
be visited. This can be a full path,
or a partial path; any URL that starts
with this value will not be retrieved.
However, if the first part of the URL will change, you must use wildcards:
User-Agent: *
Disallow: /startup?page=
Disallow: *page=
Disallow: *?page=
You can put this on the pages you do not want indexed:
<META NAME="ROBOTS" CONTENT="NONE">
This tells robots not to index the page.
On a search page, it may be more interesting to use:
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">
This instructs robots to not index the current page, but still follow the links on this page, allowing them to get to the pages found in the search.
Create a text file and name it: robots.txt
Add user agents and disallow sections (see sample below)
Place the file in the root of your site
Sample:
###############################
#My robots.txt file
#
User-agent: *
#
#list directories robots are not allowed to index
#
Disallow: /testing/
Disallow: /staging/
Disallow: /admin/
Disallow: /assets/
Disallow: /images/
#
#
#list specific files robots are not allowed to index
#
Disallow: /startup?page=2
Disallow: /startup?page=3
Disallow: /startup?page=3
#
#
#End of robots.txt file
#
###############################
Here's a link to Google's actual robots.txt file
You can get some good information on the Google webmaster's help topic on blocking or removing pages using a robots.txt file