Robots Excluding url with special character - robots.txt

Look for recommendation as pertains to robots.txt and we dont want to go wrong in excluding the entire site
Is the below command apt for excluding all url with backslash encoded in url
Disallow: /*\
Will it only exclude url's having backslash or %22 in the site url path. There are some pages been indexed with backslash and coming in as duplicate in Google webmaster.
Does the above command does not hinder or block site to search engines in any or either way except for url with backslash to it

To update, we resolved it by rather applying 301 redirect through htaccess rather than blocking through robots.txt

Related

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

How do I disallow search robots from www.example.com and exsample.com

I would like to know if it is possible to block all robots from my site. I get some trouble because I redirect exsample.com to www.exsample.com. The robots.txt checker tools says I don't have a robots.txt file on exsample.com but have it on www.exsample.com.
Hope someone can help me out :)
just make a text file named robots.txt and in this file you write the following
User-agent: *
Disallow: /
and put it in your www folder or public_html folder
this would ask all the search engines to disallow all content of the website but not all the search engines would obbay to this protocol, but the most important search engines would read it and do as you asked
Robots.txt works per host.
So if you want to block URLs on http://www.example.com, the robots.txt must be accessible at http://www.example.com/robots.txt.
Note that the subdomain matters, so you can’t block URLs on http://example.com with a robots.txt only available on http://www.example.com/robots.txt.

Will search engines honor robots.txt for a separate site that is a virtual directory under another site?

I have a website (Ex: www.examplesite.com), and I am creating another site as a separate, stand-alone site in IIS. This second site's URL will make it look like it's part of my main site: www.examplesite.com/anothersite. This is accomplished by creating a virtual directory under my main site that points to the second site.
I am allowing my main site (www.examplesite.com) to be indexed in search engines, but I do not want my second, virtual directory site to be seen by search engines. Can I allow my second site to have its own robots.txt file, and disallow all pages for that site there? Or do I need to modify my main site's robots.txt file and tell it to disallow the virtual directory?
You can't have an own robots.txt for directories. Only a "host" can have it's own robots.txt: example.com, www.example.com, sub.example.com, sub.sub.example.com, …
So if you want to set rules for www.example.com/anothersite, you have to use the robots.txt at www.example.com/robots.txt.
If you want to block all pages of the sub-site, simply add:
User-agent: *
Disallow: /anothersite
This will block all URL paths that start with "anothersite". E.g. these links are all blocked then:
www.example.com/anothersite
www.example.com/anothersite.html
www.example.com/anothersitefoobar
www.example.com/anothersite/foobar
www.example.com/anothersite/foo/bar/
…
Note: If your robots.txt already contains User-agent: *, you'd have to add the Disallow line in this block instead of adding a new block (bots will stop reading the robots.txt as soon as they found a block that matches for them).

Google couldn't follow your URL because it redirected too many times

I was fixing url's on a website, and one of the problems there was that the url's contained characters that were sometimes upper-case while other times lower-case, the server did not care about it, but google did, and indexed the pages as duplicates.
Also some urls contained characters that are simply not allowed to be in that part of the URL, like commas "," and brackets "()" although [round brackets are technically not reserved][1]
I still decided to get rid of them by encoding them.
I added a check that checks if the url is valid, and if not, would do a 301 redirect to the correct url.
for example
http://www.example.com/articles/SomeGreatArticle(2012).html
would do a 301 redirect to
http://www.example.com/articles/somegreatarticle%282012%29.html
It works, and it does one redirect to the correct url.
But for a small fraction of the pages (which are possibly the only pages google has indexed so far) google webmaster tools started to give me the following error under the Crawl errors > Not followed tab:
Google couldn't follow your URL because it redirected too many
times.
googling for this error with quotes gives me 0 results, and I'm sure I'm not the only one to ever get this error, so I would like to know some more information about it, for example:
how many redirects can a single page do before google thinks that it's too many?
what are the other possible causes for such an error?
SOLUTION
According to this experiment http://www.monperrus.net/martin/google+url+encoding
Google has it's own character encoding rules, where google will always encode some characters and always decode other.
The following characters are never encoded
-,.#~_*)!$'(
So even if you give Google this url
http://www.example.com/articles/somegreatarticle%282012%29.html
where the round brackets () are encoded, google will transform this URL, decode the brackets and follow this URL instead:
http://www.example.com/articles/somegreatarticle(2012).html
What happened in my situation:
http://www.example.com/articles/somegreatarticle(2012).html
my server would do a 301 redirect to
http://www.example.com/articles/somegreatarticle%282012%29.html
while Googlebot would ignore the encoded brackets and follow:
http://www.example.com/articles/somegreatarticle(2012).html
get redirected to
http://www.example.com/articles/somegreatarticle%282012%29.html
follow
http://www.example.com/articles/somegreatarticle(2012).html
get redirected to
http://www.example.com/articles/somegreatarticle%282012%29.html
and give up after a couple of tries and show the "Google couldn't follow your URL because it redirected too many times" error.
I don't know about Google webmaster tools, but I have seen a similar error in PHP, when there is an infinite loop of redirection. Make sure that none of the pages is redirecting to itself.
Oke first of all I would remove the () and , signs from the urls, it is a fact that googlebot has a harder time working with these. And they don't do any benefit for SEO purposes either.
Readability for the client isn't an issues so if i where you just use a - or _ dash.
Try not to use any other character in your file/folder names.
You should also clean up your html, there are quite some errors and issues to resolve.
A cleaner source is better for google, browsers and your visitors.
I couldn't find any definitive problem that google would have an issue with.

block google robots for URLS containing a certain word

my client has a load of pages which they dont want indexed by google - they are all called
http://example.com/page-xxx
so they are /page-123 or /page-2 or /page-25 etc
Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt
would something ike this work?
Disallow: /page-*
Thanks
In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?
Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.
Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page- will work. If they're scattered across directories in many different places, then things are a bit more difficult.
As #user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.
For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:
Disallow: /*page-
That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:
Disallow: */page-
It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.
Then I pulled this from Google's documentation:
Pattern matching
Googlebot (but not all search engines) respects some pattern matching.
To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /?$
Disallow: /?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.
Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.
You could put all the pages that you don't want to get visited in a folder and then use disallow to tell bots not to visit pages in that folder.
Disallow: /private/
I don't know very much about robots.txt so I'm not sure how to use wildcards like that
Here, it says "you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines."
http://www.robotstxt.org/faq/robotstxt.html