Will search engines honor robots.txt for a separate site that is a virtual directory under another site? - robots.txt

I have a website (Ex: www.examplesite.com), and I am creating another site as a separate, stand-alone site in IIS. This second site's URL will make it look like it's part of my main site: www.examplesite.com/anothersite. This is accomplished by creating a virtual directory under my main site that points to the second site.
I am allowing my main site (www.examplesite.com) to be indexed in search engines, but I do not want my second, virtual directory site to be seen by search engines. Can I allow my second site to have its own robots.txt file, and disallow all pages for that site there? Or do I need to modify my main site's robots.txt file and tell it to disallow the virtual directory?

You can't have an own robots.txt for directories. Only a "host" can have it's own robots.txt: example.com, www.example.com, sub.example.com, sub.sub.example.com, …
So if you want to set rules for www.example.com/anothersite, you have to use the robots.txt at www.example.com/robots.txt.
If you want to block all pages of the sub-site, simply add:
User-agent: *
Disallow: /anothersite
This will block all URL paths that start with "anothersite". E.g. these links are all blocked then:
www.example.com/anothersite
www.example.com/anothersite.html
www.example.com/anothersitefoobar
www.example.com/anothersite/foobar
www.example.com/anothersite/foo/bar/
…
Note: If your robots.txt already contains User-agent: *, you'd have to add the Disallow line in this block instead of adding a new block (bots will stop reading the robots.txt as soon as they found a block that matches for them).

Related

robots.txt content / selenium web scraping

I am trying to run web scraping using selenium
What does this robot.txt content mean?
User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/
Can i run web scraping in all folders except go and launch-announcement?
What is a robots.txt file?
Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents. view more...
The Disallow: tells the robot that it should not visit the mentioned page on the site.
Can i run web scraping in all folders except go and launch-announcement?
Yes you can scrape the other page except these 2.
According to the basic robots.txt guide, the rule-
User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/
means crawling /go/ and /launch-announcement/ (and their subdirectories) is disallowed for all user agents.

robots.txt allows & disallows few pages, what does it mean for other pages?

I was going through many websites' robots.txt files to check if I could scrape some specific pages. When I see following pattern -
User-agent: *Allow: /some-pageDisallow: /some-other-page
There is nothing else on robots.txt file. Does it mean that all other remaining pages on the given website are available to be scraped?
P.S. - I tried googling this specific case but no luck.
According to this website, Allow is used to a allow a directory when it's parent may be disallowed. I found this website quite useful as well.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Regarding your question, if the remaining pages aren't included in a Disallow directory, you should be okay.

Robots.txt: Allow everything but the root directory

I have a site that is meant to have http://domain.com/blog as the root directory, and any traffic to http://domain.com is redirected to http://domain.com/blog.
This causes a problem cause when I go to Google and do site:domain.com, I see the root directory with the title of one of the first articles on the page. How can I block the root from being crawled, thus not showing up in search?
In webmaster tools I added the site as http://domain.com but I only fetch as google on the /blog directory and other static pages. Is that correct?
I usually know how to do this but this time the site has a sub-directory as the intended root so it's a bit different.
Can someone verify if this will do what I am trying to achieve?
User-agent: *
Allow: /$
Disallow: /
Robots.txt does NOT block a crawler from crawling certain webpages. Robots.txt is simply a text file with a set of guidelines that you ask the crawler to follow it does not at any time block a crawler. If you want to block a certain webpage from being crawl/visited - you will then have to block all access to that page, this includes other users that are not crawlers. But since you have already have it to redirect I see no issue.
Also the $ is not a unified standard, neither is Allow(technically). Try to make it focused on specific bots. Google and Bing recognise the Allow keyword, but many other bots does not.
Also your current robots.txt says this: Do not crawl any pages, but the root
I recommend this as your robots.txt
User-agent: *
Disallow: /
User-agent: googlebot
Disallow: /$
This tells all other bots, but google to not crawl your webpage. And it tells the google crawler not to crawl in root, but everything else is allowed.

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

How do I disallow search robots from www.example.com and exsample.com

I would like to know if it is possible to block all robots from my site. I get some trouble because I redirect exsample.com to www.exsample.com. The robots.txt checker tools says I don't have a robots.txt file on exsample.com but have it on www.exsample.com.
Hope someone can help me out :)
just make a text file named robots.txt and in this file you write the following
User-agent: *
Disallow: /
and put it in your www folder or public_html folder
this would ask all the search engines to disallow all content of the website but not all the search engines would obbay to this protocol, but the most important search engines would read it and do as you asked
Robots.txt works per host.
So if you want to block URLs on http://www.example.com, the robots.txt must be accessible at http://www.example.com/robots.txt.
Note that the subdomain matters, so you can’t block URLs on http://example.com with a robots.txt only available on http://www.example.com/robots.txt.