How do I disallow search robots from www.example.com and exsample.com - redirect

I would like to know if it is possible to block all robots from my site. I get some trouble because I redirect exsample.com to www.exsample.com. The robots.txt checker tools says I don't have a robots.txt file on exsample.com but have it on www.exsample.com.
Hope someone can help me out :)

just make a text file named robots.txt and in this file you write the following
User-agent: *
Disallow: /
and put it in your www folder or public_html folder
this would ask all the search engines to disallow all content of the website but not all the search engines would obbay to this protocol, but the most important search engines would read it and do as you asked

Robots.txt works per host.
So if you want to block URLs on http://www.example.com, the robots.txt must be accessible at http://www.example.com/robots.txt.
Note that the subdomain matters, so you can’t block URLs on http://example.com with a robots.txt only available on http://www.example.com/robots.txt.

Related

Is our robots.txt file formatted correctly?

I'm trying to make sure our robots.txt file is correct and would greatly appreciate some info. We want all bots to be able to crawl and index the homepage and the 'sample triallines' but that's it. Here's the file:
User-agent: *
Allow: /$
Allow: /sample-triallines$
Disallow: /
Can anyone please let me know if this is correct?
Thanks in advance.
You can test your XML sitemap directly with a robots testing tool or within the webmaster tools of most major search engines (e.g. Google Search Console). Your current robots.txt file will work for most crawlers for the exact URLs you mentioned (e.g. https://www.example/ and https://www.example/sample-triallines).
However, just to note, if your URLs deviate from these exact URLs they will be blocked to crawlers (e.g. tracking parameters). For example, the below URLs will be blocked with the current robots.txt setup, which may or may not be acceptable for what you're working on.
https://www.example/index.html
https://www.example/?marketing=promo
https://www.example/sample-triallines/
https://www.example/sample-triallines?marketing=promo
If any of these above URLs need to be crawled you'll just need to add additional directives into the robots.txt file as needed and test them within the robots testing tools. Additional information on robots directives can be found here.
Hope this helps

Robots.txt: Allow everything but the root directory

I have a site that is meant to have http://domain.com/blog as the root directory, and any traffic to http://domain.com is redirected to http://domain.com/blog.
This causes a problem cause when I go to Google and do site:domain.com, I see the root directory with the title of one of the first articles on the page. How can I block the root from being crawled, thus not showing up in search?
In webmaster tools I added the site as http://domain.com but I only fetch as google on the /blog directory and other static pages. Is that correct?
I usually know how to do this but this time the site has a sub-directory as the intended root so it's a bit different.
Can someone verify if this will do what I am trying to achieve?
User-agent: *
Allow: /$
Disallow: /
Robots.txt does NOT block a crawler from crawling certain webpages. Robots.txt is simply a text file with a set of guidelines that you ask the crawler to follow it does not at any time block a crawler. If you want to block a certain webpage from being crawl/visited - you will then have to block all access to that page, this includes other users that are not crawlers. But since you have already have it to redirect I see no issue.
Also the $ is not a unified standard, neither is Allow(technically). Try to make it focused on specific bots. Google and Bing recognise the Allow keyword, but many other bots does not.
Also your current robots.txt says this: Do not crawl any pages, but the root
I recommend this as your robots.txt
User-agent: *
Disallow: /
User-agent: googlebot
Disallow: /$
This tells all other bots, but google to not crawl your webpage. And it tells the google crawler not to crawl in root, but everything else is allowed.

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

Will search engines honor robots.txt for a separate site that is a virtual directory under another site?

I have a website (Ex: www.examplesite.com), and I am creating another site as a separate, stand-alone site in IIS. This second site's URL will make it look like it's part of my main site: www.examplesite.com/anothersite. This is accomplished by creating a virtual directory under my main site that points to the second site.
I am allowing my main site (www.examplesite.com) to be indexed in search engines, but I do not want my second, virtual directory site to be seen by search engines. Can I allow my second site to have its own robots.txt file, and disallow all pages for that site there? Or do I need to modify my main site's robots.txt file and tell it to disallow the virtual directory?
You can't have an own robots.txt for directories. Only a "host" can have it's own robots.txt: example.com, www.example.com, sub.example.com, sub.sub.example.com, …
So if you want to set rules for www.example.com/anothersite, you have to use the robots.txt at www.example.com/robots.txt.
If you want to block all pages of the sub-site, simply add:
User-agent: *
Disallow: /anothersite
This will block all URL paths that start with "anothersite". E.g. these links are all blocked then:
www.example.com/anothersite
www.example.com/anothersite.html
www.example.com/anothersitefoobar
www.example.com/anothersite/foobar
www.example.com/anothersite/foo/bar/
…
Note: If your robots.txt already contains User-agent: *, you'd have to add the Disallow line in this block instead of adding a new block (bots will stop reading the robots.txt as soon as they found a block that matches for them).

Stopping Google's crawl of my site

Google has started crawling my site, but from a temporary domain (beta.mydomain instead of just mydomain) and also I only want him to crawl just some of my pages. Therefore, I want to stop their crawl and only let them crawl pages I specify in a sitemap. How can I do that? (I know how to add a sitemap, but how can I stop their current crawling and request that they'll crawl just the sitemap)
Update: If I kill the subdomain beta.mydomain - will that be "fine" by them or will they continue go over all killed pages and "not like" them? Can I specify that in each page's header?
Create a single text file called 'robots.txt' in the root folder for your site. Inside...
User-agent: *
Disallow: /thisfolder/
Disallow: /foo.html
Disallow: /andthisfoldertoo/
Disallow: /andthisfile.html
I use this for project files. In fact, as I write this I think I'll change the way I work on projects and always put them in a sub-directory called /projects/project1/ so one line will do...
Disallow: /projects/
AND I also add a line for my image files. I don't like my images all over the web...
Disallow: /imgs/
You could start with a robots.txt file.
See google's info here
I presume you have already looked at webmaster tools and sitemaps from what you say? Do be aware that while a sitemap will help tell google WHAT to crawl, it won't work very well for telling them what NOT to crawl.
For that you will want to use the robots.txt file to block certain pages / folders.
Use a robots.txt, see this site.