robots.txt allows & disallows few pages, what does it mean for other pages? - robots.txt

I was going through many websites' robots.txt files to check if I could scrape some specific pages. When I see following pattern -
User-agent: *Allow: /some-pageDisallow: /some-other-page
There is nothing else on robots.txt file. Does it mean that all other remaining pages on the given website are available to be scraped?
P.S. - I tried googling this specific case but no luck.

According to this website, Allow is used to a allow a directory when it's parent may be disallowed. I found this website quite useful as well.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Regarding your question, if the remaining pages aren't included in a Disallow directory, you should be okay.

Related

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

How come when I block a directory in robots.txt, its contents are still coming up?

This is what I've got in my robots.txt, placed in the base directory, of course:
User-Agent: *
Disallow: /foo/
But then, in Google, I have no index of /foo/, but for some reason, I still have /foo/foo.php showing up as a link in Google.
How come? Did I write something incorrectly? Do I need to write something else?
When you put robots.txt after your site went live, Google could already index files under /foo/.
You can remove already indexed files via Google Webmaster Tools - removal request.
robots.txt does not prevent Google to link to your blocked pages. Google won't index your blocked pages (so it won't show the page title/description/snippet), but if it finds a link to any blocked page, it might still link it from their search results.
If you want to also forbid this linking, you could use the meta element with robots and noindex.

Stopping Google's crawl of my site

Google has started crawling my site, but from a temporary domain (beta.mydomain instead of just mydomain) and also I only want him to crawl just some of my pages. Therefore, I want to stop their crawl and only let them crawl pages I specify in a sitemap. How can I do that? (I know how to add a sitemap, but how can I stop their current crawling and request that they'll crawl just the sitemap)
Update: If I kill the subdomain beta.mydomain - will that be "fine" by them or will they continue go over all killed pages and "not like" them? Can I specify that in each page's header?
Create a single text file called 'robots.txt' in the root folder for your site. Inside...
User-agent: *
Disallow: /thisfolder/
Disallow: /foo.html
Disallow: /andthisfoldertoo/
Disallow: /andthisfile.html
I use this for project files. In fact, as I write this I think I'll change the way I work on projects and always put them in a sub-directory called /projects/project1/ so one line will do...
Disallow: /projects/
AND I also add a line for my image files. I don't like my images all over the web...
Disallow: /imgs/
You could start with a robots.txt file.
See google's info here
I presume you have already looked at webmaster tools and sitemaps from what you say? Do be aware that while a sitemap will help tell google WHAT to crawl, it won't work very well for telling them what NOT to crawl.
For that you will want to use the robots.txt file to block certain pages / folders.
Use a robots.txt, see this site.

block google robots for URLS containing a certain word

my client has a load of pages which they dont want indexed by google - they are all called
http://example.com/page-xxx
so they are /page-123 or /page-2 or /page-25 etc
Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt
would something ike this work?
Disallow: /page-*
Thanks
In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?
Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.
Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page- will work. If they're scattered across directories in many different places, then things are a bit more difficult.
As #user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.
For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:
Disallow: /*page-
That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:
Disallow: */page-
It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.
Then I pulled this from Google's documentation:
Pattern matching
Googlebot (but not all search engines) respects some pattern matching.
To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /?$
Disallow: /?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.
Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.
You could put all the pages that you don't want to get visited in a folder and then use disallow to tell bots not to visit pages in that folder.
Disallow: /private/
I don't know very much about robots.txt so I'm not sure how to use wildcards like that
Here, it says "you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines."
http://www.robotstxt.org/faq/robotstxt.html

robots.txt: user-agent: Googlebot disallow: / Google still indexing

Look at the robots.txt of this site:
fr2.dk/robots.txt
The content is:
User-Agent: Googlebot
Disallow: /
That ought to tell google not to index the site, no?
If true, why does the site appear in google searches?
Besides having to wait, because Google's index updates take some time, also note that if you have other sites linking to your site, robots.txt alone won't be sufficient to remove your site.
Quoting Google's support page "Remove a page or site from Google's search results":
If the page still exists but you don't want it to appear in search results, use robots.txt to prevent Google from crawling it. Note that in general, even if a URL is disallowed by robots.txt we may still index the page if we find its URL on another site. However, Google won't index the page if it's blocked in robots.txt and there's an active removal request for the page.
One possible alternative solution is also mentioned in above document:
Alternatively, you can use a noindex meta tag. When we see this tag on a page, Google will completely drop the page from our search results, even if other pages link to it. This is a good solution if you don't have direct access to the site server. (You will need to be able to edit the HTML source of the page).
If you just added this, then you'll have to wait - it's not instantaenous - until Googlebot comes back to respider the site and sees the robots.txt, the site'll still be in their database.
I doubt it's relevant, but you might want to change your "Agent" to "agent" - Google's most likely not case sensitive for this, but can't hurt to follow the standard exactly.
I can confirm Google doesn't respect the Robots Exclusion File. Here's my file, which I created before putting this origin online:
https://git.habd.as/robots.txt
And the full contents of the file:
User-agent: *
Disallow:
User-agent: Google
Disallow: /
And Google still indexed it.
I don't use Google after cancelling my account last March and never had this site added to a webmaster console outside Yandex which leaves me with two assumptions:
Google is scraping Yandex
Google doesn't respect the Robots Exclusion Standard
I haven't grepped my logs yet but I will and my assumption is I'll find Google spiders in there misbehaving.