Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a site with the following robots.txt in the root:
User-agent: *
Disabled: /
User-agent: Googlebot
Disabled: /
User-agent: Googlebot-Image
Disallow: /
And pages within this site are getting scanned by Googlebots all day long. Is there something wrong with my file or with Google?
It should be Disallow:, not Disabled:.
Maybe give the Google robots.txt checker a try
Google have an analysis tool for checking robots.txt entries, read about it here
You might also want to check the IP addresses of the "rogue" robots to see if they really are owned by Google.
Also I believe that the bot goes down the page and takes the first directive that applies to it. In your case, Googlebot and Googlebot-Image would never see their specific directives because they would respect the "User-Agent: *" first.
Disregard this answer. I found information that points to this not being the case. The bot should find the directive specific to it and respect it
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
On a /robots.txt page, what does this mean?
User-agent: *
Disallow: /
Does this mean that you cannot search and get results of this website on a search engine? For example does it block Google?
It blocks (good) bots (e.g, Googlebot) from indexing any page.
From this page:
The "User-agent: *" means this section applies to all robots. The
"Disallow: /" tells the robot that it should not visit any pages on
the site.
There are two important considerations when using /robots.txt:
robots can ignore your /robots.txt. Especially malware robots that
scan the web for security vulnerabilities, and email address
harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of
your server you don't want robots to use.
See the robots.txt specification.
User-agent: * matches every bot that supports robots.txt (and hasn’t a more specific record in the same file, e.g. User-agent: BotWithAName).
Disallow: / forbids those bots to crawl anything on your host.
Note that not all bots support and respect a robots.txt file.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have an app whose content should not be publicly indexed. I've therefore disallowed access to all crawlers.
robots.txt:
# Robots shouldn't index a private app.
User-agent: *
Disallow: /
However, Bing has been ignoring this and daily requests a /trafficbasedsspsitemap.xml file, which I have no need to create.
I also have no need to receive daily 404 error notifications for this file. I'd like to just make the bingbot go away, so what do I need to do to forbid it from making requests?
According to this answer, this is Bingbot checking for an XML sitemap generated by the Bing Sitemap Plugin for IIS and Apache. It apparently cannot be blocked by robots.txt.
For those coming from google-
You could block bots via apache user agent detection/ rewrite directives, that would allow you to keep bingbot out entirely.
https://superuser.com/questions/330671/wildcard-blocking-of-bots-in-apache
Block all bots/crawlers/spiders for a special directory with htaccess
etc.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
My site has profiles, and then pages beyond those profiles. (Example: http://www.site.com/profile, http://www.site.com/profile/settings)
I would like to block Google crawlers from the sub folders. I want google to index the /profile/ but not anything beyond it.
Another example: - http://twitter.com/bmull <-- Allow - http://twitter.com/bmull/favorites <-- Block
You could also use <meta name="robots" content="noindex, nofollow" /> in the pages you dont want to robots to index/follow, however always remember that everything in these files is voluntary and the robots can choose not to follow so I recommend ip or user agent blocking as a better route.
This will work with Google, but isn't guaranteed to work with other spiders. As secretformula suggested, your best bet is to go with ip or user agent blocking in your server side logic
User-agent: *
Disallow: /*/settings
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
We implemented a rating system on a site a while back that involves a link to a script. However, with the vast majority of ratings on the site at 3/5 and the ratings very even across 1-5 we're beginning to suspect that search engine crawlers etc. are getting through. The urls used look like this:
http://www.thesite.com/path/to/the/page/rate?uid=abcdefghijk&value=3
When we started we add the following to our robots.txt:
User-agent: *
Disallow: /rate
Is this incorrect or are googlebot and others simply ignoring our robots.txt?
You should use POST for actions which change things as search engine usually do not submit forms. Additionally, this will prevent users who download your website recursively (e.g. with wget) from submitting tons of votes.
Depending on your site, handling voting though javascript might be a solution, too.
Regarding your robots.txt:
It has to be in the root path - i.e. http://www.thesite.com/robots.txt - and if your rating system is at /blah/rate you need to use Disallow: /blah/rate instead of Disallow: /rate
Looks incorrect to me. You're only disallowing access to http://www.thesite.com/rate (and pages below it IIRC). Plus some crawlers ignore robots.txt!
Better to make it so that ratings are only ever altered in response to a POST, rather than a GET. Search engines never use POST.
User-agent: *
Disallow: /path/to/the/page/rate
You have to use the full path.
Might want to read up here a bit: http://www.javascriptkit.com/howto/robots.shtml
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What would the syntax be to block all access to any bots to https:// pages? I have an old site that now doesn't have an SSL and I want to block access to all https:// pages
I don’t know if it works, if the robots use/request different robots.txt for different protocols. But you could deliver a different robots.txt for requests over HTTPS.
So when http://example.com/robots.txt is requested, you deliver the normal robots.txt. And when https://example.com/robots.txt is requested, you deliver the robots.txt that disallows everything.