Googlebot not respecting Robots.txt [closed] - robots.txt

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:
Sitemap: http://[omitted]/sitemap_index.xml
User-agent: Mediapartners-Google
Disallow: /scripts
User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Anything in the scripts folder are correctly blocked for both the Googlebot and Mediapartners-Google. I can see that the two robots are seeing the correct directive because the Googlebot says the scripts are blocked from line 7 while the Mediapartners-Google is blocked from line 4. And yet ANY other url I put in from the disallowed urls under the second user-agent directive are NOT blocked!
I'm wondering if my comment or using absolute urls are screwing things up...
Any insight is appreciated. Thanks.

The reason why they are ignored is that you have the fully qualified URL in the robots.txt file for Disallow entries while the specification doesn't allow it. (You should only specify relative paths, or absolute paths using /). Try the following:
Sitemap: /sitemap_index.xml
User-agent: Mediapartners-Google
Disallow: /scripts
User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
As for caching, google tries to get a copy of the robots.txt file every 24 hours in average.

It's the absolute URLs. robots.txt is only supposed to include relative URIs; the domain is inferred based on the domain that the robots.txt was accessed from.

It's been up for at least a week, and Google says it was last downloaded 3 hours ago, so I'm sure it's recent.

Did you recently make this change to your robots.txt file? In my experience it seems that google caches that stuff for a really long time.

Related

Are local robots.txt files read by Facebook and Google? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have a folder which is half public: The URL is not linked, the people that know the URL are only a few friends (which will not link it) and it is cryptic enough to make sure that nobody lands there by accident.
However, the link is send via Googlemail and Facebook messages. Is there a way to tell Facebook and Google in a local robots.txt file not to index the page?
When I add it to the "global" robots.txt file then everybody who takes a look there will see that in my /secret-folder-12argoe22v4 might be something interesting. So I will not do that. But will Facebook / Google look at /secret-folder-12argoe22v4/robots.txt?
The content would be
User-agent: *
Disallow: .
or
User-agent: *
Disallow: /secret-folder-12argoe22v4/
As CBroe mentioned, a robots.txt file must always be at the top level of the site. If you put it in a subdirecory, it will be ignored. One way you can block a directory without publicly revealing its full name is to block just part of it, like this:
User-agent: *
Disallow: /secret
This will block any URL that starts with "/secret", including "/secret-folder-12argoe22v4/".
I should point out that the above is not a 100% reliable way to keep the files out of the search engines. It will keep the search engines from directly crawling the directory, but they can still show it in search results if some other site links to it. You may consider using robots meta tags instead, but even this won't prevent someone from directly following an off-site link. The only really reliable way to keep a directory private is to put it behind a password.

Can I use robots.txt to block any directory tree that starts with numbers?

I'm not even sure if this is the best way to handle this, but I had made a temporary mistake with my rewrites and Google (possibly others) picked up on it, now it has them indexed and keeps coming up with errors.
Basically, I'm generating URLs based on a variety of factors, one being the id of an article, which is automatically generated. These then redirect to the correct spot.
I had first accidentally set up stuff like this:
/2343/news/blahblahblah
/7645/reviews/blahblahblah
Etc.
This was a problem for a lot of reasons, the main one being that there would be duplicates and stuff wasn't pointing to the right places and yada yada. And I fixed them to this now:
/news/2343/blahblahblah
/reviews/7645/blahblahblah
Etc.
And that's all good. But I want to block anything that falls into the pattern of the first. In other words, anything that looks like this:
** = any numerical pattern
/**/anythingelsehere
So that Google (and any others who have maybe indexed the wrong stuff) stops trying to look for these URLs that were all messed up and that don't even exist anymore. Is this possible? Should I even be doing this through robots.txt?
You don't need to setup a robots.txt for that, just return 404 errors for those urls and Google and other search engines will eventually drop them.
Google also has Webmaster tools which you can use to deindex urls. I'm pretty sure other hosts have similar things.
To answer the question: Yes, you can block any URLs that start with a number.
User-agent: *
Disallow: /0
Disallow: /1
Disallow: /2
Disallow: /3
Disallow: /4
Disallow: /5
Disallow: /6
Disallow: /7
Disallow: /8
Disallow: /9
It would block URLs like:
example.com/1
example.com/2.html
example.com/3/foo
example.com/4you
example.com/52347612
These URLs would still be allowed:
example.com/foo/1
example.com/foo2.html
example.com/bar/3/foo
example.com/only4you

Joomla articles showing up as PDFs in Google search results

I have certain articles which i do not want Google to index .I have set the settings to noindex nofollow in the metadata settings in the backend for the article .However now even though the article itself does not show up a pdf version of it does .
Any thoughts on how to prevent them from being indexed?
You have tow choices:
disable pdf button going to articles -> configuration button and look for the PDF option button.
change the .htaccess file located at the root folder of your Joomla Installation dir and add the following lines:
User-agent: Googlebot
Disallow: /index.php?view=article*&format=pdf
Disallow: /index.php?view=article*&print=1*
Disallow: /index.php?option=com_mailto*
Disallow: /component/mailto/*
User-agent: Slurp
Disallow: /index.php?view=article*&format=pdf
Disallow: /index.php?view=article*&print=1*
Disallow: /index.php?option=com_mailto*
Disallow: /component/mailto/*
width one of these options you must solve the problem, the latter also get rid of duplicate content from the send to friend option and print view.
Also bear in mind that it will take some time to change the way Google or other bots are indexing your site.
Regards
Depends where the pdf files are stored. You can use the robots.txt file to exclude the folder where the pdf files are located. You could also add rel="nofollow" to the links to the PDF files.
Some links for reading:
http://www.useit.com/alertbox/20030728_spidering.html
http://www.webmasterworld.com/robots_txt/3741109.htm

Robots.txt, how to allow access only to domain root, and no deeper? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I want to allow crawlers to access my domain's root directory (i.e. the index.html file), but nothing deeper (i.e. no subdirectories). I do not want to have to list and deny every subdirectory individually within the robots.txt file. Currently I have the following, but I think it is blocking everything, including stuff in the domain's root.
User-agent: *
Allow: /$
Disallow: /
How can I write my robots.txt to accomplish what I am trying for?
Thanks in advance!
There's nothing that will work for all crawlers. There are two options that might be useful to you.
Robots that allow wildcards should support something like:
Disallow: /*/
The major search engine crawlers understand the wildcards, but unfortunately most of the smaller ones don't.
If you have relatively few files in the root and you don't often add new files, you could use Allow to allow access to just those files, and then use Disallow: / to restrict everything else. That is:
User-agent: *
Allow: /index.html
Allow: /coolstuff.jpg
Allow: /morecoolstuff.html
Disallow: /
The order here is important. Crawlers are supposed to take the first match. So if your first rule was Disallow: /, a properly behaving crawler wouldn't get to the following Allow lines.
If a crawler doesn't support Allow, then it's going to see the Disallow: / and not crawl anything on your site. Providing, of course, that it ignores things in robots.txt that it doesn't understand.
All the major search engine crawlers support Allow, and a lot of the smaller ones do, too. It's easy to implement.
In short no there is no way to do this nicely using the robots.txt standard. Remember the Disallow specifies a path prefix. Wildcards and allows are non-standard.
So the following approach (a kludge!) will work.
User-agent: *
Disallow: /a
Disallow: /b
Disallow: /c
...
Disallow: /z
Disallow: /A
Disallow: /B
Disallow: /C
...
Disallow: /Z
Disallow: /0
Disallow: /1
Disallow: /2
...
Disallow: /9

how to disallow all dynamic urls robots.txt [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
how to disallow all dynamic urls in robots.txt
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
i want to disallow all things that start with /?q=
The answer to your question is to use
Disallow: /?q=
The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)
According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.
The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.
This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.
As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.
That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.
If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.
If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.
More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.
In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:
User-agent: *
Allow: /index.html
Allow: /offices
Allow: /static
Disallow: /
This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.