Joomla articles showing up as PDFs in Google search results - joomla1.5

I have certain articles which i do not want Google to index .I have set the settings to noindex nofollow in the metadata settings in the backend for the article .However now even though the article itself does not show up a pdf version of it does .
Any thoughts on how to prevent them from being indexed?

You have tow choices:
disable pdf button going to articles -> configuration button and look for the PDF option button.
change the .htaccess file located at the root folder of your Joomla Installation dir and add the following lines:
User-agent: Googlebot
Disallow: /index.php?view=article*&format=pdf
Disallow: /index.php?view=article*&print=1*
Disallow: /index.php?option=com_mailto*
Disallow: /component/mailto/*
User-agent: Slurp
Disallow: /index.php?view=article*&format=pdf
Disallow: /index.php?view=article*&print=1*
Disallow: /index.php?option=com_mailto*
Disallow: /component/mailto/*
width one of these options you must solve the problem, the latter also get rid of duplicate content from the send to friend option and print view.
Also bear in mind that it will take some time to change the way Google or other bots are indexing your site.
Regards

Depends where the pdf files are stored. You can use the robots.txt file to exclude the folder where the pdf files are located. You could also add rel="nofollow" to the links to the PDF files.
Some links for reading:
http://www.useit.com/alertbox/20030728_spidering.html
http://www.webmasterworld.com/robots_txt/3741109.htm

Related

Robots.txt - prevent index of .html files

I want to prevent index of *.html files on our site - so that just clean urls are indexed.
So I would like www.example.com/en/login indexed but not www.example.com/en/login/index.html
Currently I have:
User-agent: *
Disallow: /
Disallow: /**.html - not working
Allow: /$
Allow: /*/login*
I know I can just disallow e.g. Disallow: /*/login/index.html, but my issue is I have a number of these .html files that I do not want indexed - so wondered if there was a way to Disallow them all instead of doing them individually?
First of all, you keep using the word "indexed", so I want to ensure that you're aware that the robots.txt convention is only about suggesting to automated crawlers that they avoid certain URLs on your domain, but pages listed in a robots.txt file can still show up on search engine indexes if they have other data about the page. For instance, Google explicitly states they will still index and list a URL, even if they're not allowed to crawl it. I just wanted you to be aware of that in case you are using the word "indexed" to mean "listed in a search engine" rather than "getting crawled by an automated program".
Secondly, there's no standard way to accomplish what you're asking for. Per "The Web Robots Pages":
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
That being said, it's a common addition that many crawlers do support. For example, in Google's documentation of they directives they support, they describe pattern matching support that does handle using * as a wildcard. So, you could add a Disallow: /*.html$ directive and then Google would not crawl URLs ending with .html, though they could still end up in search results.
But, if your primary goal is telling search engines what URL you consider "clean" and preferred, then what you're actually looking for is specifying Canonical URLs. You can put a link rel="canonical" element on each page with your preferred URL for that page, and search engines that use that element will use it in order to determine which path to prefer when displaying that page.

how to block multiple links in robot.txt with one line?

I have many pages whose links are as follow:
http://site.com/school_flower/
http://site.com/school_rose/
http://site.com/school_pink/
etc.
I can't block them manually.
How could i block these kind of pages, while i have hundreds fo links of above type and not wanted to write each line for each link.
You can't.
robots.txt is a very simple format. But you can create a tool that will generate that file for you. That should be fairly easy, if you have a list of URLs to be blocked, one per line, you just have to prepend Disallow: to each line.
That said, the fact that you want to block many urls is an alarm. Probably, you are doing something wrong. You could ask a question about your ultimate goal and we would give you a better solution.
Continuing from my comment:
user-agent: *
Disallow: /folder/
Of course you'll have to place all files you don't want robots to access under a single directory, unless you block the entire site by Disallow: /
In responce to your comment, kirelagin has provided the correct answer.

Can I use robots.txt to block any directory tree that starts with numbers?

I'm not even sure if this is the best way to handle this, but I had made a temporary mistake with my rewrites and Google (possibly others) picked up on it, now it has them indexed and keeps coming up with errors.
Basically, I'm generating URLs based on a variety of factors, one being the id of an article, which is automatically generated. These then redirect to the correct spot.
I had first accidentally set up stuff like this:
/2343/news/blahblahblah
/7645/reviews/blahblahblah
Etc.
This was a problem for a lot of reasons, the main one being that there would be duplicates and stuff wasn't pointing to the right places and yada yada. And I fixed them to this now:
/news/2343/blahblahblah
/reviews/7645/blahblahblah
Etc.
And that's all good. But I want to block anything that falls into the pattern of the first. In other words, anything that looks like this:
** = any numerical pattern
/**/anythingelsehere
So that Google (and any others who have maybe indexed the wrong stuff) stops trying to look for these URLs that were all messed up and that don't even exist anymore. Is this possible? Should I even be doing this through robots.txt?
You don't need to setup a robots.txt for that, just return 404 errors for those urls and Google and other search engines will eventually drop them.
Google also has Webmaster tools which you can use to deindex urls. I'm pretty sure other hosts have similar things.
To answer the question: Yes, you can block any URLs that start with a number.
User-agent: *
Disallow: /0
Disallow: /1
Disallow: /2
Disallow: /3
Disallow: /4
Disallow: /5
Disallow: /6
Disallow: /7
Disallow: /8
Disallow: /9
It would block URLs like:
example.com/1
example.com/2.html
example.com/3/foo
example.com/4you
example.com/52347612
These URLs would still be allowed:
example.com/foo/1
example.com/foo2.html
example.com/bar/3/foo
example.com/only4you

Make PHP Web Crawler to Respect the robots.txt file of any website

I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling.
I see that this is the robots.txt file structure:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
I can read, line by line and then use explode with space character as delimiter to find data.
Is there any other way that i can load the entire data ?
Does this kind of files have a language, like XPath has ?
Or do i have to interprete the entire file ?
Any help is welcomed, even links, duplicates if found ...
The structure is very simple, so the best thing you can do is probably parse the file on your own. i would read it line by line and as you said look for keywords like User-agent, Disallow etc.

Robots.txt Disallow Certain Folder Names

I want to disallow robots from crawling any folder, at any position in the url with the name: this-folder.
Examples to disallow:
http://mysite.com/this-folder/
http://mysite.com/houses/this-folder/
http://mysite.com/some-other/this-folder/
http://mysite.com/no-robots/this-folder/
This is my attempt:
Disallow: /.*this-folder/
Will this work?
Officially globbing and regular expressions are not supported:
http://www.robotstxt.org/robotstxt.html
but apparently some search engines support this.