Where to place the robots.txt file with G-WAN? - robots.txt

I want to disallow robots from crawling the csp folder and plan to use the following robots.txt file:
User-agent: *
Disallow: /csp
So, my question is double:
Is the syntax correct for G-WAN?
With G-WAN, where should I place this file?

The well-documented robots.txt file should be placed in the /www G-WAN fodler - if you want to use this feature. robots.txt is a hint for robots, many of them do not respect your will (so it's much safer to define file-system permissions or use an index.html file in the folders that you don't want to be browsed).
The /csp directory cannot be crawled by any HTTP client (including robots). Only the /www directory can.
This separation has worked pretty well in terms of simplicity, design and security so far, avoiding the pitfall of deciding what is executable and what is the presentation layer.

Related

Robots.txt - prevent index of .html files

I want to prevent index of *.html files on our site - so that just clean urls are indexed.
So I would like www.example.com/en/login indexed but not www.example.com/en/login/index.html
Currently I have:
User-agent: *
Disallow: /
Disallow: /**.html - not working
Allow: /$
Allow: /*/login*
I know I can just disallow e.g. Disallow: /*/login/index.html, but my issue is I have a number of these .html files that I do not want indexed - so wondered if there was a way to Disallow them all instead of doing them individually?
First of all, you keep using the word "indexed", so I want to ensure that you're aware that the robots.txt convention is only about suggesting to automated crawlers that they avoid certain URLs on your domain, but pages listed in a robots.txt file can still show up on search engine indexes if they have other data about the page. For instance, Google explicitly states they will still index and list a URL, even if they're not allowed to crawl it. I just wanted you to be aware of that in case you are using the word "indexed" to mean "listed in a search engine" rather than "getting crawled by an automated program".
Secondly, there's no standard way to accomplish what you're asking for. Per "The Web Robots Pages":
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
That being said, it's a common addition that many crawlers do support. For example, in Google's documentation of they directives they support, they describe pattern matching support that does handle using * as a wildcard. So, you could add a Disallow: /*.html$ directive and then Google would not crawl URLs ending with .html, though they could still end up in search results.
But, if your primary goal is telling search engines what URL you consider "clean" and preferred, then what you're actually looking for is specifying Canonical URLs. You can put a link rel="canonical" element on each page with your preferred URL for that page, and search engines that use that element will use it in order to determine which path to prefer when displaying that page.

Allow all files in webroot, and disallow all directories unless specifically allowed

I like to disallow everything except:
All files in the web root
Specified directories in the web root.
I have seen this example at this answer
Allow: /public/section1/
Disallow: /
But does the above allow crawling of all files in web root?
I want to allow all files in web root.
If you want to disallow directories without disallowing files, you will need to use wildcards:
User-agent: *
Allow: /public/section1/
Disallow: /*/
The above will allow all of the following:
http://example.com/
http://example.com/somefile
http://example.com/public/section1/
http://example.com/public/section1/somefile
http://example.com/public/section1/somedir/
http://example.com/public/section1/somedir/somefile
And it will disallow all of the following:
http://example.com/somedir/
http://example.com/somedir/somefile
http://example.com/somedir/otherdir/somefile
Just be aware that wildcards are not part of the original robots.txt specification, and are not supported by all crawlers. They are supported by all of the major search engines, but there are many other crawlers out there that don't support them.

how can I override robots in a sub folder

I have a sub-domain for testing purposes. I have set robots.txt to disallow this folder.
Some of the results are still showing for some reason. I thought it may be because I hadn't set up the robots.txt originally and Google hadn't removed some of them yet.
Now I'm worried that the robots.txt files within the individual joomla sites in this folder are causing Google to keep indexing them. Ideally I would like to stop that from happening because I don't want to have to remember to turn robots.txt back to follow when they go live (just in case).
Is there a way to override these explicitly with a robots.txt in a folder above this folder?
As far as a crawler is concerned, robots.txt exists only in the site's root directory. There is no concept of a hierarchy of robots.txt files.
So if you have http://example.com and http://foo.example.com, then you would need two different robots.txt files: one for example.com and one for foo.example.com. When Googlebot reads the robots.txt file for foo.example.com, it does not take into account the robots.txt for example.com.
When Google bot is crawling example.com, it will not under any circumstances interpret the robots.txt file for foo.example.com. And when it's crawling foo.example.com, it will not interpret the robots.txt for example.com.
Does that answer your question?
More info
When Googlebot crawls foo.com, it will read foo.com/robots.txt and use the rules in that file. It will not read and follow the rules in foo.com/portfolio/robots.txt or foo.com/portfolio/mydummysite.com/robots.txt. See the first two sentences of my original answer.
I don't fully understand what you're trying to prevent, probably because I don't fully understand your site hierarchy. But you can't change a crawler's behavior on mydummysite.com by changing the robots.txt file at foo.com/robots.txt or foo.com/portfolio/robots.txt.

how to block multiple links in robot.txt with one line?

I have many pages whose links are as follow:
http://site.com/school_flower/
http://site.com/school_rose/
http://site.com/school_pink/
etc.
I can't block them manually.
How could i block these kind of pages, while i have hundreds fo links of above type and not wanted to write each line for each link.
You can't.
robots.txt is a very simple format. But you can create a tool that will generate that file for you. That should be fairly easy, if you have a list of URLs to be blocked, one per line, you just have to prepend Disallow: to each line.
That said, the fact that you want to block many urls is an alarm. Probably, you are doing something wrong. You could ask a question about your ultimate goal and we would give you a better solution.
Continuing from my comment:
user-agent: *
Disallow: /folder/
Of course you'll have to place all files you don't want robots to access under a single directory, unless you block the entire site by Disallow: /
In responce to your comment, kirelagin has provided the correct answer.

Make PHP Web Crawler to Respect the robots.txt file of any website

I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling.
I see that this is the robots.txt file structure:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
I can read, line by line and then use explode with space character as delimiter to find data.
Is there any other way that i can load the entire data ?
Does this kind of files have a language, like XPath has ?
Or do i have to interprete the entire file ?
Any help is welcomed, even links, duplicates if found ...
The structure is very simple, so the best thing you can do is probably parse the file on your own. i would read it line by line and as you said look for keywords like User-agent, Disallow etc.