Check if an url is blocked by robots.txt using Perl - perl

Can anybody tell me sample code to check if an url has been blocked by robots.txt?
We can specify full url or directory in the robots.txt.
Is there any helper function in Perl?

Check out WWW::RobotRules:
The following methods are provided:
$rules = WWW::RobotRules->new($robot_name)
This is the constructor for WWW::RobotRules objects. The first
argument given to new() is the name of the robot.
$rules->parse($robot_txt_url, $content, $fresh_until)
The parse() method takes as arguments the URL that was used to
retrieve the /robots.txt file, and the contents of the file.
$rules->allowed($uri)
Returns TRUE if this robot is allowed to retrieve this URL.

WWW::RobotRules is the standard class for parsing robots.txt files and then checking URLs to see if they're blocked.
You may also be interested in LWP::RobotUA, which integrates that into LWP::UserAgent, automatically fetching and checking robots.txt files as needed.

Load the robots.txt file and search for "Disallow:" in the file. Then check if the following pattern (after the Disallow:) is within your URL. If so, the URL is banned by the robots.txt
Example -
You find the following line in the robots.txt:
Disallow: /cgi-bin/
Now remove the "Disallow: " and check, if "/cgi-bin/" (the remaining part) is directly after the TLD.
If your URL looks like:
www.stackoverflow.com/cgi-bin/somwhatelse.pl
it is banned.
If your URL looks like:
www.stackoverflow.com/somwhatelse.pl
it is ok. The complete set of rules you'll find on http://www.robotstxt.org/. This is the way, if you can not install additional modules for any reason.
Better would be to use a module from cpan:
There is a great module on cpan that I use to deal with it: LWP::RobotUA. LWP (libwww) is imho the standard for webaccess in perl - and this module is part of it and ensures your behaviour is nice.

Hum, you don't seem to have even looked! On the first page of search results, I see various download engines that handle robots.txt automatically for you, and at least one that does exactly what you asked.

WWW::RobotRules skips rules "substring"
User-agent: *
Disallow: *anytext*
url http://example.com/some_anytext.html be passed (not banned)

Related

Disallow dynamic URL in robots.txt

Our URL is:
http://example.com/kitchen-knife/collection/maitre-universal-cutting-boards-rana-parsley-chopper-cheese-slicer-vegetables-knife-sharpening-stone-ham-stand-ham-stand-riviera-niza-knives-block-benin.html
I want to disallow URLs to be crawled after collection, but before collection there are categories that are dynamically coming.
How would I disallow URLs in robots.txt after /collection?
This is not possible in the original robots.txt specification.
But some (!) parsers extend the specification and define a wildcard character (typically *).
For those parsers, you could use:
Disallow: /*/collection
Parsers that understand * as wildcard will stop crawling any URL whose path starts with anything (which may be nothing), followed by /collection/, followed by anything, e.g.,
http://example.com/foo/collection/
http://example.com/foo/collection/bar
http://example.com/collection/
Parsers that don’t understand * as wildcard (i.e., they follow the original specification) will stop crawling any URL whose paths starts with /*/collection/, e.g.
http://example.com/*/collection/
http://example.com/*/collection/bar

How do I know what to name a file downloaded using HTTP?

I am creating an HTTP client downloader in Python. I am able to correctly download a file such as http://www.google.com/images/srpr/logo11w.png just fine. However, I'm not sure what to actually name the thing.
There is of course the filename at the end of the URL, but is this always reliable?
If I recall correctly, wget uses the following heuristic:
If a Content-Disposition header exists, get the filename from there.
If the filename component of the URL exists (e.g. http://myserver/filename), use that.
If there is no filename component (e.g. http://www.google.com), derive the filename from the Content-Type header (such as index.html for text/html)
In all cases, if this filename is already present in the directory use a numerical suffix, such as index (1).html, or overwrite, depending on configuration.
There are plenty of other flags that control other heuristics, such as creating .html for ASP/DHTML content-types.
In short, it really depends how far you want to go. For most people, doing the first two + basic Content-Type->name mapping should be enough.

Does wget check if specified user agent is allowed in robots.txt?

If I specify a custom user agent for wget, eg "MyBot (info#mybot...)" Will wget check this in robots.txt as well, if the bot was banned, or only the general robot exclusions?
No, if you specify your own user agent, Wget does not check for it in the robots.txt file. In fact, I believe I've found another bug in Wget while trying to answer your question. Even if you specify a custom User Agent, Wget seems to adhere to its own User Agent rules when parsing robots.txt. I have created a test case for this and will fix the implementation in Wget ASAP.
Now for the authoritative answer to your original question. The answer is no, because the in the source of Wget, you see the following comment preceding the function that parses the robots file for rules:
/* Parse textual RES specs beginning with SOURCE of length LENGTH.
Return a specs objects ready to be fed to res_match_path.
The parsing itself is trivial, but creating a correct SPECS object is
trickier than it seems, because RES is surprisingly byzantine if you
attempt to implement it correctly.
A "record" is a block of one or more User-Agent' lines followed by
one or moreAllow' or Disallow' lines. Record is accepted by Wget if
one of theUser-Agent' lines was "wget", or if the user agent line
was "*".
After all the lines have been read, we examine whether an exact
("wget") user-agent field was specified. If so, we delete all the
lines read under "User-Agent: *" blocks because we have our own
Wget-specific blocks. This enables the admin to say:
User-Agent: * Disallow: /
User-Agent: google User-Agent: wget Disallow: /cgi-bin
This means that to Wget and to Google, /cgi-bin is disallowed, whereas
for all other crawlers, everything is disallowed. res_parse is
implemented so that the order of records doesn't matter. In the case
above, the "User-Agent: *" could have come after the other one. */

Disallow URLs with empty parameters in robots.txt

Normally I have this URL structure:
http://example.com/team/name/16356**
But sometimes my CMS generates URLs without name:
http://example./com/team//16356**
and then it’s 404.
How to disavow such URLs when they are empty?
Probably it would be possible with regex for empty symbol here, but I dont want to mess up with Googlebot, better do good from the beginning.
If you want to block URLs like http://example./com/team//16356**, where the number part can be different, you could use the following robots.txt:
User-agent: *
Disallow: /team//
This will block crawling of any URL whose path starts with /team//.

Help to rightly create robots.txt

I have dynamic urls like this.
mydomain.com/?pg=login
mydomain.com/?pg=reguser
mydomain.com/?pg=aboutus
mydomain.com/?pg=termsofuse
When the page is requested for ex. mydomainname.com/?pg=login index.php include login.php file.
some of the urls are converted to static url like
mydomain.com/aboutus.html
mydomain.com/termsofuse.html
I need to allow index mydomainname.com/aboutus.html, mydomainname.com/termsofuse.html
and disallow mydomainname.com/?pg=login, mydomainname.com/?pg=reguser, please help to manage my robots.txt file.
I have also mydomainname.com/posted.php?details=50 (details can have any number) which I converted to mydomainname.com/details/50.html
I need also to allow all this type of urls.
If you wish to only index your static pages, you can use this:
Disallow: /*?
This will disallow all URLs which contain a question mark.
If you wish to keep indexing posted.php?details=50 URLs, and you have a finite set of params you wish to disallow, you can create a disallow entry for each, like this:
Disallow: /?pg=login
Or just prevent everything starting with /?
Disallow: /?*
You can use a tool like this to test a sampling of URLs to see if it will match them or not.
http://tools.seobook.com/robots-txt/analyzer/