Handling 404 and internal server errors with perl WWW::Mechanize - perl

I am using WWW::Mechanize to crawl sites, and it works great except for sometimes it will hit a page that returns error code 404 or 500 (not found or internal server error), and then my script will just exit and stop running. This is really messing with my data collection, so is there anyway that WWW::Mechanize will let me catch these errors and see what kind of error code was returned (i.e. 404,500, etc.). Thanks for the help!

You need to disable autocheck:
my $mech = WWW::Mechanize->new( autocheck => 0 );
$mech->get("http://somedomain.com");
if ( $mech->success() ) {
...
}
else {
print "status is: " . $mech->status;
}
Also, as an aside, have a look at WWW::Mechanize::Cached::GZip and WWW::Mechanize::Cached to speed up your development when testing your mech scripts.

Turn off autocheck and manually check status(), which returns the HTTP status code of the response.
This is a 3-digit number like 200 for OK, 404 for Not Found, and so on.
use strict;
use warnings;
use WWW::Mechanize;
my $url = 'http://...';
my $mech = WWW::Mechanize->new(autocheck => 0);
$mech->get($url);
print $mech->status();
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for Status Code Definitions.
If the status code is 400 or above, then you got error...

Related

Capture specific Internal Server Errors in Mod_perl

My perl application currently has basic logging. I am using Mod_perl and CGI
When an error occurs in the script the user gets an internal server error page. I know that CPAN CGI has an error reporting feature, but this feature seems to be for a individual script. What I wanted was to capture a stack trace globally and then report it to the user on a webpage. so that when this happens the user gets a nice 'something went wrong' page rather than a clueless Internal Server Error page.
I handled this nicely by creating bootstrap.pl script. I redirected with mod_rewrite all scripts to bootstrap.pl and posted actual url as parameter. In bootstrap I am handling sessions, logging, etc and in the end I am calling eval('script from url'). The most interesting for you is error handling:
use ex::override GLOBAL_die => sub {
my $stackTrace = Devel::StackTrace->new(no_refs => 1)->as_string;
if ($stackTrace =~ /eval \{/) {
CORE::die #_;
} else {
local *__ANON__ = "Exception";
select STDOUT;
print getError500( sprintf(
join("\n",<DATA>),
#_,
$stackTrace,
Dumper(\%ENV)
));
CORE::exit 1;
}
};
This is overwriting die and printing nice error 500 page with full stack on exception. I tried many things and this looks like the best approach for me.

Processing external page with perl CGI or act as a reverse proxy

There is a page residing on a local server running Apache. I would like to submit the form via a GET request with a single name/value pair, like:
id=item1234
This GET request has to be processed by another server which I don't have control over subsequently returning a page which I would like to transform with the CGI script. In other words:
User submits form
MY apache proxies to external resource
EXTERNAL resource throws back a page
MY apache transforms it with a CGI (maybe another way?)
User get a modified page
Again this more like an architectural question so I'd be grateful for any hints, even poking my nose into some guides will help as I wasn't able to structure my google request well enough to locate anything related.
Thanks.
Pass the id "17929632" to this CGI code ("proxy.pl?id=17929632"), and you should this exact page in your browser.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use CGI::Pretty qw(:standard -any -no_xhtml -oldstyle_urls);
print header;
print "<html>\n";
print " <head><title>Proxy Demo</title></head>\n";
print " <body bgcolor=\"white\">\n";
my $id = param('id') || die "No CGI param 'id'\n";
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
# Create a request
my $req = HTTP::Request->new(GET => "http://stackoverflow.com/questions/$id");
# Pass request to the user agent and get a response back
my $response = $ua->request($req);
# Check the outcome of the response
if ($response->is_success) {
my $content = $response->content;
# Modify the original content here!
print $content;
}
else {
print $response->status_line;
}
print "</body></html>\n";
Vague question, vague answer: write your CGI program to include a HTTP user agent, e.g. LWP.

500 Internal Server Error in perl-cgi program

I am getting error as "Internal Server Error.The server encountered an internal error or misconfiguration and was unable to complete your request."
I am submitting a form in html and get its values.
HTML Code (index.cgi)
#!c:/perl/bin/perl.exe
print "Content-type: text/html; charset=iso-8859-1\n\n";
print "<html>";
print "<body>";
print "<form name = 'login' method = 'get' action = '/cgi-bin/login.pl'> <input type = 'text' name = 'uid'><br /><input type = 'text' name = 'pass'><br /><input type = 'submit'>";
print "</body>";
print "</html>";
Perl Code to fetch data (login.pl)
#!c:/perl/bin/perl.exe
use CGI::Carp qw(fatalsToBrowser);
my(%frmfields);
getdata(\%frmfields);
sub getdata {
my ($buffer) = "";
if (($ENV{'REQUEST_METHOD'} eq 'GET')) {
my (%hashref) = shift;
$buffer = $ENV{'QUERY_STRING'};
foreach (split(/&/,$buffer)) {
my ($key, $value) = split(/=/, $_);
$key = decodeURL($key);
$value= decodeURL($value);
$hashref{$key} = $value;
}
}
else{
read(STDIN,$buffer,$ENV{'CONTENT_LENGTH'})
}
}
sub decodeURL{
$_=shift;
tr/+/ /;
s/%(..)/pack('c', hex($1))/eg;
return($_);
}
The HTML page opens correctly but when i submit the form, i get internal server error.
Please help.
What does the web server's error log say?
Independent of what it says, you must stop parsing the form data yourself. There are modules for that, specifically CGI.pm. Using that, you can do this instead:
use CGI;
my $CGI = CGI->new();
my $uid = $CGI->param( 'uid' );
my $pass = $CGI->param( 'pass' );
# rest of your script
Much cleaner and much safer.
I agree with Tore that you must not parse this yourself. Your code has multiple errors. You don't allow multiple parameter values, you don't allow the ; alternate separator, you don't handle POST with a query string in the URL, and so on.
I don't know how long it will be online for free, but chapter 15 of my new "Beginning Perl" book covers Web programming. That should get you started on some decent basics. Note that the online version is an early, rough draft. The actual book also includes Chapter 19 which has a complete Web app example.
could it be this line that's the problem?
my (%hashref) = shift;
You're initialising a proper hash, but shift will give you a hash reference, since you did getdata(\%frmfields);. You probably want this, instead:
my $hashref = shift;
"500 Internal Server Error" just means that something didn't work the way the web server expected. Maybe you don't have CGI enabled. Maybe the script isn't executable. Maybe it's in a directory the web server isn't allowed to access. It's even possible that maybe the web server ran the script successfully and it worked perfectly, but didn't start its output with a valid set of HTTP headers. You need to look in the web server's error log to find out what it didn't like, which may or may not be a Perl issue.
Like everyone else has said, though, don't try to parse query strings and grovel though %ENV yourself. Use one of the many fine modules or frameworks which are available and already known to work correctly. CGI.pm is the granddaddy of them all and works well for smaller projects, but I'd recommend looking into a proper web application framework such as Dancer, Mojolicious, or Catalyst (there are many others, but those are the big three) if you're planning to build anything with more than a handful of relatively simple pages and forms.

How to obtain the 301/302 website redirect location from the http response and follow it?

I have been trying to obtain the 301/302 redirect location from the http response using perl Mechanize (WWW::Mechanize), however have been having problems extracting it from the response using things like response->header and so on.
Can anyone help with extracting the redirect location from the http responses from websites that use 301 or 302 redirects please?
I know what I want to do and how to do it once I have this redirection location URL as I have done more complex things with Mechanize before, but I'm just having real problems with getting the location (or any other response fields) from the http response.
Your help would be much appreciated, Many thanks, CM
WWW::Mechanize should automatically follow redirects (unless you've told it not to via requests_redirectable), so you should not need to do anything.
EDIT: just to demonstrate:
DB<4> $mech = WWW::Mechanize->new;
DB<5> $mech->get('http://www.preshweb.co.uk/linkedin');
DB<6> x $mech->uri;
0 URI::http=SCALAR(0x903f990)
-> 'http://www.linkedin.com/in/bigpresh'
... as you can see, WWW::Mechanize followed the redirect, and ended up at the destination, automatically.
Updated with another example as requested:
DB<15> $mech = WWW::Mechanize->new;
DB<16> $mech->get('http://jjbsports.com/');
DB<17> x $mech->uri;
0 URI::http=SCALAR(0x90988f0)
-> 'http://www.jjbsports.com/'
DB<18> x substr $mech->content, 0, 40;
0 '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML'
DB<19> x $mech->title;
0 'JJB Sports | Trainers, Clothing, Football Kits, Football Boots, Running'
As you can see, it followed the redirect, and $mech->content is returning the content of the page. Does that help at all?
If its a redirect, WWW::Mechanize would use $mech->redirect_ok(); while request()ing to follow the redirect URL (this is an LWP method).
Note -
WWW::Mechanize's constructor pushes POST on to the agent's
requests_redirectable list
So you wouldn't have to worry about pushing POST to the requests_redirectable list.
If you want to be absolutely certain that the program is redirecting your URLs and log every redirect in a log file (or something), you can use LWP's simple_request and HTTP::Response's is_redirect to detect redirects, something like this-
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->stack_depth(0);
my $resp = $mech->simple_request( HTTP::Request->new(GET => 'http://www.googl.com/') );
if( $resp->is_redirect ) {
my $location = $resp->header( "Location" );
my $uri = new URI( $location );
print "Got redirected to URL - $uri\n";
$mech->get($uri);
print $mech->content;
}
is_redirect will detect both 301 and 302 response codes.

Site scraping in perl using WWW::Mechanize

I have used WWW::Mechanize in perl for site scraping application.
I have faced some difficulties when I'm going to login to particular site via WWW::Mechanize. I have gone through some examples of WWW::Mechanize. But i couldn't find out my issue.
I have mention below my code.
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTTP::Cookies;
use Crypt::SSLeay;
my $agent = WWW::Mechanize->new(noproxy => 0);
$agent->cookie_jar(HTTP::Cookies->new());
$agent->agent('Mozilla/5.0');
$agent->proxy(['https', 'http', 'ftp'], 'http://proxy.rcapl.com:3128');
$agent->get("http://www.facebook.com");
my $re=$agent->submit_form(
form_number => 1,
fields => {
Email => 'xyz#gmail.com',
Passwd =>'xyz'
}
);
print $re->content();
When i run the code it says:
Error POSTing https://www.facebook.com/login.php?login_attempt=1: Not Implemented at ./test.pl line 11
Can anybody tell what's going wrong on code. Do i need to set all the parameters which facebook send for login?.
The proxy is faulty:
Error GETing http://www.facebook.com: Can't connect to proxy.rcapl.com:3128 (Bad hostname) at so11406791.pl line 11.
The program works for me without calling the proxy method. Remove this.