Why does my Perl script using WWW-Mechanize fail intermittently? - perl

I am trying to write a Perl script using WWW-Mechanize.
Here is my code:
use DBI;
use JSON;
use WWW::Mechanize;
sub fetch_companies_list
{
my $url = shift;
my $browser = WWW::Mechanize->new( stack_depth => 0 );
my ($content, $json, $parsed_text, $company_name, $company_url);
eval
{
print "Getting the companies list...\n";
$browser->get( $url );
# die "Can't get the companies list.\n" unless( $browser->status );
$content = $browser->content();
# die "Can't get companies names.\n" unless( $browser->status );
$json = new JSON;
$parsed_text = $json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode( $content );
foreach(#$parsed_text)
{
$company_name = $_->{name};
fetch_company_info( $company_name, $browser );
}
}
}
fetch_companies_list( "http://api.crunchbase.com/v/1/companies.js" );
The problem is the follows:
I start the script it finishes fine.
I restart the script. The script fails in "$browser->get()".
I have to wait some time (about 5 min) then it will start working again.
I am working on Linux and have WWW-Mechanize version 1.66.
Any idea what might be the problem? I don't have any firewall installed either on computer or on my router.
Moreover uncommenting the "die ..." line does not help as it stopping inside get() call. I can try to upgrade to the latest, which is 1.71, but I'd like to know if someone else experience this with this Perl module.

5 minutes (300 seconds) is the default timeout. Exactly what timed out will be returned in the response's status line.
my $response = $mech->res;
if (!$response->is_success()) {
die($response->status_line());
}

This is target site issue. It shows
503 Service Unavailable No server is available to handle this
request.
right now.

Retry with wait, try this
## set maximum no of tries
my $retries = 10;
## number of secs to sleep
my $sleep = 1;
do {
eval {
print "Getting the companies list...\n";
$browser->get($url);
# die "Can't get the companies list.\n" unless( $browser->status );
$content = $browser->content();
# die "Can't get companies names.\n" unless( $browser->status );
$json = new JSON;
$parsed_text = $json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode($content);
foreach (#$parsed_text) {
$company_name = $_->{name};
fetch_company_info( $company_name, $browser );
}
};
if ($#) {
warn $#;
## rest for some time
sleep($sleep);
## increase the value of $sleep exponetially
$sleep *= 2;
}
} while ( $# && $retries-- );

Related

Understanding async in perl on specific example

I have to write a script that get some URLs in parallel and do some work. In the past I have always used Parallel::ForkManager for such things, but now I wanted to learn something new and try asynchronous programming with AnyEvent (and AnyEvent::HTTP or AnyEvent::Curl::Multi) ... but I'm having problem understanding AnyEvent and writing a script that should:
open a file (every line is a seperate URL)
(from now in parallel, but with a limit for f.e. 10 concurrent requests)
read file line after line (I dont want to load whole file to memory - it might be big)
make a HTTP request for that URL
read response
updates MySQL record accordingly
(next file line)
I have read many manuals, tutorials, but its still hard for me to understand differences between blocking and non-blocking code. I have found similar script at http://perlmaven.com/fetching-several-web-pages-in-parallel-using-anyevent, where Mr. Szabo explains the basics, but I still cant understand how to implement something like:
...
open my $fh, "<", $file;
while ( my $line = <$fh> )
{
# http request, read response, update MySQL
}
close $fh
...
... and add a concurrency limit in this case.
I would be very grateful for help ;)
UPDATE
Following Ikegami's advice I gave Net::Curl::Multi a try. I'm very pleased with results. After years of using Parallel::ForkManager just for concurrent grabbing thousands of URLs, Net::Curl::Multi seems to be awesome.
Here is my code with while loop on filehandle. It seems to work as it should, but considering it's my first time writing something like this I would like to ask more experienced Perl users to take a look and tell me if there are some potential bugs, something I missed, etc.
Also, if I may ask: as I don't fully understand how Net::Curl::Multi's concurrency works, please tell me whether I should expect any problems with putting MySQL UPDATE command (via DBI) inside RESPONSE loop (besides higher server load obviously - I expect final script to run with about 50 concurrent N::C::M workers, maybe more).
#!/usr/bin/perl
use Net::Curl::Easy qw( :constants );
use Net::Curl::Multi qw( );
sub make_request {
my ( $url ) = #_;
my $easy = Net::Curl::Easy->new();
$easy->{url} = $url;
$easy->setopt( CURLOPT_URL, $url );
$easy->setopt( CURLOPT_HEADERDATA, \$easy->{head} );
$easy->setopt( CURLOPT_FILE, \$easy->{body} );
return $easy;
}
my $maxWorkers = 10;
my $multi = Net::Curl::Multi->new();
my $workers = 0;
my $i = 1;
open my $fh, "<", "urls.txt";
LINE: while ( my $url = <$fh> )
{
chomp( $url );
$url .= "?$i";
print "($i) $url\n";
my $easy = make_request( $url );
$multi->add_handle( $easy );
$workers++;
my $running = 0;
do {
my ($r, $w, $e) = $multi->fdset();
my $timeout = $multi->timeout();
select $r, $w, $e, $timeout / 1000
if $timeout > 0;
$running = $multi->perform();
RESPONSE: while ( my ( $msg, $easy, $result ) = $multi->info_read() ) {
$multi->remove_handle( $easy );
$workers--;
printf( "%s getting %s\n", $easy->getinfo( CURLINFO_RESPONSE_CODE ), $easy->{url} );
}
# dont max CPU while waiting
select( undef, undef, undef, 0.01 );
} while ( $workers == $maxWorkers || ( eof && $running ) );
$i++;
}
close $fh;
Net::Curl is a rather good library that's extremely fast. Furthermore, it can handle parallel requests too! I'd recommend using this instead of AnyEvent.
use Net::Curl::Easy qw( :constants );
use Net::Curl::Multi qw( );
sub make_request {
my ( $url ) = #_;
my $easy = Net::Curl::Easy->new();
$easy->{url} = $url;
$easy->setopt( CURLOPT_URL, $url );
$easy->setopt( CURLOPT_HEADERDATA, \$easy->{head} );
$easy->setopt( CURLOPT_FILE, \$easy->{body} );
return $easy;
}
my $max_running = 10;
my #urls = ( 'http://www.google.com/' );
my $multi = Net::Curl::Multi->new();
my $running = 0;
while (1) {
while ( #urls && $running < $max_running ) {
my $easy = make_request( shift( #urls ) );
$multi->add_handle( $easy );
++$running;
}
last if !$running;
my ( $r, $w, $e ) = $multi->fdset();
my $timeout = $multi->timeout();
select( $r, $w, $e, $timeout / 1000 )
if $timeout > 0;
$running = $multi->perform();
while ( my ( $msg, $easy, $result ) = $multi->info_read() ) {
$multi->remove_handle( $easy );
printf( "%s getting %s\n", $easy->getinfo( CURLINFO_RESPONSE_CODE ), $easy->{url} );
}
}
This does exactly what you want, in an asynchronous fashion, and it does that by wrapping Net::Curl in a safe fashion:
#!/usr/bin/env perl
package MyDownloader;
use strict;
use warnings qw(all);
use Moo;
extends 'YADA::Worker';
has '+use_stats'=> (default => sub { 1 });
has '+retry' => (default => sub { 10 });
after init => sub {
my ($self) = #_;
$self->setopt(
encoding => '',
verbose => 1,
);
};
after finish => sub {
my ($self, $result) = #_;
if ($self->has_error) {
print "ERROR: $result\n";
} else {
# do the interesting stuff here
printf "Finished downloading %s: %d bytes\n", $self->final_url, length ${$self->data};
}
};
around has_error => sub {
my $orig = shift;
my $self = shift;
return 1 if $self->$orig(#_);
return 1 if $self->getinfo('response_code') =~ m{^5[0-9]{2}$}x;
};
1;
package main;
use strict;
use warnings qw(all);
use Carp;
use YADA;
my $q = YADA->new(
max => 8,
timeout => 30,
);
open(my $fh, '<', 'file_with_urls_per_line.txt')
or croak "can't open queue: $!";
while (my $url = <$fh>) {
chomp $url;
$q->append(sub {
MyDownloader->new($url)
});
}
close $fh;
$q->wait;

Perl not continuing

Note: This is a test perl code to check if works.
I have problem with my perl script, I know there's a solution for that by adding
print "Your server is not ok, please check!\n";
die "- Server is not ok!\n";
But in my project after stop in die "- Server is not ok!\n"; continue the script, I use print to show if works.
Heres the code
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new( timeout => 1 );
$ua->agent("007");
my $req = HTTP::Request->new( GET => 'http://www.google.com.ph/' );
my $res;
for ( 1 .. 10 ) {
$res = $ua->request($req);
if ( $res->is_success ) {
print "+ Server is ok!\n";
}
else {
die "- Server is not ok!\n"; # I want to stop if server is not ok and print last code. Or other solution to stop instead of using die.
}
sleep 1;
}
print "Your server is not ok, please check!\n"; # Why this print not showing if stop in "- Server is not ok!\n"?
See image...
Other solution to stop? Intead using die to continue the perl script.
I hope someone will fixed this little problem, thanks for all reply!
for ( 1 .. 10 ) {
$res = $ua->request($req);
if ( $res->is_success ) {
print "+ Server is ok!\n";
}
else {
print "- Server is not ok!\n";
last; #this statement causes to exit the loop
}
sleep 1;
}
# at this point everything is oke
print "Your server is ok!\n";

Flush INET Socket response data with BLOCKING enabled

I am making a program that interfaces with Teamspeak, and I have an issue where the responses received will not match the commands sent. I run the program multiple times and each time, I will get different results when they should be the same, due to responses being out of sync.
my $buf = '';
use IO::Socket;
my $sock = new IO::Socket::INET (
PeerAddr => 'localhost'
,PeerPort => '10011'
,Proto => 'tcp'
,Autoflush => 1
,Blocking => 1
,Timeout => 10
);
sub ExecuteCommand{
print $sock $_[0]."\n";$sock->sysread($buf,1024*10);
return $buf;
};
ExecuteCommand("login ${username} ${password}");
ExecuteCommand("use sid=1");
ExecuteCommand("clientupdate client_nickname=Idle\\sTimer");
my $client_list = ExecuteCommand("clientlist");
Each command is executed properly, however the server likes to return extra lines, so a single sysread will not be enough and I will have to execute another. The size of responses are at most 512, so they aren't being cut off. If I try to run the sysread multiple times in an attempt to flush it, when there is nothing to read it will just make the program hang.
The end of the executions are followed with "error id=0 msg=ok"
How would I be able to read all the data that comes out, even if it's multiple lines? Or just be able to flush it all out so I can move onto the next command without having to worry about old data?
So you want to read until you find a line starting with error. In addition to doing that, the following buffers anything extra read since it's part of the next response.
sub read_response {
my ($conn) = #_;
my $fh = $conn->{fh};
our $buf; local *buf = \($conn->{buf}); # alias
our $eof; local *eof = \($conn->{eof}); # alias
$buf = '' if !defined($buf);
return undef if $eof;
while (1) {
if ($buf =~ s/\A(.*?^error[^\n]*\n)//ms) {
return $1;
}
my $rv = sysread($fh, $buf, 64*1024, length($buf));
if (!$rv) {
if (defined($rv)) {
$eof = 1;
return undef;
} else {
die "Can't read response: $!\n";
}
}
}
}
my $conn = { fh => $sock };
... send command ...
my $response = read_response($conn);
...
... send command ...
my $response = read_response($conn);
...
I changed my ExecuteCommand subroutine to include a check for "error code=[0-9]{1,}", which is what is always at the end of a response for Teamspeak 3 servers.
sub ExecuteCommand{
print $sock $_[0]."\n";
my $response = "";
while (1){
$sock->sysread($buf,1024*10);
last if($buf =~ /error id=([0-9]{1,})/);
$response .= $buf;
};
return $response;
};

How to make Apache "sleep" for a few seconds

I want my server to stop for a few seconds and then start again.
I am trying to use sleep(5).
Will this help?
I've tried with a Perl script containing:
if($mech=~ m/(Connection refused)/) {print "SLEEPING\n";sleep(5);redo;}
A server hosts web pages. A client connects to a server. A Perl script using WWW::Mechanize is a client.
I suspect you're trying to do the following:
my $retries = 3;
my $resp; # response goes here
while ( $retries-- > 0 ) {
$resp = $ua->get( "http://www.google.com/" );
if ( ! $resp->is_success ) {
warn( "Failed to get webpage: " . $resp->status_line );
sleep( 5 );
next; # continue the while loop
}
last; # success!
}
if ( $retries == 0 ) {
die( "Too many retries!" );
}
# old versions of LWP::UserAgent didn't have decoded_content()
my $content = $resp->can('decoded_content') ?
$resp->decoded_content : $resp->content;
print( $content );
UPDATE
Here is the code you supplied in a comment:
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
eval {
$mech->get($url);
};
if($#) {
print STDERR "Burst\n";
print STDERR Dumper($mech);
# my $var=$mech;
# print STDERR "var $var\n";
# if($mech=~ m/(Connection refused)/)
# {
# print "SLEEPING\n";
# sleep(5);
# redo;
# }
From perdoc -f redo:
The "redo" command restarts the loop block without evaluating the conditional again. The "continue" block, if any, is not executed. If the LABEL is omitted, the command refers to the innermost enclosing loop.
Seeing as you haven't put your code in a loop the call to redo doesn't have any effect.

True timeout on LWP::UserAgent request method

I am trying to implement a request to an unreliable server. The request is a nice to have, but not 100% required for my perl script to successfully complete. The problem is that the server will occasionally deadlock (we're trying to figure out why) and the request will never succeed. Since the server thinks it is live, it keeps the socket connection open thus LWP::UserAgent's timeout value does us no good what-so-ever. What is the best way to enforce an absolute timeout on a request?
FYI, this is not an DNS problem. The deadlock has something to do with a massive number of updates hitting our Postgres database at the same time. For testing purposes, we've essentially put a while(1) {} line in the servers response handler.
Currently, the code looks like so:
my $ua = LWP::UserAgent->new;
ua->timeout(5); $ua->cookie_jar({});
my $req = HTTP::Request->new(POST => "http://$host:$port/auth/login");
$req->content_type('application/x-www-form-urlencoded');
$req->content("login[user]=$username&login[password]=$password");
# This line never returns
$res = $ua->request($req);
I've tried using signals to trigger a timeout, but that does not seem to work.
eval {
local $SIG{ALRM} = sub { die "alarm\n" };
alarm(1);
$res = $ua->request($req);
alarm(0);
};
# This never runs
print "here\n";
The final answer I'm going to use was proposed by someone offline, but I'll mention it here. For some reason, SigAction works while $SIG(ALRM) does not. Still not sure why, but this has been tested to work. Here are two working versions:
# Takes a LWP::UserAgent, and a HTTP::Request, returns a HTTP::Request
sub ua_request_with_timeout {
my $ua = $_[0];
my $req = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)
use Sys::SigAction qw( timeout_call );
our $res = undef;
if( timeout_call( 5, sub {$res = $ua->request($req);}) ) {
return HTTP::Response->new( 408 ); #408 is the HTTP timeout
} else {
return $res;
}
}
sub ua_request_with_timeout2 {
print "ua_request_with_timeout\n";
my $ua = $_[0];
my $req = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)
my $timeout_for_client = $ua->timeout() - 2;
our $socket_has_timedout = 0;
use POSIX;
sigaction SIGALRM, new POSIX::SigAction(
sub {
$socket_has_timedout = 1;
die "alarm timeout";
}
) or die "Error setting SIGALRM handler: $!\n";
my $res = undef;
eval {
alarm ($timeout_for_client);
$res = $ua->request($req);
alarm(0);
};
if ( $socket_has_timedout ) {
return HTTP::Response->new( 408 ); #408 is the HTTP timeout
} else {
return $res;
}
}
You might try LWPx::ParanoidAgent, a subclass of LWP::UserAgent which is more cautious about how it interacts with remote webservers.
Among other things, it allows you to specify a global timeout. It was developed by Brad Fitzpatrick as part of the LiveJournal project.
You can make your own timeout like this:
use LWP::UserAgent;
use IO::Pipe;
my $agent = new LWP::UserAgent;
my $finished = 0;
my $timeout = 5;
$SIG{CHLD} = sub { wait, $finished = 1 };
my $pipe = new IO::Pipe;
my $pid = fork;
if($pid == 0) {
$pipe->writer;
my $response = $agent->get("http://stackoverflow.com/");
$pipe->print($response->content);
exit;
}
$pipe->reader;
sleep($timeout);
if($finished) {
print "Finished!\n";
my $content = join('', $pipe->getlines);
}
else {
kill(9, $pid);
print "Timed out.\n";
}
From what I understand, the timeout property doesn't take into account DNS timeouts. It's possible that you could make a DNS lookup separately, then make the request to the server if that works, with the correct timeout value set for the useragent.
Is this a DNS problem with the server, or something else?
EDIT: It could also be a problem with IO::Socket. Try updating your IO::Socket module, and see if that helps. I'm pretty sure there was a bug in there that was preventing LWP::UserAgent timeouts from working.
Alex
The following generalization of one of the original answers also restores the alarm signal handler to the previous handler and adds a second call to alarm(0) in case the call in the eval clock throws a non alarm exception and we want to cancel the alarm. Further $# inspection and handling can be added:
sub ua_request_with_timeout {
my $ua = $_[0];
my $request = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)`enter code here`
my $timeout_for_client_sec = $ua->timeout();
our $res_has_timedout = 0;
use POSIX ':signal_h';
my $newaction = POSIX::SigAction->new(
sub { $res_has_timedout = 1; die "web request timeout"; },# the handler code ref
POSIX::SigSet->new(SIGALRM),
# not using (perl 5.8.2 and later) 'safe' switch or sa_flags
);
my $oldaction = POSIX::SigAction->new();
if(!sigaction(SIGALRM, $newaction, $oldaction)) {
log('warn',"Error setting SIGALRM handler: $!");
return $ua->request($request);
}
my $response = undef;
eval {
alarm ($timeout_for_client_sec);
$response = $ua->request($request);
alarm(0);
};
alarm(0);# cancel alarm (if eval failed because of non alarm cause)
if(!sigaction(SIGALRM, $oldaction )) {
log('warn', "Error resetting SIGALRM handler: $!");
};
if ( $res_has_timedout ) {
log('warn', "Timeout($timeout_for_client_sec sec) while waiting for a response from cred central");
return HTTP::Response->new(408); #408 is the HTTP timeout
} else {
return $response;
}
}