Understanding async in perl on specific example - perl

I have to write a script that get some URLs in parallel and do some work. In the past I have always used Parallel::ForkManager for such things, but now I wanted to learn something new and try asynchronous programming with AnyEvent (and AnyEvent::HTTP or AnyEvent::Curl::Multi) ... but I'm having problem understanding AnyEvent and writing a script that should:
open a file (every line is a seperate URL)
(from now in parallel, but with a limit for f.e. 10 concurrent requests)
read file line after line (I dont want to load whole file to memory - it might be big)
make a HTTP request for that URL
read response
updates MySQL record accordingly
(next file line)
I have read many manuals, tutorials, but its still hard for me to understand differences between blocking and non-blocking code. I have found similar script at http://perlmaven.com/fetching-several-web-pages-in-parallel-using-anyevent, where Mr. Szabo explains the basics, but I still cant understand how to implement something like:
...
open my $fh, "<", $file;
while ( my $line = <$fh> )
{
# http request, read response, update MySQL
}
close $fh
...
... and add a concurrency limit in this case.
I would be very grateful for help ;)
UPDATE
Following Ikegami's advice I gave Net::Curl::Multi a try. I'm very pleased with results. After years of using Parallel::ForkManager just for concurrent grabbing thousands of URLs, Net::Curl::Multi seems to be awesome.
Here is my code with while loop on filehandle. It seems to work as it should, but considering it's my first time writing something like this I would like to ask more experienced Perl users to take a look and tell me if there are some potential bugs, something I missed, etc.
Also, if I may ask: as I don't fully understand how Net::Curl::Multi's concurrency works, please tell me whether I should expect any problems with putting MySQL UPDATE command (via DBI) inside RESPONSE loop (besides higher server load obviously - I expect final script to run with about 50 concurrent N::C::M workers, maybe more).
#!/usr/bin/perl
use Net::Curl::Easy qw( :constants );
use Net::Curl::Multi qw( );
sub make_request {
my ( $url ) = #_;
my $easy = Net::Curl::Easy->new();
$easy->{url} = $url;
$easy->setopt( CURLOPT_URL, $url );
$easy->setopt( CURLOPT_HEADERDATA, \$easy->{head} );
$easy->setopt( CURLOPT_FILE, \$easy->{body} );
return $easy;
}
my $maxWorkers = 10;
my $multi = Net::Curl::Multi->new();
my $workers = 0;
my $i = 1;
open my $fh, "<", "urls.txt";
LINE: while ( my $url = <$fh> )
{
chomp( $url );
$url .= "?$i";
print "($i) $url\n";
my $easy = make_request( $url );
$multi->add_handle( $easy );
$workers++;
my $running = 0;
do {
my ($r, $w, $e) = $multi->fdset();
my $timeout = $multi->timeout();
select $r, $w, $e, $timeout / 1000
if $timeout > 0;
$running = $multi->perform();
RESPONSE: while ( my ( $msg, $easy, $result ) = $multi->info_read() ) {
$multi->remove_handle( $easy );
$workers--;
printf( "%s getting %s\n", $easy->getinfo( CURLINFO_RESPONSE_CODE ), $easy->{url} );
}
# dont max CPU while waiting
select( undef, undef, undef, 0.01 );
} while ( $workers == $maxWorkers || ( eof && $running ) );
$i++;
}
close $fh;

Net::Curl is a rather good library that's extremely fast. Furthermore, it can handle parallel requests too! I'd recommend using this instead of AnyEvent.
use Net::Curl::Easy qw( :constants );
use Net::Curl::Multi qw( );
sub make_request {
my ( $url ) = #_;
my $easy = Net::Curl::Easy->new();
$easy->{url} = $url;
$easy->setopt( CURLOPT_URL, $url );
$easy->setopt( CURLOPT_HEADERDATA, \$easy->{head} );
$easy->setopt( CURLOPT_FILE, \$easy->{body} );
return $easy;
}
my $max_running = 10;
my #urls = ( 'http://www.google.com/' );
my $multi = Net::Curl::Multi->new();
my $running = 0;
while (1) {
while ( #urls && $running < $max_running ) {
my $easy = make_request( shift( #urls ) );
$multi->add_handle( $easy );
++$running;
}
last if !$running;
my ( $r, $w, $e ) = $multi->fdset();
my $timeout = $multi->timeout();
select( $r, $w, $e, $timeout / 1000 )
if $timeout > 0;
$running = $multi->perform();
while ( my ( $msg, $easy, $result ) = $multi->info_read() ) {
$multi->remove_handle( $easy );
printf( "%s getting %s\n", $easy->getinfo( CURLINFO_RESPONSE_CODE ), $easy->{url} );
}
}

This does exactly what you want, in an asynchronous fashion, and it does that by wrapping Net::Curl in a safe fashion:
#!/usr/bin/env perl
package MyDownloader;
use strict;
use warnings qw(all);
use Moo;
extends 'YADA::Worker';
has '+use_stats'=> (default => sub { 1 });
has '+retry' => (default => sub { 10 });
after init => sub {
my ($self) = #_;
$self->setopt(
encoding => '',
verbose => 1,
);
};
after finish => sub {
my ($self, $result) = #_;
if ($self->has_error) {
print "ERROR: $result\n";
} else {
# do the interesting stuff here
printf "Finished downloading %s: %d bytes\n", $self->final_url, length ${$self->data};
}
};
around has_error => sub {
my $orig = shift;
my $self = shift;
return 1 if $self->$orig(#_);
return 1 if $self->getinfo('response_code') =~ m{^5[0-9]{2}$}x;
};
1;
package main;
use strict;
use warnings qw(all);
use Carp;
use YADA;
my $q = YADA->new(
max => 8,
timeout => 30,
);
open(my $fh, '<', 'file_with_urls_per_line.txt')
or croak "can't open queue: $!";
while (my $url = <$fh>) {
chomp $url;
$q->append(sub {
MyDownloader->new($url)
});
}
close $fh;
$q->wait;

Related

Mojo::UserAgent non-blocking vs blocking performance

I have the following code:
my $ua = Mojo::UserAgent->new ();
my #ids = qw(id1 id2 id3);
foreach (#ids) {
my $input = $_;
my $res = $ua->get('http://my_site/rest/id/'.$input.'.json' => sub {
my ($ua, $res) = #_;
print "$input =>" . $res->result->json('/net/id/desc'), "\n";
});
}
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
Why when I run the above code (non-blocking) does it take about 6 seconds while when running the code as blocking, i.e. inside the loop something like:
my $res = $ua->get('http://my_site/rest/id/'.$input.'.json');
print "$input =>" . $res->result->json('/net/id/desc'), "\n";
without the latest line it takes about 1 second?
Why is the blocking code faster than the non-blocking code?
The first thing to check when things happened. I couldn't get the same delay. Remember to try each way several times to spot outliers where there's a network hiccup. Note that the second argument to the non-blocking sub is a transaction object, normally written as $tx, where the response object is normally written res:
use Mojo::Util qw(steady_time);
say "Begin: " . steady_time();
END { say "End: " . steady_time() }
my $ua = Mojo::UserAgent->new ();
my #ids = qw(id1 id2 id3);
foreach (#ids) {
my $input = $_;
my $res = $ua->get(
$url =>
sub {
my ($ua, $tx) = #_;
print "Fetched\n";
}
);
}
One possibility is that keep-alive is holding an open connection. What happens if you turn that off?
my $res = $ua->get(
$url =>
{ Connection => 'close' }
sub {
my ($ua, $tx) = #_;
print "Fetched\n";
}
);
Here's a version that uses promises, which you'll want to get used to as more Mojo stuff moves to it:
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Mojo::Promise;
use Mojo::Util qw(steady_time);
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
say "Begin: " . steady_time();
END { say "End: " . steady_time() }
my #ids = qw(id1 id2 id3);
my #gets = map {
$ua->get_p( 'http://www.perl.com' )->then(
sub ( $tx ) { say "Fetched: " . steady_time() },
sub { print "Error: #_" }
);
} #ids;
Mojo::Promise->all( #gets )->wait;

AnyEvent::STOMP::Client + AnyEvent::ForkManger = Intermittent Error

I'm trying to write a process that listens to ActiveMQ and based on the message, goes out and grabs data from a webservice, does some processing and then puts the process data to another webservice. (REST/JSON)
The module below works fine until one of the wonky webservices I talk to returns an error. I've tried many things to catch the error but to no avail, yet. Once the webservice error happens though I get the following message:
unhandled callback exception on event (MESSAGE,
AnyEvent::STOMP::Client=HASH(0x3ad5e48), HASH(0x3a6bbb0)
{"action":"created","data":{"id":40578737,"type":"alert","who":null},"guid":"ADCCEE0C-73A7-11E6-8084-74B346D1CA67","hostname":"myserver","pid":48632}):
$fork_manager->start() should be called within the manager process
OK, I conceptually understand that child process is trying to start another process and that fork manager is saying that is a no no. But given the module below, what is the proper way to start a new process to handle the long running processing. Or why is an child process dying causing this exception and how can I prevent this
Here's the module (stripped down)
package consumer;
use AnyEvent::ForkManager;
use AnyEvent::STOMP::Client;
use JSON;
use Data::Dumper;
use v5.18;
use Moose;
sub run {
my $self = shift;
my $pm = AnyEvent::ForkManager->new(max_workers => 20);
my $stomp = AnyEvent::STOMP::Client->new();
$stomp->connect();
$stomp->on_connected(sub {
my $stomp = shift;
$stomp->subscribe('/topic/test');
say "Connected to STOMP";
});
$pm->on_start(sub {
my ($pm,$pid,#params) = #_;
say "Starting $pid worker";
});
$pm->on_finish(sub {
my ($pm, $pid,#params) = #_;
say "Finished $pid worker";
});
$pm->on_error(sub {
say Dumper(\#_);
});
$stomp->on_message(sub {
my ($stomp, $header, $body) = #_;
my $href = decode_json $body;
$pm->start(cb => sub {
my ($pm, #params) = #_;
$self->process(#params);
},
args => [ $href->{id}, $href->{data}->{type}, $href->{data}->{who} ],
);
});
my $cv = AnyEvent->condvar;
$cv->recv;
}
sub process {
say "Processing ".Dumper(\#_);
sleep 5;
if ( int(rand(10)) < 5 ) {
die "OOPS"; # this triggers the error message above
}
say "Done Processing $_[1]";
}
1;
Heres the driver for the module above:
#!/usr/bin/env perl
use v5.18;
use lib '.';
use consumer;
my $c = consumer->new();
$c->run;
Finally a traffic generator that you can use to see this in action:
#!/usr/bin/env perl
use lib '../lib';
use lib '../../lib';
use v5.18;
use Data::Dumper;
use JSON;
use Net::STOMP::Client;
$ENV{'scot_mode'} = "testing";
my $stomp = Net::STOMP::Client->new(
host => "127.0.0.1",
port => 61613
);
$stomp->connect();
for (my $i = 1; $i < 1000000; $i++) {
my $href = {
id => $i,
type => "event",
what => "foo",
};
my $json = encode_json $href;
say "Sending ".Dumper($href);
$stomp->send(
destination => "/topic/test",
body => $json,
);
}
$stomp->disconnect();
I was able to solve this by using Try::Catch and wrapping the call to self->process with a try catch like this:
$stomp->on_message(sub {
my ($stomp, $header, $body) = #_;
my $href = decode_json $body;
$pm->start(cb => sub {
my ($pm, #params) = #_;
try {
$self->process(#params);
}
catch {
# error handling stuff
};
},
args => [ $href->{id}, $href->{data}->{type}, $href->{data}->{who} ],
);
}
);

Is it unpolite to put an END block in a module?

Would it be OK to keep the END block in this example, because nobody wants a broken terminal or shouldn't I put an END block in a module?
package My_Package;
use warnings;
use strict;
use Term::ReadKey;
sub _init_scr {
my ( $arg ) = #_;
$arg->{backup_flush} = $|;
$| = 1;
Term::ReadKey::ReadMode 'ultra-raw';
}
sub _end_win {
my ( $arg ) = #_;
print "\n\r";
Term::ReadKey::ReadMode 'restore';
$| = $arg->{backup_flush};
}
END {
Term::ReadKey::ReadMode 'restore';
}
sub my_function {
my $arg = {};
_init_scr( $arg );
while ( 1 ) {
my $c = ReadKey 0;
if ( ! defined $c ) {
_end_win( $arg );
warn "EOT";
return;
}
next if $c eq "\e";
given ( $c ) {
when ( $c ge 'a' && $c le 'z' ) {
print $c;
$arg->{string} .= $c;
}
when ( $c eq "\cC" ) {
_end_win( $arg );
print STDERR "^C";
kill( 'INT', $$ );
return;
}
when ( $c eq "\r" ) {
_end_win( $arg );
return $arg->{string};
}
}
}
}
If your module changes the terminal mode, then I would think the most polite thing to do would be for it to also install an END block to restore the terminal mode before the program exits.
No, it's polite and expected that you put things back as you found them.
However, it's unwelcome to tidy up someone else's workspace unless you've been asked to do so.
That is, your END routine shouldn't run unless it has reason to do so, and your module probably ought to allow a developer to disable the automatic cleanup. (E.g., use My_Package qw(:no_auto_restore).)
Failing that, the POD ought to explicitly document that the module fiddles with a system resource upon exit.

Why does my Perl script using WWW-Mechanize fail intermittently?

I am trying to write a Perl script using WWW-Mechanize.
Here is my code:
use DBI;
use JSON;
use WWW::Mechanize;
sub fetch_companies_list
{
my $url = shift;
my $browser = WWW::Mechanize->new( stack_depth => 0 );
my ($content, $json, $parsed_text, $company_name, $company_url);
eval
{
print "Getting the companies list...\n";
$browser->get( $url );
# die "Can't get the companies list.\n" unless( $browser->status );
$content = $browser->content();
# die "Can't get companies names.\n" unless( $browser->status );
$json = new JSON;
$parsed_text = $json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode( $content );
foreach(#$parsed_text)
{
$company_name = $_->{name};
fetch_company_info( $company_name, $browser );
}
}
}
fetch_companies_list( "http://api.crunchbase.com/v/1/companies.js" );
The problem is the follows:
I start the script it finishes fine.
I restart the script. The script fails in "$browser->get()".
I have to wait some time (about 5 min) then it will start working again.
I am working on Linux and have WWW-Mechanize version 1.66.
Any idea what might be the problem? I don't have any firewall installed either on computer or on my router.
Moreover uncommenting the "die ..." line does not help as it stopping inside get() call. I can try to upgrade to the latest, which is 1.71, but I'd like to know if someone else experience this with this Perl module.
5 minutes (300 seconds) is the default timeout. Exactly what timed out will be returned in the response's status line.
my $response = $mech->res;
if (!$response->is_success()) {
die($response->status_line());
}
This is target site issue. It shows
503 Service Unavailable No server is available to handle this
request.
right now.
Retry with wait, try this
## set maximum no of tries
my $retries = 10;
## number of secs to sleep
my $sleep = 1;
do {
eval {
print "Getting the companies list...\n";
$browser->get($url);
# die "Can't get the companies list.\n" unless( $browser->status );
$content = $browser->content();
# die "Can't get companies names.\n" unless( $browser->status );
$json = new JSON;
$parsed_text = $json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode($content);
foreach (#$parsed_text) {
$company_name = $_->{name};
fetch_company_info( $company_name, $browser );
}
};
if ($#) {
warn $#;
## rest for some time
sleep($sleep);
## increase the value of $sleep exponetially
$sleep *= 2;
}
} while ( $# && $retries-- );

Perl Classes :: Can not write output

I am new to Object oriented programming in perl. So, I have a silly question.
What --
I am writing a script which will do something and write result to stream ( STDOUT or NETWORK ).
How --
[main.pl]
#!/usr/bin/perl
use strict;
require const::StreamTypes;
require output;
my $out = output->new("output");
$out->writeLine("Sample output");
[output.pm]
#!/usr/bin/perl
use strict;
require const::StreamTypes;
package output;
my $OUTSTR;
sub new{
my $class = shift();
my $stream = shift();
if($stream eq const::StreamTypes->STDNET){
}elsif($stream eq const::StreamTypes->STDWEB){
}else{
*OUTSTR = *STDOUT;
}
my $self = {
"_outStream" => $stream,
"_outStreamPtr" => $OUTSTR
};
bless($self, $class);
}
sub writeLine{
my $msg = shift();
print(OUTSTR "$msg\n");
}
return 1;
So, can anyone help me understand what is going wrong here? 'cas program runs without error but with no output.
Thanks!
I changed a couple of things here:
the first parameter of a methd is the invocant (instance or class) itself
indirect file handles are globals!
the autodie module comes in handy, if using open
consider using strict in your modules, too
I would not recommend the use of package global variable ( my $OUTSTR; ), because that's going to be messy with multiple instances, which want to have different streams.
And I definitely got into the habit of using accessors for all attributes. You can use a lightweight system like Class::Accessor or perhaps you are even lucky enough to use Moose our Mouse. Of course there are a couple of other modules also providing accessors in different ways.
package output;
use strict;
use warnings;
use autodie;
use Class::Accessor "moose-like";
has "outStream" => ( is => 'rw' );
sub new{
my ( $class, $stream ) = #_;
my $self = bless( {}, $class );
if ( 0 ) {
# ...
} else {
open( my $outStream, '>&', \*STDOUT );
$self->outStream( $outStream );
}
return $self;
}
sub writeLine{
my ( $self, $msg ) = #_;
print { $self->outStream } "$msg\n";
}
return 1;
Moose would create a constructor for you, but you can insert your parameter processing as easy as follows:
use Moose;
has "outStream" => ( is => 'rw' );
sub BUILDARGS {
my ( $class, $stream ) = #_;
open( my $outStream, '>&', \*STDOUT );
return {
outStream => $outStream,
};
}
$OUTSTR and *OUTSTR are very different things -- you should clear up your misunderstanding about this before you worry about object oriented programming.
That said, you can probably fix this script by getting everything to refer to $OUTSTR:
...
}else{
$OUTSTR = *STDOUT;
}
...
print $OUTSTR "$msg\n";
How about just passing a file handle directly into the object's constructor?
package output;
sub new {
my ($class, $fh) = #_;
bless { file_handle => $fh }, $class;
}
sub writeLine {
my $self = shift;
my $line = shift;
print {$self->{file_handle}} $line;
}
1;
Example usage:
my $output = output->new(\*STDOUT); # write to stdout
my $socket = IO::Socket::INET->new('www.perl.org', PeerPort => 'http(80)', Proto => 'tcp');
my $output = output->new($socket); # write to a socket
Please don't use barenames for file handles. Use lexical file handles.
The following lines assume that there is a hash %type_handlers somewhere that looks something like this:
{ const::StreamTypes->STDNET => \&constructor_for_stdnet_handles
, const::StreamTypes->STDWEB => \&constructor_for_stdweb_handles
}
Then you can replace the bottom of your constructor with:
my $handler = $type_handlers{ $stream };
my $outstr
= $handler ? $handler->()
: do { my $h; open( $h, '>&', \*::STDOUT ) and $h; }
;
return bless( {
_outStream => $stream
, _outStreamPtr => $outstr
}
, $class
);
Then writeLine becomes:
sub writeLine {
my ( $self, $msg ) = #_;
( $self->{_outStreamPtr} || *::STDOUT{IO} )->say( $msg );
}
The method is a little more robust in cases where somebody just blessed themselves into your class.
my $q_and_d = bless {}, 'output';
If you don't want to allow "quick & dirty" instances, and want more precise messages from possible failures, you could do this:
Carp::croak( 'No outstream!' )
unless my $h = Params::Util::_HANDLE( $self->{_outStreamPtr} )
;