How to multithread seeing if a webpage exists in Perl? - perl

I'm writing a Perl script that takes in a list of URLs and checks to see if they exist. (Note that I only care if they exist; I don’t care what their contents are. Here’s the important part of the program.
use LWP::Simple qw($ua head);
if (head($url))
{
$numberAlive ++;
}
else
{
$numberDead ++;
}
Right now the program works fine; however, I want it to run faster. Thus I'm considering making it multithreaded. I assume that the slow part of my program is contacting the server for each URL; therefore, I'm looking for a way in which I can send out requests to the URLs of other webpages on my list while I'm waiting for the first response. How can I do this? As far as I can tell, the head routine doesn't have a callback that can get called once the server has responded.

Begin with familiar-looking front matter.
#! /usr/bin/env perl
use strict;
use warnings;
use 5.10.0; # for // (defined-or)
use IO::Handle;
use IO::Select;
use LWP::Simple;
use POSIX qw/ :sys_wait_h /;
use Socket;
Global constants control program execution.
my $DEBUG = 0;
my $EXIT_COMMAND = "<EXIT>";
my $NJOBS = 10;
URLs to check arrive one per line on a worker’s end of the socket. For each URL, the worker calls LWP::Simple::head to determine whether the resource is fetchable. The worker then writes back to the socket a line of the form url : *status* where *status* is either "YES" or "NO" and represents the space character.
If the URL is $EXIT_COMMAND, then the worker exits immediately.
sub check_sites {
my($s) = #_;
warn "$0: [$$]: waiting for URL" if $DEBUG;
while (<$s>) {
chomp;
warn "$0: [$$]: got '$_'" if $DEBUG;
exit 0 if $_ eq $EXIT_COMMAND;
print $s "$_: ", (head($_) ? "YES" : "NO"), "\n";
}
die "NOTREACHED";
}
To create a worker, we start by creating a socketpair. The parent process will use one end and each worker (child) will use the other. We disable buffering at both ends and add the parent end to our IO::Select instance. We also note each child’s process ID so we can wait for all workers to finish.
sub create_worker {
my($sel,$kidpid) = #_;
socketpair my $parent, my $kid, AF_UNIX, SOCK_STREAM, PF_UNSPEC
or die "$0: socketpair: $!";
$_->autoflush(1) for $parent, $kid;
my $pid = fork // die "$0: fork: $!";
if ($pid) {
++$kidpid->{$pid};
close $kid or die "$0: close: $!";
$sel->add($parent);
}
else {
close $parent or die "$0: close: $!";
check_sites $kid;
die "NOTREACHED";
}
}
To dispatch URLs, the parent grabs as many readers as are available and hands out the same number of URLs from the job queue. Any workers that remain after the job queue is empty receive the exit command.
Note that print will fail if the underlying worker has already exited. The parent must ignore SIGPIPE to prevent immediate termination.
sub dispatch_jobs {
my($sel,$jobs) = #_;
foreach my $s ($sel->can_write) {
my $url = #$jobs ? shift #$jobs : $EXIT_COMMAND;
warn "$0 [$$]: sending '$url' to fd ", fileno $s if $DEBUG;
print $s $url, "\n" or $sel->remove($s);
}
}
By the time control reaches read_results, the workers have been created and received work. Now the parent uses can_read to wait for results to arrive from one or more workers. A defined result is an answer from the current worker, and an undefined result means the child has exited and closed the other end of the socket.
sub read_results {
my($sel,$results) = #_;
warn "$0 [$$]: waiting for readers" if $DEBUG;
foreach my $s ($sel->can_read) {
warn "$0: [$$]: reading from fd ", fileno $s if $DEBUG;
if (defined(my $result = <$s>)) {
chomp $result;
push #$results, $result;
warn "$0 [$$]: got '$result' from fd ", fileno $s if $DEBUG;
}
else {
warn "$0 [$$]: eof from fd ", fileno $s if $DEBUG;
$sel->remove($s);
}
}
}
The parent must keep track of live workers in order to collect all results.
sub reap_workers {
my($kidpid) = #_;
while ((my $pid = waitpid -1, WNOHANG) > 0) {
warn "$0: [$$]: reaped $pid" if $DEBUG;
delete $kidpid->{$pid};
}
}
Running the pool executes the subs above to dispatch all URLs and return all results.
sub run_pool {
my($n,#jobs) = #_;
my $sel = IO::Select->new;
my %kidpid;
my #results;
create_worker $sel, \%kidpid for 1 .. $n;
local $SIG{PIPE} = "IGNORE"; # writes to dead workers will fail
while (#jobs || keys %kidpid || $sel->handles) {
dispatch_jobs $sel, \#jobs;
read_results $sel, \#results;
reap_workers \%kidpid;
}
warn "$0 [$$]: returning #results" if $DEBUG;
#results;
}
Using an example main program
my #jobs = qw(
bogus
http://stackoverflow.com/
http://www.google.com/
http://www.yahoo.com/
);
my #results = run_pool $NJOBS, #jobs;
print $_, "\n" for #results;
the output is
bogus: NO
http://www.google.com/: YES
http://stackoverflow.com/: YES
http://www.yahoo.com/: YES

Another option is HTTP::Async.
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Request;
use HTTP::Async;
my $numberAlive = 0;
my $numberDead = 0;
my #urls = ('http://www.perl.com','http://www.example.xyzzy/foo.html');
my $async = HTTP::Async->new;
# you might want to wrap this in a loop to deal with #urls in batches
foreach my $url (#urls){
$async->add( HTTP::Request->new( HEAD => $url ) );
}
while ( my $response = $async->wait_for_next_response ) {
if ($response->code == 200){$numberAlive ++;}
else{$numberDead ++;}
}
print "$numberAlive Alive, $numberDead Dead\n";

Worker-based parallelisation (using your choice of threads or processes):
use strict;
use warnings;
use feature qw( say );
use threads; # or: use forks;
use LWP::Simple qw( head );
use Thread::Queue::Any qw( );
use constant NUM_WORKERS => 10; # Or whatever.
my $req_q = Thread::Queue::Any->new();
my $resp_q = Thread::Queue::Any->new();
my #workers;
for (1..NUM_WORKERS) {
push #workers, async {
while (my $url = $req_q->dequeue()) {
my $is_alive = head($url) ? 1 : 0;
$resp_q->enqueue($is_alive);
}
};
}
$req_q->enqueue($_) for #urls;
my ($alive, $dead);
for (1..#urls) {
my $is_alive = $resp_q->dequeue();
++( $is_alive ? $alive : $dead );
}
$req_q->enqueue(undef) for #workers;
$_->join for #workers;
say $alive;
say $dead;

Related

perl redirect stdout to lexical filehandle

I'm trying to write a helper function that runs a perl function in another process and returns a closure that produces a line of output at a time when called.
I figured out a way of doing this using pipe that mixes old and new-style filehandles. I used an old-style one for the sink in order to use the open(STDOUT, ">&thing") syntax and a new-style one for the source since it needs to be captured by a closure and I didn't want to burden the caller with providing a filehandle.
Is there a way of using a new-style filehandle in a construction with the same meaning as open(STDOUT, ">&thing")?
#!/usr/bin/env perl
# pipe.pl
# use pipe() to create a pair of fd's.
# write to one and read from the other.
#
# The source needs to be captured by the closure and can't be
# destructed at the end of get_reader(), so it has to be lexical.
#
# We need to be able to redirect stdout to sink in such a way that
# we actually dup the file descriptor (so shelling out works as intended).
# open(STDOUT, ">&FILEHANDLE") achieves this but appears to require an
# old-style filehandle.
use strict;
use warnings;
sub get_reader {
local *SINK;
my $source;
pipe($source, SINK) or die "can't open pipe!";
my $cpid = fork();
if ($cpid == -1) {
die 'failed to fork';
}
elsif ($cpid == 0) {
open STDOUT, ">&SINK" or die "can't open sink";
system("echo -n hi");
exit;
}
else {
return sub {
my $line = readline($source);
printf "from child (%s)\n", $line;
exit;
}
}
}
sub main {
my $reader = get_reader();
$reader->();
}
main();
When run, this produces
from child (hi)
as expected.
sub get_reader {
my ($cmd) = #_;
open(my $pipe, '-|', #$cmd);
return sub {
return undef if !$pipe;
my $line = <$pipe>;
if (!defined($line)) {
close($pipe);
$pipe = undef;
return undef;
}
chomp($line);
return $line;
};
}
If that's not good enough (e.g. because you also need to redirect the child's STDIN or STDERR), you can use IPC::Run instead.
use IPC::Run qw( start );
sub get_reader {
my ($cmd) = #_;
my $buf = '';
my $h = start($cmd, '>', \$buf);
return sub {
return undef if !$h;
while (1) {
if ($buf =~ s/^([^\n]*)\n//) {
return $1;
}
if (!$h->pump())) {
$h->finish();
$h = undef;
return substr($buf, 0, length($buf), '') if length($buf);
return undef;
}
}
};
}
Either way, you can now do
my $i = get_reader(['prog', 'arg', 'arg']);
while (defined( my $line = $i->() )) {
print "$line\n";
}
Either way, error handling left to you.

How to check if command executed with IPC::open3 is hung?

I'm using the following script to capture STDIN, STDOUT and STDERR from the command passed as an argument.
#!/usr/bin/perl
use strict;
use warnings;
use IPC::Open3;
local(*CMD_IN, *CMD_OUT, *CMD_ERR);
my $pid = open3(*CMD_IN, *CMD_OUT, *CMD_ERR, $ARGV[0]);
close(CMD_IN);
my #stdout_output = <CMD_OUT>;
my #stderr_output = <CMD_ERR>;
close(CMD_OUT);
close(CMD_ERR);
waitpid ($pid, 0); # reap the exit code
print "OUT:\n", #stdout_output;
print "ERR:\n", #stderr_output;
It all works good with the exception that I'm not sure how to monitor if the command passed is hung. Could you please suggest a way?
I've borrowed this snippet originally from 'Programming Perl'.
You can use select or IO::Select and provide a timeout. If you want to read both from stdout and stderr, you should do that anyway (see the documentation of IPC::Open3).
Here's an example program using IO::Select:
#!/usr/bin/perl
use strict;
use warnings;
use IO::Select;
use IPC::Open3;
use Symbol 'gensym';
my ($cmd_in, $cmd_out, $cmd_err);
$cmd_err = gensym;
my $pid = open3($cmd_in, $cmd_out, $cmd_err, $ARGV[0]);
close($cmd_in);
my $select = IO::Select->new($cmd_out, $cmd_err);
my $stdout_output = '';
my $stderr_output = '';
while (my #ready = $select->can_read(5)) {
foreach my $handle (#ready) {
if (sysread($handle, my $buf, 4096)) {
if ($handle == $cmd_out) {
$stdout_output .= $buf;
}
else {
$stderr_output .= $buf;
}
}
else {
# EOF or error
$select->remove($handle);
}
}
}
if ($select->count) {
print "Timed out\n";
kill('TERM', $pid);
}
close($cmd_out);
close($cmd_err);
waitpid($pid, 0); # reap the exit code
print "OUT:\n", $stdout_output;
print "ERR:\n", $stderr_output;
Notes:
I use lexical vars for file handles. This requires the use of gensym for the stderr handle.
The argument to can_read is the timeout in seconds.
I use sysread for non-buffered IO.
I terminate the child if there's a read timeout.
I came up with the following solution heavily based on this answer.
However using select and avoiding signals as in nwellnhof's example looks much cleaner which is why I accepted it. I'm posting it here if somebody is interested:
my $pid = open3(*CMD_IN, *CMD_OUT, *CMD_ERR, $cmd);
if ($pid > 0){
eval{
local $SIG{ALRM} = sub {kill 9, $pid;};
alarm 6;
waitpid($pid, 0);
alarm 0;
};
}

Reading STDOUT and STDERR of external command with no wait

I would like to execute external command rtmpdump and read it's STDOUT and STDERR separately, but not to wait till such command ends, but read its partial outputs in bulks, when available...
What is a safe way to do it in Perl?
This is a code I have that works "per-line" basis:
#!/usr/bin/perl
use warnings;
use strict;
use Symbol;
use IPC::Open3;
use IO::Select;
sub execute {
my($cmd) = #_;
print "[COMMAND]: $cmd\n";
my $pid = open3(my $in, my $out, my $err = gensym(), $cmd);
print "[PID]: $pid\n";
my $sel = new IO::Select;
$sel->add($out, $err);
while(my #fhs = $sel->can_read) {
foreach my $fh (#fhs) {
my $line = <$fh>;
unless(defined $line) {
$sel->remove($fh);
next;
}
if($fh == $out) {
print "[OUTPUT]: $line";
} elsif($fh == $err) {
print "[ERROR] : $line";
} else {
die "[ERROR]: This should never execute!";
}
}
}
waitpid($pid, 0);
}
But the above code works in text mode only, I believe. To use rtmpdump as a command, I need to collect partial outputs in binary mode, so do not read STDOUT line-by-line as it is in the above code.
Binary output of STDOUT should be stored in variable, not printed.
Using blocking functions (e.g. readline aka <>, read, etc) inside a select loop defies the use of select.
$sel->add($out, $err);
my %bufs;
while ($sel->count) {
for my $fh ($sel->can_read) {
my $rv = sysread($fh, $bufs{$fh}, 128*1024, length($bufs{$fh}));
if (!defined($rv)) {
# Error
die $! ;
}
if (!$rv) {
# Eof
$sel->remove($fh);
next;
}
if ($fh == $err) {
while ($bufs{$err} =~ s/^(.*\n)//) {
print "[ERROR] $1";
}
}
}
}
print "[ERROR] $bufs{$err}\n" if length($bufs{$err});
waitpid($pid, 0);
... do something with $bufs{$out} ...
But it would be much simpler to use IPC::Run.
use IPC::Run qw( run );
my ($out_buf, $err_buf);
run [ 'sh', '-c', $cmd ],
'>', \$out_buf,
'2>', sub {
$err_buf .= $_[0];
while ($err_buf =~ s/^(.*\n)//) {
print "[ERROR] $1";
}
};
print "[ERROR] $err_buf\n" if length($err_buf);
... do something with $out_buf ...
If you're on a POSIX system, try using Expect.pm. This is exactly the sort of problem it is designed to solve, and it also simplifies the task of sending keystrokes to the spawned process.

Can't read from socket in perl - possible deadlock?

My OS is Archlinux with perl 5.14.2. I am just trying to write a little program to accomplish a remote comlile. The program just passes a C source file to the server. The server will call gcc to compile the C code and pass the compiler's message. The client can't receive the compiler's message. I have the message in the server.
There is the code:
#!/usr/bin/perl -w
# oj.pl --- alpha
use warnings;
use strict;
use IO::File;
use IO::Socket;
use constant MY_TRAN_PORT => 138000;
$| = 1;
my $tmpFileToBeCompiled = IO::File->new ("> tmpFile09090989.c") or die "Can't creat this file";
#if (defined $tmpFileToBeCompiled) {
# print $tmpFileToBeCompiled "argh"; # just for test!
#}
# $fihi->close;
my $port = shift || MY_TRAN_PORT;
my $sock_server = IO::Socket::INET->new (Listen => 20,
LocalPort => $port,
Timeout => 60,
Reuse => 1)
or die "Can't create listening socket: $!\n";
my $tmp = 1;
while ($tmp) {
next unless my $session = $sock_server->accept;
my $peer = gethostbyaddr ($session->peeraddr, AF_INET)
|| $session->peerhost;
warn "Connection from [$peer, $port]\n";
while (<$session>) {
print $tmpFileToBeCompiled $_; # if it works, the filehandle should be changed into tmpFile. just fixed.
print $session "test!";
}
my #lines = `gcc tmpFile09090989.c 2>&1`;
foreach ( #lines) {
print $session $_ . "test!!!\n";
# $session->print;
}
print "OK!";
$tmpFileToBeCompiled->close;
warn "Connecting finished!\n";
$session->close;
$tmp --;
}
$sock_server->close;
----------------------------------------end--------------------------------------------------------
-------------------------------------client.pl--------------------------------------------------------
use warnings;
use strict;
use IO::Socket qw(:DEFAULT);
use File::Copy;
use constant MY_TRAN_PORT => 138000;
use IO::File;
my $host = shift || '127.0.0.1';
my $port = shift || MY_TRAN_PORT;
my $socket = IO::Socket::INET->new("$host:$port") or die $#;
my $fh = IO::File->new("a.c", "r");
my $child = fork();
die "Can't fork: $!\n" unless defined $child;
# if (!$child) {
# $SIG{CHLD} = sub { exit 0 };
# userToHost();
# print "Run userToHost done!\n";
# $socket->shutdown(1);
# sleep;
# } else {
# hostToUser();
# print "Run hostToUser done! \n";
# warn "Connection closed by foreign host\n";
# }
userToHost();
unless ($child) {
hostToUser();
print "Run hostToUser done! \n";
warn "Connection closed by foreign host\n";
$socket->close;
}
sub userToHost {
while (<$fh>) {
# print $_; # for debug
print $socket $_;
}
}
sub hostToUser {
while (<$socket >) {
print $_;
}
}
# copy ("a.c", $socket) or die "Copy failed: $!";
print "Done!";
You don't need to fork in client. At all. Just like themel said
You have error in client code: <$socket > should be <$socket>
You need to notify server that you have written all data and server can start compilation. Otherwise server will stuck at while (<$session>) forever.
To achieve this you could call shutdown($socket, 1) which means you finished writing. See perldoc -f shutdown
Final prototype (very rough) could look like this: https://gist.github.com/19b589b8fc8072e3cfff
yko nailed it, but let me just suggest that your task will be solved in a much easier and more maintainable way by a shell script running from inetd.

How can I timeout a forked process that might hang?

I am writing a Perl script that will write some inputs and send those inputs to an external program. There is a small but non-zero chance that this program will hang, and I want to time it out:
my $pid = fork;
if ($pid > 0){
eval{
local $SIG{ALRM} = sub { die "TIMEOUT!"};
alarm $num_secs_to_timeout;
waitpid($pid, 0);
alarm 0;
};
}
elsif ($pid == 0){
exec('echo blahblah | program_of_interest');
exit(0);
}
As it stands now, after $num_secs_to_timeout, program_of_interest still persists. I tried to kill it in the anonymous subroutine for $SIG{ALRM} as follows:
local $SIG{ALRM} = sub{kill 9, $pid; die "TIMEOUT!"}
but this doesn't do anything. program_of_interest is still persisting. How do I go about killing this process?
I was able to successfully kill my exec()ed process by killing the process group, as shown as the answer to question In perl, killing child and its children when child was created using open. I modified my code as follows:
my $pid = fork;
if ($pid > 0){
eval{
local $SIG{ALRM} = sub {kill 9, -$PID; die "TIMEOUT!"};
alarm $num_secs_to_timeout;
waitpid($pid, 0);
alarm 0;
};
}
elsif ($pid == 0){
setpgrp(0,0);
exec('echo blahblah | program_of_interest');
exit(0);
}
After timeout, program_of_interest is successfully killed.
The above code (by strictlyrude27) didn't work out of the box, because -$PID is spelt in capitals.
(BTW: there's also: http://www.gnu.org/software/coreutils/manual/html_node/timeout-invocation.html)
Here's an example with test:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
my $prg = basename $0;
my $num_secs_sleep = 2;
my $num_secs_to_timeout = 1;
my $orig_program = "sleep $num_secs_sleep; echo \"Look ma, survived!\"";
my $program = $orig_program;
my $expect = "";
if (#ARGV){
if($ARGV[0] eq "test"){
test();
exit 0;
} elsif (#ARGV == 1) {
$num_secs_to_timeout = $ARGV[0];
} elsif (#ARGV == 2) {
$program = $ARGV[0];
$num_secs_to_timeout = $ARGV[1];
} else {
die "Usage: $prg [ \"test\" | [program] seconds ] "
}
}
if($orig_program eq $program) {
if(#ARGV < 2) {
$expect = $num_secs_to_timeout > $num_secs_sleep ?
"(we expected to survive.)" : "(we expected to TIME OUT!)";
}
print STDERR "sleeping: $num_secs_sleep seconds$/";
}
print STDERR <<END;
timeout after: $num_secs_to_timeout seconds,
running program: '$program'
END
if($orig_program eq $program) {
print STDERR "$expect$/";
}
exit Timed::timed($program, $num_secs_to_timeout);
sub test {
eval "use Test::More qw(no_plan);";
my $stdout;
close STDOUT;
open STDOUT, '>', \$stdout or die "Can't open STDOUT: $!";
Timed::timed("sleep 1", 3);
is($stdout, undef);
Timed::timed("sleep 2", 1);
is($stdout, "TIME OUT!$/");
}
################################################################################
package Timed;
use strict;
use warnings;
sub timed {
my $retval;
my ($program, $num_secs_to_timeout) = #_;
my $pid = fork;
if ($pid > 0){ # parent process
eval{
local $SIG{ALRM} =
sub {kill 9, -$pid; print STDOUT "TIME OUT!$/"; $retval = 124;};
alarm $num_secs_to_timeout;
waitpid($pid, 0);
alarm 0;
};
return defined($retval) ? $retval : $?>>8;
}
elsif ($pid == 0){ # child process
setpgrp(0,0);
exec($program);
} else { # forking not successful
}
}
Hmmm your code works for me, after some minor modifications - which I assume are changes made by yourself to make the code into a generic example.
So that leaves me with two ideas:
You removed the problem when you created the sample code - try creating a small sample that actually runs (I had to change 'program_of_interest' and $num_secs_to_timeout to real values to test it). Make sure the sample has the same problem.
It's something to do with the program_of_interest you're running - as far as I know, you can't mask a kill 9, but maybe there's something going on. Have you tried testing your code with a really simple script. I created one for my testing that goes while (1) { print "hi\n"; sleep 1; }
Something else.
Good luck...
The only way SIGKILL can be ignored is if the process is stuck in a system call which is uninterruptible. Check the state of the hung process (with ps aux) if the state is D, then the process can't be killed.
You might also want to check that the function is being called by outputting something from it.