Perl crashing with Parallel::ForkManager and WWW::Mechanize - perl

I have written a Perl Script using WWW::Mechanize which reads URLs from a text file and connects to them one by one. In each operation, it parses the content of the webpage looking for some specific keywords and if found, it will be written to the output file.
To speed up the process, I used Parallel::ForkManager with MAX_CHILDREN set to 3. Though I have observed an increase in the speed, the problem is that, after a while the script crashes. Perl.exe process gets killed and it does not display any specific error message.
I have run the script multiple times to see if it always fails at the same point, however the point of failure seems to be intermittent.
Please note that I have already taken care of any memory leaks in WWW::Mechanize and HTML::TreeBuilder::XPath as follows:
For WWW::Mechanize, I set stack_depth(0) so that it does not cache the history of visited pages.
HTML::TreeBuilder::XPath, I delete the root node once I am done with it. This approach helped me in resolving a memory leak issue in another similar script which does not use fork.
Here is the structure of the script, I have mentioned only the relevant parts here, please let me know if more details are required to troubleshoot:
#! /usr/bin/perl
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
use warnings;
use diagnostics;
use constant MAX_CHILDREN => 3;
open(INPUT,"<",$input) || die("Couldn't read from the file, $input with error: $!\n");
open(OUTPUT, ">>", $output) || die("Couldn't open the file, $output with error: $!\n");
$pm = Parallel::ForkManager->new(MAX_CHILDREN);
$mech=WWW::Mechanize->new();
$mech->stack_depth(0);
while(<INPUT>)
{
chomp $_;
$url=$_;
$pm->start() and next;
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
$pm->finish();
print "Child Processing finished\n"; # it never reaches this point!
}
$pm->wait_all_children;
I would like to know, why does this Perl script keep failing after a while?
For understanding purpose, I added a print statement right after the finish method of fork manager, however it does not print that.
I have also used, wait_all_children method, since as per the document of the module on CPAN, it will wait for the processing to get over for all the children of the parent process.
I have not understood why, wait_all_children method is place outside the while or the for loop though (as observed in the documentation as well), since all the processing is taking place inside the loop.
Thanks.

As for why this code is written with a main job loop with the start and finish calls and then followed by a wait_all_children outside the loop. It works like this:
The parent process gets the next job from <INPUT> at the start of each loop.
The parent runs start, which causes the child process to fork. At this point, you have 2 processes, each of which is running the exact same code at the exact same point.
3a. The parent process hits that or next and jumps back to the top to read the next <INPUT> and start the process over.
3b. The child process does not hit the or next and continues running the code you give until it hits finish, where the child exits.
Meanwhile the parent process is busy going through the loop and creating a child each time through. After forking 3 children (or whatever you set the limit to) it blocks until one of the children exit. At which point, it immediately spawns a new child (resulting in step 3b for each child each time).
When the parent runs out of jobs, it jumps out the while loop (never having run anything in it itself) and then waits for all the remaining children to exit.
As you can see, any code in the loop after finish is called is never going to run in either the parent (because it doesn't do anything after or next within the loop) or the children (because they exit at finish).
I've never used Parallel::ForkManager, but it looks like you can put a run_on_finished hook to run some code at the finish if you want to put a print statement there at the end.
To find the problem, though, I'd suggest wrapping all the code between start and finish in an eval or use Try::Tiny and warn out the error to see if there's an exception happening in there that's breaking it. I'd expect such things to show up in STDERR when the child dies, though, so I'm not really sure that will help.
However, it's worth shot. Here's my suggestion in code, just showing the part I'd catch exceptions from:
# At the top add
use Try::Tiny;
# Later in your main loop
$pm->start() and next;
try {
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
}
catch {
warn "Bad Stuff: ", $_;
};
$pm->finish();
That might help show you what's gone wrong.
If it does not help, you might try moving the try block to include more of the program (like nearly all of it after the use Try::Tiny line) and see if that elucidates anything.

The $pm->wait_all_children; function call waits for "ALL" the child processes to end and places a Blocking lock. I am not sure what kind of error handling you have done for $mech inside the if() statement, but you may want to re-visit that.

Related

A daemon to tail a log and fork multiple external (perl) script

I'm trying to write a program, actually a daemon, which stay in memory and perform something like tail -F on a rapidly updated log file. Then the program, when detect a new line on the file, have to launch another compiled perl script which will perform some operations on the log line and then send it with a post.
To clearly explain, I will refer to these two program as "prgTAIL" and "prgPROCESS". So, prgTAIL tail the log and launch prgPROCESS passing the new line to it.
Obviously the prgTAIL doesn't have to wait for the prgPROCESS to end the process, cause prgTAIL have to stay in memory and keep detecting new line on the log. Also, the rate of file update needs to launch multiple parallel prgPROCESS instance. For this reason I'm using two program: the first small and fast just pass the data to the second, which may be heavier cause it can be launched in multiple instances.
On the prgTAIL I used:
a pipe to tail the log file
a while loop to launch prgPROCESS on new log line
a fork(); to continue without waiting prgPROCESS ends
my $log_csv = "/log/csv.csv";
open (my $pipe, "-|", "tail", "-n0", "-F", $log_csv) or die "error";
while (<$pipe>) {
$line = $_ ;
my $pid = fork();
if (defined $pid && $pid == 0) {
exec("/bin/prgPROCESS ".$line) ; # I tried system() too.
exit 0;
}
}
The prgPROCESS operation are not so important; anyway.. it parses the $line passed as arguments, construct an XML and then post it via https.
So, this stuff actually run, but I think I messed up something with the process, cause when a reach a number of newline and prgPROCESS call around 550, prgTAIL keep running but it can't call prgPROCESS anymore, cause there are too many process. I get this error on the bash:
-bash: fork: Resource temporarily unavailable
What's wrong? Any idea? Maybe the prgPROCESS processes don't end and stay stuck without make room for other process?
PS: I'm using a Mac OS X now, but this will run on Linux.
Your problem is this:
while () {
doesn't have any constraint condition, so it's just spinning as fast as it can. You're never actually reading from your pipe, you're just forking as fast as you can and spawning that new script.
You might be wanting:
while ( my $line = <$pipe> ) {
#....
}
But really - it's arguable that you don't actually need to fork at all, because a read/process/read loop would probably do just fine - fork() and exec() is basically what system already does anyway.
You should also - if forking - clean up child processes. It doesn't matter too much for short running things, but things that sit in a loop will leave a lot of zombie processes. Either via setting $SIG{CHLD} or using waitpid.

Perl: Move to next item in loop before it is done executing

I have a perl script that is like so:
foreach my $addr ('http://site1.com', ...., 'http://site2.com') {
my $script = `curl -m 15 $addr`;
*do stuff with $script*
}
The -m sets a timeout of 15 seconds. Is there a way to make it if a user pushes a key, it stops the current execution and moves onto the next item in the foreach? I know last; can move to the next item but I am unsure of how to link this to the key being pushed and how to do it while the curl script is running
Edit: So based on the answers it seems difficult to do it while curl is running. Would it be possible to push a key while curl is running and have it skip to the next item in the loop as soon as the curl script returns (or times out after 15sec)?
The problem you've got with this, is that when you run curl perl hands over control and waits for completion. It blocks until it's 'done'.
So it's not as easy to do this as it might seem.
As another poster alludes to - you can use a variety of parallel processing options. I would suggest the easiest is to move away from using 'any' key, and require a ctrl-c.
So you'd then do:
foreach my $addr ('http://site1.com', ...., 'http://site2.com') {
my $pid = open ( my $curl_fh, "-|", "curl -m 15 $addr" );
$SIG{'INT'} = sub { print "Aborting fetch of $addr"; kill $pid };
while ( <$curl_fh> ) {
print;
}
#might want to set it to something else.
#undef means 'ctrl-c' will abort the whole program.
#IGNORE means exactly what it says on the tin.
#important to change it though, as it has a specific pid it'll kill,
#and that might cause problems.
$SIG{'INT'} = undef;
}
What this does is configure SIGINT (e.g. ctrl-c) so it doesn't kill your program, but does kill the sub-process.
If you wanted to look at other options, I'd offer:
Multithreading, spawn a thread to 'do' the curl fetching in the background and use Thread::Queue to pass results back and forth. (Thread::Queue supports nonblocking checks).
Forking - fork a sub process to do the curl, and use your 'main' process to send a signal if a key is pressed.
IO::Select such that you're not making blocking reads on your process.
Basically you have two options:
1. Use threads
Create a new thread, call desired system function there. Wait for output. In another thread, check for user input. On input, you can kill the child process. When child process has finished, you can ignore user input.
Such a solution seems to be rather complex, with a lot of synchronization needed, probably with using signals. Risky.
2. Use non-blocking IO
Please read this thread. It explains how to make non-blocking IO reads from either a file or a pipe. You'd like to make a non-blocking read from pipe (created with open), then non-blocking read from STDIN, loop.
Seems like a way to go, but, alas, rather complex as well.

In Perl is there a way to restart the program currently running from within itself?

I am running a program in Perl that at one point evaluates data in an if statement called from within a subroutine, e.g.
sub check_good {
if (!good) {
# exit this subroutine
# restart program
}
else {
# keep going
}
} # end sub
The problem I have is with exiting and restarting. I know that I can just use exit 0; to exit straight out, but obviously this is not correct if I want to go back to the beginning. I tried calling the subroutine which essentially starts the program, but of course once it has run it will go back to this point again.
I thought about putting it in a while loop, but this would mean putting the whole file in the loop and it would be very impractical.
I don't actually know whether this is possible, so any input would be great.
If you have not changed #ARGV, or you keep a copy of it, you could possibly do something like exec($^X, $0, #ARGV).
$^X and $0 (or $EXECUTABLE_NAME and $PROGRAM_NAME, see Brian's comment below) are the current perl interpreter and current perl script, respectively.
An alternative would be to always have two processes: A supervisor and a worker.
Refactor all your logic into a subroutine called run(or main or whatever). Whn your real logic detect that it needs to restart it should exit with a predefined non-zero exit code (like 1 for example).
Then your main script and supervisor would look like this:
if (my $worker = fork) {
# child process
run(#ARGV);
exit 0;
}
# supervisor process
waitpid $worker;
my $status = ($? >> 8);
if ($status == 1) { ... restart .. }
exit $status; # propagate exit code...
In the simple scenario where you just want to restart once, this might be a bit overkill. But if you at any point need to be able to handle other error scenarios this method might be preferable.
For example if the exit code is 255, this indicates that the main script called die(). In this case you might want to implement some decision procedure wether to restart the script, ignore the error, or escalate the issue.
There are quite a few modules on CPAN implementing such supervisors. Proc::Launcher is one of them and the manual page includes a extensive discussion of related works. (I have never used Proc::Launcher, it is mainly due to this discussion I'm linking to it)
There's nothing to stop you calling system on yourself. Something like this (clearly in need of a tidy), where I pass in a command-line argument to prevent the code calling itself forever.
#!/usr/bin/perl
use strict;
use warnings;
print "Starting...\n";
sleep 5;
if (! #ARGV) {
print "Start myself again...\n";
system("./sleep.pl secondgo");
print "...and die now\n";
exit;
} elsif ((#ARGV) && $ARGV[0] eq "secondgo") {
print "Just going to die straightaway this time\n";
exit;
}

Making a Perl daemon that runs 24/7 and reads from named pipes

I'm trying to make a log analyser using perl. The analyser would run 24/7 in the background on an AIX server and read from pipes that syslog directs logs to (from the entire network). Basically:
logs from network ----> named pipe A --------> | perl daemon
----> named pipe B --------> | * reads pipes
----> named pipe c --------> | * decides what to do based on which pipe
So, for example, I want my daemon to be able to be configured to mail root#domain.com all logs that are written to named pipe C. For this, I'm assuming the daemon needs to have a hash (new to perl, but this seems like an appropriate data structure) that would be able to be changed on the fly and would tell it what to do with each pipe.
Is this possible? Or should I create a .conf file in /etc to hold the information. Something like this:
namedpipeA:'mail root#domain.com'
namedpipeB:save:'mail user#domain.com'
So getting anything from A will be mailed to root#domain.com and everything from B will be saved to a log file (like it is usually) AND it will be sent to user#domain.com
Seeing as this is my first time using Perl and my first time creating a daemon, is there anyway for me to make this while adhering to the KISS principal? Also, are there any conventions that I should stick to? If you could take into consideration my lack of knowledge when replying it would be most helpful.
I'll cover part of your question: how to write a long-running Perl program that deals with IO.
The most efficient way to write a Perl program that handles many simultaneous IO operations is to use an event loop. This will allow us to write handlers for events, like "a line appeared on the named pipe" or "the email was sent successfully" or "we received SIGINT". Crucially, it will allow us to compose an arbitrary number of these event handlers in one program. This means that you can "multitask" but still easily share state between the tasks.
We'll use the AnyEvent framework. It lets us write event handlers, called watchers, that will work with any event loop that Perl supports. You probably don't care which event loop you use, so this abstraction probably doesn't matter to your application. But it will let us reuse pre-written event handlers available on CPAN; AnyEvent::SMTP to handle email, AnyEvent::Subprocess to interact with child processes, AnyEvent::Handle to deal with the pipes, and so on.
The basic structure of an AnyEvent-based daemon is very simple. You create some watchers, enter the event loop, and ... that's it; the event system does everything else. To get started, let's write a program that will print "Hello" every five seconds.
We start by loading modules:
use strict;
use warnings;
use 5.010;
use AnyEvent;
Then, we'll create a time watcher, or a "timer":
my $t = AnyEvent->timer( after => 0, interval => 5, cb => sub {
say "Hello";
});
Note that we assign the timer to a variable. This keeps the timer alive as long as $t is in scope. If we said undef $t, then the timer would be cancelled and the callback would never be called.
About callbacks, that's the sub { ... } after cb =>, and that's how we handle events. When an event happens, the callback is invoked. We do our thing, return, and the event loop continues calling other callbacks as necessary. You can do anything you want in callbacks, including cancelling and creating other watchers. Just don't make a blocking call, like system("/bin/sh long running process") or my $line = <$fh> or sleep 10. Anything that blocks must be done by a watcher; otherwise, the event loop won't be able to run other handlers while waiting for that task to complete.
Now that we have a timer, we just need to enter the event loop. Typically, you'll choose an event loop that you want to use, and enter it in the specific way that the event loop's documentation describes. EV is a good one, and you enter it by calling EV::loop(). But, we'll let AnyEvent make the decision about what event loop to use, by writing AnyEvent->condvar->recv. Don't worry what this does; it's an idiom that means "enter the event loop and never return". (You'll see a lot about condition variables, or condvars, as you read about AnyEvent. They are nice for examples in the documentation and in unit tests, but you really don't want to ever use them in your program. If you're using them inside a .pm file, you're doing something very wrong. So just pretend they don't exist for now, and you'll write extremely clean code right from the start. And that'll put you ahead of many CPAN authors!)
So, just for completeness:
AnyEvent->condvar->recv;
If you run that program, it will print "Hello" every five seconds until the universe ends, or, more likely, you kill it with control c. What's neat about this is that you can do other things in those five seconds between printing "Hello", and you do it just by adding more watchers.
So, now onto reading from pipes. AnyEvent makes this very easy with its AnyEvent::Handle module. AnyEvent::Handle can connect to sockets or pipes and will call a callback whenever data is available to read from them. (It can also do non-blocking writes, TLS, and other stuff. But we don't care about that right now.)
First, we need to open a pipe:
use autodie 'open';
open my $fh, '<', '/path/to/pipe';
Then, we wrap it with an AnyEvent::Handle. After creating the Handle object, we'll use it for all operations on this pipe. You can completely forget about $fh, AnyEvent::Handle will handle touching it directly.
my $h = AnyEvent::Handle->new( fh => $fh );
Now we can use $h to read lines from the pipe when they become available:
$h->push_read( line => sub {
my ($h, $line, $eol) = #_;
say "Got a line: $line";
});
This will call the callback that prints "Got a line" when the next line becomes available. If you want to continue reading lines, then you need to make the function push itself back onto the read queue, like:
my $handle_line; $handle_line = sub {
my ($h, $line, $eol) = #_;
say "Got a line: $line";
$h->push_read( line => $handle_line );
};
$h->push_read( line => $handle_line );
This will read lines and call $handle_line->() for each line until the file is closed. If you want to stop reading early, that's easy... just don't push_read again in that case. (You don't have to read at the line level; you can ask that your callback be called whenever any bytes become available. But that's more complicated and left as an exercise to the reader.)
So now we can tie this all together into a daemon that handles reading the pipes. What we want to do is: create a handler for lines, open the pipes and handle the lines, and finally set up a signal handler to cleanly exit the program. I recommend taking an OO approach to this problem; make each action ("handle lines from the access log file") a class with a start and stop method, instantiate a bunch of actions, setup a signal handler to cleanly stop the actions, start all the actions, and then enter the event loop. That's a lot of code that's not really related to this problem, so we'll do something simpler. But keep that in mind as you design your program.
#!/usr/bin/env perl
use strict;
use warnings;
use AnyEvent;
use AnyEvent::Handle;
use EV;
use autodie 'open';
use 5.010;
my #handles;
my $abort; $abort = AnyEvent->signal( signal => 'INT', cb => sub {
say "Exiting.";
$_->destroy for #handles;
undef $abort;
# all watchers destroyed, event loop will return
});
my $handler; $handler = sub {
my ($h, $line, $eol) = #_;
my $name = $h->{name};
say "$name: $line";
$h->push_read( line => $handler );
};
for my $file (#ARGV) {
open my $fh, '<', $file;
my $h = AnyEvent::Handle->new( fh => $fh );
$h->{name} = $file;
$h->push_read( line => $handler );
}
EV::loop;
Now you have a program that reads a line from an arbitrary number of pipes, prints each line received on any pipe (prefixed with the path to the pipe), and exits cleanly when you press Control-C!
First simplification - handle each named pipe in a separate process. That means you will run one perl process for each named pipe, but then you won't have to use event-based I/O or threading.
Given that, how about just passing the configuration data (path to the named pipe, email address to use, etc.) on the command line, e.g.:
the-daemon --pipe /path/to/named-pipe-A --mailto root#domainA.com
the-daemon --pipe /path/to/named-pipe-B --mailto root#domainB.com
...
Does that work for you?
To make sure the daemons stay up, have a look at a package like D. J. Bernstein's daemontools or supervisord (gasp! a python package).
Each of these packages tells you how to configure your rc-scripts so that they start at machine boot time.

losing children in a fork

use strict;
use warnings;
use Parallel::ForkManager;
my $log = "/scripts/downloads/test.log";
print "Check the $log file\n" and open(LOG,">$log");
*STDERR = *LOG;
*STDOUT = *LOG;
my $pmm=new Parallel::ForkManager(9);
my #arr=(1..200);
for(#arr){
$pmm->start and next; # do the fork
print $_."\n";
$pmm->finish; # do the exit in the child process
}
$pmm->wait_all_children;
open(READ,"$log") or die $!;
my #check=<READ>;
print "there are ".scalar #arr ." items....";
print "but we only got ".scalar #check." items\n";
This is a simplified version of a script I have going.
In this case, everytime I use more than 9 children I lose anywhere from 3-15 children, sometimes more. The obvious answer is to use less children but in my "real" script if I use less children the script will take many more hours to complete...time we don't have. Why is it losing children and is there a way to "capture" them and re-run them if they don't get run?
thx!
You're having all of your children write to the same logfile, and you haven't taken any steps to prevent them from overwriting each others' output. Without being able to reproduce the problem on my machine I can't say for sure, but I would guess that all N children are actually running, but some of the output is getting clobbered.
To have N processes write to the same file simultaneously without any output getting lost, you have to open the file for append rather than regular write. In C, you would use flags O_WRONLY|O_APPEND with open(2), or mode "a" with fopen(3). The kernel then ensures that all writes go to the very end of the file. (According to the man page, though, this is not reliable over NFS.) You also have to pay attention to how much you write to the file at once, or output from one process might show up in the middle of output from another. I don't know if either of these things are possible in Perl, but it sounds like you found another solution anyway.