Perl: Move to next item in loop before it is done executing - perl

I have a perl script that is like so:
foreach my $addr ('http://site1.com', ...., 'http://site2.com') {
my $script = `curl -m 15 $addr`;
*do stuff with $script*
}
The -m sets a timeout of 15 seconds. Is there a way to make it if a user pushes a key, it stops the current execution and moves onto the next item in the foreach? I know last; can move to the next item but I am unsure of how to link this to the key being pushed and how to do it while the curl script is running
Edit: So based on the answers it seems difficult to do it while curl is running. Would it be possible to push a key while curl is running and have it skip to the next item in the loop as soon as the curl script returns (or times out after 15sec)?

The problem you've got with this, is that when you run curl perl hands over control and waits for completion. It blocks until it's 'done'.
So it's not as easy to do this as it might seem.
As another poster alludes to - you can use a variety of parallel processing options. I would suggest the easiest is to move away from using 'any' key, and require a ctrl-c.
So you'd then do:
foreach my $addr ('http://site1.com', ...., 'http://site2.com') {
my $pid = open ( my $curl_fh, "-|", "curl -m 15 $addr" );
$SIG{'INT'} = sub { print "Aborting fetch of $addr"; kill $pid };
while ( <$curl_fh> ) {
print;
}
#might want to set it to something else.
#undef means 'ctrl-c' will abort the whole program.
#IGNORE means exactly what it says on the tin.
#important to change it though, as it has a specific pid it'll kill,
#and that might cause problems.
$SIG{'INT'} = undef;
}
What this does is configure SIGINT (e.g. ctrl-c) so it doesn't kill your program, but does kill the sub-process.
If you wanted to look at other options, I'd offer:
Multithreading, spawn a thread to 'do' the curl fetching in the background and use Thread::Queue to pass results back and forth. (Thread::Queue supports nonblocking checks).
Forking - fork a sub process to do the curl, and use your 'main' process to send a signal if a key is pressed.
IO::Select such that you're not making blocking reads on your process.

Basically you have two options:
1. Use threads
Create a new thread, call desired system function there. Wait for output. In another thread, check for user input. On input, you can kill the child process. When child process has finished, you can ignore user input.
Such a solution seems to be rather complex, with a lot of synchronization needed, probably with using signals. Risky.
2. Use non-blocking IO
Please read this thread. It explains how to make non-blocking IO reads from either a file or a pipe. You'd like to make a non-blocking read from pipe (created with open), then non-blocking read from STDIN, loop.
Seems like a way to go, but, alas, rather complex as well.

Related

A daemon to tail a log and fork multiple external (perl) script

I'm trying to write a program, actually a daemon, which stay in memory and perform something like tail -F on a rapidly updated log file. Then the program, when detect a new line on the file, have to launch another compiled perl script which will perform some operations on the log line and then send it with a post.
To clearly explain, I will refer to these two program as "prgTAIL" and "prgPROCESS". So, prgTAIL tail the log and launch prgPROCESS passing the new line to it.
Obviously the prgTAIL doesn't have to wait for the prgPROCESS to end the process, cause prgTAIL have to stay in memory and keep detecting new line on the log. Also, the rate of file update needs to launch multiple parallel prgPROCESS instance. For this reason I'm using two program: the first small and fast just pass the data to the second, which may be heavier cause it can be launched in multiple instances.
On the prgTAIL I used:
a pipe to tail the log file
a while loop to launch prgPROCESS on new log line
a fork(); to continue without waiting prgPROCESS ends
my $log_csv = "/log/csv.csv";
open (my $pipe, "-|", "tail", "-n0", "-F", $log_csv) or die "error";
while (<$pipe>) {
$line = $_ ;
my $pid = fork();
if (defined $pid && $pid == 0) {
exec("/bin/prgPROCESS ".$line) ; # I tried system() too.
exit 0;
}
}
The prgPROCESS operation are not so important; anyway.. it parses the $line passed as arguments, construct an XML and then post it via https.
So, this stuff actually run, but I think I messed up something with the process, cause when a reach a number of newline and prgPROCESS call around 550, prgTAIL keep running but it can't call prgPROCESS anymore, cause there are too many process. I get this error on the bash:
-bash: fork: Resource temporarily unavailable
What's wrong? Any idea? Maybe the prgPROCESS processes don't end and stay stuck without make room for other process?
PS: I'm using a Mac OS X now, but this will run on Linux.
Your problem is this:
while () {
doesn't have any constraint condition, so it's just spinning as fast as it can. You're never actually reading from your pipe, you're just forking as fast as you can and spawning that new script.
You might be wanting:
while ( my $line = <$pipe> ) {
#....
}
But really - it's arguable that you don't actually need to fork at all, because a read/process/read loop would probably do just fine - fork() and exec() is basically what system already does anyway.
You should also - if forking - clean up child processes. It doesn't matter too much for short running things, but things that sit in a loop will leave a lot of zombie processes. Either via setting $SIG{CHLD} or using waitpid.

Perl crashing with Parallel::ForkManager and WWW::Mechanize

I have written a Perl Script using WWW::Mechanize which reads URLs from a text file and connects to them one by one. In each operation, it parses the content of the webpage looking for some specific keywords and if found, it will be written to the output file.
To speed up the process, I used Parallel::ForkManager with MAX_CHILDREN set to 3. Though I have observed an increase in the speed, the problem is that, after a while the script crashes. Perl.exe process gets killed and it does not display any specific error message.
I have run the script multiple times to see if it always fails at the same point, however the point of failure seems to be intermittent.
Please note that I have already taken care of any memory leaks in WWW::Mechanize and HTML::TreeBuilder::XPath as follows:
For WWW::Mechanize, I set stack_depth(0) so that it does not cache the history of visited pages.
HTML::TreeBuilder::XPath, I delete the root node once I am done with it. This approach helped me in resolving a memory leak issue in another similar script which does not use fork.
Here is the structure of the script, I have mentioned only the relevant parts here, please let me know if more details are required to troubleshoot:
#! /usr/bin/perl
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
use warnings;
use diagnostics;
use constant MAX_CHILDREN => 3;
open(INPUT,"<",$input) || die("Couldn't read from the file, $input with error: $!\n");
open(OUTPUT, ">>", $output) || die("Couldn't open the file, $output with error: $!\n");
$pm = Parallel::ForkManager->new(MAX_CHILDREN);
$mech=WWW::Mechanize->new();
$mech->stack_depth(0);
while(<INPUT>)
{
chomp $_;
$url=$_;
$pm->start() and next;
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
$pm->finish();
print "Child Processing finished\n"; # it never reaches this point!
}
$pm->wait_all_children;
I would like to know, why does this Perl script keep failing after a while?
For understanding purpose, I added a print statement right after the finish method of fork manager, however it does not print that.
I have also used, wait_all_children method, since as per the document of the module on CPAN, it will wait for the processing to get over for all the children of the parent process.
I have not understood why, wait_all_children method is place outside the while or the for loop though (as observed in the documentation as well), since all the processing is taking place inside the loop.
Thanks.
As for why this code is written with a main job loop with the start and finish calls and then followed by a wait_all_children outside the loop. It works like this:
The parent process gets the next job from <INPUT> at the start of each loop.
The parent runs start, which causes the child process to fork. At this point, you have 2 processes, each of which is running the exact same code at the exact same point.
3a. The parent process hits that or next and jumps back to the top to read the next <INPUT> and start the process over.
3b. The child process does not hit the or next and continues running the code you give until it hits finish, where the child exits.
Meanwhile the parent process is busy going through the loop and creating a child each time through. After forking 3 children (or whatever you set the limit to) it blocks until one of the children exit. At which point, it immediately spawns a new child (resulting in step 3b for each child each time).
When the parent runs out of jobs, it jumps out the while loop (never having run anything in it itself) and then waits for all the remaining children to exit.
As you can see, any code in the loop after finish is called is never going to run in either the parent (because it doesn't do anything after or next within the loop) or the children (because they exit at finish).
I've never used Parallel::ForkManager, but it looks like you can put a run_on_finished hook to run some code at the finish if you want to put a print statement there at the end.
To find the problem, though, I'd suggest wrapping all the code between start and finish in an eval or use Try::Tiny and warn out the error to see if there's an exception happening in there that's breaking it. I'd expect such things to show up in STDERR when the child dies, though, so I'm not really sure that will help.
However, it's worth shot. Here's my suggestion in code, just showing the part I'd catch exceptions from:
# At the top add
use Try::Tiny;
# Later in your main loop
$pm->start() and next;
try {
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
}
catch {
warn "Bad Stuff: ", $_;
};
$pm->finish();
That might help show you what's gone wrong.
If it does not help, you might try moving the try block to include more of the program (like nearly all of it after the use Try::Tiny line) and see if that elucidates anything.
The $pm->wait_all_children; function call waits for "ALL" the child processes to end and places a Blocking lock. I am not sure what kind of error handling you have done for $mech inside the if() statement, but you may want to re-visit that.

Killing an application started using system() in Perl

I am trying to run an application inside a Perl script using system(). The application I'm running gets stuck sometimes (it enters some kind of infinite loop). Is there a way I can know if this application is stuck and kill it to continue with the Perl script?
I'm trying to do something like this:
start testapp.exe;
if(stuck with testapp.exe) {
kill testapp.exe;
}
Determining if "it is stuck in infinite loop" is called Halting Problem and is undecidable.
If you want to kill it, you will have to fork the application using fork and then kill it from the other fork, if it is going for too long.
You can determine if the proccess is going for too long by this
use POSIX ":sys_wait_h";
waitpid($pid, WNOHANG)>0 #waitpid returns 0 if it still running
at least, according to this manual page
I am not sure how well it works on various systems, you can try it out.
Not a direct answer, but I can recommend using forks module if you want to fork with ease, but it works only on UNIX systems (not windows).
OK, more helping code :) It works in UNIX, according to perlfork perldoc, it should work on Windows exactly the same way.
use warnings;
use strict;
use POSIX ":sys_wait_h";
my $exited_cleanly; #to this variable I will save the info about exiting
my $pid = fork;
if (!$pid) {
system("anything_long.exe"); #your long program
} else {
sleep 10; #wait 10 seconds (can be longer)
my $result = waitpid(-1, WNOHANG); #here will be the result
if ($result==0) { #system is still running
$exited_cleanly = 0; #I already know I had to kill it
kill('TERM', $pid); #kill it with TERM ("cleaner") first
sleep(1); #wait a bit if it ends
my $result_term = waitpid(-1, WNOHANG);
#did it end?
if ($result_term == 0) { #if it still didnt...
kill('KILL', $pid); #kill it with full force!
}
} else {
$exited_cleanly = 1; #it exited cleanly
}
}
#you can now say something to the user, for example
if (!$exited_cleanly) {...}
system("start testapp")
is short for
system("cmd", "/c", "start testapp")
Perl just knows about cmd; it doesn't know anything about start, much less about testapp. system is not the tool you want. That's the first problem.
The second problem is that you haven't defined what it means to be "stuck". If you want to monitor a program, it needs a heartbeat. A heartbeat is a periodic activity that can be externally examined. It can be writing to a pipe. It can be changing a file. Anything.
The monitoring program listens for this heartbeat, and presumes the program is dead if the heart stops beating, so to speak.
"Killing" is done using signals in unix, but it's done using TerminateProcess in Windows. The third problem is that Perl core does not give you access to that function.
The solution to the first and third problem is Win32::Process. It allows you to launch a process in the background, and it also allows you to terminate it.
Creating a heartbeat is up to you.
Here is one way you can handle the problem if you know that testapp should not take more than N seconds to do its thing, then you can use a timeout to kill the app by way of IPC::Run.
In the example below there is a timeout of 1 second which kills the sleep 10 command that takes too long (longer than the timeout of 1 second). If this doesn't do what you want, then you should provide more information about how you can detect that testapp.exe is "stuck".
#!/usr/bin/env perl
use IPC::Run qw( run timeout );
eval { # if (stuck with testapp.exe for more than N seconds)
#cmd = ('sleep', '10'); # this could be testapp.exe instead of sleep
run \#cmd, \$in, \$out, \$err, timeout( 1 ) or die "test"; # start testapp.exe
print "do stuff if cmd succeeds\n";
};
print "more stuff to do afterwards whether or not command fails or succeeds\n";
You can't determine that the application is stuck if you execute it like that, because the system statement won't return until the application terminates.
So, at least, you need to start the test application so it can run asynchronously from the Perl script that is to monitor it.
Having resolved that part of the problem, you have to establish a mechanism that will allow the monitoring Perl script to determine that the application is stuck. That is a non-trivial exercise, and likely system dependent, unless you adopt a simple expedient such as requiring the application to write a heart-beat indication somewhere, and the Perl script monitors for the heart-beat. For example (not necessarily a good example), the application could write the current time into a file identified by its PID, and the Perl script could monitor the file to see if the heart-beat is sufficiently recent. Of course, this assumes that the 'infinite loop' doesn't include code that writes to the heart-beat file.

Making a Perl daemon that runs 24/7 and reads from named pipes

I'm trying to make a log analyser using perl. The analyser would run 24/7 in the background on an AIX server and read from pipes that syslog directs logs to (from the entire network). Basically:
logs from network ----> named pipe A --------> | perl daemon
----> named pipe B --------> | * reads pipes
----> named pipe c --------> | * decides what to do based on which pipe
So, for example, I want my daemon to be able to be configured to mail root#domain.com all logs that are written to named pipe C. For this, I'm assuming the daemon needs to have a hash (new to perl, but this seems like an appropriate data structure) that would be able to be changed on the fly and would tell it what to do with each pipe.
Is this possible? Or should I create a .conf file in /etc to hold the information. Something like this:
namedpipeA:'mail root#domain.com'
namedpipeB:save:'mail user#domain.com'
So getting anything from A will be mailed to root#domain.com and everything from B will be saved to a log file (like it is usually) AND it will be sent to user#domain.com
Seeing as this is my first time using Perl and my first time creating a daemon, is there anyway for me to make this while adhering to the KISS principal? Also, are there any conventions that I should stick to? If you could take into consideration my lack of knowledge when replying it would be most helpful.
I'll cover part of your question: how to write a long-running Perl program that deals with IO.
The most efficient way to write a Perl program that handles many simultaneous IO operations is to use an event loop. This will allow us to write handlers for events, like "a line appeared on the named pipe" or "the email was sent successfully" or "we received SIGINT". Crucially, it will allow us to compose an arbitrary number of these event handlers in one program. This means that you can "multitask" but still easily share state between the tasks.
We'll use the AnyEvent framework. It lets us write event handlers, called watchers, that will work with any event loop that Perl supports. You probably don't care which event loop you use, so this abstraction probably doesn't matter to your application. But it will let us reuse pre-written event handlers available on CPAN; AnyEvent::SMTP to handle email, AnyEvent::Subprocess to interact with child processes, AnyEvent::Handle to deal with the pipes, and so on.
The basic structure of an AnyEvent-based daemon is very simple. You create some watchers, enter the event loop, and ... that's it; the event system does everything else. To get started, let's write a program that will print "Hello" every five seconds.
We start by loading modules:
use strict;
use warnings;
use 5.010;
use AnyEvent;
Then, we'll create a time watcher, or a "timer":
my $t = AnyEvent->timer( after => 0, interval => 5, cb => sub {
say "Hello";
});
Note that we assign the timer to a variable. This keeps the timer alive as long as $t is in scope. If we said undef $t, then the timer would be cancelled and the callback would never be called.
About callbacks, that's the sub { ... } after cb =>, and that's how we handle events. When an event happens, the callback is invoked. We do our thing, return, and the event loop continues calling other callbacks as necessary. You can do anything you want in callbacks, including cancelling and creating other watchers. Just don't make a blocking call, like system("/bin/sh long running process") or my $line = <$fh> or sleep 10. Anything that blocks must be done by a watcher; otherwise, the event loop won't be able to run other handlers while waiting for that task to complete.
Now that we have a timer, we just need to enter the event loop. Typically, you'll choose an event loop that you want to use, and enter it in the specific way that the event loop's documentation describes. EV is a good one, and you enter it by calling EV::loop(). But, we'll let AnyEvent make the decision about what event loop to use, by writing AnyEvent->condvar->recv. Don't worry what this does; it's an idiom that means "enter the event loop and never return". (You'll see a lot about condition variables, or condvars, as you read about AnyEvent. They are nice for examples in the documentation and in unit tests, but you really don't want to ever use them in your program. If you're using them inside a .pm file, you're doing something very wrong. So just pretend they don't exist for now, and you'll write extremely clean code right from the start. And that'll put you ahead of many CPAN authors!)
So, just for completeness:
AnyEvent->condvar->recv;
If you run that program, it will print "Hello" every five seconds until the universe ends, or, more likely, you kill it with control c. What's neat about this is that you can do other things in those five seconds between printing "Hello", and you do it just by adding more watchers.
So, now onto reading from pipes. AnyEvent makes this very easy with its AnyEvent::Handle module. AnyEvent::Handle can connect to sockets or pipes and will call a callback whenever data is available to read from them. (It can also do non-blocking writes, TLS, and other stuff. But we don't care about that right now.)
First, we need to open a pipe:
use autodie 'open';
open my $fh, '<', '/path/to/pipe';
Then, we wrap it with an AnyEvent::Handle. After creating the Handle object, we'll use it for all operations on this pipe. You can completely forget about $fh, AnyEvent::Handle will handle touching it directly.
my $h = AnyEvent::Handle->new( fh => $fh );
Now we can use $h to read lines from the pipe when they become available:
$h->push_read( line => sub {
my ($h, $line, $eol) = #_;
say "Got a line: $line";
});
This will call the callback that prints "Got a line" when the next line becomes available. If you want to continue reading lines, then you need to make the function push itself back onto the read queue, like:
my $handle_line; $handle_line = sub {
my ($h, $line, $eol) = #_;
say "Got a line: $line";
$h->push_read( line => $handle_line );
};
$h->push_read( line => $handle_line );
This will read lines and call $handle_line->() for each line until the file is closed. If you want to stop reading early, that's easy... just don't push_read again in that case. (You don't have to read at the line level; you can ask that your callback be called whenever any bytes become available. But that's more complicated and left as an exercise to the reader.)
So now we can tie this all together into a daemon that handles reading the pipes. What we want to do is: create a handler for lines, open the pipes and handle the lines, and finally set up a signal handler to cleanly exit the program. I recommend taking an OO approach to this problem; make each action ("handle lines from the access log file") a class with a start and stop method, instantiate a bunch of actions, setup a signal handler to cleanly stop the actions, start all the actions, and then enter the event loop. That's a lot of code that's not really related to this problem, so we'll do something simpler. But keep that in mind as you design your program.
#!/usr/bin/env perl
use strict;
use warnings;
use AnyEvent;
use AnyEvent::Handle;
use EV;
use autodie 'open';
use 5.010;
my #handles;
my $abort; $abort = AnyEvent->signal( signal => 'INT', cb => sub {
say "Exiting.";
$_->destroy for #handles;
undef $abort;
# all watchers destroyed, event loop will return
});
my $handler; $handler = sub {
my ($h, $line, $eol) = #_;
my $name = $h->{name};
say "$name: $line";
$h->push_read( line => $handler );
};
for my $file (#ARGV) {
open my $fh, '<', $file;
my $h = AnyEvent::Handle->new( fh => $fh );
$h->{name} = $file;
$h->push_read( line => $handler );
}
EV::loop;
Now you have a program that reads a line from an arbitrary number of pipes, prints each line received on any pipe (prefixed with the path to the pipe), and exits cleanly when you press Control-C!
First simplification - handle each named pipe in a separate process. That means you will run one perl process for each named pipe, but then you won't have to use event-based I/O or threading.
Given that, how about just passing the configuration data (path to the named pipe, email address to use, etc.) on the command line, e.g.:
the-daemon --pipe /path/to/named-pipe-A --mailto root#domainA.com
the-daemon --pipe /path/to/named-pipe-B --mailto root#domainB.com
...
Does that work for you?
To make sure the daemons stay up, have a look at a package like D. J. Bernstein's daemontools or supervisord (gasp! a python package).
Each of these packages tells you how to configure your rc-scripts so that they start at machine boot time.

How can I make fork in Perl in different scripts?

I have a process in Perl that creates another one with the system command, I leave it on memory and I pass some variables like this:
my $var1 = "Hello";
my $var1 = "World";
system "./another_process.pl $var1 $var2 &";
But the system command only returns the result, I need to get the PID. I want to make something like fork. What should I do? How can I make something like fork but in diferent scripts?
Thanks in advance!
Perl has a fork function.
See perldoc perlfaq8 - How do I start a process in the background?
(contributed by brian d foy)
There's not a single way to run code
in the background so you don't have to
wait for it to finish before your
program moves on to other tasks.
Process management depends on your
particular operating system, and many
of the techniques are in perlipc.
Several CPAN modules may be able to
help, including
IPC::Open2
or
IPC::Open3
,
IPC::Run
,
Parallel::Jobs
,
Parallel::ForkManager
,
POE
,
Proc::Background
, and
Win32::Process
.
There are many other modules you might
use, so check those namespaces for
other options too. If you are on a
Unix-like system, you might be able to
get away with a system call where you
put an & on the end of the command:
system("cmd &")
You can also try using
fork,
as described in
perlfunc
(although this is the same thing that
many of the modules will do for you).
STDIN, STDOUT, and STDERR are shared
Both the main process and the
backgrounded one (the "child" process)
share the same STDIN, STDOUT and
STDERR filehandles. If both try to
access them at once, strange things
can happen. You may want to close or
reopen these for the child. You can
get around this with opening a pipe
(see open) but on some systems this
means that the child process cannot
outlive the parent.
Signals
You'll have to catch the SIGCHLD
signal, and possibly SIGPIPE too.
SIGCHLD is sent when the backgrounded
process finishes. SIGPIPE is sent when
you write to a filehandle whose child
process has closed (an untrapped
SIGPIPE can cause your program to
silently die). This is not an issue
with system("cmd&").
Zombies
You have to be prepared to "reap" the
child process when it finishes.
$SIG{CHLD} = sub { wait };
$SIG{CHLD} = 'IGNORE'; You can also
use a double fork. You immediately
wait() for your first child, and the
init daemon will wait() for your
grandchild once it exits.
unless ($pid = fork) {
unless (fork) {
exec "what you really wanna do";
die "exec failed!";
}
exit 0;
}
waitpid($pid, 0);
See Signals in
perlipc
for other examples of code to do this.
Zombies are not an issue with
system("prog &").system("prog &").
It's true that you can use fork/exec, but I think it will be much easier to simply use the pipe form of open. Not only is the return value the pid you are looking for, you can be connected to either the stdin or stdout of the process, depending on how you open. For instance:
open my $handle, "foo|";
will return the pid of foo and connect you to the stdout so that if you you get a line of output from foo. Using "|foo" instead will allow you to write to foo's stdin.
You can also use open2 and open3 to do both simultaneously, though that has some major caveats applied as you can run in to unexpected issues due to io buffering.
Use fork and exec.
If you need to get the PID of a perl script you can use the $$ variable. You can put it in your another_process.pl then have it output the pid to a file. Can you be more clear on like fork? You can always use the fork exec combination.