For the temp directory I need in my script, I go with:
my $work_dir = File::Temp->newdir(TEMPLATE => "/tmp/work.$$.XXXX" ) or die "Cannot create tempdir directory $!";
My hope with newdir() is to get the promise of:
By default the directory is deleted when the object goes out of scope.
Only to learn that if I hit CTRL-C, $work_dir will NOT be removed.
So I added signals:
use sigtrap qw(handler signal_handler normal-signals);
and then I simply use (File::Path 'remove_tree');
sub signal_handler
{
remove_tree $work_dir;
}
This helps if I hit CTRL-C while my script runs...
However, it does NOT clean up while using the debugger, if I CTRL-C out of the debugger! If I quit out cleanly (with the q command) then it works/cleans fine, only if I CTRL-C out of the debug session, that's when $work_dir is not being removed.
Is it possible, in any way, to have the signal handler being called automatically even within a perl debug session ?
(or any other "proper" ways how to use/install signal handlers)
Your signal handler isn't doing what you think it does, because passing an object to remove_tree doesn't work:
use strict;
use warnings;
use 5.010;
use File::Path qw(remove_tree);
use File::Temp;
my $tmpdir = File::Temp->newdir(CLEANUP => 0);
remove_tree $tmpdir;
say "$tmpdir still exists" if -d $tmpdir;
Outputs:
/tmp/lTfotn79RD still exists
The call to remove_tree in your signal handler seems to work when run outside of the debugger, but it's actually not doing anything. (You can prove this to yourself by commenting out the call to remove_tree and re-running your script.) So why does the directory get removed?
If a signal handler doesn't exit or die, execution continues wherever it left off before the signal was caught. In your case, after the signal handler finishes, the program simply runs to completion. When the program terminates, any objects that are still in scope are cleaned up by calling their DESTROY methods. File::Temp->newdir returns a File::Temp::Dir object; this object's DESTROY method is what actually removes the directory from the filesystem (it uses rmtree under the hood).
This doesn't work when you interrupt the debugger; I'm not familiar with the debugger's internals, but I'm guessing it keeps references to objects so that DESTROY isn't called, even when you step off the end of the program. If you Ctrl+C again at this point, the object is never cleaned up, and neither is the temporary directory.
One way I found to work around this is to explicitly undef the object returned by File::Temp->newdir in your signal handler:
use strict;
use warnings;
use 5.010;
use File::Temp;
use sigtrap qw(handler cleanup normal-signals);
my $tmpdir = File::Temp->newdir;
sub cleanup {
my ($sig) = #_;
say "Caught signal SIG$sig";
undef $tmpdir;
exit 0;
}
This causes the DESTROY method to be called before the program exits, so you're not relying on the debugger to clean up. This seems like kind of a hack, though; why not just quit the debugger gracefully with q?
Note that you could also pass the stringified version of $tmpdir to remove_tree like this:
remove_tree "$tmpdir";
But I wouldn't recommend this, since the documentation strongly cautions against relying on file names:
For maximum security, endeavour always to avoid ever looking at, touching, or even imputing the existence of the filename. You do not know that that filename is connected to the same file as the handle you have, and attempts to check this can only trigger more race conditions. It's far more secure to use the filehandle alone and dispense with the filename altogether.
I like to use an END block. The any clean exit from the program, especially a 'quit' from the debugger will trigger the END block an, in my case, delete all my test data.
So put your clean up code in and END block have have you sig handler call exit() instead of remove_tree.
END {
remove_tree $work_dir;
}
sub signal_handler
{
exit();
}
Related
Basically I want to use system(), exec(), back-ticks or something to make a system call, but then to immediately resume execution in the calling script without caring about the result of the call and whether or not it returns, dies, stalls, whatever.
Is it possible to do this without threading/forking?
Short answer, no. This is the entire point of fork(). The only question is whether you call fork() directly, or get something else to do that for you.
As mentioned, you could use the shell's backgrounding operator (&) to do that, but then you're using the shell, and that comes with the usual string-injection attack problem
system("some command with args &");
More directly, you could just do the usual fork() and exec() and then not perform the waitpid() that normally happens, which is where the blocking occurs:
if(fork() == 0) {
exec("some", "command", "with", "args") or die "Cannot exec - $!";
}
# No waitpid here so no waiting
If you're doing that, best also to put a SIGCHLD handler in to ignore when the child does eventually exit, so as not to leave zombies hanging around. At some point near the beginning of the code, put
$SIG{CHLD} = 'IGNORE';
Use the shell's background operator &:
system("some command &");
I am running a program in Perl that at one point evaluates data in an if statement called from within a subroutine, e.g.
sub check_good {
if (!good) {
# exit this subroutine
# restart program
}
else {
# keep going
}
} # end sub
The problem I have is with exiting and restarting. I know that I can just use exit 0; to exit straight out, but obviously this is not correct if I want to go back to the beginning. I tried calling the subroutine which essentially starts the program, but of course once it has run it will go back to this point again.
I thought about putting it in a while loop, but this would mean putting the whole file in the loop and it would be very impractical.
I don't actually know whether this is possible, so any input would be great.
If you have not changed #ARGV, or you keep a copy of it, you could possibly do something like exec($^X, $0, #ARGV).
$^X and $0 (or $EXECUTABLE_NAME and $PROGRAM_NAME, see Brian's comment below) are the current perl interpreter and current perl script, respectively.
An alternative would be to always have two processes: A supervisor and a worker.
Refactor all your logic into a subroutine called run(or main or whatever). Whn your real logic detect that it needs to restart it should exit with a predefined non-zero exit code (like 1 for example).
Then your main script and supervisor would look like this:
if (my $worker = fork) {
# child process
run(#ARGV);
exit 0;
}
# supervisor process
waitpid $worker;
my $status = ($? >> 8);
if ($status == 1) { ... restart .. }
exit $status; # propagate exit code...
In the simple scenario where you just want to restart once, this might be a bit overkill. But if you at any point need to be able to handle other error scenarios this method might be preferable.
For example if the exit code is 255, this indicates that the main script called die(). In this case you might want to implement some decision procedure wether to restart the script, ignore the error, or escalate the issue.
There are quite a few modules on CPAN implementing such supervisors. Proc::Launcher is one of them and the manual page includes a extensive discussion of related works. (I have never used Proc::Launcher, it is mainly due to this discussion I'm linking to it)
There's nothing to stop you calling system on yourself. Something like this (clearly in need of a tidy), where I pass in a command-line argument to prevent the code calling itself forever.
#!/usr/bin/perl
use strict;
use warnings;
print "Starting...\n";
sleep 5;
if (! #ARGV) {
print "Start myself again...\n";
system("./sleep.pl secondgo");
print "...and die now\n";
exit;
} elsif ((#ARGV) && $ARGV[0] eq "secondgo") {
print "Just going to die straightaway this time\n";
exit;
}
I have written a Perl Script using WWW::Mechanize which reads URLs from a text file and connects to them one by one. In each operation, it parses the content of the webpage looking for some specific keywords and if found, it will be written to the output file.
To speed up the process, I used Parallel::ForkManager with MAX_CHILDREN set to 3. Though I have observed an increase in the speed, the problem is that, after a while the script crashes. Perl.exe process gets killed and it does not display any specific error message.
I have run the script multiple times to see if it always fails at the same point, however the point of failure seems to be intermittent.
Please note that I have already taken care of any memory leaks in WWW::Mechanize and HTML::TreeBuilder::XPath as follows:
For WWW::Mechanize, I set stack_depth(0) so that it does not cache the history of visited pages.
HTML::TreeBuilder::XPath, I delete the root node once I am done with it. This approach helped me in resolving a memory leak issue in another similar script which does not use fork.
Here is the structure of the script, I have mentioned only the relevant parts here, please let me know if more details are required to troubleshoot:
#! /usr/bin/perl
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
use warnings;
use diagnostics;
use constant MAX_CHILDREN => 3;
open(INPUT,"<",$input) || die("Couldn't read from the file, $input with error: $!\n");
open(OUTPUT, ">>", $output) || die("Couldn't open the file, $output with error: $!\n");
$pm = Parallel::ForkManager->new(MAX_CHILDREN);
$mech=WWW::Mechanize->new();
$mech->stack_depth(0);
while(<INPUT>)
{
chomp $_;
$url=$_;
$pm->start() and next;
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
$pm->finish();
print "Child Processing finished\n"; # it never reaches this point!
}
$pm->wait_all_children;
I would like to know, why does this Perl script keep failing after a while?
For understanding purpose, I added a print statement right after the finish method of fork manager, however it does not print that.
I have also used, wait_all_children method, since as per the document of the module on CPAN, it will wait for the processing to get over for all the children of the parent process.
I have not understood why, wait_all_children method is place outside the while or the for loop though (as observed in the documentation as well), since all the processing is taking place inside the loop.
Thanks.
As for why this code is written with a main job loop with the start and finish calls and then followed by a wait_all_children outside the loop. It works like this:
The parent process gets the next job from <INPUT> at the start of each loop.
The parent runs start, which causes the child process to fork. At this point, you have 2 processes, each of which is running the exact same code at the exact same point.
3a. The parent process hits that or next and jumps back to the top to read the next <INPUT> and start the process over.
3b. The child process does not hit the or next and continues running the code you give until it hits finish, where the child exits.
Meanwhile the parent process is busy going through the loop and creating a child each time through. After forking 3 children (or whatever you set the limit to) it blocks until one of the children exit. At which point, it immediately spawns a new child (resulting in step 3b for each child each time).
When the parent runs out of jobs, it jumps out the while loop (never having run anything in it itself) and then waits for all the remaining children to exit.
As you can see, any code in the loop after finish is called is never going to run in either the parent (because it doesn't do anything after or next within the loop) or the children (because they exit at finish).
I've never used Parallel::ForkManager, but it looks like you can put a run_on_finished hook to run some code at the finish if you want to put a print statement there at the end.
To find the problem, though, I'd suggest wrapping all the code between start and finish in an eval or use Try::Tiny and warn out the error to see if there's an exception happening in there that's breaking it. I'd expect such things to show up in STDERR when the child dies, though, so I'm not really sure that will help.
However, it's worth shot. Here's my suggestion in code, just showing the part I'd catch exceptions from:
# At the top add
use Try::Tiny;
# Later in your main loop
$pm->start() and next;
try {
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
}
catch {
warn "Bad Stuff: ", $_;
};
$pm->finish();
That might help show you what's gone wrong.
If it does not help, you might try moving the try block to include more of the program (like nearly all of it after the use Try::Tiny line) and see if that elucidates anything.
The $pm->wait_all_children; function call waits for "ALL" the child processes to end and places a Blocking lock. I am not sure what kind of error handling you have done for $mech inside the if() statement, but you may want to re-visit that.
I'm trying to make a log analyser using perl. The analyser would run 24/7 in the background on an AIX server and read from pipes that syslog directs logs to (from the entire network). Basically:
logs from network ----> named pipe A --------> | perl daemon
----> named pipe B --------> | * reads pipes
----> named pipe c --------> | * decides what to do based on which pipe
So, for example, I want my daemon to be able to be configured to mail root#domain.com all logs that are written to named pipe C. For this, I'm assuming the daemon needs to have a hash (new to perl, but this seems like an appropriate data structure) that would be able to be changed on the fly and would tell it what to do with each pipe.
Is this possible? Or should I create a .conf file in /etc to hold the information. Something like this:
namedpipeA:'mail root#domain.com'
namedpipeB:save:'mail user#domain.com'
So getting anything from A will be mailed to root#domain.com and everything from B will be saved to a log file (like it is usually) AND it will be sent to user#domain.com
Seeing as this is my first time using Perl and my first time creating a daemon, is there anyway for me to make this while adhering to the KISS principal? Also, are there any conventions that I should stick to? If you could take into consideration my lack of knowledge when replying it would be most helpful.
I'll cover part of your question: how to write a long-running Perl program that deals with IO.
The most efficient way to write a Perl program that handles many simultaneous IO operations is to use an event loop. This will allow us to write handlers for events, like "a line appeared on the named pipe" or "the email was sent successfully" or "we received SIGINT". Crucially, it will allow us to compose an arbitrary number of these event handlers in one program. This means that you can "multitask" but still easily share state between the tasks.
We'll use the AnyEvent framework. It lets us write event handlers, called watchers, that will work with any event loop that Perl supports. You probably don't care which event loop you use, so this abstraction probably doesn't matter to your application. But it will let us reuse pre-written event handlers available on CPAN; AnyEvent::SMTP to handle email, AnyEvent::Subprocess to interact with child processes, AnyEvent::Handle to deal with the pipes, and so on.
The basic structure of an AnyEvent-based daemon is very simple. You create some watchers, enter the event loop, and ... that's it; the event system does everything else. To get started, let's write a program that will print "Hello" every five seconds.
We start by loading modules:
use strict;
use warnings;
use 5.010;
use AnyEvent;
Then, we'll create a time watcher, or a "timer":
my $t = AnyEvent->timer( after => 0, interval => 5, cb => sub {
say "Hello";
});
Note that we assign the timer to a variable. This keeps the timer alive as long as $t is in scope. If we said undef $t, then the timer would be cancelled and the callback would never be called.
About callbacks, that's the sub { ... } after cb =>, and that's how we handle events. When an event happens, the callback is invoked. We do our thing, return, and the event loop continues calling other callbacks as necessary. You can do anything you want in callbacks, including cancelling and creating other watchers. Just don't make a blocking call, like system("/bin/sh long running process") or my $line = <$fh> or sleep 10. Anything that blocks must be done by a watcher; otherwise, the event loop won't be able to run other handlers while waiting for that task to complete.
Now that we have a timer, we just need to enter the event loop. Typically, you'll choose an event loop that you want to use, and enter it in the specific way that the event loop's documentation describes. EV is a good one, and you enter it by calling EV::loop(). But, we'll let AnyEvent make the decision about what event loop to use, by writing AnyEvent->condvar->recv. Don't worry what this does; it's an idiom that means "enter the event loop and never return". (You'll see a lot about condition variables, or condvars, as you read about AnyEvent. They are nice for examples in the documentation and in unit tests, but you really don't want to ever use them in your program. If you're using them inside a .pm file, you're doing something very wrong. So just pretend they don't exist for now, and you'll write extremely clean code right from the start. And that'll put you ahead of many CPAN authors!)
So, just for completeness:
AnyEvent->condvar->recv;
If you run that program, it will print "Hello" every five seconds until the universe ends, or, more likely, you kill it with control c. What's neat about this is that you can do other things in those five seconds between printing "Hello", and you do it just by adding more watchers.
So, now onto reading from pipes. AnyEvent makes this very easy with its AnyEvent::Handle module. AnyEvent::Handle can connect to sockets or pipes and will call a callback whenever data is available to read from them. (It can also do non-blocking writes, TLS, and other stuff. But we don't care about that right now.)
First, we need to open a pipe:
use autodie 'open';
open my $fh, '<', '/path/to/pipe';
Then, we wrap it with an AnyEvent::Handle. After creating the Handle object, we'll use it for all operations on this pipe. You can completely forget about $fh, AnyEvent::Handle will handle touching it directly.
my $h = AnyEvent::Handle->new( fh => $fh );
Now we can use $h to read lines from the pipe when they become available:
$h->push_read( line => sub {
my ($h, $line, $eol) = #_;
say "Got a line: $line";
});
This will call the callback that prints "Got a line" when the next line becomes available. If you want to continue reading lines, then you need to make the function push itself back onto the read queue, like:
my $handle_line; $handle_line = sub {
my ($h, $line, $eol) = #_;
say "Got a line: $line";
$h->push_read( line => $handle_line );
};
$h->push_read( line => $handle_line );
This will read lines and call $handle_line->() for each line until the file is closed. If you want to stop reading early, that's easy... just don't push_read again in that case. (You don't have to read at the line level; you can ask that your callback be called whenever any bytes become available. But that's more complicated and left as an exercise to the reader.)
So now we can tie this all together into a daemon that handles reading the pipes. What we want to do is: create a handler for lines, open the pipes and handle the lines, and finally set up a signal handler to cleanly exit the program. I recommend taking an OO approach to this problem; make each action ("handle lines from the access log file") a class with a start and stop method, instantiate a bunch of actions, setup a signal handler to cleanly stop the actions, start all the actions, and then enter the event loop. That's a lot of code that's not really related to this problem, so we'll do something simpler. But keep that in mind as you design your program.
#!/usr/bin/env perl
use strict;
use warnings;
use AnyEvent;
use AnyEvent::Handle;
use EV;
use autodie 'open';
use 5.010;
my #handles;
my $abort; $abort = AnyEvent->signal( signal => 'INT', cb => sub {
say "Exiting.";
$_->destroy for #handles;
undef $abort;
# all watchers destroyed, event loop will return
});
my $handler; $handler = sub {
my ($h, $line, $eol) = #_;
my $name = $h->{name};
say "$name: $line";
$h->push_read( line => $handler );
};
for my $file (#ARGV) {
open my $fh, '<', $file;
my $h = AnyEvent::Handle->new( fh => $fh );
$h->{name} = $file;
$h->push_read( line => $handler );
}
EV::loop;
Now you have a program that reads a line from an arbitrary number of pipes, prints each line received on any pipe (prefixed with the path to the pipe), and exits cleanly when you press Control-C!
First simplification - handle each named pipe in a separate process. That means you will run one perl process for each named pipe, but then you won't have to use event-based I/O or threading.
Given that, how about just passing the configuration data (path to the named pipe, email address to use, etc.) on the command line, e.g.:
the-daemon --pipe /path/to/named-pipe-A --mailto root#domainA.com
the-daemon --pipe /path/to/named-pipe-B --mailto root#domainB.com
...
Does that work for you?
To make sure the daemons stay up, have a look at a package like D. J. Bernstein's daemontools or supervisord (gasp! a python package).
Each of these packages tells you how to configure your rc-scripts so that they start at machine boot time.
I'm debugging a really weird problem with a long-running Perl script.
The problem is that the script does not exit on die() as expected. Instead the script just hangs without returning.
I've not defined any error handlers myself so I would assume that die() would lead to an immediate termination of the script.
This is the basic structure of the script and the modules used:
#!/usr/bin/perl
use strict;
use utf8;
use warnings;
use DBI; # with MySQL driver ("dbi:mysql:database=...")
use Geo::IP;
use POSIX;
use URI::Escape;
open(COMMAND, 'command_line |');
while (<COMMAND>) {
#
# .. stuff that can go wrong ..
#
die("I'm expecting the script to terminate here. It doesn't.") if ($gone_wrong);
}
close(COMMAND);
What could be the explanation to this behaviour? Is any of the modules used known to set up error handlers that could explain the script hanging on die()?
Well, END blocks and object destructors are still called after a die. If one of those hangs (or does something that takes a long time), the script won't exit immediately. But that should happen after printing the message from die (unless STDERR is buffered so you don't see the message immediately).
You mention DBI, so you probably have a database handle whose destructor is being called. (I'm not sure that's the problem, though.)