losing children in a fork - perl

use strict;
use warnings;
use Parallel::ForkManager;
my $log = "/scripts/downloads/test.log";
print "Check the $log file\n" and open(LOG,">$log");
*STDERR = *LOG;
*STDOUT = *LOG;
my $pmm=new Parallel::ForkManager(9);
my #arr=(1..200);
for(#arr){
$pmm->start and next; # do the fork
print $_."\n";
$pmm->finish; # do the exit in the child process
}
$pmm->wait_all_children;
open(READ,"$log") or die $!;
my #check=<READ>;
print "there are ".scalar #arr ." items....";
print "but we only got ".scalar #check." items\n";
This is a simplified version of a script I have going.
In this case, everytime I use more than 9 children I lose anywhere from 3-15 children, sometimes more. The obvious answer is to use less children but in my "real" script if I use less children the script will take many more hours to complete...time we don't have. Why is it losing children and is there a way to "capture" them and re-run them if they don't get run?
thx!

You're having all of your children write to the same logfile, and you haven't taken any steps to prevent them from overwriting each others' output. Without being able to reproduce the problem on my machine I can't say for sure, but I would guess that all N children are actually running, but some of the output is getting clobbered.
To have N processes write to the same file simultaneously without any output getting lost, you have to open the file for append rather than regular write. In C, you would use flags O_WRONLY|O_APPEND with open(2), or mode "a" with fopen(3). The kernel then ensures that all writes go to the very end of the file. (According to the man page, though, this is not reliable over NFS.) You also have to pay attention to how much you write to the file at once, or output from one process might show up in the middle of output from another. I don't know if either of these things are possible in Perl, but it sounds like you found another solution anyway.

Related

How to pipe to and read from the same tempfile handle without race conditions?

Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH

What is the preferable way to get the I/O of another console program in Perl?

It seems there are 2 ways to get the I/O of another program in Perl, one is surround the program with arguments by `, the other is using open like below, any difference between these 2 and which one is preferred?
# 2 options to get the I/O of another program
1. $output = `program`;
2. open(PIPE, 'program |');
Your 2nd case appears to be piping to 'program', rather than receiving input from it.
If you mean (and I think you do):
open(PIPE, "program |")
then the advantage of piping in is that you can process record-by-record, and you don't consume the whole of the process's output in one go (as occurs in the first scenario). That would be better from a memory perspective, and possbily from a time perspective if the program runs over a long period.
A possibility is to pipe from stdin, and then you can organise your pipes on the command line, and provide additional filtering etc. if required, and not hardcode a dependency on a particular binary.
Pipes are useful when you want to read output from a process over time. Let's say you want to monitor a log file.
You can, for example, write:
open(PIPE, 'tail -f log.txt |');
while (<PIPE>) {
chomp;
print("NEW LINE: $_\n");
}
and do some action on every new line added to the log.
If you were to write:
$output = `tail -f log.txt`;
println($output);
Then your program would never actually get to the println part since a command in backticks will wait until the command has returned.
There are a lot of different ways of communicating between processes in perl - so many that there's a whole section of documentation on it perlipc
For simple requirements, it doesn't really matter what you use. I would say - use whatever's most obvious given the context.
If you're just running a command and grabbing the output as a block:
my $df_out = `df -h`;
Then I consider backticks most appropriate.
If you're wanting to do line by line parsing - backticks will work, but generally I'll prefer an open.
E.g.
open ( my $ssh_data, "-|", "ssh -n $hostname df" );
my %size_of;
while ( <$ssh_data> ) {
my ( $fs, $total, $used, $avail ) = split;
$size_of{$fs} = $total;
}
close ( $ssh_data );
IPC::Open2 and IPC::Open3 are also options to allow bidirectional communication.
One caveat though - I'd strongly recommend not using 2 argument open, and using lexical filehandles. It's a good habit to train yourself in - even if it doesn't matter now getting used to 3-arg open is worth doing.

Perl crashing with Parallel::ForkManager and WWW::Mechanize

I have written a Perl Script using WWW::Mechanize which reads URLs from a text file and connects to them one by one. In each operation, it parses the content of the webpage looking for some specific keywords and if found, it will be written to the output file.
To speed up the process, I used Parallel::ForkManager with MAX_CHILDREN set to 3. Though I have observed an increase in the speed, the problem is that, after a while the script crashes. Perl.exe process gets killed and it does not display any specific error message.
I have run the script multiple times to see if it always fails at the same point, however the point of failure seems to be intermittent.
Please note that I have already taken care of any memory leaks in WWW::Mechanize and HTML::TreeBuilder::XPath as follows:
For WWW::Mechanize, I set stack_depth(0) so that it does not cache the history of visited pages.
HTML::TreeBuilder::XPath, I delete the root node once I am done with it. This approach helped me in resolving a memory leak issue in another similar script which does not use fork.
Here is the structure of the script, I have mentioned only the relevant parts here, please let me know if more details are required to troubleshoot:
#! /usr/bin/perl
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
use warnings;
use diagnostics;
use constant MAX_CHILDREN => 3;
open(INPUT,"<",$input) || die("Couldn't read from the file, $input with error: $!\n");
open(OUTPUT, ">>", $output) || die("Couldn't open the file, $output with error: $!\n");
$pm = Parallel::ForkManager->new(MAX_CHILDREN);
$mech=WWW::Mechanize->new();
$mech->stack_depth(0);
while(<INPUT>)
{
chomp $_;
$url=$_;
$pm->start() and next;
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
$pm->finish();
print "Child Processing finished\n"; # it never reaches this point!
}
$pm->wait_all_children;
I would like to know, why does this Perl script keep failing after a while?
For understanding purpose, I added a print statement right after the finish method of fork manager, however it does not print that.
I have also used, wait_all_children method, since as per the document of the module on CPAN, it will wait for the processing to get over for all the children of the parent process.
I have not understood why, wait_all_children method is place outside the while or the for loop though (as observed in the documentation as well), since all the processing is taking place inside the loop.
Thanks.
As for why this code is written with a main job loop with the start and finish calls and then followed by a wait_all_children outside the loop. It works like this:
The parent process gets the next job from <INPUT> at the start of each loop.
The parent runs start, which causes the child process to fork. At this point, you have 2 processes, each of which is running the exact same code at the exact same point.
3a. The parent process hits that or next and jumps back to the top to read the next <INPUT> and start the process over.
3b. The child process does not hit the or next and continues running the code you give until it hits finish, where the child exits.
Meanwhile the parent process is busy going through the loop and creating a child each time through. After forking 3 children (or whatever you set the limit to) it blocks until one of the children exit. At which point, it immediately spawns a new child (resulting in step 3b for each child each time).
When the parent runs out of jobs, it jumps out the while loop (never having run anything in it itself) and then waits for all the remaining children to exit.
As you can see, any code in the loop after finish is called is never going to run in either the parent (because it doesn't do anything after or next within the loop) or the children (because they exit at finish).
I've never used Parallel::ForkManager, but it looks like you can put a run_on_finished hook to run some code at the finish if you want to put a print statement there at the end.
To find the problem, though, I'd suggest wrapping all the code between start and finish in an eval or use Try::Tiny and warn out the error to see if there's an exception happening in there that's breaking it. I'd expect such things to show up in STDERR when the child dies, though, so I'm not really sure that will help.
However, it's worth shot. Here's my suggestion in code, just showing the part I'd catch exceptions from:
# At the top add
use Try::Tiny;
# Later in your main loop
$pm->start() and next;
try {
$mech->get($url);
if($mech->success)
{
$tree=HTML::TreeBuilder::XPath->new();
$tree->parse($mech->content);
# do some processing here on the content and print the results to OUTPUT file
# once done then delete the root node
$tree->delete();
}
}
catch {
warn "Bad Stuff: ", $_;
};
$pm->finish();
That might help show you what's gone wrong.
If it does not help, you might try moving the try block to include more of the program (like nearly all of it after the use Try::Tiny line) and see if that elucidates anything.
The $pm->wait_all_children; function call waits for "ALL" the child processes to end and places a Blocking lock. I am not sure what kind of error handling you have done for $mech inside the if() statement, but you may want to re-visit that.

Perl script getting stuck in terminal for no apparent reason

I have a Perl script which reads three files and writes new files after reading each one of them. Everything is one thread.
In this script, I open and work with three text files and store the contents in a hash. The files are large (close to 3 MB).
I am using a loop to go through each of the files (open -> read -> Do some action (hash table) -> close)
I am noticing that the whenever I am scanning through the first file, the Perl terminal window in my Cygwin shell gets stuck. The moment I hit the enter key I can see the script process the rest of the files without any issues.
It's very odd as there is no read from STDIN in my script. Moreover, the same logic applies to all the three files as everything is in the same loop.
Has anyone here faced a similar issue? Does this usually happen when dealing with large files or big hashes?
I can't post the script here, but there is not much in it to post anyway.
Could this just be a problem in my Cygwin shell?
If this problem does not go away, how can I circumvent it? Like providing the enter input when the script is in progress? More importantly, how can I debug such a problem?
sub read_set
{
#lines_in_set = ();
push #lines_in_set , $_[0];
while (<INPUT_FILE>)
{ $line = $_;
chomp($line);
if ($line=~ /ENDNEWTYPE/i or $line =~ /ENDSYNTYPE/ or eof())
{
push #lines_in_set , $line;
last;
}
else
{
push #lines_in_set , $line;
}
}
return #lines_in_set;
}
--------> I think i found the problem :- or eof() call was ensuring that the script would be stuck !! Somehow happening only at the first time. I have no idea why though
The eof() call is the problem. See perldoc -f eof.
eof with empty parentheses refers to the pseudo file accessed via while (<>), which consists of either all the files named in #ARGV, or to STDIN if there are none.
And in particular:
Note that this function actually reads a character and then "ungetc"s it, so isn't useful in an interactive context.
But your loop reads from another handle, one called INPUT_FILE.
It would make more sense to call eof(INPUT_FILE). But even that probably isn't necessary; your outer loop will terminate when it reaches the end of INPUT_FILE.
Some more suggestions, not related to the symptoms you're seeing:
Add
use strict;
use warnings;
near the top of your script, and correct any error messages this produces (perl -cw script-name does a compile-only check). You'll need to declare your variables using my (perldoc -f my). And use consistent indentation; I recommend the same style you'll find in most Perl documentation.

How can I quickly find the user's terminal PID in Perl?

The following snippet of code is used to find the PID of a user's terminal, by using ptree and grabbing the third PID from the results it returns. All terminal PID's are stored in a hash with the user's login as the key.
## If process is a TEMINAL.
## The command ptree is used to get the terminal's process ID.
## The user can then use this ID to peek the user's terminal.
if ($PID =~ /(\w+)\s+(\d+) .+basic/) {
$user = $1;
if (open(PTREE, "ptree $2 |")) {
while ($PTREE = <PTREE>) {
if ($PTREE =~ /(\d+)\s+-pksh-ksh/) {
$terminals{$user} = $terminals{$user} . " $1";
last;
}
next;
}
close(PTREE);
}
next;
}
Below is a sample ptree execution:
ares./home_atenas/lmcgra> ptree 29064
485 /usr/lib/inet/inetd start
23054 /usr/sbin/in.telnetd
23131 -pksh-ksh
26107 -ksh
29058 -ksh
29064 /usr/ob/bin/basic s=61440 pgm=/usr/local/etc/logon -q -nr trans
412 sybsrvr
I'd like to know if there is a better way to code this. This is the part of the script that takes longest to run.
Note: this code, along with other snippets, are inside a loop and are executed a couple of times.
I think the main problem is that this code is in a loop. You don't need to run ptree and parse the results more than once! You need to figure out a way to run ptree once and put it into a data structure that you can use later. Probably be some kind of simple hash will suffice. You may even be able to just keep around your %terminals hash and keep reusing it.
Some nitpicks...
Both of your "next" statements seem
unnecessary to me... you should be
able to just remove them.
Replace
$terminals{$user} = $terminals{$user} . " $1";
with:
$terminals{$user} .= " $1";
Replace the bareword PTREE which you
are using as a filehandle with
$ptreeF or some such... using
barewords became unnecessary for
filehandles about 10 years ago :)
I don't know why your $PID variable
is all caps... it could be confusing
to readers of your code because it
looks like there is something
special about that variable, and
there isn't.
I think you'll get the best performance improvement by avoiding the overhead of repeatedly executing an external command (ptree, in this case). I'd look for a CPAN module that provides a direct interface to the data structures that ptree is reading. Check the Linux:: namespace, maybe? (I'm not sure if ptree is setuid; that may complicate things.)
The above advice aside, some additional style and robustness notes based on the posted snippet only (forgive me if the larger code invalidates them):
I'd start by using strict, at the very least. Lexical filehandles would also be a good idea.
You appear to be silently ignoring the case when you cannot open() the ptree command. That could happen for many reasons, some of which I can't imagine you wanting to ignore, such as…
You're not using the full path to the ptree command, but rather assuming it's in your path—and that the one in your path is the right one.
How many users are on the system? Can you invert this? List all -pksh-ksh processes in the system along with their EUIDs, and build the map from that - that might be only one execution of ps/ptree.
I was thinking of using ps to get the parents pid, but I would need to loop this to get the great-grandparent's pid. That's the one I need. Thanks. – lamcro
Sorry, there are many users and each can have up to three terminals open. The whole script is used to find those terminals that are using a file. I use fuser to find the processes that use a file. Then use ptree to find the terminal's pid. – lamcro
If you have (or can get) a list of PIDs using a file, and just need all of the grand-parents of that PID, there's an easier way, for sure.
#!perl
use warnings;
use strict;
#***** these PIDs are gotten with fuser or some other method *****
my($fpids) = [27538, 31812, 27541];
#***** get all processes, assuming linux PS *****
my($cmd) = "ps -ef";
open(PS, "$cmd |") || die qq([ERROR] Cannot open pipe from "$cmd" - $!\n);
my($processlist) = {};
while (<PS>) {
chomp;
my($user, $pid, $ppid, $rest) = split(/ +/, $_, 4);
$processlist->{$pid} = $ppid;
}
close PS;
#***** lookup grandparent *****
foreach my $fpid (#$fpids) {
my($parent) = $processlist->{$fpid} || 0;
my($grandparent) = $processlist->{$parent} || 0;
if ($grandparent) {
#----- do something here with grandparent's pid -----
print "PID:GRANDPID - $fpid:$grandparent\n";
}
else {
#----- some error condition -----
print "ERROR - Cannot determine GrandPID: $fpid ($parent)\n";
}
}
Which for me produces:
ERROR - Cannot determine GrandPID: 27538 (1)
PID:GRANDPID - 31812:2804
PID:GRANDPID - 27541:27538
Have you considered using 'who -u' to tell you which process is the login shell for a given tty instead of using ptree? This would simplify your search - irrespective of the other changes you should also make.
I just did some trivial timings here based on your script (calling "cat ptree.txt" instead of ptree itself) and confirmed my thoughts that all of your time is spent creating new sub-processes and running ptree itself. Unless you can factor away the need to call ptree (maybe there's a way to open up the connection once and reuse it, like with nslookup), you won't see any real gains.