I need to query different LDAP servers in perl via Net::LDAP. I have something that works well. However, in an attempt to speed up things, I tried to query the different servers in parallel, using Parallel::Forkmanager - and things do not work when I do that.
I get the following types of errors:
decode error 02<=>30 0 8 at /Users/myname/perl5/perlbrew/perls/perl-5.18.2/lib/site_perl/5.18.2/Convert/ASN1/_decode.pm line 113, <> line 18.
decode error 43<=>30 0 8 at /Users/myname/perl5/perlbrew/perls/perl-5.18.2/lib/site_perl/5.18.2/Convert/ASN1/_decode.pm line 113, <> line 25.
at the line that gets the search response:
$mesg = $ldap->search( base => $dn, filter => '(CN=*)');
I am puzzled.
Telling it in other other words, why does this fail:
use Net::LDAP;
use Parallel::Forkmanager;
...; # bind LDAP servers
while (<>) {
chop;
my $dn = $_;
foreach my $ldap (#servers) {
my $pid;
$pid = $pm->start and next; # do the fork
print $dn, $pid;
my $mesg;
try {
$mesg = $ldap->search( base => $dn, filter => '(CN=*)');
} catch {
...;
}
$pm->finish;
}
}
while this:
use Net::LDAP;
...; # bind LDAP servers
while (<>) {
chop;
my $dn = $_;
foreach my $ldap (#servers) {
print $dn;
my $mesg;
try {
$mesg = $ldap->search( base => $dn, filter => '(CN=*)');
} catch {
...;
}
}
}
works perfectly?
Whilst forking doesn't have quite the same thread safety problems of threading - there are still a few places you have gotchas. I think this is what's biting you - your Net::LDAP objects are created in the parent thread, but then (effectively) cloned to each when you fork.
Which means in your code - there's a very real possibility that if you've got the list of names coming in fast enough, that a new fork will try to reuse an existing Net::LDAP connection before a previous one is finished with it.
The easy way of preventing this is call wait_all_children to ensure all your parallel LDAP queries are finished before the next one starts.
If you put your LDAP bind within the ForkManager loop, do you still have the same problem? I appreciate that's a potential overhead as you'll be binding each iteration, but if that addresses it, I'd suggest that it's because Net::LDAP is sharing the same file descriptors between forks.
The next best solution there would be to adopt a 'worker' model, where you've got a bunch of 'workers' each with their on LDAP connections to do the querying. That's easier with threading, than forking - goes a bit like this:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
sub server_worker {
my ( $hostname, $q_ref ) = #_;
## connect LDAP to $hostname;
while ( my $dn = $$q_ref->dequeue ) {
#query $dn
}
}
my #queues;
foreach my $server (#list_of_servers) {
my $server_q = Threads::Queue->new();
push( #queues, $server_q );
threads->create( \&server_worker, $hostname, \$server_q );
}
while ( my $dn = <STDIN> ) {
chomp($dn);
foreach my $q (#queues) {
$q->enqueue($dn);
}
}
foreach my $q ( #queues ) {
$q -> end;
}
foreach my $thr ( threads->list ) {
$thr->join();
}
Doing something similar with forking should work:
#!/usr/bin/perl
use strict;
use warnings;
use IO::Pipe;
use Parallel::ForkManager;
use Net::LDAP;
my #list_of_servers = qw ( servername servenama anotherserver );
my $pm = Parallel::ForkManager -> new ( scalar #list_of_servers );
my %pipe_for;
foreach my $server ( #list_of_servers ) {
my $pipe = IO::Pipe -> new();
my $pid = pm -> start;
if ( $pid ) {
print "$$: parent\n";
$pipe -> writer -> autoflush;
$pipe_for{$server} = $pipe;
}
else {
print "$$ child connecting to $server\n";
$pipe -> reader -> autoflush;
close ( STDIN ); #because this is a child.
#Net::LDAP setup
while ( my $item = <$pipe> ) {
chomp ( $item );
#ldap_search $item;
}
}
$pm -> finish;
}
And then send stuff:
for my $number ( 1..10 ) {
foreach my $pipe ( values %pipe_for ) {
print {$pipe} "test$number\n";
}
}
$pm -> wait_all_children();
Edit: Note - autoflush is important, otherwise the IO buffers and doesn't look like it's working. I'm pretty sure closing STDIN is probably a good idea in the child, but perhaps not vitally necessary if they don't use it.
Related
I need to access in a main program the contents of hashes that were generated via subroutines that were forked. Here specifically is what I am trying to do:-
use Benchmark;
use File::Find;
use File::Basename;
use File::Path;
use Data::Dumper;
use strict;
use warnings;
print "Process ID: $$ \n";
my #PartitionRoots = qw(/nfs/dir1 /nfs/dir2 /nfs/dir3 /nfs/dir4);
my #PatternsToCheck = qw(prefix1 prefix2);
my #MatchedDirnames = qw();
my $DirCount = 0;
my $Forks = 0;
my #AllDirs = qw();
my %SweepStats = ();
foreach my $RootPath (#PartitionRoots) {
foreach my $Pattern (#PatternsToCheck) {
if (grep {-e} glob ("$RootPath/$Pattern*")) {
my #Files = glob ("$RootPath/$Pattern*");
foreach my $FileName (#Files) {
if (-d $FileName) {
$DirCount++;
push (#AllDirs, $FileName);
my $PID = fork;
if (not defined $PID) {
warn 'Could not fork!\n';
next;
}
if ($PID) {
$Forks++;
print "In the parent PID ($$), Child pid: $PID Number of forked child processes: $Forks\n";
} else {
print "In the child PID ($$)\n";
find(\&file_stats, $FileName);
print "Child ($$) exiting...\n";
exit;
}
}
}
}
}
}
for (1 .. $Forks) {
my $PID = wait();
print "Parent saw child $PID exit.\n";
}
print "Parent ($$) ending.\n";
print Dumper (\%SweepStats);
foreach my $DirName (#AllDirs) {
print ("Printing $DirName contents...\n");
foreach (#{$SweepStats{$DirName}}) {
my $uname = $_->{uname};
my $mtime = $_->{mtime};
my $size = $_->{size};
my $file = $_->{file};
print ("$uname $mtime $size $file\n");
}
}
sub file_stats {
if (-f $File::Find::name) {
my $FileName = $_;
my $PathName = dirname($_);
my $DirName = basename($_);
my $uid = (stat($_))[4];
my $uname = getpwuid($uid);
my $size = (stat($_))[7];
my $mtime = (stat($_))[9];
if (defined $uname && $uname ne '') {
push #{$SweepStats{$FileName}}, {path=>$PathName,dir=>$DirName,uname=>$uname,mtime=>$mtime,size=>$size,file=>$File::Find::name};
} else {
push #{$SweepStats{$FileName}}, {path=>$PathName,dir=>$DirName,uname=>$uid,mtime=>$mtime,size=>$size,file=>$File::Find::name};
}
}
return;
}
exit;
...but Dumper is coming up empty, so the dereferencing and printing that immediately follows is empty, too. I know the file stat collecting is working, because if I replace the "push #{$SweepStats{$FileName}}" statements with print statements, I see exactly what is expected. I just need to properly access the hashes from the global level, but I cannot get it quite right. What am I doing wrong here? There are all kinds of posts about passing hashes to subroutines, but not the other way around.
Thanks!
The fork call creates a new, independent process. That child process and its parent cannot write to each other's data. So in order for data to be exchanged between the parent and the child we need to use some Inter-Process-Communication (IPC) mechanism.†
It is by far easiest to use a library that takes care of details, and Parallel::ForkManager seems rather suitable here as it provides an easy way to pass the data from child back to the parent, and it has a simple queue (to keep the number of simultaneous processes limited to a given number).
Here is some working code, and comments follow
use warnings;
use strict;
use feature 'say';
use File::Find;
use File::Spec;
use Parallel::ForkManager;
my %file_stats; # written from callback in run_on_finish()
my $pm = Parallel::ForkManager->new(16);
$pm->run_on_finish(
sub { # 6th argument is what is passed back from finish()
my ($pid, $exit, $ident, $signal, $core, $dataref) = #_;
foreach my $file_name (keys %$dataref) {
$file_stats{$file_name} = $dataref->{$file_name};
}
}
);
my #PartitionRoots = '.'; # For my tests: current directory,
my #PatternsToCheck = ''; # no filter (pattern is empty string)
my %stats; # for use by File::Find in child processes
foreach my $RootPath (#PartitionRoots) {
foreach my $Pattern (#PatternsToCheck) {
my #dirs = grep { -d } glob "$RootPath/$Pattern*";
foreach my $dir (#dirs) {
#say "Looking inside $dir";
$pm->start and next; # child process
find(\&get_file_stats, $dir);
$pm->finish(0, { %stats }); # exits, {%stats} passed back
}
}
}
$pm->wait_all_children;
sub get_file_stats {
return if not -f;
#say "\t$File::Find::name";
my ($uid, $size, $mtime) = (stat)[4,7,9];
my $uname = getpwuid $uid;
push #{$stats{$File::Find::name}}, {
path => $File::Find::dir,
dir => ( File::Spec->splitdir($File::Find::dir) )[-1],
uname => (defined $uname and $uname ne '') ? $uname : $uid,
mtime => $mtime,
size => $size,
file => $File::Find::name
};
}
Comments
The main question in all this is: at which part of your three-level hierarchy to spawn child processes? I left it as in the question, where for each directory a child is forked. This may be suitable if (most of) directories have many files; but if it isn't so and there is little work for each child to do then it may all get too busy and the overhead may reduce/deny the speedup
The %stats hash, necessary for File::Find to store the data it finds, need be declared outside of all loops so that it is seen in the sub. So it is inherited by all child processes, but each gets its own copy as due and we need not worry about data overlap or such
I simplified (and corrected) code other than the forking as well, following what seemed to me to be desired. Please let me know if that is off
See linked documentation, and for example this post and links in it for details
In order to display complex data structures use a library, of which there are many.
I use Data::Dump, intended to simply print nicely,
use Data::Dump qw(dd pp);
...
dd \%file_stats; # or: say "Stats for all files: ", pp \%file_stats;
for its compact output, while the most widely used is the core Data::Dumper
use Data::Dumper
...
say Dumper \%file_stats;
which also "understands" data structures (so you can mostly eval them back).
(Note: In this case there'll likely be a lot of output! So redirect to a file, or exit those loops after the first iteration so just to see how it's all going.)
† As a process is forked the variables and data from the parent are available to the child. They aren't copied right away for efficiency reasons, so initially the child does read parent's data. But any data generated after the fork call in the parent or child cannot be seen by the other process.
Try this module: IPC::Shareable.
It recommended by perldoc perlipc, and you can find answer to your question here.
I have a code, which is working as expected. But I have difficulty in storing the output of each of the executed command in $result = $ssh->capture($command_to_execute);. This uses Parallel::ForkManager module to run the commands in different Hosts by using different files as an input parameters.
Once the command execution is done, I want the resulted output should be stored in $result variable. It should append each of the hosts results in same variable and at the end I want to process the values which are in $result. I am using .= to append the resulted data to $result but it doent seems to be working.
Pasting my code here for reference:
.
.
.
my $result;
my $pm = Parallel::ForkManager->new(5);
DATA_LOOP:
foreach my $n (1..$num_buckets) {
my $pid = $pm->start and next DATA_LOOP;
$command_to_execute = $my_command." ".$Files{$n};
my $ssh = SSH_Connection( $list_of_ips[$n-1], 'username', 'pass' );
$result = $ssh->capture($command_to_execute);
$result .= "Result from File:$Files{$n} and Host:$list_of_ips[$n-1] is $result\n";
print "Result: INSIDE: $result";
$pm->finish;
}
$pm->wait_all_children;
print "Result: OUTSIDE: $result";
print "Done\n";
sub SSH_Connection {
my ( $host, $user, $passwd ) = #_;
my $ssh = Net::OpenSSH->new($host,
user => $user,
password => $passwd,
master_opts => [-o => "StrictHostKeyChecking=no"]
);
$ssh->error and die "Couldn't establish SSH connection: ". $ssh->error;
return $ssh;
}
print "Result: INSIDE: $result"; Could able to print result one by one. But print "Result: OUTSIDE: $result";
is empty, which should actually have the combined results of $results which has been taken from inside the for loop.
As shown in the documentation of Parallel::ForkManager, to get a result from a child, you need to supply a reference to the result as another parameter to finish.
$pm->finish(0, [$Files{$n}, $list_of_ips[$n-1], $result]);
Use run_on_finish to gather the results:
my $result;
$pm->run_on_finish( sub {
my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $single_result) = #_;
$result .= "Result from File: $single_result->[0] and Host: $single_result->[1]"
. " is $single_result->[2]\n";
Each time you run $pm->start, you are forking a new process to run the code until $pm->finish. This forked process cannot affect the parent process in any way except by the mechanism Parallel::ForkManager provides to send data back to the parent. This mechanism is described at https://metacpan.org/pod/Parallel::ForkManager#RETRIEVING-DATASTRUCTURES-from-child-processes.
$pm->run_on_finish(sub {
my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data) = #_;
my $result = $$data;
...
});
DATA_LOOP:
foreach my $n (1..$num_buckets) {
my $pid = $pm->start and next DATA_LOOP;
...
$pm->finish(0, \$result);
}
Forking is in fact not needed for these operations if you are willing to restructure a bit. Net::OpenSSH can provide commands that can be managed simultaneously by an event loop such as IO::Async::Loop, thus all Perl operations will occur in the same process (but not necessarily in the order they appear). Since IO::Async::Loop->run_process returns a Future, Future::Utils provides a way to manage concurrency of these commands.
use strict;
use warnings;
use Net::OpenSSH;
use IO::Async::Loop;
use Future::Utils 'fmap_concat';
my $loop = IO::Async::Loop->new;
my $future = fmap_concat {
my $n = shift;
...
my $remote_command = $ssh->make_remote_command($command_to_execute);
return $loop->run_process(command => $remote_command, capture => ['stdout'])
->transform(done => sub { "Result from File:$Files{$n} and Host:$list_of_ips[$n-1] is $_[0]\n"; });
} foreach => [1..$num_buckets], concurrent => 5;
my #results = $future->get;
There is a lot of flexibility to how the individual and overall (returned by fmap) Futures are managed, but by default any failure to execute a process will result in the whole Future failing immediately (causing get to throw an exception) and any nonzero exit will be ignored.
I have the following script that runs a command and puts the data in a DB. I need to account for the possibility of being asked for a password "password:" some of the time. How do I wrap an expect call into this?
#!/usr/software/bin/perl
use strict;
use DatabaseLib;
use Data::Dumper;
use Expect;
#Connect to database
my $dbh = DBI->connect($DB_CONNECT_STRING, $DB_USER, $DB_PASSWORD, { RaiseError => 1, AutoCommit => 1 })
or die "failed to connect to database: $DB_CONNECT_STRING";
my $expect = Expect->new;
my %burtHash;
my #cols = qw/date_create sub_by impact date-lastmod lastmod-by bug_rel case_score state s p type subtype subteam found_by target_release/;
my #burtInfo = `mycommand`;
my $timeout = 20;
my $password = "password";
while(my $ele = shift(#burtInfo)){
my ($index, #data) = split(/\s+/, $ele);
for my $i(0 .. $#cols){
$burtHash{$index}->{$cols[$i]} = shift(#data);
}
for my $id (keys %burtHash){
my %burt_details;
for my $col (keys %{$burtHash{$id}} ) {
$burt_details{$col} = $burtHash{$id}->{$col};
}
if ( $id =~ /\d+/) {
burt_update(
$dbh,
$id ,
\%burt_details,
);
}
}
}
I think I just need to put in something like this and call it, but i'm not sure where/how:
$expect->expect($timeout,
[ qr/password:/i, #/
sub {
my $self = shift;
$self->send("$password\n");
exp_continue;
}
]);
You're not using $expect anywhere there. You have to run your command via $expect->spawn so that your Expect object can handle things. And then you'll need some way of gathering its output (I'm thinking using $expect->log_file(...) to set the log to a string filehandle or something).
Once you're using $expect->spawn, then you can insert your password check. But there's no way you can do this with qx (the backticks).
im interested in performing multiple database actions in parallel. I have played with Perl Parallel::ForkManager but not used it with any databases yet. I have read that database connectivity is not supported very well with this. Does anyone have experience with this?
As an example i would probably be spawning a system call(which does the DBI work) NOT raw code, i.e.
#!/opt/local/bin/perl -w
use strict;
use Parallel::ForkManager;
$| = 1;
my $max_procs = 10;
my $pm = new Parallel::ForkManager($max_procs);
for (my $var = 0; $var <= 10; $var++) {
my $pid = $pm->start('proc'.$var) and next;
sleep ( 2 );
system( "./DBworker.pl $var" );
$pm->finish(0);
}
print "Waiting for child procs\n";
$pm->wait_all_children;
print "complete!\n";
If the work is being done by other programs, there is no danger to forking. The danger comes when you open a connection to the database and then fork. The child can't reuse the parents connection; however, take a look at DBIx::Connector, it handles the things you need to do after forking for you and running multiple programs is generally not the right answer.
#!/usr/bin/perl
use strict;
use warnings;
use DBIx::Connector;
use Parallel::ForkManager;
my $dsn = "dbi:SQLite:dbname=foo.db";
my $user = "";
my $pass = "";
my $conn = DBIx::Connector->new($dsn, $user, $pass,
{
AutoCommit => 0,
PrintError => 0,
RaiseError => 1,
ChopBlanks => 1,
FetchHashKeyName => 'NAME_lc',
}
);
END { unlink "foo.db" }
#setup table
$conn->run(fixup => sub {
my $dbh = $_;
$dbh->do("create table foo ( id integer, name char(35) )");
my $sth = $dbh->prepare("insert into foo (id, name) values (?, ?)");
while (<DATA>) {
chomp;
$sth->execute(split /,/);
}
});
my $pm = Parallel::ForkManager->new(3);
my $sth = $conn->dbh->prepare("select * from foo where id = ?");
for my $id (1 .. 3) {
next if $pm->start;
$sth->execute($id);
while (my $row = $sth->fetchrow_hashref) {
print "$id saw $row->{id} => $row->{name}\n";
}
$pm->finish;
}
$pm->wait_all_children;
print "done\n";
__DATA__
1,foo
2,bar
3,baz
I'm setting something up to SSH out to several servers in 'batches'. I basically want to maintain 5 connections at a time, and when one finishes open up another (following an array of server IPs).
I'm wondering for something like this should I be using fork()? If so, what logic can I use to ensure that the I maintain 5 children at a time?
Forking (or threading) is what you want, but you should look at CPAN for modules that will provide most of what you need to prevent you from reinventing the wheel and going through the learning pains of what you need to do.
For example, Parallel::ForkManager looks like it's EXACTLY what you want.
use Parallel::ForkManager;
$pm = new Parallel::ForkManager($MAX_PROCESSES);
foreach $data (#all_data) {
# Forks and returns the pid for the child:
my $pid = $pm->start and next;
... do some work with $data in the child process ...
$pm->finish; # Terminates the child process
}
There are several modules that solve exactly this problem. See Parallel::ForkManager, Forks::Super, or Proc::Queue, for example.
use Net::OpenSSH::Parallel;
my $pssh = Net::OpenSSH::Parallel->new(connections => 5);
for my $ip (#ips) {
$pssh->add_host($ip);
}
$pssh->push('*', command => 'do this');
$pssh->push('*', command => 'do that');
$pssh->push('*', scp_get => 'foo', 'bar-%HOST%');
$pssh->push('*', scp_put => 'doz', 'there');
$pssh->run;
My personal forking(!) favourite is Proc::Fork
General overview from pod:
use Proc::Fork;
run_fork {
child {
# child code goes here.
}
parent {
my $child_pid = shift;
# parent code goes here.
waitpid $child_pid, 0;
}
retry {
my $attempts = shift;
# what to do if if fork() fails:
# return true to try again, false to abort
return if $attempts > 5;
sleep 1, return 1;
}
error {
# Error-handling code goes here
# (fork() failed and the retry block returned false)
}
};
And to limit the number of maximum processes running for something like SSH batches then this should do the trick:
use strict;
use warnings;
use 5.010;
use POSIX qw(:sys_wait_h);
use Proc::Fork;
my $max = 5;
my %pids;
my #ssh_files = (
sub { system "scp file0001 baz#foo:/somedir/." },
...
sub { system "scp file9999 baz#foo:/somedir/." },
);
while (my $proc = shift #ssh_files) {
# max limit reached
while ($max == keys %pids) {
# loop thru pid list until a child is released
for my $pid (keys %procs) {
if (my $kid = waitpid($pid, WNOHANG)) {
delete $pids{ $kid };
last;
}
}
}
run_fork {
parent {
my $child = shift;
$pids{ $child } = 1;
}
child {
$proc->();
exit;
}
}
}
/I3az/