What are the Perl techniques to detach just a portion of code to run independently? - perl

I'm not involved in close-to-OS programming techniques, but as I know, when it comes to doing something in parallel in Perl the weapon of choice is fork and probably some useful modules built upon it. The doc page for fork says:
Does a fork(2) system call to create a new process running the same program at the same point.
As a consequence, having a big application that consumes a lot of memory and calling fork for a small task means there will be 2 big perl processes, and the second will waste resources just to do some simple work.
So, the question is: what to do (or how to use fork, if it's the only method) in order to have a detached portion of code running independently and consuming just the resources it needs?
Just a very simpel example:
use strict;
use warnings;
my #big_array = ( 1 .. 2000000 ); # at least 80 MB memory
sleep 10; # to have time to inspect easely the memory usage
fork();
sleep 10; # to have time to inspect easely the memory usage
and the child process consumes 80+ MB too.
To be clear: it's not important to communicate to this detached code or to use its result somehow, just to be possible to say "hey, run for me this simple task in the background and let me continue my heavy work meanwhile ... and don't waste my resources!" when running a heavy perl application.

fork() to exec() is your bunny here. You fork() to create a new process (which is a fairly cheap operation, see below), then exec() to replace the big perl you've got running with something smaller. This looks like this:
use strict;
use warnings;
use 5.010;
my #ary = (1 .. 10_000_000);
if (my $pid = fork()) {
# parent
say "Forked $pid from $$; sleeping";
sleep 1_000;
} else {
# child
exec('perl -e sleep 1_000');
}
(#ary was just used to fill up the original process' memory a bit.)
I said that fork()ing was relatively cheap, even though it does copy the entire original process. These statements are not in conflict; the guys who designed fork noticed this same problem. The copy is lazy, that is, only the bits that are actually changed are copied.
If you find you want the processes to talk to each other, you'll start getting into the more complex domain of IPC, about which a number of books have been written.

Your forked process is not actually using 80MB of resident memory. A large portion of that memory will be shared - 'borrowed' from the parent process until either the parent or child writes to it, at which point copy-on-write semantics will cause the memory to actually be copied.
If you want to drop that baggage completely, run exec in your fork. That will replace the child Perl process with a different executable, thus freeing the memory. It's also perfect if you don't need to communicate anything back to the parent.

There is no way to fork just a subset of your process's footprint, so the usual workarounds come down to:
fork before you run memory intensive code in the parent process
start a separate process with system or open HANDLE,'|-',.... Of course this new process won't inherit any data from its parent, so you will need to pass data to this child somehow.

fork() as implemented on most operating systems is nicely efficient. It commonly uses a technique called copy-on-write, to mean that pages are initially shared until one or other process writes to them. Also a lot of your process memory is going to be readonly mapped files anyway.
Just because one process uses 80MB before fork() doesn't mean that afterwards the two will use 160. To start with it will be only a tiny fraction more than 80MB, until each process starts dirtying more pages.

Related

Perl daemonize with child daemons

I have to employ daemons in my code. I need a control daemon that constantly checks the database for the tasks and supervises child daemons. The control daemon must assign tasks to the child daemons, control tasks, create new children if one of them dies, etc. The child daemons check database for tasks for them (by PID). How should I implement daemons for this purpose?
Daemon is just a code word for "background process that runs a long time". So the answer is 'it depends'. Perl has two major ways of doing multiprocessing:
Threading
You run a subroutine as a thread, in parallel with the main program code. (Which may then just monitor thread states).
The overhead of creating a thread is higher, but it's better suited for 'shared memory' style multiprocessing, e.g. when you're passing significant quantities of data back and forth. There's several libraries that make passing information between threads positively straightforward. Personally I quite like Thread::Queue, Thread::Semaphore and Storable.
In particular - Storable has freeze and thaw which lets you move complex data structures (e.g. objects/hashes) around in queues, which is very useful.
Basic threading example:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
Storable
When it comes to Storable, this is worth a separate example I think, because it's handy to move data around.
use Storable qw ( freeze thaw );
use MyObject; #home made object.
use Thread::Queue;
my $work_q = Thread::Queue->new();
sub worker_thread {
while ( my $packed_item = $work_q->dequeue ) {
my $object = thaw($packed_item);
$object->run_some_methods();
$object->set_status("processed");
#maybe return $object via 'freeze' and a queue?
}
}
my $thr = threads->create( \&worker_thread );
my $newobject = MyObject->new("some_parameters");
$work_q->enqueue( freeze($newobject) );
$work_q->end();
$thr->join();
Because you're passing the object around within the queue, you're effectively cloning it between threads. So bear in mind that you may need to freeze it and 'return' it somehow once you've done something to it's internal state. But it does mean you can do this asynchronously without needing to arbitrate locking or shared memory. You may also find it useful to be able to 'store' and 'retrieve' and object - this works as you might expect. (Although I daresay you might need to be careful about availability of module versions vs. defined attributes if you're retrieving a stored object)
Forking
Your script clones itself, leaving a 'parent' and 'child' - the child then generally diverges and does something different. This uses the Unix built in fork() which as a result is well optimised and generally very efficient - but because it's low level, means it's difficult to do lots of data transfer. You'll end up some slightly complicated things to do Interprocess communication - IPC. (See perlipc for more detail). It's efficient not least because most fork() implementations do a lazy data copy - memory space for your process is only allocated as it's needed e.g. when it's changed.
It's therefore really good if you want to delegate a lot of tasks that don't require much supervision from the parent. For example - you might fork a web server, because the child is reading files and delivering them to a particular client, and the parent doesn't much care. Or perhaps you would do this if you want to spend a lot of CPU time computing a result, and only pass that result back.
It's also not supported on Windows.
Useful libraries include Parallel::ForkManager
A basic example of 'forking' code looks a bit like this:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
Which is right for you?
So it's hard to say without a bit more detail about what you're trying to accomplish. This is why StacKOverflow usually likes to show some working, approaches you've tried, etc.
I would generally say:
if you need to pass data around, use threads. Thread::Queue especially when combined with Storable is very good for it.
if you don't, forks (on Unix) are generally faster/more efficient. (But fast alone isn't usually enough - write understandable stuff first, and aim for speed second. It doesn't matter much usually).
Avoid where possible spawning too many threads - they're fairly intensive on memory and creation overhead. You're far better off using a fixed number in a 'worker thread' style of programming, than repeatedly creating new, short lived threads. (On the other hand - forks are actually very good at this, because they don't copy your whole process).
I would suggest in the scenario you give - you're looking at threads and queues. Your parent process can track child threads via threads -> list() and join or create to keep the right number. And can feed data via a central queue to your worker threads. Or have multiple queues - one per 'child' and use that as a task assignment system.

Perl fork queue for n-Core processor

I am writing an application similar to what was suggested here. Essentially, I am using Perl to manage the execution of multiple CPU intensive processes in parallel via fork and wait. However, I am running on a 4-core machine, and I have many more processes, all with very dissimilar expected run-times which aren't known a priori.
Ultimately, it would take more effort to estimate the run times and gang them appropriately, than to simply utilize a queue system for each core. Ultimately I want each core to be processing, with as little downtime as possible, until everything is done. Is there a preferred algorithm or mechanism for doing this? I would assume this is a common problem/use so I don't want to re-invent the wheel, as my wheel will probably be inferior to the 'right way. '
As a minor aside, I would prefer to not have to import additional modules (like Parallel::ForkManager) to accomplish this, but if that is the best way to go, then I will consider it.
~Thanks!
EDIT: Fixed 'here' link: Thanks to ikegami
EDIT: P::FM is too easy to use, not to... Today I Learned.
Forks::Super has some features that are good for this sort of task.
extended syntax, but not a lot of new syntax: if you already have a program with fork and wait calls, you can still use the features of Forks::Super without too many changes. That is, your new code will still have fork and wait calls.
job throttling: like Parallel::ForkManager, you can control how many jobs you run simultaneously. When one job completes, the module can start another one, keeping your system fully utilized. You can also specify more complex logic like "run at most 6 background jobs on the weekends or between midnight and 6:00 am, but 2 background jobs the rest of the time"
timing utilities: Forks::Super keeps track of the start time and end time of every job, letting you log and analyze how long each job took:
fork { cmd => "some command" };
...
$pid = wait;
$elapsed = $pid->{end} - $pid->{start};
print LOG "That job took ${elapsed}s\n";
CPU affinity control: I can't tell whether this is something you need, but Guarav seemed to think it mattered. You can assign background jobs to specific cores
# restrict job to cores #0 and #2
$job = fork { sub => \&background_process, args => \#args,
cpu_affinity => 0x05 };

Running perl function in background or separate process

I'm running in circles. I have webpage that creates a huge file. This file takes forever to be created and is in a subroutine.
What is the best way for my page to run this subroutine but not wait for it to be created/processed? Are there any issues with apache processes since I'm doing this from a webpage?
The simplest way to perform this task is to simply use fork() and have the long-running subroutine run in the child process. Meanwhile, have the parent return to Apache. You indicate that you've tried this already, but absent more information on exactly what the code looks like and what is failing it's hard to help you move forward on this path.
Another option is to have run a separate process that is responsible for managing the long-running task. Have the webpage send a unit of work to the long-running process using a local socket (or by creating a file with the necessary input data), and then your web script can return immediately while the separate process takes care of completing the long running task.
This method of decoupling the execution is fairly common and is often called a "task queue" (if there is some mechanism in place for queuing requests as they come in). There are a number of tools out there that will help you design this sort of solution (but for simple cases with filesystem-based communication you may be fine without them).
I think you want to create a worker grandchild of Apache -- that is:
Apache -> child -> grandchild
where the child dies right after forking the grandchild, and the grandchild closes STDIN, STDOUT, and STDERR. (The grandchild then creates the file.) These are the basic steps in creating a zombie daemon (a parent-less worker process unconnected with the webserver).

Remove Perl module from child stack

I have a daemon which loads DBI (DBD::mysql) and then forks child processes. I'd like to prevent the DBI module from being in memory in the forked child processes.
So something like this:
#!/usr/bin/perl
use DBI;
my $dbh = DBI->connect(db_info);
my $pid = fork();
if($pid){
# The forked process here should not have DBI loaded
}
Thanks for the help!
Loading a module is to execute it like a script. There's absolutely no difference between a module and a script to Perl. To unload a module, one would need to undo the effects of running it. That can't be done mechanically, and it's not feasible to do manually.
The simplest solution would to be to have the child exec something. It could even be the script you are already running.
exec($^X, $0, '--child', #args)
The child can be given access to the socket by binding it to the child's fd 0 (stdin) and fd 1 (stdout).
You can't do that easily unless you put the load after the fork. But to do that you have to not use use. Do this instead:
my $pid = fork();
if ($pid) {
# child
} else {
require DBI;
import DBI;
}
That should prevent the DBI module from loading until after the fork. The use routine essentially does a require/import but inside a BEGIN {} block which is why you have to not use it.
If you are running a modern Linux system, then forks are COW (copy on write). This means pages from the parent are only copied to the child's address space if they are modified by the parent or the child. So, the DBI module is not in the memory of the forked child processes.
Perl 5 does not have any way of unloading modules from memory. If you really need the children to have different code than the parent for some reason, you are better off separating that code out of the main code as its own script and then using exec after the fork to run the child script. This will be slower than normal forking since it has to compile the child code, so if you fork a lot, it might be better to have two scripts that talk to each other over sockets and have the "child" script pre-fork.
Knowing now what you want to do with this, since there isn't a good way to unload modules i Perl, a good solution to the problem as to write an authentication server separate from the application server. The application server asks the authentication server if an IP has permissions. That way they remain in wholly separate processes. This might also have security benefits, your application code can't access your authentication database.
Since any given application is likely to expand to the point where it needs a SQL database of its own, this exercise is probably futile, but your call.
This is a bunch of extra work and maintenance and complexity. It's only worth while if it's causing you real memory problems, not just because it's bugs you. Remember, RAM is very cheap. Developer time is very expensive.

Is there a way to have managed processes in Perl (i.e. a threads replacement that actually works)?

I have a multithreded application in perl for which I have to rely on several non-thread safe modules, so I have been using fork()ed processes with kill() signals as a message passing interface.
The problem is that the signal handlers are a bit erratic (to say the least) and often end up with processes that get killed in inapropriate states.
Is there a better way to do this?
Depending on exactly what your program needs to do, you might consider using POE, which is a Perl framework for multi-threaded applications with user-space threads. It's complex, but elegant and powerful and can help you avoid non-thread-safe modules by confining activity to a single Perl interpreter thread.
Helpful resources to get started:
Programming POE presentation by Matt Sergeant (start here to understand what it is and does)
POE project page (lots of cookbook examples)
Plus there are hundreds of pre-built POE components you can use to assemble into an application.
You can always have a pipe between parent and child to pass messages back and forth.
pipe my $reader, my $writer;
my $pid = fork();
if ( $pid == 0 ) {
close $reader;
...
}
else {
close $writer;
my $msg_from_child = <$reader>;
....
}
Not a very comfortable way of programming, but it shouldn't be 'erratic'.
Have a look at forks.pm, a "drop-in replacement for Perl threads using fork()" which makes for much more sensible memory usage (but don't use it on Win32). It will allow you to declare "shared" variables and then it automatically passes changes made to such variables between the processes (similar to how threads.pm does things).
From perl 5.8 onwards you should be looking at the core threads module. Have a look at http://metacpan.org/pod/threads
If you want to use modules which aren't thread safe you can usually load them with a require and import inside the thread entry point.