What modules should I look at for doing multithreading in Perl?
I'm looking to do something fairly low performance; I want threads is to run multiple workers simultaneously, with each of them sleeping for varying amounts of time.
There are lots of reasons why you might not want to multithread. If you do want to multithread, however, the following code might serve as a helpful example. It creates a number of jobs, puts those in a thread-safe queue, then starts some threads that pull jobs from the queue and complete them. Each thread keeps pulling jobs from the queue in a loop until it sees no more jobs. The program waits for all the thread to finish and then prints the total time that it spent working on the jobs.
#!/usr/bin/perl
use threads;
use Thread::Queue;
use Modern::Perl;
my $queue= Thread::Queue->new;
my $thread_count= 4;
my $job_count= 10;
my $start_time= time;
my $max_job_time= 10;
# Come up with some jobs and put them in a thread-safe queue. Each job
# is a string with an id and a number of seconds to sleep. Jobs consist
# of sleeping for the specified number of seconds.
my #jobs= map {"$_," . (int(rand $max_job_time) + 1)} (1 .. $job_count);
$queue->enqueue(#jobs);
# List the jobs
say "Jobs IDs: ", join(", ", map {(split /,/, $_)[0]} #jobs);
# Start the threads
my #threads= map {threads->create(sub {function($_)})} (1 .. $thread_count);
# Wait for all the threads to complete their work
$_->join for (#threads);
# We're all done
say "All done! Total time: ", time - $start_time;
# Here's what each thread does. Each thread starts, then fetches jobs
# from the job queue until there are no more jobs in the queue. Then,
# the thread exists.
sub function {
my $thread_id= shift;
my ($job, $job_id, $seconds);
while($job= $queue->dequeue_nb) {
($job_id, $seconds)= split /,/, $job;
say "Thread $thread_id starting on job $job_id ",
"(job will take $seconds seconds).";
sleep $seconds;
say "Thread $thread_id done with job $job_id.";
}
say "No more jobs for thread $thread_id; thread exiting.";
}
Most recent versions of Perl have threading support. Run perl -V:usethreads to see if it is available in your system.
$ perl -V:usethreads
usethreads='define'
perldoc threads gives a pretty good introduction to using them.
If performance isn't a big issue, then forking multiple processes will probably be a lot easier than dealing with threads. I frequently use Parallel::ForkManager which is very simple, but very good at what it does.
It sounds like you don't need preemptive multithreading; in which case, look at POE's cooperative model. Since your code will only yield to other threads when you decide, and you'll only have one thread running at a time, development and debugging will be much easier.
Coro is a nice module for cooperative multitasking.
99% of the time, this is what you need if you want threads in Perl.
If you want threads to speed up your code when multiple cores are available, you are going down the wrong path. Perl is 50x slower than other languages. Rewriting your code to run on two CPUs means that it now only runs 25x slower than other languages ... on one CPU. Better to spend the effort porting the slow parts to a different language.
But if you just don't want IO to block other "threads", then Coro is exactly what you want.
Related
I have to employ daemons in my code. I need a control daemon that constantly checks the database for the tasks and supervises child daemons. The control daemon must assign tasks to the child daemons, control tasks, create new children if one of them dies, etc. The child daemons check database for tasks for them (by PID). How should I implement daemons for this purpose?
Daemon is just a code word for "background process that runs a long time". So the answer is 'it depends'. Perl has two major ways of doing multiprocessing:
Threading
You run a subroutine as a thread, in parallel with the main program code. (Which may then just monitor thread states).
The overhead of creating a thread is higher, but it's better suited for 'shared memory' style multiprocessing, e.g. when you're passing significant quantities of data back and forth. There's several libraries that make passing information between threads positively straightforward. Personally I quite like Thread::Queue, Thread::Semaphore and Storable.
In particular - Storable has freeze and thaw which lets you move complex data structures (e.g. objects/hashes) around in queues, which is very useful.
Basic threading example:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
Storable
When it comes to Storable, this is worth a separate example I think, because it's handy to move data around.
use Storable qw ( freeze thaw );
use MyObject; #home made object.
use Thread::Queue;
my $work_q = Thread::Queue->new();
sub worker_thread {
while ( my $packed_item = $work_q->dequeue ) {
my $object = thaw($packed_item);
$object->run_some_methods();
$object->set_status("processed");
#maybe return $object via 'freeze' and a queue?
}
}
my $thr = threads->create( \&worker_thread );
my $newobject = MyObject->new("some_parameters");
$work_q->enqueue( freeze($newobject) );
$work_q->end();
$thr->join();
Because you're passing the object around within the queue, you're effectively cloning it between threads. So bear in mind that you may need to freeze it and 'return' it somehow once you've done something to it's internal state. But it does mean you can do this asynchronously without needing to arbitrate locking or shared memory. You may also find it useful to be able to 'store' and 'retrieve' and object - this works as you might expect. (Although I daresay you might need to be careful about availability of module versions vs. defined attributes if you're retrieving a stored object)
Forking
Your script clones itself, leaving a 'parent' and 'child' - the child then generally diverges and does something different. This uses the Unix built in fork() which as a result is well optimised and generally very efficient - but because it's low level, means it's difficult to do lots of data transfer. You'll end up some slightly complicated things to do Interprocess communication - IPC. (See perlipc for more detail). It's efficient not least because most fork() implementations do a lazy data copy - memory space for your process is only allocated as it's needed e.g. when it's changed.
It's therefore really good if you want to delegate a lot of tasks that don't require much supervision from the parent. For example - you might fork a web server, because the child is reading files and delivering them to a particular client, and the parent doesn't much care. Or perhaps you would do this if you want to spend a lot of CPU time computing a result, and only pass that result back.
It's also not supported on Windows.
Useful libraries include Parallel::ForkManager
A basic example of 'forking' code looks a bit like this:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
Which is right for you?
So it's hard to say without a bit more detail about what you're trying to accomplish. This is why StacKOverflow usually likes to show some working, approaches you've tried, etc.
I would generally say:
if you need to pass data around, use threads. Thread::Queue especially when combined with Storable is very good for it.
if you don't, forks (on Unix) are generally faster/more efficient. (But fast alone isn't usually enough - write understandable stuff first, and aim for speed second. It doesn't matter much usually).
Avoid where possible spawning too many threads - they're fairly intensive on memory and creation overhead. You're far better off using a fixed number in a 'worker thread' style of programming, than repeatedly creating new, short lived threads. (On the other hand - forks are actually very good at this, because they don't copy your whole process).
I would suggest in the scenario you give - you're looking at threads and queues. Your parent process can track child threads via threads -> list() and join or create to keep the right number. And can feed data via a central queue to your worker threads. Or have multiple queues - one per 'child' and use that as a task assignment system.
I am writing an application similar to what was suggested here. Essentially, I am using Perl to manage the execution of multiple CPU intensive processes in parallel via fork and wait. However, I am running on a 4-core machine, and I have many more processes, all with very dissimilar expected run-times which aren't known a priori.
Ultimately, it would take more effort to estimate the run times and gang them appropriately, than to simply utilize a queue system for each core. Ultimately I want each core to be processing, with as little downtime as possible, until everything is done. Is there a preferred algorithm or mechanism for doing this? I would assume this is a common problem/use so I don't want to re-invent the wheel, as my wheel will probably be inferior to the 'right way. '
As a minor aside, I would prefer to not have to import additional modules (like Parallel::ForkManager) to accomplish this, but if that is the best way to go, then I will consider it.
~Thanks!
EDIT: Fixed 'here' link: Thanks to ikegami
EDIT: P::FM is too easy to use, not to... Today I Learned.
Forks::Super has some features that are good for this sort of task.
extended syntax, but not a lot of new syntax: if you already have a program with fork and wait calls, you can still use the features of Forks::Super without too many changes. That is, your new code will still have fork and wait calls.
job throttling: like Parallel::ForkManager, you can control how many jobs you run simultaneously. When one job completes, the module can start another one, keeping your system fully utilized. You can also specify more complex logic like "run at most 6 background jobs on the weekends or between midnight and 6:00 am, but 2 background jobs the rest of the time"
timing utilities: Forks::Super keeps track of the start time and end time of every job, letting you log and analyze how long each job took:
fork { cmd => "some command" };
...
$pid = wait;
$elapsed = $pid->{end} - $pid->{start};
print LOG "That job took ${elapsed}s\n";
CPU affinity control: I can't tell whether this is something you need, but Guarav seemed to think it mattered. You can assign background jobs to specific cores
# restrict job to cores #0 and #2
$job = fork { sub => \&background_process, args => \#args,
cpu_affinity => 0x05 };
I'm not involved in close-to-OS programming techniques, but as I know, when it comes to doing something in parallel in Perl the weapon of choice is fork and probably some useful modules built upon it. The doc page for fork says:
Does a fork(2) system call to create a new process running the same program at the same point.
As a consequence, having a big application that consumes a lot of memory and calling fork for a small task means there will be 2 big perl processes, and the second will waste resources just to do some simple work.
So, the question is: what to do (or how to use fork, if it's the only method) in order to have a detached portion of code running independently and consuming just the resources it needs?
Just a very simpel example:
use strict;
use warnings;
my #big_array = ( 1 .. 2000000 ); # at least 80 MB memory
sleep 10; # to have time to inspect easely the memory usage
fork();
sleep 10; # to have time to inspect easely the memory usage
and the child process consumes 80+ MB too.
To be clear: it's not important to communicate to this detached code or to use its result somehow, just to be possible to say "hey, run for me this simple task in the background and let me continue my heavy work meanwhile ... and don't waste my resources!" when running a heavy perl application.
fork() to exec() is your bunny here. You fork() to create a new process (which is a fairly cheap operation, see below), then exec() to replace the big perl you've got running with something smaller. This looks like this:
use strict;
use warnings;
use 5.010;
my #ary = (1 .. 10_000_000);
if (my $pid = fork()) {
# parent
say "Forked $pid from $$; sleeping";
sleep 1_000;
} else {
# child
exec('perl -e sleep 1_000');
}
(#ary was just used to fill up the original process' memory a bit.)
I said that fork()ing was relatively cheap, even though it does copy the entire original process. These statements are not in conflict; the guys who designed fork noticed this same problem. The copy is lazy, that is, only the bits that are actually changed are copied.
If you find you want the processes to talk to each other, you'll start getting into the more complex domain of IPC, about which a number of books have been written.
Your forked process is not actually using 80MB of resident memory. A large portion of that memory will be shared - 'borrowed' from the parent process until either the parent or child writes to it, at which point copy-on-write semantics will cause the memory to actually be copied.
If you want to drop that baggage completely, run exec in your fork. That will replace the child Perl process with a different executable, thus freeing the memory. It's also perfect if you don't need to communicate anything back to the parent.
There is no way to fork just a subset of your process's footprint, so the usual workarounds come down to:
fork before you run memory intensive code in the parent process
start a separate process with system or open HANDLE,'|-',.... Of course this new process won't inherit any data from its parent, so you will need to pass data to this child somehow.
fork() as implemented on most operating systems is nicely efficient. It commonly uses a technique called copy-on-write, to mean that pages are initially shared until one or other process writes to them. Also a lot of your process memory is going to be readonly mapped files anyway.
Just because one process uses 80MB before fork() doesn't mean that afterwards the two will use 160. To start with it will be only a tiny fraction more than 80MB, until each process starts dirtying more pages.
I want to write unit tests for a subroutine in perl. The subroutine is using multiple threads to do its tasks. So, it first creates some threads and then it waits for them to join.
The problem is that our unit tests run on a server which is not able to run multi-threaded tests, so I need to somehow mock out the thread behavior. Basically I want to override the threads create and join functions such that its not threaded anymore. Any pointers how can I do that and test the code ?
Edit : The server fails to run the threaded code for the following reason:
Devel::Cover does not yet work with threads
Update: this answer doesn't solve the OP's problem as described in the edited question, but it might be useful to someone.
Perl threads are an interpreter emulation, not an operating system feature. So, they should work on any platform. If your testing server doesn't support threads, it's probably for one of these reasons:
Your version of Perl is very old.
Perl was compiled without thread support.
Your testing framework wasn't created with threaded code in mind.
The first two could be easily rectified by updating your environment. However, I suspect yours is the third issue.
I don't think you should solve this by mocking the thread behavior. This changes the original code too much to be a valid test. And it would be a significant amount of work anyway, so why not direct that effort toward getting a threaded test working?
The exact issues depend on your code, but probably the issue is that your subroutine starts a thread and then returns, with the thread still running. Then your test framework runs the sub over and over, accumulating a whole bunch of concurrent threads.
In that case, all you need is a wrapper sub that calls the sub you are testing, and then blocks until the threads are complete. This should be fairly simple. Take a look at threads->list() to see how you can detect running threads. Just have a loop that waits until the threads in question are no longer running before exiting the wrapper sub.
Here is a simple complete example demonstrating a wrapper sub:
#!usr/bin/perl
use strict;
use warnings;
use threads;
sub sub_to_test {
threads->create(sub { sleep 5; print("Thread done\n"); threads->detach() });
return "Sub done\n";
}
sub wrapper {
#Get a count of the running threads.
my $original_running_threads = threads->list(threads::running);
my #results = sub_to_test(#_);
#block until the number of running threads is the same as when we started.
sleep 1 while (threads->list(threads::running) > $original_running_threads);
return #results;
}
print wrapper;
I have a multithreded application in perl for which I have to rely on several non-thread safe modules, so I have been using fork()ed processes with kill() signals as a message passing interface.
The problem is that the signal handlers are a bit erratic (to say the least) and often end up with processes that get killed in inapropriate states.
Is there a better way to do this?
Depending on exactly what your program needs to do, you might consider using POE, which is a Perl framework for multi-threaded applications with user-space threads. It's complex, but elegant and powerful and can help you avoid non-thread-safe modules by confining activity to a single Perl interpreter thread.
Helpful resources to get started:
Programming POE presentation by Matt Sergeant (start here to understand what it is and does)
POE project page (lots of cookbook examples)
Plus there are hundreds of pre-built POE components you can use to assemble into an application.
You can always have a pipe between parent and child to pass messages back and forth.
pipe my $reader, my $writer;
my $pid = fork();
if ( $pid == 0 ) {
close $reader;
...
}
else {
close $writer;
my $msg_from_child = <$reader>;
....
}
Not a very comfortable way of programming, but it shouldn't be 'erratic'.
Have a look at forks.pm, a "drop-in replacement for Perl threads using fork()" which makes for much more sensible memory usage (but don't use it on Win32). It will allow you to declare "shared" variables and then it automatically passes changes made to such variables between the processes (similar to how threads.pm does things).
From perl 5.8 onwards you should be looking at the core threads module. Have a look at http://metacpan.org/pod/threads
If you want to use modules which aren't thread safe you can usually load them with a require and import inside the thread entry point.