I have to employ daemons in my code. I need a control daemon that constantly checks the database for the tasks and supervises child daemons. The control daemon must assign tasks to the child daemons, control tasks, create new children if one of them dies, etc. The child daemons check database for tasks for them (by PID). How should I implement daemons for this purpose?
Daemon is just a code word for "background process that runs a long time". So the answer is 'it depends'. Perl has two major ways of doing multiprocessing:
Threading
You run a subroutine as a thread, in parallel with the main program code. (Which may then just monitor thread states).
The overhead of creating a thread is higher, but it's better suited for 'shared memory' style multiprocessing, e.g. when you're passing significant quantities of data back and forth. There's several libraries that make passing information between threads positively straightforward. Personally I quite like Thread::Queue, Thread::Semaphore and Storable.
In particular - Storable has freeze and thaw which lets you move complex data structures (e.g. objects/hashes) around in queues, which is very useful.
Basic threading example:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
Storable
When it comes to Storable, this is worth a separate example I think, because it's handy to move data around.
use Storable qw ( freeze thaw );
use MyObject; #home made object.
use Thread::Queue;
my $work_q = Thread::Queue->new();
sub worker_thread {
while ( my $packed_item = $work_q->dequeue ) {
my $object = thaw($packed_item);
$object->run_some_methods();
$object->set_status("processed");
#maybe return $object via 'freeze' and a queue?
}
}
my $thr = threads->create( \&worker_thread );
my $newobject = MyObject->new("some_parameters");
$work_q->enqueue( freeze($newobject) );
$work_q->end();
$thr->join();
Because you're passing the object around within the queue, you're effectively cloning it between threads. So bear in mind that you may need to freeze it and 'return' it somehow once you've done something to it's internal state. But it does mean you can do this asynchronously without needing to arbitrate locking or shared memory. You may also find it useful to be able to 'store' and 'retrieve' and object - this works as you might expect. (Although I daresay you might need to be careful about availability of module versions vs. defined attributes if you're retrieving a stored object)
Forking
Your script clones itself, leaving a 'parent' and 'child' - the child then generally diverges and does something different. This uses the Unix built in fork() which as a result is well optimised and generally very efficient - but because it's low level, means it's difficult to do lots of data transfer. You'll end up some slightly complicated things to do Interprocess communication - IPC. (See perlipc for more detail). It's efficient not least because most fork() implementations do a lazy data copy - memory space for your process is only allocated as it's needed e.g. when it's changed.
It's therefore really good if you want to delegate a lot of tasks that don't require much supervision from the parent. For example - you might fork a web server, because the child is reading files and delivering them to a particular client, and the parent doesn't much care. Or perhaps you would do this if you want to spend a lot of CPU time computing a result, and only pass that result back.
It's also not supported on Windows.
Useful libraries include Parallel::ForkManager
A basic example of 'forking' code looks a bit like this:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
Which is right for you?
So it's hard to say without a bit more detail about what you're trying to accomplish. This is why StacKOverflow usually likes to show some working, approaches you've tried, etc.
I would generally say:
if you need to pass data around, use threads. Thread::Queue especially when combined with Storable is very good for it.
if you don't, forks (on Unix) are generally faster/more efficient. (But fast alone isn't usually enough - write understandable stuff first, and aim for speed second. It doesn't matter much usually).
Avoid where possible spawning too many threads - they're fairly intensive on memory and creation overhead. You're far better off using a fixed number in a 'worker thread' style of programming, than repeatedly creating new, short lived threads. (On the other hand - forks are actually very good at this, because they don't copy your whole process).
I would suggest in the scenario you give - you're looking at threads and queues. Your parent process can track child threads via threads -> list() and join or create to keep the right number. And can feed data via a central queue to your worker threads. Or have multiple queues - one per 'child' and use that as a task assignment system.
Related
sub parallelizing{
my counter = 0;
my $MAX_PROCESS = 10;
my $workerQueue = Parallel::ForkManager->new($MAX_PROCESS);
$workerQueue->start and next;
print "$process #" . $counter . " started\n";
$counter = $counter +1
$workerQueue->finish;
}
I am using Parallel::ForkManager to create child processes that share the variable $counter, but it turns out that it's not shared. Is there any way to let child processes to share a variable?
Short answer - no, not that way. fork creates a separate process instance, with it's own memory state. It implements a copy-on-write mechanism (generally) so it's quite efficient. But processes can't share memory trivially.
However you can do something like threads::shared to create a threaded; shared variable accessible by multiple things. (Although I'd normally suggest using Thread::Queue instead).
Something like this:
Perl daemonize with child daemons
Might give some useful examples.
I have a couple long running perl scripts in windows (strawberry perl) that I'm working on.
The first process is a parent monitoring process. It restarts the child process every 24 hours and will be always running.
The second is the child payment processing script. It is imperative that this process completes whatever it's doing before being shutdown.
It's my understanding that signal handling doesn't work in perl on win32 and that it shouldn't be relied on. Is there some other way that I can handle a signal? Win32::Process::Kill seems to kill the process without letting it safely shut down.
This is the signal handling that I've tried...
#Child
my $interrupted = 0;
$SIG{INT} = sub{$interrupted = 1;};
while(!$interrupted){
#keep doing your thing, man
}
#Parent
my $pid = open2(\*CHLD_OUT,\*CHLD_IN,'C:\\strawberry\\perl\\bin\\perl.exe','process.pl');
kill INT=>$pid;
waitpid($pid,0);
The only other thing I can think of is to open a socket between the two processes and write messages across the socket. But there must be something easier. Anyone know of any module that can do this?
Update
I've started working on creating a "signal" mechanism via IO::Socket::INET and IO::Select by opening a socket. This appears to work and I'm thinking of writing a module that is compatible with AnyEvent. But I'm still interested in an implementation that doesn't require opening a listening port and that doesn't require a server/client relationship. Is it possible to do this by subscribing to and firing custom events in windows?
Hmm, an interesting question. One thing I'd be wondering - how feasible is it to rewrite your code to thread?
When faced with a similar problem I found encapsulating the 'child' process as a thread meant I could better 'manage' it from the parent.
e.g.:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use threads::shared;
my $interrupted : shared;
sub child {
while ( not $interrupted ) {
#loop;
}
}
#main process
while ( 1 ) {
$interrupted = 0;
my $child = threads -> create ( \&child );
sleep 60;
$interrupted = 1;
$child -> join();
sleep ( 3600 );
}
But because you've got the IPCs from threading - you've got Thread::Queue, threads::shared, Thread::Semaphore and - I'm at least fairly sure you can send pseudo 'kill' signals within the script. This is because threads emulates 'kill' signals internally too.
http://www.perlmonks.org/?node_id=557328
Add to your thread:
$SIG{'TERM'} = sub { threads->exit(); };
And then then your 'main' can:
$thr->kill('TERM')->detach();
With ActiveState Perl I use windows native events through the Win32::Event module.
This way you don't need to implement anything fancy and you can even have your script interract with native applications.
I use this module in many applications.
Since it is part of Win32::IPC, which uses native code, it may not be available for Strawberry Perl. If that is the case you could try compiling it from the CPAN sources. It might be worth a try if you have lots of Windows perl-based software.
I'm not involved in close-to-OS programming techniques, but as I know, when it comes to doing something in parallel in Perl the weapon of choice is fork and probably some useful modules built upon it. The doc page for fork says:
Does a fork(2) system call to create a new process running the same program at the same point.
As a consequence, having a big application that consumes a lot of memory and calling fork for a small task means there will be 2 big perl processes, and the second will waste resources just to do some simple work.
So, the question is: what to do (or how to use fork, if it's the only method) in order to have a detached portion of code running independently and consuming just the resources it needs?
Just a very simpel example:
use strict;
use warnings;
my #big_array = ( 1 .. 2000000 ); # at least 80 MB memory
sleep 10; # to have time to inspect easely the memory usage
fork();
sleep 10; # to have time to inspect easely the memory usage
and the child process consumes 80+ MB too.
To be clear: it's not important to communicate to this detached code or to use its result somehow, just to be possible to say "hey, run for me this simple task in the background and let me continue my heavy work meanwhile ... and don't waste my resources!" when running a heavy perl application.
fork() to exec() is your bunny here. You fork() to create a new process (which is a fairly cheap operation, see below), then exec() to replace the big perl you've got running with something smaller. This looks like this:
use strict;
use warnings;
use 5.010;
my #ary = (1 .. 10_000_000);
if (my $pid = fork()) {
# parent
say "Forked $pid from $$; sleeping";
sleep 1_000;
} else {
# child
exec('perl -e sleep 1_000');
}
(#ary was just used to fill up the original process' memory a bit.)
I said that fork()ing was relatively cheap, even though it does copy the entire original process. These statements are not in conflict; the guys who designed fork noticed this same problem. The copy is lazy, that is, only the bits that are actually changed are copied.
If you find you want the processes to talk to each other, you'll start getting into the more complex domain of IPC, about which a number of books have been written.
Your forked process is not actually using 80MB of resident memory. A large portion of that memory will be shared - 'borrowed' from the parent process until either the parent or child writes to it, at which point copy-on-write semantics will cause the memory to actually be copied.
If you want to drop that baggage completely, run exec in your fork. That will replace the child Perl process with a different executable, thus freeing the memory. It's also perfect if you don't need to communicate anything back to the parent.
There is no way to fork just a subset of your process's footprint, so the usual workarounds come down to:
fork before you run memory intensive code in the parent process
start a separate process with system or open HANDLE,'|-',.... Of course this new process won't inherit any data from its parent, so you will need to pass data to this child somehow.
fork() as implemented on most operating systems is nicely efficient. It commonly uses a technique called copy-on-write, to mean that pages are initially shared until one or other process writes to them. Also a lot of your process memory is going to be readonly mapped files anyway.
Just because one process uses 80MB before fork() doesn't mean that afterwards the two will use 160. To start with it will be only a tiny fraction more than 80MB, until each process starts dirtying more pages.
I need to fetch some data from many web data providers, who do not expose any service, so I have to write something like this, using for example WWW::Mechanize:
use WWW::Mechanize;
#urls = ('http://www.first.data.provider.com', 'http://www.second.data.provider.com', 'http://www.third.data.provider.com');
%results = {};
foreach my $url (#urls) {
$mech = WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
consume(%results);
Is there some (possibly simple ;-) way to fetch data to a common %results variable, simultaneously, i.e: in parallel, from all the providers?
threads are to be avoided in Perl. use threads is mostly for
emulating UNIX-style fork on Windows; beyond that, it's pointless.
(If you care, the implementation makes this fact very clear. In perl,
the interpreter is a PerlInterpreter object. The way threads
works is by making a bunch of threads, and then creating a brand-new
PerlInterpreter object in each thread. Threads share absolutely
nothing, even less than child processes do; fork gets you
copy-on-write, but with threads, all the copying is done in Perl
space! Slow!)
If you'd like to do many things concurrently in the same process, the
way to do that in Perl is with an event loop, like
EV,
Event, or
POE, or by using Coro. (You can
also write your code in terms of the AnyEvent API, which will let
you use any event loop. This is what I prefer.) The difference
between the two is how you write your code.
AnyEvent (and EV, Event,
POE, and so on) forces you to write your code in a callback-oriented
style. Instead of control flowing from top to bottom, control is in a
continuation-passing style. Functions don't return values, they call
other functions with their results. This allows you to run many IO
operations in parallel -- when a given IO operation has yielded
results, your function to handle those results will be called. When
another IO operation is complete, that function will be called. And
so on.
The disadvantage of this approach is that you have to rewrite your
code. So there's a module called Coro that gives Perl real
(user-space) threads that will let you write your code top-to-bottom,
but still be non-blocking. (The disadvantage of this is that it
heavily modifies Perl's internals. But it seems to work pretty well.)
So, since we don't want to rewrite
WWW::Mechanize
tonight, we're going to use Coro. Coro comes with a module called
Coro::LWP that will make
all calls to LWP be
non-blocking. It will block the current thread ("coroutine", in Coro
lingo), but it won't block any other threads. That means you can make
a ton of requests all at once, and process the results as they become
available. And Coro will scale better than your network connection;
each coroutine uses just a few k of memory, so it's easy to have tens
of thousands of them around.
With that in mind, let's see some code. Here's a program that starts
three HTTP requests in parallel, and prints the length of each
response. It's similar to what you're doing, minus the actual
processing; but you can just put your code in where we calculate the
length and it will work the same.
We'll start off with the usual Perl script boilerplate:
#!/usr/bin/env perl
use strict;
use warnings;
Then we'll load the Coro-specific modules:
use Coro;
use Coro::LWP;
use EV;
Coro uses an event loop behind the scenes; it will pick one for you if
you want, but we'll just specify EV explicitly. It's the best event
loop.
Then we'll load the modules we need for our work, which is just:
use WWW::Mechanize;
Now we're ready to write our program. First, we need a list of URLs:
my #urls = (
'http://www.google.com/',
'http://www.jrock.us/',
'http://stackoverflow.com/',
);
Then we need a function to spawn a thread and do our work. To make a
new thread on Coro, you call async like async { body; of the
thread; goes here }. This will create a thread, start it, and
continue with the rest of the program.
sub start_thread($) {
my $url = shift;
return async {
say "Starting $url";
my $mech = WWW::Mechanize->new;
$mech->get($url);
printf "Done with $url, %d bytes\n", length $mech->content;
};
}
So here's the meat of our program. We just put our normal LWP program
inside async, and it will be magically non-blocking. get blocks,
but the other coroutines will run while waiting for it to get the data
from the network.
Now we just need to start the threads:
start_thread $_ for #urls;
And finally, we want to start handling events:
EV::loop;
And that's it. When you run this, you'll see some output like:
Starting http://www.google.com/
Starting http://www.jrock.us/
Starting http://stackoverflow.com/
Done with http://www.jrock.us/, 5456 bytes
Done with http://www.google.com/, 9802 bytes
Done with http://stackoverflow.com/, 194555 bytes
As you can see, the requests are made in parallel, and you didn't have
to resort to threads!
Update
You mentioned in your original post that you want to limit the number of HTTP requests that run in parallel. One way to do that is with a semaphore,
Coro::Semaphore in Coro.
A semaphore is like a counter. When you want to use the resource that a semaphore protects, you "down" the semaphore. This decrements the counter and continues running your program. But if the counter is at zero when you try to down the semaphore, your thread/coroutine will go to sleep until it is non-zero. When the count goes up again, your thread will wake up, down the semaphore, and continue. Finally, when you're done using the resource that the semaphore protects, you "up" the semaphore and give other threads the chance to run.
This lets you control access to a shared resource, like "making HTTP requests".
All you need to do is create a semaphore that your HTTP request threads will share:
my $sem = Coro::Semaphore->new(5);
The 5 means "let us call 'down' 5 times before we block", or, in other words, "let there be 5 concurrent HTTP requests".
Before we add any code, let's talk about what can go wrong. Something bad that could happen is a thread "down"-ing the semaphore, but never "up"-ing it when it's done. Then nothing can ever use that resource, and your program will probably end up doing nothing. There are lots of ways this could happen. If you wrote some code like $sem->down; do something; $sem->up, you might feel safe, but what if "do something" throws an exception? Then the semaphore will be left down, and that's bad.
Fortunately, Perl makes it easy to have scope Guard objects, that will automatically run code when the variable holding the object goes out of scope. We can make the code be $sem->up, and then we'll never have to worry about holding a resource when we don't intend to.
Coro::Semaphore integrates the concept of guards, meaning you can say my $guard = $sem->guard, and that will automatically down the semaphore and up it when control flows away from the scope where you called guard.
With that in mind, all we have to do to limit the number of parallel requests is guard the semaphore at the top of our HTTP-using coroutines:
async {
say "Waiting for semaphore";
my $guard = $sem->guard;
say "Starting";
...;
return result;
}
Addressing the comments:
If you don't want your program to live forever, there are a few options. One is to run the event loop in another thread, and then join on each worker thread. This lets you pass results from the thread to the main program, too:
async { EV::loop };
# start all threads
my #running = map { start_thread $_ } #urls;
# wait for each one to return
my #results = map { $_->join } #running;
for my $result (#results) {
say $result->[0], ': ', $result->[1];
}
Your threads can return results like:
sub start_thread($) {
return async {
...;
return [$url, length $mech->content];
}
}
This is one way to collect all your results in a data structure. If you don't want to return things, remember that all the coroutines share state. So you can put:
my %results;
at the top of your program, and have each coroutine update the results:
async {
...;
$results{$url} = 'whatever';
};
When all the coroutines are done running, your hash will be filled with the results. You'll have to join each coroutine to know when the answer is ready, though.
Finally, if you are doing this as part of a web service, you should use a coroutine-aware web server like Corona. This will run each HTTP request in a coroutine, allowing you to handle multiple HTTP requests in parallel, in addition to being able to send HTTP requests in parallel. This will make very good use of memory, CPU, and network resources, and will be pretty easy to maintain!
(You can basically cut-n-paste our program from above into the coroutine that handles the HTTP request; it's fine to create new coroutines and join inside a coroutine.)
Looks like ParallelUserAgent is what you're looking for.
Well, you could create threads to do it--specifically see perldoc perlthrtut and Thread::Queue. So, it might look something like this.
use WWW::Mechanize;
use threads;
use threads::shared;
use Thread::Queue;
my #urls=(#whatever
);
my %results :shared;
my $queue=Thread::Queue->new();
foreach(#urls)
{
$queue->enqueue($_);
}
my #threads=();
my $num_threads=16; #Or whatever...a pre-specified number of threads.
foreach(1..$num_threads)
{
push #threads,threads->create(\&mechanize);
}
foreach(#threads)
{
$queue->enqueue(undef);
}
foreach(#threads)
{
$_->join();
}
consume(\%results);
sub mechanize
{
while(my $url=$queue->dequeue)
{
my $mech=WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
}
Note that since you're storing your results in a hash (instead of writing stuff to a file), you shouldn't need any kind of locking unless there's a danger of overwriting values. In which case, you'll want to lock %results by replacing
$results{$url} = parse($resp->content());
with
{
lock(%results);
$results{$url} = parse($resp->content());
}
Try https://metacpan.org/module/Parallel::Iterator -- saw a very good presentation about it last week, and one of the examples was parallel retrieval of URLs -- it's also covered in the pod example. It's simpler than using threads manually (although it uses fork underneath).
As far as I can tell, you'd still be able to use WWW::Mechanize, but avoid messing about with memory sharing between threads. It's a higher-level model for this task, and might be a little simpler, leaving the main logic of #Jack Maney's mechanize routine intact.
What modules should I look at for doing multithreading in Perl?
I'm looking to do something fairly low performance; I want threads is to run multiple workers simultaneously, with each of them sleeping for varying amounts of time.
There are lots of reasons why you might not want to multithread. If you do want to multithread, however, the following code might serve as a helpful example. It creates a number of jobs, puts those in a thread-safe queue, then starts some threads that pull jobs from the queue and complete them. Each thread keeps pulling jobs from the queue in a loop until it sees no more jobs. The program waits for all the thread to finish and then prints the total time that it spent working on the jobs.
#!/usr/bin/perl
use threads;
use Thread::Queue;
use Modern::Perl;
my $queue= Thread::Queue->new;
my $thread_count= 4;
my $job_count= 10;
my $start_time= time;
my $max_job_time= 10;
# Come up with some jobs and put them in a thread-safe queue. Each job
# is a string with an id and a number of seconds to sleep. Jobs consist
# of sleeping for the specified number of seconds.
my #jobs= map {"$_," . (int(rand $max_job_time) + 1)} (1 .. $job_count);
$queue->enqueue(#jobs);
# List the jobs
say "Jobs IDs: ", join(", ", map {(split /,/, $_)[0]} #jobs);
# Start the threads
my #threads= map {threads->create(sub {function($_)})} (1 .. $thread_count);
# Wait for all the threads to complete their work
$_->join for (#threads);
# We're all done
say "All done! Total time: ", time - $start_time;
# Here's what each thread does. Each thread starts, then fetches jobs
# from the job queue until there are no more jobs in the queue. Then,
# the thread exists.
sub function {
my $thread_id= shift;
my ($job, $job_id, $seconds);
while($job= $queue->dequeue_nb) {
($job_id, $seconds)= split /,/, $job;
say "Thread $thread_id starting on job $job_id ",
"(job will take $seconds seconds).";
sleep $seconds;
say "Thread $thread_id done with job $job_id.";
}
say "No more jobs for thread $thread_id; thread exiting.";
}
Most recent versions of Perl have threading support. Run perl -V:usethreads to see if it is available in your system.
$ perl -V:usethreads
usethreads='define'
perldoc threads gives a pretty good introduction to using them.
If performance isn't a big issue, then forking multiple processes will probably be a lot easier than dealing with threads. I frequently use Parallel::ForkManager which is very simple, but very good at what it does.
It sounds like you don't need preemptive multithreading; in which case, look at POE's cooperative model. Since your code will only yield to other threads when you decide, and you'll only have one thread running at a time, development and debugging will be much easier.
Coro is a nice module for cooperative multitasking.
99% of the time, this is what you need if you want threads in Perl.
If you want threads to speed up your code when multiple cores are available, you are going down the wrong path. Perl is 50x slower than other languages. Rewriting your code to run on two CPUs means that it now only runs 25x slower than other languages ... on one CPU. Better to spend the effort porting the slow parts to a different language.
But if you just don't want IO to block other "threads", then Coro is exactly what you want.