Perl AnyEvent concurrency internals

Perl AnyEvent concurrency internals - perl

I have a server, which creates a "AnyEvent timer" watcher object on every client connection (adds it to an AnyEvent loop).
use AnyEvent;
...
my $db_handle = myschema->connect();
my $w; $w = AnyEvent->timer (
interval => $interval,
after => $interval,
cb => sub {
## Remove from loop on some condition
unless ( $ret = _check_session($sid) ) {
undef $w;
}
## Execute main logic
$db_handle->lookup($sid);
}
);
...
So the callback will be executed every $interval seconds. In case there are a lot of clients, some callbacks will have to be executed at the same time. How does AnyEvent handle this? Does it execute those callbacks one after another OR there is some concurrency mechanism in this case, so that those callbacks will be executed simultaneously? (for instance to speed up the execution of several callbacks that have to be executed at the same time by means of creating several threads)
In case of my server the callback performs database lookup. The database handle for the database connection has been initialized outside of event loop. My concern is that if there is any concurrency mechanism in AnyEvent, then the callbacks cannot be executed simultaneously because one callback has to wait until another one has finished database lookup and the database handle is free.
P.S. Thanks to 'ikegami' for the answer.

AnyEvent does not create any concurrency. It processes events when you call into it (e.g. ->recv).

As ikegami said - there's no concurrency in AnyEvent. Any event loop gives you the ability to handle events 'asynchronously' - so you can have multiple requests/operations/timers in progress simultaneously, without ever having concurrent code execution. Events are always handled sequentially.
For your specific case of multiple timers expiring at the same time - each timer is processed in turn (they are sorted on their expiry time), and all 'expired' timers are processed at each event loop iteration - e.g. lines 208-213 of Loop.pm for the pure Perl implementation which you can view # CPAN.

Related

Perl daemonize with child daemons

I have to employ daemons in my code. I need a control daemon that constantly checks the database for the tasks and supervises child daemons. The control daemon must assign tasks to the child daemons, control tasks, create new children if one of them dies, etc. The child daemons check database for tasks for them (by PID). How should I implement daemons for this purpose?

Daemon is just a code word for "background process that runs a long time". So the answer is 'it depends'. Perl has two major ways of doing multiprocessing:
Threading
You run a subroutine as a thread, in parallel with the main program code. (Which may then just monitor thread states).
The overhead of creating a thread is higher, but it's better suited for 'shared memory' style multiprocessing, e.g. when you're passing significant quantities of data back and forth. There's several libraries that make passing information between threads positively straightforward. Personally I quite like Thread::Queue, Thread::Semaphore and Storable.
In particular - Storable has freeze and thaw which lets you move complex data structures (e.g. objects/hashes) around in queues, which is very useful.
Basic threading example:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
Storable
When it comes to Storable, this is worth a separate example I think, because it's handy to move data around.
use Storable qw ( freeze thaw );
use MyObject; #home made object.
use Thread::Queue;
my $work_q = Thread::Queue->new();
sub worker_thread {
while ( my $packed_item = $work_q->dequeue ) {
my $object = thaw($packed_item);
$object->run_some_methods();
$object->set_status("processed");
#maybe return $object via 'freeze' and a queue?
}
}
my $thr = threads->create( \&worker_thread );
my $newobject = MyObject->new("some_parameters");
$work_q->enqueue( freeze($newobject) );
$work_q->end();
$thr->join();
Because you're passing the object around within the queue, you're effectively cloning it between threads. So bear in mind that you may need to freeze it and 'return' it somehow once you've done something to it's internal state. But it does mean you can do this asynchronously without needing to arbitrate locking or shared memory. You may also find it useful to be able to 'store' and 'retrieve' and object - this works as you might expect. (Although I daresay you might need to be careful about availability of module versions vs. defined attributes if you're retrieving a stored object)
Forking
Your script clones itself, leaving a 'parent' and 'child' - the child then generally diverges and does something different. This uses the Unix built in fork() which as a result is well optimised and generally very efficient - but because it's low level, means it's difficult to do lots of data transfer. You'll end up some slightly complicated things to do Interprocess communication - IPC. (See perlipc for more detail). It's efficient not least because most fork() implementations do a lazy data copy - memory space for your process is only allocated as it's needed e.g. when it's changed.
It's therefore really good if you want to delegate a lot of tasks that don't require much supervision from the parent. For example - you might fork a web server, because the child is reading files and delivering them to a particular client, and the parent doesn't much care. Or perhaps you would do this if you want to spend a lot of CPU time computing a result, and only pass that result back.
It's also not supported on Windows.
Useful libraries include Parallel::ForkManager
A basic example of 'forking' code looks a bit like this:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
Which is right for you?
So it's hard to say without a bit more detail about what you're trying to accomplish. This is why StacKOverflow usually likes to show some working, approaches you've tried, etc.
I would generally say:
if you need to pass data around, use threads. Thread::Queue especially when combined with Storable is very good for it.
if you don't, forks (on Unix) are generally faster/more efficient. (But fast alone isn't usually enough - write understandable stuff first, and aim for speed second. It doesn't matter much usually).
Avoid where possible spawning too many threads - they're fairly intensive on memory and creation overhead. You're far better off using a fixed number in a 'worker thread' style of programming, than repeatedly creating new, short lived threads. (On the other hand - forks are actually very good at this, because they don't copy your whole process).
I would suggest in the scenario you give - you're looking at threads and queues. Your parent process can track child threads via threads -> list() and join or create to keep the right number. And can feed data via a central queue to your worker threads. Or have multiple queues - one per 'child' and use that as a task assignment system.

.NET: How can I keep using a socket after a read timeout?

In my application, I want to have a polling loop which blocks on a socket receive operation but times out after 100 ms. This would allow me to exit the loop when I want (e.g. the user clicks something in the UI) while avoiding using a busy loop or Thread.sleep.
However, it seems that once a .NET socket is opened, it can only time out once. After the first timeout, any calls that would block throw an exception immediately.
According to this question, "you can’t timeout or cancel asynchronous Socket operations." Why not? Is there a better way to approach the problem in the .NET world?

To write a non-busy polling loop on a .NET socket, you can use the Poll socket method, as follows:
for (keepGoing) {
if (mySocket.Poll(1000 * timeout_milliseconds, SelectMode.SelectRead)) {
// Assert: mySocket.Available > 0.
// TODO: Call mySocket.Receive or a related method
}
}
In this case, Poll returns false if no data becomes available to read from the socket within the specified timeout. Another thread can reset keepGoing to false if it wants to cleanly shut down the polling loop.

Does puting a block on a sync GCD queue locks that block and pauses the others?

I read that GCD synchronous queues (dispatch_sync) should be used to implement critical sections of code. An example would be a block that subtracts transaction amount from account balance. The interesting part of sync calls is a question, how does that affect the work of other blocks on multiple threads?
Lets imagine the situation where there are 3 threads that use and execute both system and user defined blocks from main and custom queues in asynchronous mode. Those block are all executed in parallel in some order. Now, if a block is put on a custom queue with sync mode, does that mean that all other blocks (including on other threads) are suspended until the successful execution of the block? Or does that mean that only some lock will be put on that block while other will still execute. However, if other blocks use the same data as the sync block then it's inevitable that other blocks will wait until that lock will be released.
IMHO it doesn't matter, is it one or multiple cores, sync mode should freeze the whole app work. However, these are just my thoughts so please comment on that and share your insights :)

Synchronous dispatch suspends the execution of your code until the dispatched block has finished. Asynchronous dispatch returns immediately, the block is executed asynchronously with regard to the calling code:
dispatch_sync(somewhere, ^{ something });
// Reached later, when the block is finished.
dispatch_async(somewhere, ^{ something });
// Reached immediately. The block might be waiting
// to be executed, executing or already finished.
And there are two kinds of dispatch queues, serial and concurrent. The serial ones dispatch the blocks strictly one by one in the order they are being added. When one finishes, another one starts. There is only one thread needed for this kind of execution. The concurrent queues dispatch the blocks concurrently, in parallel. There are more threads being used there.
You can mix and match sync/async dispatch and serial/concurrent queues as you see fit. If you want to use GCD to guard access to a critical section, use a single serial queue and dispatch all operations on the shared data on this queue (synchronously or asynchronously, does not matter). That way there will always be just one block operating with the shared data:
- (void) addFoo: (id) foo {
dispatch_sync(guardingQueue, ^{ [sharedFooArray addObject:foo]; });
}
- (void) removeFoo: (id) foo {
dispatch_sync(guardingQueue, ^{ [sharedFooArray removeObject:foo]; });
}
Now if guardingQueue is a serial queue, the add/remove operations can never clash even if the addFoo: and removeFoo: methods are called concurrently from different threads.

No it doesn't.
The synchronised part is that the block is put on a queue but control does not pass back to the calling function until the block returns.
Many uses of GCD are asynchronous; you put a block on a queue and rather than waiting for the block to complete it's work control is passed back to the calling function.
This has no effect on other queues.

If you need to serialize the access to a certain resource then there are at least two
mechanisms that are accessible to you. If you have an account object (that is unique
for a given account number), then you can do something like:
#synchronize(accountObject) { ... }
If you don't have an object but are using a C structure for which there is only one
such structure for a given account number then you can do the following:
// Should be added to the account structure.
// 1 => at most 1 object can access accountLock at a time.
dispatch_semaphore_t accountLock = dispatch_semaphore_create(1);
// In your block you do the following:
block = ^(void) {
dispatch_semaphore_wait(accountLock,DISPATCH_TIME_FOREVER);
// Do something
dispatch_semaphore_signal(accountLock);
};
// -- Edited: semaphore was leaking.
// At the appropriate time release the lock
// If the semaphore was created in the init then
// the semaphore should be released in the release method.
dispatch_release(accountLock);
With this, regardless of the level of concurrency of your queues, you are guaranteed that only one thread will access an account at any given time.
There are many more types of synchronization objects but these two are easy to use and
quite flexible.

Perl, how to fetch data from urls in parallel?

I need to fetch some data from many web data providers, who do not expose any service, so I have to write something like this, using for example WWW::Mechanize:
use WWW::Mechanize;
#urls = ('http://www.first.data.provider.com', 'http://www.second.data.provider.com', 'http://www.third.data.provider.com');
%results = {};
foreach my $url (#urls) {
$mech = WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
consume(%results);
Is there some (possibly simple ;-) way to fetch data to a common %results variable, simultaneously, i.e: in parallel, from all the providers?

threads are to be avoided in Perl. use threads is mostly for
emulating UNIX-style fork on Windows; beyond that, it's pointless.
(If you care, the implementation makes this fact very clear. In perl,
the interpreter is a PerlInterpreter object. The way threads
works is by making a bunch of threads, and then creating a brand-new
PerlInterpreter object in each thread. Threads share absolutely
nothing, even less than child processes do; fork gets you
copy-on-write, but with threads, all the copying is done in Perl
space! Slow!)
If you'd like to do many things concurrently in the same process, the
way to do that in Perl is with an event loop, like
EV,
Event, or
POE, or by using Coro. (You can
also write your code in terms of the AnyEvent API, which will let
you use any event loop. This is what I prefer.) The difference
between the two is how you write your code.
AnyEvent (and EV, Event,
POE, and so on) forces you to write your code in a callback-oriented
style. Instead of control flowing from top to bottom, control is in a
continuation-passing style. Functions don't return values, they call
other functions with their results. This allows you to run many IO
operations in parallel -- when a given IO operation has yielded
results, your function to handle those results will be called. When
another IO operation is complete, that function will be called. And
so on.
The disadvantage of this approach is that you have to rewrite your
code. So there's a module called Coro that gives Perl real
(user-space) threads that will let you write your code top-to-bottom,
but still be non-blocking. (The disadvantage of this is that it
heavily modifies Perl's internals. But it seems to work pretty well.)
So, since we don't want to rewrite
WWW::Mechanize
tonight, we're going to use Coro. Coro comes with a module called
Coro::LWP that will make
all calls to LWP be
non-blocking. It will block the current thread ("coroutine", in Coro
lingo), but it won't block any other threads. That means you can make
a ton of requests all at once, and process the results as they become
available. And Coro will scale better than your network connection;
each coroutine uses just a few k of memory, so it's easy to have tens
of thousands of them around.
With that in mind, let's see some code. Here's a program that starts
three HTTP requests in parallel, and prints the length of each
response. It's similar to what you're doing, minus the actual
processing; but you can just put your code in where we calculate the
length and it will work the same.
We'll start off with the usual Perl script boilerplate:
#!/usr/bin/env perl
use strict;
use warnings;
Then we'll load the Coro-specific modules:
use Coro;
use Coro::LWP;
use EV;
Coro uses an event loop behind the scenes; it will pick one for you if
you want, but we'll just specify EV explicitly. It's the best event
loop.
Then we'll load the modules we need for our work, which is just:
use WWW::Mechanize;
Now we're ready to write our program. First, we need a list of URLs:
my #urls = (
'http://www.google.com/',
'http://www.jrock.us/',
'http://stackoverflow.com/',
);
Then we need a function to spawn a thread and do our work. To make a
new thread on Coro, you call async like async { body; of the
thread; goes here }. This will create a thread, start it, and
continue with the rest of the program.
sub start_thread($) {
my $url = shift;
return async {
say "Starting $url";
my $mech = WWW::Mechanize->new;
$mech->get($url);
printf "Done with $url, %d bytes\n", length $mech->content;
};
}
So here's the meat of our program. We just put our normal LWP program
inside async, and it will be magically non-blocking. get blocks,
but the other coroutines will run while waiting for it to get the data
from the network.
Now we just need to start the threads:
start_thread $_ for #urls;
And finally, we want to start handling events:
EV::loop;
And that's it. When you run this, you'll see some output like:
Starting http://www.google.com/
Starting http://www.jrock.us/
Starting http://stackoverflow.com/
Done with http://www.jrock.us/, 5456 bytes
Done with http://www.google.com/, 9802 bytes
Done with http://stackoverflow.com/, 194555 bytes
As you can see, the requests are made in parallel, and you didn't have
to resort to threads!
Update
You mentioned in your original post that you want to limit the number of HTTP requests that run in parallel. One way to do that is with a semaphore,
Coro::Semaphore in Coro.
A semaphore is like a counter. When you want to use the resource that a semaphore protects, you "down" the semaphore. This decrements the counter and continues running your program. But if the counter is at zero when you try to down the semaphore, your thread/coroutine will go to sleep until it is non-zero. When the count goes up again, your thread will wake up, down the semaphore, and continue. Finally, when you're done using the resource that the semaphore protects, you "up" the semaphore and give other threads the chance to run.
This lets you control access to a shared resource, like "making HTTP requests".
All you need to do is create a semaphore that your HTTP request threads will share:
my $sem = Coro::Semaphore->new(5);
The 5 means "let us call 'down' 5 times before we block", or, in other words, "let there be 5 concurrent HTTP requests".
Before we add any code, let's talk about what can go wrong. Something bad that could happen is a thread "down"-ing the semaphore, but never "up"-ing it when it's done. Then nothing can ever use that resource, and your program will probably end up doing nothing. There are lots of ways this could happen. If you wrote some code like $sem->down; do something; $sem->up, you might feel safe, but what if "do something" throws an exception? Then the semaphore will be left down, and that's bad.
Fortunately, Perl makes it easy to have scope Guard objects, that will automatically run code when the variable holding the object goes out of scope. We can make the code be $sem->up, and then we'll never have to worry about holding a resource when we don't intend to.
Coro::Semaphore integrates the concept of guards, meaning you can say my $guard = $sem->guard, and that will automatically down the semaphore and up it when control flows away from the scope where you called guard.
With that in mind, all we have to do to limit the number of parallel requests is guard the semaphore at the top of our HTTP-using coroutines:
async {
say "Waiting for semaphore";
my $guard = $sem->guard;
say "Starting";
...;
return result;
}
Addressing the comments:
If you don't want your program to live forever, there are a few options. One is to run the event loop in another thread, and then join on each worker thread. This lets you pass results from the thread to the main program, too:
async { EV::loop };
# start all threads
my #running = map { start_thread $_ } #urls;
# wait for each one to return
my #results = map { $_->join } #running;
for my $result (#results) {
say $result->[0], ': ', $result->[1];
}
Your threads can return results like:
sub start_thread($) {
return async {
...;
return [$url, length $mech->content];
}
}
This is one way to collect all your results in a data structure. If you don't want to return things, remember that all the coroutines share state. So you can put:
my %results;
at the top of your program, and have each coroutine update the results:
async {
...;
$results{$url} = 'whatever';
};
When all the coroutines are done running, your hash will be filled with the results. You'll have to join each coroutine to know when the answer is ready, though.
Finally, if you are doing this as part of a web service, you should use a coroutine-aware web server like Corona. This will run each HTTP request in a coroutine, allowing you to handle multiple HTTP requests in parallel, in addition to being able to send HTTP requests in parallel. This will make very good use of memory, CPU, and network resources, and will be pretty easy to maintain!
(You can basically cut-n-paste our program from above into the coroutine that handles the HTTP request; it's fine to create new coroutines and join inside a coroutine.)

Looks like ParallelUserAgent is what you're looking for.

Well, you could create threads to do it--specifically see perldoc perlthrtut and Thread::Queue. So, it might look something like this.
use WWW::Mechanize;
use threads;
use threads::shared;
use Thread::Queue;
my #urls=(#whatever
);
my %results :shared;
my $queue=Thread::Queue->new();
foreach(#urls)
{
$queue->enqueue($_);
}
my #threads=();
my $num_threads=16; #Or whatever...a pre-specified number of threads.
foreach(1..$num_threads)
{
push #threads,threads->create(\&mechanize);
}
foreach(#threads)
{
$queue->enqueue(undef);
}
foreach(#threads)
{
$_->join();
}
consume(\%results);
sub mechanize
{
while(my $url=$queue->dequeue)
{
my $mech=WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
}
Note that since you're storing your results in a hash (instead of writing stuff to a file), you shouldn't need any kind of locking unless there's a danger of overwriting values. In which case, you'll want to lock %results by replacing
$results{$url} = parse($resp->content());
with
{
lock(%results);
$results{$url} = parse($resp->content());
}

Try https://metacpan.org/module/Parallel::Iterator -- saw a very good presentation about it last week, and one of the examples was parallel retrieval of URLs -- it's also covered in the pod example. It's simpler than using threads manually (although it uses fork underneath).
As far as I can tell, you'd still be able to use WWW::Mechanize, but avoid messing about with memory sharing between threads. It's a higher-level model for this task, and might be a little simpler, leaving the main logic of #Jack Maney's mechanize routine intact.

Practical use of futures? Ie, how to kill them?

Futures are very convenient, but in practice, you may need some guarantees on their execution. For example, consider:
import scala.actors.Futures._
def slowFn(time:Int) = {
Thread.sleep(time * 1000)
println("%d second fn done".format(time))
}
val fs = List( future(slowFn(2)), future(slowFn(10)) )
awaitAll(5000, fs:_*)
println("5 second expiration. Continuing.")
Thread.sleep(12000) // ie more calculations
println("done with everything")
The idea is to kick off some slow running functions in parallel. But we wouldn't want to hang forever if the functions executed by the futures don't return. So we use awaitAll() to put a timeout on the futures. But if you run the code, you see that the 5 second timer expires, but the 10 second future continues to run and returns later. The timeout doesn't kill the future; it just limits the join wait.
So how do you kill a future after a timeout period? It seems like futures can't be used in practice unless you're certain that they will return in a known amount of time. Otherwise, you run the risk of losing threads in the thread pool to non-terminating futures until there are none left.
So the questions are: How do you kill futures? What are the intended usage patterns for futures given these risks?

Futures are intended to be used in settings where you do need to wait for the computation to complete, no matter what. That's why they are described as being used for slow running functions. You want that function's result, but you have other stuff you can be doing meanwhile. In fact, you might have many futures, all independent of each other that you may want to run in parallel, while you wait until all complete.
The timer just provides a wait to get partial results.

I think the reason Future can't simply be "killed" is exactly the same as why java.lang.Thread.stop() is deprecated.
While Future is running, a Thread is required. In order to stop a Future without calling stop() on the executing Thread, application specific logic is needed: checking for an application specific flag or the interrupted status of the executing Thread periodically is one way to do it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl AnyEvent concurrency internals - perl

AnyEvent does not create any concurrency. It processes events when you call into it (e.g. ->recv).

Related

Perl daemonize with child daemons

.NET: How can I keep using a socket after a read timeout?

Does puting a block on a sync GCD queue locks that block and pauses the others?

Perl, how to fetch data from urls in parallel?

Practical use of futures? Ie, how to kill them?

Categories

Resources