Perl, how to fetch data from urls in parallel? - perl

I need to fetch some data from many web data providers, who do not expose any service, so I have to write something like this, using for example WWW::Mechanize:
use WWW::Mechanize;
#urls = ('http://www.first.data.provider.com', 'http://www.second.data.provider.com', 'http://www.third.data.provider.com');
%results = {};
foreach my $url (#urls) {
$mech = WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
consume(%results);
Is there some (possibly simple ;-) way to fetch data to a common %results variable, simultaneously, i.e: in parallel, from all the providers?

threads are to be avoided in Perl. use threads is mostly for
emulating UNIX-style fork on Windows; beyond that, it's pointless.
(If you care, the implementation makes this fact very clear. In perl,
the interpreter is a PerlInterpreter object. The way threads
works is by making a bunch of threads, and then creating a brand-new
PerlInterpreter object in each thread. Threads share absolutely
nothing, even less than child processes do; fork gets you
copy-on-write, but with threads, all the copying is done in Perl
space! Slow!)
If you'd like to do many things concurrently in the same process, the
way to do that in Perl is with an event loop, like
EV,
Event, or
POE, or by using Coro. (You can
also write your code in terms of the AnyEvent API, which will let
you use any event loop. This is what I prefer.) The difference
between the two is how you write your code.
AnyEvent (and EV, Event,
POE, and so on) forces you to write your code in a callback-oriented
style. Instead of control flowing from top to bottom, control is in a
continuation-passing style. Functions don't return values, they call
other functions with their results. This allows you to run many IO
operations in parallel -- when a given IO operation has yielded
results, your function to handle those results will be called. When
another IO operation is complete, that function will be called. And
so on.
The disadvantage of this approach is that you have to rewrite your
code. So there's a module called Coro that gives Perl real
(user-space) threads that will let you write your code top-to-bottom,
but still be non-blocking. (The disadvantage of this is that it
heavily modifies Perl's internals. But it seems to work pretty well.)
So, since we don't want to rewrite
WWW::Mechanize
tonight, we're going to use Coro. Coro comes with a module called
Coro::LWP that will make
all calls to LWP be
non-blocking. It will block the current thread ("coroutine", in Coro
lingo), but it won't block any other threads. That means you can make
a ton of requests all at once, and process the results as they become
available. And Coro will scale better than your network connection;
each coroutine uses just a few k of memory, so it's easy to have tens
of thousands of them around.
With that in mind, let's see some code. Here's a program that starts
three HTTP requests in parallel, and prints the length of each
response. It's similar to what you're doing, minus the actual
processing; but you can just put your code in where we calculate the
length and it will work the same.
We'll start off with the usual Perl script boilerplate:
#!/usr/bin/env perl
use strict;
use warnings;
Then we'll load the Coro-specific modules:
use Coro;
use Coro::LWP;
use EV;
Coro uses an event loop behind the scenes; it will pick one for you if
you want, but we'll just specify EV explicitly. It's the best event
loop.
Then we'll load the modules we need for our work, which is just:
use WWW::Mechanize;
Now we're ready to write our program. First, we need a list of URLs:
my #urls = (
'http://www.google.com/',
'http://www.jrock.us/',
'http://stackoverflow.com/',
);
Then we need a function to spawn a thread and do our work. To make a
new thread on Coro, you call async like async { body; of the
thread; goes here }. This will create a thread, start it, and
continue with the rest of the program.
sub start_thread($) {
my $url = shift;
return async {
say "Starting $url";
my $mech = WWW::Mechanize->new;
$mech->get($url);
printf "Done with $url, %d bytes\n", length $mech->content;
};
}
So here's the meat of our program. We just put our normal LWP program
inside async, and it will be magically non-blocking. get blocks,
but the other coroutines will run while waiting for it to get the data
from the network.
Now we just need to start the threads:
start_thread $_ for #urls;
And finally, we want to start handling events:
EV::loop;
And that's it. When you run this, you'll see some output like:
Starting http://www.google.com/
Starting http://www.jrock.us/
Starting http://stackoverflow.com/
Done with http://www.jrock.us/, 5456 bytes
Done with http://www.google.com/, 9802 bytes
Done with http://stackoverflow.com/, 194555 bytes
As you can see, the requests are made in parallel, and you didn't have
to resort to threads!
Update
You mentioned in your original post that you want to limit the number of HTTP requests that run in parallel. One way to do that is with a semaphore,
Coro::Semaphore in Coro.
A semaphore is like a counter. When you want to use the resource that a semaphore protects, you "down" the semaphore. This decrements the counter and continues running your program. But if the counter is at zero when you try to down the semaphore, your thread/coroutine will go to sleep until it is non-zero. When the count goes up again, your thread will wake up, down the semaphore, and continue. Finally, when you're done using the resource that the semaphore protects, you "up" the semaphore and give other threads the chance to run.
This lets you control access to a shared resource, like "making HTTP requests".
All you need to do is create a semaphore that your HTTP request threads will share:
my $sem = Coro::Semaphore->new(5);
The 5 means "let us call 'down' 5 times before we block", or, in other words, "let there be 5 concurrent HTTP requests".
Before we add any code, let's talk about what can go wrong. Something bad that could happen is a thread "down"-ing the semaphore, but never "up"-ing it when it's done. Then nothing can ever use that resource, and your program will probably end up doing nothing. There are lots of ways this could happen. If you wrote some code like $sem->down; do something; $sem->up, you might feel safe, but what if "do something" throws an exception? Then the semaphore will be left down, and that's bad.
Fortunately, Perl makes it easy to have scope Guard objects, that will automatically run code when the variable holding the object goes out of scope. We can make the code be $sem->up, and then we'll never have to worry about holding a resource when we don't intend to.
Coro::Semaphore integrates the concept of guards, meaning you can say my $guard = $sem->guard, and that will automatically down the semaphore and up it when control flows away from the scope where you called guard.
With that in mind, all we have to do to limit the number of parallel requests is guard the semaphore at the top of our HTTP-using coroutines:
async {
say "Waiting for semaphore";
my $guard = $sem->guard;
say "Starting";
...;
return result;
}
Addressing the comments:
If you don't want your program to live forever, there are a few options. One is to run the event loop in another thread, and then join on each worker thread. This lets you pass results from the thread to the main program, too:
async { EV::loop };
# start all threads
my #running = map { start_thread $_ } #urls;
# wait for each one to return
my #results = map { $_->join } #running;
for my $result (#results) {
say $result->[0], ': ', $result->[1];
}
Your threads can return results like:
sub start_thread($) {
return async {
...;
return [$url, length $mech->content];
}
}
This is one way to collect all your results in a data structure. If you don't want to return things, remember that all the coroutines share state. So you can put:
my %results;
at the top of your program, and have each coroutine update the results:
async {
...;
$results{$url} = 'whatever';
};
When all the coroutines are done running, your hash will be filled with the results. You'll have to join each coroutine to know when the answer is ready, though.
Finally, if you are doing this as part of a web service, you should use a coroutine-aware web server like Corona. This will run each HTTP request in a coroutine, allowing you to handle multiple HTTP requests in parallel, in addition to being able to send HTTP requests in parallel. This will make very good use of memory, CPU, and network resources, and will be pretty easy to maintain!
(You can basically cut-n-paste our program from above into the coroutine that handles the HTTP request; it's fine to create new coroutines and join inside a coroutine.)

Looks like ParallelUserAgent is what you're looking for.

Well, you could create threads to do it--specifically see perldoc perlthrtut and Thread::Queue. So, it might look something like this.
use WWW::Mechanize;
use threads;
use threads::shared;
use Thread::Queue;
my #urls=(#whatever
);
my %results :shared;
my $queue=Thread::Queue->new();
foreach(#urls)
{
$queue->enqueue($_);
}
my #threads=();
my $num_threads=16; #Or whatever...a pre-specified number of threads.
foreach(1..$num_threads)
{
push #threads,threads->create(\&mechanize);
}
foreach(#threads)
{
$queue->enqueue(undef);
}
foreach(#threads)
{
$_->join();
}
consume(\%results);
sub mechanize
{
while(my $url=$queue->dequeue)
{
my $mech=WWW::Mechanize->new();
$mech->get($url);
$mech->form_number(1);
$mech->set_fields('user' => 'myuser', pass => 'mypass');
$resp = $mech->submit();
$results{$url} = parse($resp->content());
}
}
Note that since you're storing your results in a hash (instead of writing stuff to a file), you shouldn't need any kind of locking unless there's a danger of overwriting values. In which case, you'll want to lock %results by replacing
$results{$url} = parse($resp->content());
with
{
lock(%results);
$results{$url} = parse($resp->content());
}

Try https://metacpan.org/module/Parallel::Iterator -- saw a very good presentation about it last week, and one of the examples was parallel retrieval of URLs -- it's also covered in the pod example. It's simpler than using threads manually (although it uses fork underneath).
As far as I can tell, you'd still be able to use WWW::Mechanize, but avoid messing about with memory sharing between threads. It's a higher-level model for this task, and might be a little simpler, leaving the main logic of #Jack Maney's mechanize routine intact.

Related

Perl daemonize with child daemons

I have to employ daemons in my code. I need a control daemon that constantly checks the database for the tasks and supervises child daemons. The control daemon must assign tasks to the child daemons, control tasks, create new children if one of them dies, etc. The child daemons check database for tasks for them (by PID). How should I implement daemons for this purpose?
Daemon is just a code word for "background process that runs a long time". So the answer is 'it depends'. Perl has two major ways of doing multiprocessing:
Threading
You run a subroutine as a thread, in parallel with the main program code. (Which may then just monitor thread states).
The overhead of creating a thread is higher, but it's better suited for 'shared memory' style multiprocessing, e.g. when you're passing significant quantities of data back and forth. There's several libraries that make passing information between threads positively straightforward. Personally I quite like Thread::Queue, Thread::Semaphore and Storable.
In particular - Storable has freeze and thaw which lets you move complex data structures (e.g. objects/hashes) around in queues, which is very useful.
Basic threading example:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q = Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/bin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
Storable
When it comes to Storable, this is worth a separate example I think, because it's handy to move data around.
use Storable qw ( freeze thaw );
use MyObject; #home made object.
use Thread::Queue;
my $work_q = Thread::Queue->new();
sub worker_thread {
while ( my $packed_item = $work_q->dequeue ) {
my $object = thaw($packed_item);
$object->run_some_methods();
$object->set_status("processed");
#maybe return $object via 'freeze' and a queue?
}
}
my $thr = threads->create( \&worker_thread );
my $newobject = MyObject->new("some_parameters");
$work_q->enqueue( freeze($newobject) );
$work_q->end();
$thr->join();
Because you're passing the object around within the queue, you're effectively cloning it between threads. So bear in mind that you may need to freeze it and 'return' it somehow once you've done something to it's internal state. But it does mean you can do this asynchronously without needing to arbitrate locking or shared memory. You may also find it useful to be able to 'store' and 'retrieve' and object - this works as you might expect. (Although I daresay you might need to be careful about availability of module versions vs. defined attributes if you're retrieving a stored object)
Forking
Your script clones itself, leaving a 'parent' and 'child' - the child then generally diverges and does something different. This uses the Unix built in fork() which as a result is well optimised and generally very efficient - but because it's low level, means it's difficult to do lots of data transfer. You'll end up some slightly complicated things to do Interprocess communication - IPC. (See perlipc for more detail). It's efficient not least because most fork() implementations do a lazy data copy - memory space for your process is only allocated as it's needed e.g. when it's changed.
It's therefore really good if you want to delegate a lot of tasks that don't require much supervision from the parent. For example - you might fork a web server, because the child is reading files and delivering them to a particular client, and the parent doesn't much care. Or perhaps you would do this if you want to spend a lot of CPU time computing a result, and only pass that result back.
It's also not supported on Windows.
Useful libraries include Parallel::ForkManager
A basic example of 'forking' code looks a bit like this:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $concurrent_fork_limit = 4;
my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);
foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
my $pid = $fork_manager->start;
if ($pid) {
print "$$: Fork made a child with pid $pid\n";
} else {
print "$$: child process started, with a key of $thing ($pid)\n";
}
$fork_manager->finish;
}
$fork_manager->wait_all_children();
Which is right for you?
So it's hard to say without a bit more detail about what you're trying to accomplish. This is why StacKOverflow usually likes to show some working, approaches you've tried, etc.
I would generally say:
if you need to pass data around, use threads. Thread::Queue especially when combined with Storable is very good for it.
if you don't, forks (on Unix) are generally faster/more efficient. (But fast alone isn't usually enough - write understandable stuff first, and aim for speed second. It doesn't matter much usually).
Avoid where possible spawning too many threads - they're fairly intensive on memory and creation overhead. You're far better off using a fixed number in a 'worker thread' style of programming, than repeatedly creating new, short lived threads. (On the other hand - forks are actually very good at this, because they don't copy your whole process).
I would suggest in the scenario you give - you're looking at threads and queues. Your parent process can track child threads via threads -> list() and join or create to keep the right number. And can feed data via a central queue to your worker threads. Or have multiple queues - one per 'child' and use that as a task assignment system.

Perl AnyEvent concurrency internals

I have a server, which creates a "AnyEvent timer" watcher object on every client connection (adds it to an AnyEvent loop).
use AnyEvent;
...
my $db_handle = myschema->connect();
my $w; $w = AnyEvent->timer (
interval => $interval,
after => $interval,
cb => sub {
## Remove from loop on some condition
unless ( $ret = _check_session($sid) ) {
undef $w;
}
## Execute main logic
$db_handle->lookup($sid);
}
);
...
So the callback will be executed every $interval seconds. In case there are a lot of clients, some callbacks will have to be executed at the same time. How does AnyEvent handle this? Does it execute those callbacks one after another OR there is some concurrency mechanism in this case, so that those callbacks will be executed simultaneously? (for instance to speed up the execution of several callbacks that have to be executed at the same time by means of creating several threads)
In case of my server the callback performs database lookup. The database handle for the database connection has been initialized outside of event loop. My concern is that if there is any concurrency mechanism in AnyEvent, then the callbacks cannot be executed simultaneously because one callback has to wait until another one has finished database lookup and the database handle is free.
P.S. Thanks to 'ikegami' for the answer.
AnyEvent does not create any concurrency. It processes events when you call into it (e.g. ->recv).
As ikegami said - there's no concurrency in AnyEvent. Any event loop gives you the ability to handle events 'asynchronously' - so you can have multiple requests/operations/timers in progress simultaneously, without ever having concurrent code execution. Events are always handled sequentially.
For your specific case of multiple timers expiring at the same time - each timer is processed in turn (they are sorted on their expiry time), and all 'expired' timers are processed at each event loop iteration - e.g. lines 208-213 of Loop.pm for the pure Perl implementation which you can view # CPAN.

perl how to mock thread behavior

I want to write unit tests for a subroutine in perl. The subroutine is using multiple threads to do its tasks. So, it first creates some threads and then it waits for them to join.
The problem is that our unit tests run on a server which is not able to run multi-threaded tests, so I need to somehow mock out the thread behavior. Basically I want to override the threads create and join functions such that its not threaded anymore. Any pointers how can I do that and test the code ?
Edit : The server fails to run the threaded code for the following reason:
Devel::Cover does not yet work with threads
Update: this answer doesn't solve the OP's problem as described in the edited question, but it might be useful to someone.
Perl threads are an interpreter emulation, not an operating system feature. So, they should work on any platform. If your testing server doesn't support threads, it's probably for one of these reasons:
Your version of Perl is very old.
Perl was compiled without thread support.
Your testing framework wasn't created with threaded code in mind.
The first two could be easily rectified by updating your environment. However, I suspect yours is the third issue.
I don't think you should solve this by mocking the thread behavior. This changes the original code too much to be a valid test. And it would be a significant amount of work anyway, so why not direct that effort toward getting a threaded test working?
The exact issues depend on your code, but probably the issue is that your subroutine starts a thread and then returns, with the thread still running. Then your test framework runs the sub over and over, accumulating a whole bunch of concurrent threads.
In that case, all you need is a wrapper sub that calls the sub you are testing, and then blocks until the threads are complete. This should be fairly simple. Take a look at threads->list() to see how you can detect running threads. Just have a loop that waits until the threads in question are no longer running before exiting the wrapper sub.
Here is a simple complete example demonstrating a wrapper sub:
#!usr/bin/perl
use strict;
use warnings;
use threads;
sub sub_to_test {
threads->create(sub { sleep 5; print("Thread done\n"); threads->detach() });
return "Sub done\n";
}
sub wrapper {
#Get a count of the running threads.
my $original_running_threads = threads->list(threads::running);
my #results = sub_to_test(#_);
#block until the number of running threads is the same as when we started.
sleep 1 while (threads->list(threads::running) > $original_running_threads);
return #results;
}
print wrapper;

Is there a way to have managed processes in Perl (i.e. a threads replacement that actually works)?

I have a multithreded application in perl for which I have to rely on several non-thread safe modules, so I have been using fork()ed processes with kill() signals as a message passing interface.
The problem is that the signal handlers are a bit erratic (to say the least) and often end up with processes that get killed in inapropriate states.
Is there a better way to do this?
Depending on exactly what your program needs to do, you might consider using POE, which is a Perl framework for multi-threaded applications with user-space threads. It's complex, but elegant and powerful and can help you avoid non-thread-safe modules by confining activity to a single Perl interpreter thread.
Helpful resources to get started:
Programming POE presentation by Matt Sergeant (start here to understand what it is and does)
POE project page (lots of cookbook examples)
Plus there are hundreds of pre-built POE components you can use to assemble into an application.
You can always have a pipe between parent and child to pass messages back and forth.
pipe my $reader, my $writer;
my $pid = fork();
if ( $pid == 0 ) {
close $reader;
...
}
else {
close $writer;
my $msg_from_child = <$reader>;
....
}
Not a very comfortable way of programming, but it shouldn't be 'erratic'.
Have a look at forks.pm, a "drop-in replacement for Perl threads using fork()" which makes for much more sensible memory usage (but don't use it on Win32). It will allow you to declare "shared" variables and then it automatically passes changes made to such variables between the processes (similar to how threads.pm does things).
From perl 5.8 onwards you should be looking at the core threads module. Have a look at http://metacpan.org/pod/threads
If you want to use modules which aren't thread safe you can usually load them with a require and import inside the thread entry point.

Which signals should a wrapper script pass along to a subprogram?

If I have a script that is a wrapper for another program (e.g., a daemonizer wrapper or a wrapper for mathematica), it is sometimes useful to trap signals in the wrapper program and pass them along to the subprogram.
For example, here's some Perl code to deal with the INT (interrupt) signal so that if you do ctrl-C after launching the wrapper the subprogram also gets interrupted:
my $subprogram = "foo args";
my $pid = open(F, "$subprogram |") or die "Can't open pipe from $subprogram: $!";
$SIG{INT} = sub { kill('INT', $pid);
close(F);
die "Passed along INT signal and now aborting.\n"; };
print while(<F>);
close(F);
Of [all the possible signals](http://en.wikipedia.org/wiki/Signal_(computing) that a program might handle, which ones should a wrapper script pass along?
Is there anything else that a good wrapper should do?
EDIT: Originally this question was asking how to pass along all possible signals. Thanks to the initial answers I learned that was not the right question.
EDIT: I figured out what threw me for a loop here. Mathematica apparently detaches itself from its parent process. So I have to pass along various termination signals explicitly:
$SIG{INT} = sub { kill('INT', $pid); }; # pass along SIGINT (eg, ctrl-C)
$SIG{TERM} = sub { kill('TERM', $pid); }; # pass along SIGTERM (kill's default)
$SIG{QUIT} = sub { kill('QUIT', $pid); }; # pass along SIGQUIT
$SIG{ABRT} = sub { kill('ABRT', $pid); }; # pass along SIGABRT
$SIG{HUP} = sub { kill('HUP', $pid); }; # pass along SIGHUP
Normally this would not be necessary as child processes automatically get these signals passed along (thanks for the answers that set me straight on this!). So now I'm wondering why (and how) mathematica detaches itself...
There are many signals that would be dangerous to just "pass through" in this way. "man 7 signal" and look at things like SIGPIPE, SIGILL, SIGCHILD, etc. It's very likely that you just don't want to touch those guys.
So, the real question becomes: What is the behavior you're looking for? I bet all you really want is SIGINT & SIGCONT to be passed through to the child. Does the subprogram do anything fancy with other signals, like HUP, USR1, or USR2?
I presume you're really just interested in SIGINT & SIGCONT, and the rest (i.e. SIGSEGV, SIGKILL, etc.) will take care of themselves, as the parent process termination will clean up the child as well.
Oh, and by the way, your example of:
print while(<F>);
Is fine, and if the parent perl process is suspended, it won't continue reading from F, once that pipe fills up, your subprogram will block writing into stdout, which is probably pretty much the behavior you want as well.
For some more interesting thoughts, take a look at "man bash" and the "trap" builtin to see what the shell developers have done to help this problem.
None.
The signals that you might generate at the keyboard are sent to the child process anyway, unless the child takes steps to avoid that (in which case, who are you to interfere).
The other signals that might kill the parent process would normally cause the child to disappear when it next writes to the pipe that was closed when the parent died.
So, unless you have reason to think that the child process mismanages signals, I would not worry about relaying any signals to the child. And if the child mismanages signals, maybe you should fix the child rather than hack the parent.
I really don't see why you would want to do that. For that matter, I don't see why you would want to send signals to the wrapper script.
ETA: Stop and restart signals are sent to the whole process group. If that's what you're after: just make sure your child process is part of that process group (it is by default AFAIK).
ETA2:
If mathematica really detaches itself (which can be done using setsid() and setpgid()), that means it explicitly does not want to receive such signals, so you shouldn't send them: it probably won't handle them anyway.
You can get a list of all signals with keys %SIG. So, to pass them all through, you set handlers to send the signal the child process ID:
for my $signal (keys %SIG) {
$SIG{$signal} = sub { kill($signal, $child_pid); };
}
You'll probably want to localize %SIG.
Not that I'm saying it's a good idea, but that's how to do it.