This is my first Catalyst app and I'm not sure how to solve the following problem.
The user enters some data in a form and selects a file (up to 100MB) for uploading. After submitting the form, the actual computation takes up to 5 minutes and the results are stored in a DB.
What I want to do is to run this process (and maybe also the file upload) in the background to avoid a server timeout. There should be some kind of feedback to the user (like a message "Job has been started" or a progress bar). The form should be blocked while the job is still running. A result page should be displayed once the job finished.
In hours of reading I stumbled upon concepts like asynchronous requests, job queues, daemons, Gearman, or Catalyst::Plugin::RunAfterRequest.
How would you do it?
Thanks for helping a web dev novice!
PS: In my current local app the work is done in parallel with Parallel::ForkManager. For the real app, would it be advisable to use a cloud computing service like Amazon EC2? Or just find a hoster who offers multi-core servers?
Somehow I couldn't get the idea of File::Queue. For non-blocking parallel execution, I ended up using a combination of TheSchwartz and Parallel::Prefork like it is implemented in the Foorum Catalyst App.
Basically, there are 5 important elements. Maybe this summary will be helpful to others.
1) TheSchwartz DB
2) A client (DB handle) for the TheSchwartz DB
package MyApp::TheSchwartz::Client;
use TheSchwartz;
sub theschwartz {
my $theschwartz = TheSchwartz->new(
databases => [ {
dsn => 'dbi:mysql:theschwartz',
user => 'user',
pass => 'pass',
} ],
verbose => 1,
);
return $theschwartz;
}
3) A job worker (where the actual work is done)
package MyApp::TheSchwartz::Worker::Test;
use base qw( TheSchwartz::Moosified::Worker );
use MyApp::Model::DB; # Catalyst DB connect_info
use MyApp::Schema; # Catalyst DB schema
sub work {
my $class = shift;
my $job = shift;
my ($args) = $job->arg;
my ($arg1, $arg2) = #$args;
# re-use Catalyst DB schema
my $connect_info = MyApp::Model::DB->config->{connect_info};
my $schema = MyApp::Schema->connect($connect_info);
# do the heavy lifting
$job->completed();
}
4) A worker process TheSchwartzWorker.pl that monitors the table job non-stop
use MyApp::TheSchwartz::Client qw/theschwartz/; # db connection
use MyApp::TheSchwartz::Worker::Test;
use Parallel::Prefork;
my $client = theschwartz();
my $pm = Parallel::Prefork->new({
max_workers => 16,
trap_signals => {
TERM => 'TERM',
HUP => 'TERM',
USR1 => undef,
}
});
while ($pm->signal_received ne 'TERM') {
$pm->start and next;
$client->can_do('MyApp::TheSchwartz::Worker::Test');
my $delay = 10; # When no job is available, the working process will sleep for $delay seconds
$client->work( $delay );
$pm->finish;
}
$pm->wait_all_children();
5) In the Catalyst controller: insert a new job into the table job and pass some arguments
use MyApp::TheSchwartz::Client qw/theschwartz/;
sub start : Chained('base') PathPart('start') Args(0) {
my ($self, $c ) = #_;
$client = theschwartz();
$client->insert(‘MyApp::TheSchwartz::Worker::Test’, [ $arg1, $arg2 ]);
$c->response->redirect(
$c->uri_for(
$self->action_for('archive'),
{mid => $c->set_status_msg("Run '$name' started")}
)
);
}
The new run is greyed out on the "archive" page until all results are available in the database.
Put the job in a queue and do it in a different process, outside of the Web application. While you Catalyst process is busy, even if using Catalyst::Plugin::RunAfterRequest, it cannot be used to process other web requests.
There are very simple queuing systems, like File::Queue. Basically, you assign a job ID to the document, put it in the queue. Another process checks the queue and picks up new jobs.
You can save the job status in a database, or anything accessible any the web applications. On the front end, you can poll the job status every X seconds or minutes to give feedback to the user.
You have to figure out how much memory and CPU you need. Multi-core CPU or multiple CPUs may not be required, even if you have several processes running. Choosing between a dedicated server or cloud like EC2 is more about the flexibility (resizing, snapshot, etc.) vs. price.
Related
I've added advisory locking to Sqitch, using Postgres advisory locks and MySQL GET_LOCK(). This feature prevents more than one instance of Sqitch from deploying to a database at one time. This works great, but I wanted to add a lock timeout, too, so that one never finds a CI/CD process hung for hours or days because something went amiss.
MySQL's GET_LOCK() supports a timeout argument, but Postgres advisory locks do not. Since I thought it likely that other database engines would also not have timeouts, I thought it best to implement the timeout in Perl. Following the DBI manual, I used Sys::SigAction to set and handle the timeout:
# Try waiting for the lock.
require App::Sqitch::SigAction;
return $self->_locked(1) unless App::Sqitch::SigAction::timeout_call($wait, sub {
$self->wait_lock
});
I also added tests to confirm it works with both MySQL and Postgres. So far so good.
Alas, Sys::SigAction does not work on Windows. I took a stab and testing it on Windows, but since Windows Perl is not compiled with d_sigaction, which Sys::SigAction also requires, I didn't get far. I tried implementing the Perl-standard alarm/$SIG{ALRM} pattern, but it failed to send the signal while waiting on the Postgres lock.
Which has led me here and to my question: What is the best cross-platform pattern for timing out some execution in Perl? Ideally it has a straight-forward interface, works on *nix and Windows, and effectively handles breaking out of a database query.
I ended up ditching Sys::SigAction following discussion here and elsewhere, and instead switched to:
Letting the database handle the timeout, as MySQL's get_lock() does
Adding a simple interface for polling with exponential backoff and timeout that engines can use to poll for a lock instead of waiting (similar to Retry::Backoff)
Switching the Postgres implementation to use the async query support in DBD::Pg to send off the lock request, and uses the backoff/timeout interface to check to see if it has returned and cancel the query if it times out
I was especially pleased to realize I could do #3, as I originally used the timeout/backoff interface to poll with pg_try_advisory_lock( key ), which just feels heavy. Better to asynchronously call pg_advisory_lock ( key ) and poll for its response. It looks like this:
sub wait_lock {
my $self = shift;
# Asyncronouslly request a lock with an indefinite wait.
my $dbh = $self->dbh;
$dbh->do(
'SELECT pg_advisory_lock(75474063)',
{ pg_async => DBD::Pg::PG_ASYNC() },
);
# Use _timeout to periodically check for the result.
return 1 if $self->_timeout(sub { $dbh->pg_ready && $dbh->pg_result });
# Timed out, cancel the query and return false.
$dbh->pg_cancel;
return 0;
}
Of course the MySQL implementation is simpler, since get_lock() does all the work:
sub wait_lock {
my $self = shift;
$self->dbh->selectcol_arrayref(
q{SELECT get_lock('sqitch working', ?)},
undef, $self->lock_timeout
)->[0]
}
In my server program I need to have ability to iterate every 5 minutes through all opened connections and see which is really "active" or not.
I know that the best approach is to use "heart beat", but then, the server need to have somehow ability to check weather the connection is "off" in order to delete the associated "user parameters" that is attached to the connection.
My first approach was to use "Async" module, but this works in a separate process - so I cannot delete any element from the main process unless I use a technique to invoke a subroutine from the main process called from the child process (I don't know how, any help will be warmly welcomed).
Another possibility using Async is create a static client that is all the time on (also in the server) and sending "commands" to the server, but to me it looks "too exaggerating" to create such "wasting memory" client in the server, and also "eat" CPU time (I think much more than simple event like equivalent to "setTimeout" in JS).
Yet another approach is to use EV: But when I call EV::run it will NOT RUN anything ELSE than this "periodic event" - means that it will not reach the next line where the ->Start for the server is.
Placing it after the ->Start will make this event useless too: As the server works the program will not go behind the ->Start line.
Using EV::run EV::RUN_NOWAIT; will make the server work, but the EV will somehow not work, for a strange reason (Anyone know how can I still make it work?)
I prefer to not use Net::Websocket::EV, because as per their script, it doesn't do the handshake automatically, and many things (as well as SSL connection that I have) I will need to do manually and for me it is change a lot in my program.
PROBLEM SUMMARY:
How to make the code in EV run every 5 minutes, together with the server?
my %CON; # Connections (And user data) hash
my $__ConChk=EV::periodic 0, 300, 0, sub {
my #l=keys %CON;
for(my $i=0 ; $i<#l ; $i++) {
if($CON{$l[$i]}{"T"}+3600<time()) { # I give one hour time to be completely offline (for different reasons)
$CON{$l[$i]}{"C"}->disconnect(); delete $CON{$l[$i]};
}
}
};
EV::run EV::RUN_NOWAIT; # This is not working - Seems to be ignored!
Net::WebSocket::Server->new(
listen => $ssl, # Earlier preset
silence_max=>60, # Time to just shut the connection off, but don't delete user data
on_connect => sub {
my($serv,$conn)=#_;
my $cid; # Connection ID (for the hash)
$conn->on(
handshake => sub {
my($conn,$handshake)=#_;
# Create user data in $CON{$cid}
},
binary => sub {
$CON{$cid}{"T"}=time();
# Handling of single incomming message (command)
},
disconnect => sub {
# Do NOT DELETE THE ENTRY!! Maybe the connection drop due to instability!!
}
);
}
)->start; # This will run - but ignoring EV::run - what to do....?
undef $__ConChk;
I have small perl script which gets services details from mongoDB, queries its statuses and gives html output
#...some stuff to get $token
my #cmd = ('/opt/mongo/bin/mongo', '127.0.0.1:27117/service_discovery', '--quiet', '-u', 'xxx', '-p', 'xxx', '--eval', "var environ='$env'; var action='status'", '/home/mongod/www/cgi/getstatus.js');
my $mongo_out;
run \#cmd, '>>', \$mongo_out;
$json->incr_parse ($mongo_out);
while (my $obj = $json->incr_parse) {
my $hostname = "$obj->{'hostname'}";
print "<tr><td colspan=4 align=\"center\"><h4>$hostname</h4></td></tr>";
foreach my $service (#{$obj->{'services'}}) {
my $name = "$service->{'name'}";
my $port = "$service->{'port'}";
my $proto = "$service->{'proto'}";
my $request = HTTP::Request->new(GET => "${proto}://$hostname:${port}/status/service");
$request->header(Authorization => "Bearer $token");
my $ua = LWP::UserAgent->new;
$ua->timeout(2);
my $response = $ua->request($request);
my $code = $response->code();
if ($code == 200) {
my $var = %$response->{'_content'};
my $coder = JSON::XS->new->ascii->pretty->allow_nonref;
my $out = try {my $output = $coder->decode($var)} catch {undef};
if(exists $out->{'name'} && exists $out->{'version'}) {
print "<tr><td align=\"center\">$port</td><td align=\"center\">$name</td><td align=\"center\">$out->{'name'}</td><td align=\"center\">$out->{'version'}</td></tr>";
} else {
print "<tr><td align=\"center\">$port</td><td align=\"center\">$name</td><td colspan=2 align=\"center\">auth failed</td></tr>";
}
} elsif ($code == 500) {
print "<tr><td align=\"center\">$port</td><td align=\"center\">$name</td><td colspan=2 align=\"center\">offline</td></tr>";
} elsif ($code == 404) {
print "<tr><td align=\"center\">$port</td><td align=\"center\">$name</td><td colspan=2 align=\"center\">page not found</td></tr>";
}
}
}
It executes for a while, especially when some services are offline. Is it possible to query services within same host simultaneously?
This is almost a question that's too broad to answer, because ... it depends.
But yes. You have two and a half mechanism for parallelising in perl:
thread
fork
Non blocking IO.
I say two and a half, because non-blocking IO isn't really parallel, as much as solving the same problem a different way.
Implementation of parallelism is a really good way to end up with some horrific and hard to trace bugs, and requires a bit of a shift of mind set, because your code is no longer executing in a well defined sequence - the whole point is that your code might hit different bits at different times, and that can cause utter chaos.
And not least because modules you import - might well not be "thread safe" (which means they may be fine, but occasionally will break in a very unpredictable way, and you'll tear your hair out trying to track down the bug).
So with that in mind
threads
Perhaps slightly counter intuitively, if you've used threads in another language - perl threads are NOT light weight. There is a significant cost to starting them, not least because you effectively end up multiplying your memory footprint by the number of threads you are running.
I would normally suggest as a result - look at a "worker threads" model, using Thread::Queue. You start up a number of threads, and use queues to serialise the input and output from the threads.
forking
fork() is a unix native system call. You use it a lot, and it's quite efficient. It splits your program into two identical copies - including position within the code - at the point at which it's called. The only difference initially is the return code of the fork() system call - the parent will get the process ID of the child, the child will get zero.
It's quite easy to do strange thing accidentally, as both piece of code at this point are at exactly the same point in terms of loop iterations, file handles, etc. but this rapidly diverges and you can again, end up with some very strange things happening if you interact with 'shared' resources.
I would normally suggest looking at Parallel::ForkManager module as an easy way to avoid tripping yourself up with fork().
non blocking IO
You can often use something like IO::Select and the can_read method, which detects which file handles will block if you read from them - you can skip that one, until it blocks. This would also work for your use case, although it's not always applicable.
I've got examples of both the above here: Perl daemonize with child daemons
I am working on a project that uses AnyEvent Log in the main program as well as several dependent modules/packages. I currently have each module writing to it's own context, and all contexts are added to the main programs context as slaves. This project is part of a much larger project, and in addition to writing out a local log file, there are certain messages that I would like to send to a remote program which will then be responsible for presenting the messages to users.
The problem is that in order to send to the remote program, I have to have a piece of information that is only available from the main program, so it's not feesible to just implement a method at the package level to send messages. The piece of information I need is more or less a transaction id, and the log messages are interesting events from a particular transaction.
The main program has 2 contexts ( main , secondary ). The messages I am interested in will either come from the secondary ctx OR one of the package/module contexts. I am interested in only sending info - crit level messages to users, but ONLY WHEN the txID exists in the main program. I ALWAYS want messages to be written to my local log file regardless of whether or not a deployment is running. I would like this to be something that I setup in the main program rather than in a module because the modules are tasked to do certain thing and shouldn't even be aware of the fact that there is an ID associated with the task at hand.
Here is a quick breakdown of the log configuration specific code in the main program.
# Immediately after Proc::Daemon::Init
my $logger = AnyEvent::Log::ctx "desman";
# configure is done before daemonization to allow for --nodaemon
sub configure {
my ( $level, $file ) = #_;
$AnyEvent::Log::FILTER->level($level);
$AnyEvent::Log::LOG->log_to_file($file);
}
sub log_event {
... logic to send messages as tx event ...
}
sub worker_init {
threads->create(sub {
$logger->attach( my $worklog = AnyEvent::Log::ctx "worker" );
... more stuff for worker specifics ...
});
}
Ideally, I would be able to use one or both of log_cb and fmt_cb to handle the formatting and sending of messages to the remote program using the log_event sub. I have tried a few different things, and so far I'm stuck.
# doesn't seem to do anything
$logger->fmt_cb( sub { ... } );
$logger->log_cb( sub { ... } );
# broke everything
$AnyEvent::Log::COLLECT->attach( my $evtlog = new AnyEvent::Log::Ctx
fmt_cb => \&event_formatter,
log_cb => \&log_event
);
$evtlog->levels('crit','warning','notice','info');
I've been searching around for more examples than what's in the docs, but haven't found much yet. Not much of a surprise there since AE::log is pretty much awesome as it is, but anything to help will be greatly appreciated.
I'm working on a mathematical model that uses data generated from XFOIL, a popular aerospace tool used to find the lift and drag coefficients on airfoils.
I have a Perl script that calls XFOIL repeatedly with different input parameters to generate the data I need. I need XFOIL to run 5,600 times, at around 100 seconds per run, soabout 6.5 days to complete.
I have a quad-core machine, but my experience as a programmer is limited, and I really only know how to use basic Perl.
I would like to run four instances of XFOIL at a time, all on their own core. Something like this:
while ( 1 ) {
for ( i = 1..4 ) {
if ( ! exists XFOIL_instance(i) ) {
start_new_XFOIL_instance(i, input_parameter_list);
}
}
}
So the program is checking (or preferably sleeping) until an XFOIL instance is free, when we can start a new instance with the new input parameter list.
Try Parallel::ForkManager. It's a module that provides a simple interface for forking off processes like this.
Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my #input_parameter_list =
map { join '_', ('param', $_) }
( 1 .. 15 );
my $n_processes = 4;
my $pm = Parallel::ForkManager->new( $n_processes );
for my $i ( 1 .. $n_processes ) {
$pm->start and next;
my $count = 0;
foreach my $param_set (#input_parameter_list) {
$count++;
if ( ( $count % $i ) == 0 ) {
if ( !output_exists($param_set) ) {
start_new_XFOIL_instance($param_set);
}
}
}
$pm->finish;
}
$pm->wait_all_children;
sub output_exists {
my $param_set = shift;
return ( -f "$param_set.out" );
}
sub start_new_XFOIL_instance {
my $param_set = shift;
print "starting XFOIL instance with parameters $param_set!\n";
sleep( 5 );
touch( "$param_set.out" );
print "finished run with parameters $param_set!\n";
}
sub touch {
my $fn = shift;
open FILE, ">$fn" or die $!;
close FILE or die $!;
}
You'll need to supply your own implementations for the start_new_XFOIL_instance and the output_exists functions, and you'll also want to define your own sets of parameters to pass to XFOIL.
This looks like you can use gearman for this project.
www.gearman.org
Gearman is a job queue. You can split your work flow into a lot of mini parts.
I would recommend using amazon.com or even their auction able servers to complete this project.
Spending 10cents per computing hour or less, can significantly spead up your project.
I would use gearman locally, make sure you have a "perfect" run for 5-10 of your subjobs before handing it off to an amazon compute farm.
Perl threads will take advantage of multiple cores and processors. The main pro of threads is its fairly easy to share data between the threads and coordinate their activities. A forked process cannot easily return data to the parent nor coordinate amongst themselves.
The main cons of Perl threads is they are relatively expensive to create compared to a fork, they must copy the entire program and all its data; you must have them compiled into your Perl; and they can be buggy, the older the Perl, the buggier the threads. If your work is expensive, the creation time should not matter.
Here's an example of how you might do it with threads. There's many ways to do it, this one uses Thread::Queue to create a big list of work your worker threads can share. When the queue is empty, the threads exit. The main advantages are that its easier to control how many threads are active, and you don't have to create a new, expensive thread for each bit of work.
This example shoves all the work into the queue at once, but there's no reason you can't add to the queue as you go. If you were to do that, you'd use dequeue instead of dequeue_nb which will wait around for more input.
use strict;
use warnings;
use threads;
use Thread::Queue;
# Dummy work routine
sub start_XFOIL_instance {
my $arg = shift;
print "$arg\n";
sleep 1;
}
# Read in dummy data
my #xfoil_args = <DATA>;
chomp #xfoil_args;
# Create a queue to push work onto and the threads to pull work from
# Populate it with all the data up front so threads can finish when
# the queue is exhausted. Makes things simpler.
# See https://rt.cpan.org/Ticket/Display.html?id=79733
my $queue = Thread::Queue->new(#xfoil_args);
# Create a bunch of threads to do the work
my #threads;
for(1..4) {
push #threads, threads->create( sub {
# Pull work from the queue, don't wait if its empty
while( my $xfoil_args = $queue->dequeue_nb ) {
# Do the work
start_XFOIL_instance($xfoil_args);
}
# Yell when the thread is done
print "Queue empty\n";
});
}
# Wait for threads to finish
$_->join for #threads;
__DATA__
blah
foo
bar
baz
biff
whatever
up
down
left
right
Did you consider gnu parallel parallel.
It will allow you to run several install instances of your program with different inputs and
fill your CPU cores as they begin available. It's often a very simple an efficient way to achieve parallelization of simple tasks.
This is quite old but if someone is still looking for suitable answers to this question, you might want to consider Perl Many-Core-Engine (MCE)