How can I fork a Perl CGI program to hive off long-running tasks? - perl

I am writing a Bulk Mail scheduler controlled from a Perl/CGI Application and would like to learn abut "good" ways to fork a CGI program to run a separate task? Should one do it at all? Or is it better to suffer the overhead of running a separate job-queue engine like Gearman or TheSchwartz as has been suggested recently. Does the answer/perspective change when using an near-MVC framework like CGI::Application over vanilla CGI.pm? The last comes from a possible project that I have in mind for a CGI::Application Plugin - that would make "forking" a process relatively simple to call.

Look at Proc::Daemon - it's the simplest thing that works. From your CGI script, do the CGI business (getting input, returning a response to the browser), then call Proc::Daemon::init() which does the fork, daemonizes your process and makes the parent exit. Then your script (now a daemon) does its long-running tasks and exits when they're done.
You'll want to update something (file, database record) while running as a daemon, so subsequent CGI invocations can check what it did (or how it's progressing).

Would something like POE be useful? It's more event-driven than forked, but it may meet your needs.

Related

Start mod_perl handler from another perl module

How can I start a mod_perl handler (called MyCacheHandler.pm) directly from another perl module (called MyModule.pm). Because currently I'm starting the handler via a web browser, but it would be a little bit easier to call it with MyModule.
As I understand it, you want to have it (MyCacheHandler) running in the background, and it won't produce any visible (to a browser) output? (Just side effects).
If that's correct, why is it even implemented as a mod_perl handler. Just implement it as a script and run it from cron, or as a daemon of some kind.
You could still control MyCacheHandler from MyModule (say via IPC).
Do some refactoring. Split MyCacheHandler.pm into two modules: one which is doing the hard work and does not depend on mod_perl anymore (that is, no more handling with $r), so it's callable from other modules. The other would be a now thin mod_perl handler calling the first module.
Or leave it as is, and just use LWP::UserAgent to access MyCacheHandler from MyModule.

Non-blocking / Asynchronous Execution in Perl

Is there a way to implement non-blocking / asynchronous execution (without fork()'ing) in Perl?
I used to be a Python developer for many years... Python has really great 'Twisted' framework that allows to do so (using DEFERREDs. When I ran search to see if there is anything in Perl to do the same, I came across POE framework - which seemed "close" enough to what I was searching for. But... after spending some time reading the documentation and "playing" with the code, I came against "the wall" - which is following limitation (from POE::Session documentation):
Callbacks are not preemptive. As long as one is running, no others will be dispatched. This is known as cooperative multitasking. Each session must cooperate by returning to the central dispatching kernel.
This limitation essentially defeats the purpose of asynchronous/parallel/non-blocking execution - by restricting to only one callback (block of code) executing at any given moment. No other callback can start running while another is already running!
So... is there any way in Perl to implement multi-tasking (parallel, non-blocking, asynchronous execution of code) without fork()'ing - similar to DEFERREDs in Python?
Coro is a mix between POE and threads. From reading its CPAN documentation, I think that IO::Async does real asynchronous execution. threads can be used too - at least Padre IDE successfully uses them.
I'm not very familiar with Twisted or POE, but basic parallel execution is pretty simple with threads. Interpreters are generally not compiled with threading support, so you would need to check for that. The forks package is a drop-in replacement for threading (implements the full API) but using processes seamlessly. Then you can do stuff like this:
my $thread = async {
print "you can pass a block of code as an arg unlike Python :p";
return some_func();
};
my $result = $thread->join();
I've definitely implemented callbacks from an event loop in an async process using forks and I don't see why it wouldn't work with threads.
Twisted also uses cooperative multi-tasking just like POE & Coro.
However it looks like Twisted Deferred does (or can) make use of threads. NB. See this answer from the SO question Twisted: Making code non-blocking
So you would need to go the same route with POE (though using fork is probably preferable).
So one POE solution would be to use: POE::Wheel::Run - portably run blocking code and programs in subprocesses.
For alternatives to POE take a look at AnyEvent and Reflex.
I believe you use select for that kind of thing. More similarly to forking, there's threading.
POE is fine if you want asynchronous processing but using only a single cpu (core) is fine.
For example if the app is I/O limited a single process will be enough most of the time.
No other callback can start running while another is already running!
As far as I can tell - this is the same with all languages (per CPU thread of course; modern web servers usually spawn a least one process or thread per CPU core, so it will look (to users) like stuff it working in parallel, but the long-running callback didn't get interrupted, some other core just did that work).
You can't interrupt an interrupt, unless the interrupted interrupt has been programmed specifically to accommodate it.
Imagine code that takes 1min to run, and a PC with 16 cores - now imagine a million people try to load that page, you can deliver working results to 16 people, and "time out" all the rest, or, you can crash your web server and give no results to anyone. Folks choose not to crash their web server, which is why they never permit callbacks to interrupt other callbacks (not that they could even if they tried - the caller never gets control back to make a new call before the prior one has ended anyhow...)

Which logging module to use under Perl's AnyEvent?

I am using the wonderful AnyEvent for creating an asynchronous TCP server (specifically, a MUD server).
In order to keep everything running smoothly and with as few blocking/synchronous pieces of code possible, I have replaced some modules I was using with their asynchronous counterpart, for example AnyEvent::Memcached and AnyEvent::Gearman. This allows the main program to be quite speedy, which is desirable. I have coded around the need for some of these calls to be synchronous.
One problem I currently have, and the focus of this question, is logging.
Before turning to AnyEvent for this server program, I was using Log::Log4perl as it allows me to fine-tune which modules or subroutines should be logged, at which level and to which log output (screen, file, etc).
The problem here is that the Log4perl actions (warn, info, etc) are currently performed synchronously but I have no requirement for that as long as the log lines eventually end up on the screen / file (and in the correct order).
Is Log::Log4perl still the right choice when using an asynchronous event handler such as AnyEvent, or should I look at a different module? If so, which is recommended?
AnyEvent::Log, which comes with AnyEvent, uses AnyEvent::IO, which appends to files asynchronously when IO::AIO is available (and synchronously when not).
What you are trying to avoid? If it's synchronous file IO (writing to log files/stdout etc.) then your problem would probably be solved with an asynchronous and/or buffering appender(s) rather than replacing all use of Log4perl in your code.
Log::Log4perl::Appender::Buffer seems like it might be a good start, but a completely async appender doesn't appear to exist anymore.

Perl scripts, to use forks or threads?

I am writing a couple fo scripts that go and collect data from a number of servers, the number will grow and im trynig to future proof my scripts, but im a little stuck.
so to start off with I have a script that looks up an IP in a mysql database and then connects to each server grabs some information and then puts it into the database again.
What i have been thinknig is there is a limited amount of time to do this and if i have 100 servers it will take a little bit of time to go out to each server get the information and then push it to a db. So I have thought about either using forks or threads in perl?
Which would be the prefered option in my situation? And hs anyone got any examples?
Thanks!
Edit: Ok so a bit more inforamtion needed: Im running on Linux, and what I thought was i could get the master script to collect the db information, then send off each sub process / task to connect and gather information then push teh information back to the db.
Which is best depends a lot on your needs; but for what it's worth here's my experience:
Last time I used perl's threads, I found it was actually slower and more problematic for me than forking, because:
Threads copied all data anyway, as a thread would, but did it all upfront
Threads didn't always clean up complex resources on exit; causing a slow memory leak that wasn't acceptable in what was intended to be a server
Several modules didn't handle threads cleanly, including the database module I was using which got seriously confused.
One trap to watch for is the "forks" library, which emulates "threads" but uses real forking. The problem I faced here was many of the behaviours it emulated were exactly what I was trying to get away from. I ended up using a classic old-school "fork" and using sockets to communicate where needed.
Issues with forks (the library, not the fork command):
Still confused the database system
Shared variables still very limited
Overrode the 'fork' command, resulting in unexpected behaviour elsewhere in the software
Forking is more "resource safe" (think database modules and so on) than threading, so you might want to end up on that road.
Depending on your platform of choice, on the other hand, you might want to avoid fork()-ing in Perl. Quote from perlfork(1):
Perl provides a fork() keyword that
corresponds to the Unix system call of
the same name. On most Unix-like
platforms where the fork() system call
is available, Perl's fork() simply
calls it.
On some platforms such as Windows
where the fork() system call is not
available, Perl can be built to
emulate fork() at the interpreter
level. While the emulation is
designed to be as compatible as
possible with the real fork() at the
level of the Perl program, there are
certain important differences that
stem from the fact that all the pseudo
child "processes" created this way
live in the same real process as far
as the operating system is concerned.

So, who should daemonize? The script or the caller?

I'm always wondering who should do it. In Ruby, we have the Daemons library which allows Ruby scripts to daemonize themselves. And then, looking at God (a process monitoring tool, similar to monit) page, I see that God can daemonize processes.
Any definitive answer out there?
You probably cannot get a definitive answer, as we generally end up with both: the process has the ability to daemonize itself, and the process monitor has the ability to daemonize its children.
Personally I prefer to have the process monitor or script do it, for a few reasons:
1. if the process monitor wishes to closely follow its children to restart them if they die, it can choose not to daemonize them. A SIGCHLD will be delivered to the monitor when one of its child processes exits. In embedded systems we do this a lot.
2. Typically when daemonizing, you also set the euid and egid. I prefer not to encode into every child process a knowledge of system-level policy like uids to use.
3. It allows re-use of the same application as either a command line tool or a daemon (I freely admit that this rarely happens in practice).
I would say it is better for your script to do it. I don't know your process monitoring tool there, but I would think users could potentially use an alternative tool, which means that having the script do it would be preferable.
If you can envision the script run in non-daemon fashion, I would add an option to the script to enable or disable daemonization.