Intra-process coordination in mod_perl under the worker MPM - perl

I need to do some simple timezone calculation in mod_perl. DateTime isn't an option. What I need to do is easily accomplished by setting $ENV{TZ} and using localtime and POSIX::mktime, but under a threaded MPM, I'd need to make sure only one thread at a time was mucking with the environment. (I'm not concerned about other uses of localtime, etc.)
How can I use a mutex or other locking strategy to serialize (in the non-marshalling sense) access to the environment? The docs I've looked at don't explain well enough how I would create a mutex for just this use. Maybe there's something I'm just not getting about how you create mutexes in general.
Update: yes, I am aware of the need for using Env::C to set TZ.

(repeating what I said over at PerlMonks...)
BEGIN {
my $mutex;
sub that {
$mutex ||= APR::ThreadMutex->new( $r->pool() );
$mutex->lock();
$ENV{TZ}= ...;
...
$mutex->unlock();
}
}
But, of course, lock() should happen in a c'tor and unlock() should happen in a d'tor except for one-off hacks.
Update: Note that there is a race condition in how $mutex is initialized in the subroutine (two threads could call that() for the first time nearly simultaneously). You'd most likely want to initialize $mutex before (additional) threads are created but I'm unclear on the details on the 'worker' Apache MPM and how you would accomplish that easily. If there is some code that gets run "early", simply calling that() from there would eliminate the race.
Which all suggests a much safer interface to APR::ThreadMutex:
BEGIN {
my $mutex;
sub that {
my $autoLock= APR::ThreadMutex->autoLock( \$mutex );
...
# Mutex automatically released when $autoLock destroyed
}
}
Note that autoLock() getting a reference to undef would cause it to use a mutex to prevent a race when it initializes $mutex.

Because of this issue, mod_perl 2 actually deals with the %ENV hash differently than mod_perl 1. In mod_perl 1 %ENV was tied directly to the environ struct, so changing %ENV changed the environment. In mod_perl 2, the %ENV hash is populated from environ, but changes are not passed back.
This means you can no longer muck with $ENV{TZ} to adjust the timezone -- particularly in a threaded environment. The Apache2::Localtime module will make it work for the non-threaded case (by using Env::C) but when running in a threaded MPM that will be bad news.
There are some comments in the mod_perl source (src/modules/perl/modperl_env.c) regarding this issue:
/* * XXX: what we do here might change:
* - make it optional for %ENV to be tied to r->subprocess_env
* - make it possible to modify environ
* - we could allow modification of environ if mpm isn't threaded
* - we could allow modification of environ if variable isn't a CGI
* variable (still could cause problems)
*/
/*
* problems we are trying to solve:
* - environ is shared between threads
* + Perl does not serialize access to environ
* + even if it did, CGI variables cannot be shared between threads!
* problems we create by trying to solve above problems:
* - a forked process will not inherit the current %ENV
* - C libraries might rely on environ, e.g. DBD::Oracle
*/

If you're using apache 1.3, then you shouldn't need to resort to mutexes. Apache 1.3 spawns of a number of worker processes, and each worker executes a single thread. In this case, you can write:
{
local $ENV{TZ} = whatever_I_need_it_to_be();
# Do calculations here.
}
Changing the variable with local means that it reverts back to the previous value at the end of the block, but is still passed into any subroutine calls made from within that block. It's almost certainly what you want. Since each process has its own independent environment, you won't be changing the environment of other processes using this technique.
For apache 2, I don't know what model it uses with regards to forks and threads. If it keeps the same approach of forking off processes and having a single thread each, you're fine.
If apache 2 uses honest to goodness real threads, then that's outside my area of detailed knowledge, but I hope another lovely stackoverflow person can provide assistance.
All the very best,
Paul

Related

Is the only way to disconnect WWW::Mechanize::Firefox from mozrepl destruction of the objects?

As the title says I'm trying to make a perl daemon which, being long-running I want to be sane on resource usage.
All the examples / documentation I've seen doesn't seem to mention a way to disconnect a session.
The best documentation on the topic I can find in WWW::Mechanize::Firefox::Troubleshooting
Where it's suggested the object (and connection?) is kept alive until global destruction.
In short, I've seen no 'disconnect' function, and wonder if I'm missing something.
Disconnection seems to be handled via destructors. Perl uses special DESTROY methods for this. It is not advisable to call this method manually.
You need to decrease the refcount of your $mech object in order to get it destroyed automatically. This happens when the variable drops out of scope, in the Global Destruction Phase at the end of the process, or (in the case of objects), by assigning something different to your variable, e.g.
$mech = undef;
To completely deallocate any variable, you can also
undef $mech; # which btw is the answer provided in the FAQ you linked
The differences are subtle, and irrelevant in this case.

Thread safety of global variables in perl

I have following questions:
How is global code executed and global variables initialized in perl?
If I write use package_name; in multiple packages, does the global code execute each time?
Are global variables defined this way thread safe?
Perl makes a complete copy of all code and variables for each thread. Communication between threads is via specially marked shared variables (which in fact are not shared - there is still a copy in each thread, but all the copies get updated). This is a significantly different threading model than many other languages have, so the thread-safety concerns are different - mostly centering around what happens when objects are copied to make a new thread and those objects have some form of resource to something outside the program (e.g. a database connection).
Your question about use isn't really related to threads, as far as I can tell? use does several things; one is loading the specified module and running any top-level code in it; this happens only once per module, not once per use statement.

What are the Perl techniques to detach just a portion of code to run independently?

I'm not involved in close-to-OS programming techniques, but as I know, when it comes to doing something in parallel in Perl the weapon of choice is fork and probably some useful modules built upon it. The doc page for fork says:
Does a fork(2) system call to create a new process running the same program at the same point.
As a consequence, having a big application that consumes a lot of memory and calling fork for a small task means there will be 2 big perl processes, and the second will waste resources just to do some simple work.
So, the question is: what to do (or how to use fork, if it's the only method) in order to have a detached portion of code running independently and consuming just the resources it needs?
Just a very simpel example:
use strict;
use warnings;
my #big_array = ( 1 .. 2000000 ); # at least 80 MB memory
sleep 10; # to have time to inspect easely the memory usage
fork();
sleep 10; # to have time to inspect easely the memory usage
and the child process consumes 80+ MB too.
To be clear: it's not important to communicate to this detached code or to use its result somehow, just to be possible to say "hey, run for me this simple task in the background and let me continue my heavy work meanwhile ... and don't waste my resources!" when running a heavy perl application.
fork() to exec() is your bunny here. You fork() to create a new process (which is a fairly cheap operation, see below), then exec() to replace the big perl you've got running with something smaller. This looks like this:
use strict;
use warnings;
use 5.010;
my #ary = (1 .. 10_000_000);
if (my $pid = fork()) {
# parent
say "Forked $pid from $$; sleeping";
sleep 1_000;
} else {
# child
exec('perl -e sleep 1_000');
}
(#ary was just used to fill up the original process' memory a bit.)
I said that fork()ing was relatively cheap, even though it does copy the entire original process. These statements are not in conflict; the guys who designed fork noticed this same problem. The copy is lazy, that is, only the bits that are actually changed are copied.
If you find you want the processes to talk to each other, you'll start getting into the more complex domain of IPC, about which a number of books have been written.
Your forked process is not actually using 80MB of resident memory. A large portion of that memory will be shared - 'borrowed' from the parent process until either the parent or child writes to it, at which point copy-on-write semantics will cause the memory to actually be copied.
If you want to drop that baggage completely, run exec in your fork. That will replace the child Perl process with a different executable, thus freeing the memory. It's also perfect if you don't need to communicate anything back to the parent.
There is no way to fork just a subset of your process's footprint, so the usual workarounds come down to:
fork before you run memory intensive code in the parent process
start a separate process with system or open HANDLE,'|-',.... Of course this new process won't inherit any data from its parent, so you will need to pass data to this child somehow.
fork() as implemented on most operating systems is nicely efficient. It commonly uses a technique called copy-on-write, to mean that pages are initially shared until one or other process writes to them. Also a lot of your process memory is going to be readonly mapped files anyway.
Just because one process uses 80MB before fork() doesn't mean that afterwards the two will use 160. To start with it will be only a tiny fraction more than 80MB, until each process starts dirtying more pages.

Is my Rose::DB::Object compile-time too slow?

I'm planning to move from Class::DBI to Rose::DB::Object due to its nice structure and the jargon that RDBO is faster compares to CDBI and DBIC.
However on my machine (linux 2.6.9-89, perl 5.8.9) RDBO compiled time is much slower than CDBI:
$ time perl -MClass::DBI -e0
real 0m0.233s
user 0m0.208s
sys 0m0.024s
$ time perl -MRose::DB::Object -e0
real 0m1.178s
user 0m1.097s
sys 0m0.078s
That's a lot different...
Anyone experiences similar behaviour here?
Cheers.
#manni and #john: thanks for the explanation about the modules referenced by RDBO, it surely answers why the compile-time is slower than CDBI.
The application is not running on a persistent environment. In fact it's invoked by several simultaneous cron jobs that run at 2 mins, 5 mins, and x mins interval - so yes, compile-time is crucial here...
Jonathan Rockway's App::Persistent seems interesting, however its (current) limitation to allow only one application running at a time is not suitable for my purpose. Also it has issue when we kill the client, the server process is still running...
Rose::DB::Object simply contains (or references from other modules) much more code than Class::DBI. On the bright side, it also has many more features and is much faster at runtime than Class::DBI. If compile time is concern for you, then your best bet is to load as little code as possible (or get faster disks).
Another option is to set auto_load_related_classes to false in your Metadata objects. To do this early enough and globally will probably require you to make a Metadata subclass and then set that as the meta_class in your common Rose::DB::Object base class.
Turning auto_load_related_classes off means that you'd have to manually load related classes that you actually want to use in your script. That's a bit of a pain, but it lets you control how many classes get loaded. (If you have heavily interrelated classes, loading a single one can end up pulling all the other ones in.)
You could, perhaps, have an environment variable to control the behavior. Example metadata class:
package My::DB::Object::Metadata;
use base 'Rose::DB::Object::Metadata';
# New class method to handle default
sub default_auto_load_related_classes
{
return $ENV{'RDBO_AUTO_LOAD_RELATED_CLASSES'} ? 1 : 0
}
# Override existing object method, honoring new class-defined default
sub auto_load_related_classes
{
my($self) = shift;
return $self->SUPER::auto_load_related_classes(#_) if(#_);
if(defined(my $value = $self->SUPER::auto_load_related_classes))
{
return $value;
}
# Initialize to default
return $self->SUPER::auto_load_related_classes(ref($self)->default_auto_load_related_classes);
}
And here's how it's tied to your common object base class:
package My::DB::Object;
use base 'Rose::DB::Object';
use My::DB::Object::Metadata;
sub meta_class { 'My::DB::Object::Metadata' }
Then set RDBO_AUTO_LOAD_RELATED_CLASSES to true when you're running in a persistent environment, and leave it false (and don't forget to explicitly load related classes) for command-line scripts.
Again, this will only help if you're currently loading more classes than you strictly need in a particular script due to the default true value of the auto_load_related_classes Metadata attribute.
If compile time is an issue, there are methods to lessen the impact. One is PPerl which makes a normal Perl script into a daemon that is compiled once. The only change you need to make (after installing it, of course) is to the shebang line:
#!/usr/bin/pperl
Another option is to code write a client/server model program where the bulk of the work is done by a server that loads the expensive modules and a thin script that just interacts with the server over sockets or pipes.
You should also look at App::Persistent and this article, both of which were written by Jonathan Rockway (aka jrockway).
This looks almost as dramatic over here:
time perl -MClass::DBI -e0
real 0m0.084s
user 0m0.080s
sys 0m0.004s
time perl -MRose::DB::Object -e0
real 0m0.391s
user 0m0.356s
sys 0m0.036s
I'm afraid part of the difference can simply be explained by the number of dependencies in each module:
perl -MClass::DBI -le 'print scalar keys %INC'
46
perl -MRose::DB::Object -le 'print scalar keys %INC'
95
Of course, you should ask yourself how much compilation time really matters for your particular problem. And what source code would be easier to maintain for you.

Is there a way to have managed processes in Perl (i.e. a threads replacement that actually works)?

I have a multithreded application in perl for which I have to rely on several non-thread safe modules, so I have been using fork()ed processes with kill() signals as a message passing interface.
The problem is that the signal handlers are a bit erratic (to say the least) and often end up with processes that get killed in inapropriate states.
Is there a better way to do this?
Depending on exactly what your program needs to do, you might consider using POE, which is a Perl framework for multi-threaded applications with user-space threads. It's complex, but elegant and powerful and can help you avoid non-thread-safe modules by confining activity to a single Perl interpreter thread.
Helpful resources to get started:
Programming POE presentation by Matt Sergeant (start here to understand what it is and does)
POE project page (lots of cookbook examples)
Plus there are hundreds of pre-built POE components you can use to assemble into an application.
You can always have a pipe between parent and child to pass messages back and forth.
pipe my $reader, my $writer;
my $pid = fork();
if ( $pid == 0 ) {
close $reader;
...
}
else {
close $writer;
my $msg_from_child = <$reader>;
....
}
Not a very comfortable way of programming, but it shouldn't be 'erratic'.
Have a look at forks.pm, a "drop-in replacement for Perl threads using fork()" which makes for much more sensible memory usage (but don't use it on Win32). It will allow you to declare "shared" variables and then it automatically passes changes made to such variables between the processes (similar to how threads.pm does things).
From perl 5.8 onwards you should be looking at the core threads module. Have a look at http://metacpan.org/pod/threads
If you want to use modules which aren't thread safe you can usually load them with a require and import inside the thread entry point.