Perl memory usage profiling and leak detection?

Perl memory usage profiling and leak detection? - perl

I wrote a persistent network service in Perl that runs on Linux.
Unfortunately, as it runs, its Resident Stack Size (RSS) just grows, and grows, and grows, slowly but surely.
This is despite diligent efforts on my part to expunge all unneeded hash keys and delete all references to objects that would otherwise cause reference counts to remain in place and obstruct garbage collection.
Are there any good tools for profiling the memory usage associated with various native data primitives, blessed hash reference objects, etc. within a Perl program? What do you use for tracking down memory leaks?
I do not habitually spend time in the Perl debugger or any of the various interactive profilers, so a warm, gentle, non-esoteric response would be appreciated. :-)

You could have a circular reference in one of your objects. When the garbage collector comes along to deallocate this object, the circular reference means that everything referred to by that reference will never get freed. You can check for circular references with Devel::Cycle and Test::Memory::Cycle. One thing to try (although it might get expensive in production code, so I'd disable it when a debug flag is not set) is checking for circular references inside the destructor for all your objects:
# make this be the parent class for all objects you want to check;
# or alternatively, stuff this into the UNIVERSAL class's destructor
package My::Parent;
use strict;
use warnings;
use Devel::Cycle; # exports find_cycle() by default
sub DESTROY
{
my $this = shift;
# callback will be called for every cycle found
find_cycle($this, sub {
my $path = shift;
foreach (#$path)
{
my ($type,$index,$ref,$value) = #$_;
print STDERR "Circular reference found while destroying object of type " .
ref($this) . "! reftype: $type\n";
# print other diagnostics if needed; see docs for find_cycle()
}
});
# perhaps add code to weaken any circular references found,
# so that destructor can Do The Right Thing
}

You can use Devel::Leak to search for memory leaks. However, the documentation is pretty sparse... for example, just where does one get the $handle reference to pass to Devel::Leak::NoteSV()? f I find the answer, I will edit this response.
Ok it turns out that using this module is pretty straightforward (code stolen shamelessly from Apache::Leak):
use Devel::Leak;
my $handle; # apparently this doesn't need to be anything at all
my $leaveCount = 0;
my $enterCount = Devel::Leak::NoteSV($handle);
print STDERR "ENTER: $enterCount SVs\n";
# ... code that may leak
$leaveCount = Devel::Leak::CheckSV($handle);
print STDERR "\nLEAVE: $leaveCount SVs\n";
I'd place as much code as possible in the middle section, with the leaveCount check as close to the end of execution (if you have one) as possible -- after most variables have been deallocated as possible (if you can't get a variable out of scope, you can assign undef to it to free whatever it was pointing to).

What next to try (not sure if this would be best placed in a comment after Alex's question above though): What I'd try next (other than Devel::Leak):
Try to eliminate "unnecessary" parts of your program, or segment it into separate executables (they could use signals to communicate, or call each other with command-line arguments perhaps) -- the goal is to boil down an executable into the smallest amount of code that still exhibits the bad behaviour. If you're sure it's not your code that's doing it, reduce the number of external modules you're using, particularly those that have an XS implementation. If perhaps it is your own code, look for anything potentially fishy:
definitely any use of Inline::C or XS code
direct use of references, e.g. \#list or \%hash, rather than preallocated references like [ qw(foo bar) ] (the former creates another reference which may get lost; in the latter, there is just one reference to worry about, which is usually stored in a local lexical scalar
manipulating variables indirectly, e.g. $$foo where $foo is modified, which can cause autovivication of variables (although you need to disable strict 'refs' checking)

I recently used NYTProf as a profiler for a large Perl application. It doesn't track memory usage, but it does trace all executed code paths which helps with finding out where leaks originate. If what you are leaking is scarce resources such as database connections, tracing where they are allocated and closed goes a long way towards finding leaks.

A nice guide about this is included in the Perl manual : Debugging Perl memory usage

Related

Declaring variables inside or outside loops in Perl, best practices

I've been working on some Perl libraries for data mining. The libraries are full of nested loops for gathering and processing information. I'm working with strict mode and I always declare my variables with my outside of the first loop. For instance:
# Pretty useless code for clarity purposes:
my $flag = 1;
my ($v1, $v2);
while ($flag) {
for $v1 (1 .. 1000) {
# Lots and lots of code...
$v2 = $v1 * 2;
}
}
For what I've read here, performance-wise, it is better to declare them outside of the loop, however, the maintenance of my code is becoming increasingly difficult because the declaration of some variables ends up pretty far away from where they are actually used.
Something like this would be easier to mantain:
my $flag = 1;
while ($flag) {
for my $v1 (1 .. 1000) {
# Lots and lots of code...
my $v2 = $v1 * 2;
}
}
I don't have much of experience with Perl since I come from working mostly with C++. At some point, I would like to open source most of my libraries, so I would like them to be as pleasing for all of the Perl gurus as possible.
From a professional Perl developer point of view, what is most appropriate choice between these options?

The general rule is to declare every variable as late as possible.
If the value of a variable doesn't need to be kept across iterations of a loop then declare it inside the loop, or as the loop control variable for a for loop.
If it needs to remain static across the loop iterations (like your $flag) then declare it immediately before the loop.
Yes, there is a minimal speed cost to be paid if you discard and reallocate a variable every time a block is executed, but programming and maintenance costs are by far the most important efficiency and should always be put first.
You shouldn't be optimising your code before it has been made to work and found to be running too slowly; and even then, moving declarations to the top of the file is a long way down the list of compromises that are likely to make a useful difference.

Optimize for readability. This means declaring variables in the smallest possible scope. Ideally, I can see the variable declaration and all usages of that variable at the same time. We can only keep a very limited amount of context in our heads, so declaring variables near their use makes it easier to understand, write, and debug code.
Understanding what variant performs better is difficult to estimate, and difficult to measure as the effect will be rather small. But if performance is roughly equivalent, we might as well use the more readable variant.
I personally often try to write code in a single assignment form where variables aren't reassigned, and mutators like push #array, $elem are avoided. This makes sure that the name of a variable and its value are always interchangeable which makes it easier to reason about code. This implies that each variable declaration is also an initialization, which removes a whole class of errors.

You should declare variables when you're ready to define them unless you need to access the answer in a larger scope. Even then passing the value back explicitly will be easier to follow.

The particular example you have given (declaration of a loop variable) probably does not have a performance penalty. As the link you quoted says the reason for a performance difference boils down to whether the variable is initialised inside the loop. In the case of a for loop it will be initialised either way.
I almost always declare the variables in the innermost scope possible. It reduces the chances of making mistakes. I would only alter that if performance became a problem in a specific loop.

Is the only way to disconnect WWW::Mechanize::Firefox from mozrepl destruction of the objects?

As the title says I'm trying to make a perl daemon which, being long-running I want to be sane on resource usage.
All the examples / documentation I've seen doesn't seem to mention a way to disconnect a session.
The best documentation on the topic I can find in WWW::Mechanize::Firefox::Troubleshooting
Where it's suggested the object (and connection?) is kept alive until global destruction.
In short, I've seen no 'disconnect' function, and wonder if I'm missing something.

Disconnection seems to be handled via destructors. Perl uses special DESTROY methods for this. It is not advisable to call this method manually.
You need to decrease the refcount of your $mech object in order to get it destroyed automatically. This happens when the variable drops out of scope, in the Global Destruction Phase at the end of the process, or (in the case of objects), by assigning something different to your variable, e.g.
$mech = undef;
To completely deallocate any variable, you can also
undef $mech; # which btw is the answer provided in the FAQ you linked
The differences are subtle, and irrelevant in this case.

Memory Management in perl

I have facing a weird issue of handling memory in perl.
I am working in a perl application which uses pretty big hash-structures. I am assigning the has ref to and fro objects. But at the end it seems even if I am deallocating the object and the hash, the memory usage is remaining same.
Here is a sample of the problem:
my $hash = {};
.............
this ds gets populated with a lot of data ...
.......
{
my $obj = new Class("data"=>$hash);
.......
.......
......
}
#even undefing the $hash
undef $hash;
# I can expect some improvement in Memory Utilization, but its not happening
I think I am doing some very basic mistakes. Can any one suggest?

You can't really return memory back to the OS. Perl will usually keep it in order to reallocate it later, though it will garbage collect occasionally.
See http://learn.perl.org/faq/perlfaq3.html#How-can-I-free-an-array-or-hash-so-my-program-shrinks-
and
http://clokwork.net/2012/02/12/memory-management-in-perl/

Generally speaking, Perl memory management does what you need to do, and you needn't worry about it. For example, what is the harm of keeping a huge chunk of memory allocated for the rest of your program? Probably none. Perl will release it if your OS is in danger of running out of memory.
Suppose you had some special case, like a script that runs constantly in the background, but occasionally needs to do a memory-intensive task. You could solve this by separating it into two scripts: background.pl and the memory-intensive-task.pl. The background.pl would execute memory-intensive-task.pl when needed. The memory would be freed when this program completed and exited.

Questions around memory utilization in Perl

SO community,
I have been scratching my head lately around two memory issues I am running into with some of my perl scripts and I am hoping I am finding some help/pointers here to better understand what is going on.
Questionable observation #1:
I am running the same perl script on different server instances (local laptop macosx, dedicated server hardware, virtual server hardware) and am getting significantly varying results in the traced memory consumption. Just after script initialization one instance would report be a memory consumption of the script of 210 MB compared to 330 MB on another box which is a fluctuation of over 60%. I understand that the malloc() function in charge of "garbage collection" for Perl is OS specific but are there deviations normal or should I be looking more closely at what is going on?
Questionable observation #2:
One script that is having memory leaks is relatively trivial:
foreach(#dataSamples) {
#memorycheck_1
my $string = subRoutine($_);
print FILE $string;
#memorycheck_2
}
All variables in the subRoutine are kept local and should be out of scope once the subroutine finishes. Yet when checking memory usage at #memorycheck_1 and #memorycheck_1 there is a significant memory leak.
Is there any explanation for that? Using Devel::Leak it seems there are leaked pointers which I have a hard time understanding where they would be coming from. Is there an easy way to translate the response of Devel::Leak into something that can actually give me pointers from where those leaked references origin?
Thanks

You have two different questions:
1) Why is the memory footprint not the same across various environments?
Well, are all the OS involved 64 bit? Or is there a mix? If one OS is 32 bit and the other 64 bit, the variation is to be expected. Or, as #hobbs notes in the comments, is one of the perls compiled with threads support whereas another is not?
2) Why does the memory footprint change between check #1 and check #2?
That does not necessarily mean there is a memory leak. Perl won't give back memory to the OS. The memory footprint of your program will be the largest footprint it reaches and will not go down.
Neither of these points is Perl specific. For more detail, you'll need to show more detail.
See also Question 7.25 in the C FAQ and further reading mentioned in that FAQ entry.

The most common reason for a memory leak in Perl is circular references. The simplest form would be something along the lines of:
sub subRoutine {
my( $this, $that );
$this = \$that;
$that = \$this;
return $_[0];
}
Now of course people reading that are probably saying, "Why would anyone do that?" And one generally wouldn't. But more complex data structures can contain circular references pretty easily, and we don't even blink an eye at them. Consider double-linked lists where each node refers to the node to its left and its right. It's important to not let the last explicit reference to such a list pass out of scope without first breaking the circular references contained in each of its nodes, or you'll get a structure that is inaccessible but can't be garbage collected because the reference count to each node never falls to zero.
Per Eric Strom's excellent suggestion, the core module Scalar::Util has a function called weaken. A reference that has been weakened won't hold a reference count to the entity it refers to. This can be helpful for preventing circular references. Another strategy is to implement your circular-reference-wielding datastructure within a class where an object method explicitly breaks the circular reference. Either way, such data structures do require careful handling.
Another source of trouble is poorly written XS modules (nothing against XS authors; it's just really tricky to write XS modules well). What goes on behind the closed doors of an XS module may be a memory leak.
Until we see what's happening inside of subRoutine we can only guess whether or not there's actually an issue, and what the source of the issue may be.

Is my Rose::DB::Object compile-time too slow?

I'm planning to move from Class::DBI to Rose::DB::Object due to its nice structure and the jargon that RDBO is faster compares to CDBI and DBIC.
However on my machine (linux 2.6.9-89, perl 5.8.9) RDBO compiled time is much slower than CDBI:
$ time perl -MClass::DBI -e0
real 0m0.233s
user 0m0.208s
sys 0m0.024s
$ time perl -MRose::DB::Object -e0
real 0m1.178s
user 0m1.097s
sys 0m0.078s
That's a lot different...
Anyone experiences similar behaviour here?
Cheers.
#manni and #john: thanks for the explanation about the modules referenced by RDBO, it surely answers why the compile-time is slower than CDBI.
The application is not running on a persistent environment. In fact it's invoked by several simultaneous cron jobs that run at 2 mins, 5 mins, and x mins interval - so yes, compile-time is crucial here...
Jonathan Rockway's App::Persistent seems interesting, however its (current) limitation to allow only one application running at a time is not suitable for my purpose. Also it has issue when we kill the client, the server process is still running...

Rose::DB::Object simply contains (or references from other modules) much more code than Class::DBI. On the bright side, it also has many more features and is much faster at runtime than Class::DBI. If compile time is concern for you, then your best bet is to load as little code as possible (or get faster disks).
Another option is to set auto_load_related_classes to false in your Metadata objects. To do this early enough and globally will probably require you to make a Metadata subclass and then set that as the meta_class in your common Rose::DB::Object base class.
Turning auto_load_related_classes off means that you'd have to manually load related classes that you actually want to use in your script. That's a bit of a pain, but it lets you control how many classes get loaded. (If you have heavily interrelated classes, loading a single one can end up pulling all the other ones in.)
You could, perhaps, have an environment variable to control the behavior. Example metadata class:
package My::DB::Object::Metadata;
use base 'Rose::DB::Object::Metadata';
# New class method to handle default
sub default_auto_load_related_classes
{
return $ENV{'RDBO_AUTO_LOAD_RELATED_CLASSES'} ? 1 : 0
}
# Override existing object method, honoring new class-defined default
sub auto_load_related_classes
{
my($self) = shift;
return $self->SUPER::auto_load_related_classes(#_) if(#_);
if(defined(my $value = $self->SUPER::auto_load_related_classes))
{
return $value;
}
# Initialize to default
return $self->SUPER::auto_load_related_classes(ref($self)->default_auto_load_related_classes);
}
And here's how it's tied to your common object base class:
package My::DB::Object;
use base 'Rose::DB::Object';
use My::DB::Object::Metadata;
sub meta_class { 'My::DB::Object::Metadata' }
Then set RDBO_AUTO_LOAD_RELATED_CLASSES to true when you're running in a persistent environment, and leave it false (and don't forget to explicitly load related classes) for command-line scripts.
Again, this will only help if you're currently loading more classes than you strictly need in a particular script due to the default true value of the auto_load_related_classes Metadata attribute.

If compile time is an issue, there are methods to lessen the impact. One is PPerl which makes a normal Perl script into a daemon that is compiled once. The only change you need to make (after installing it, of course) is to the shebang line:
#!/usr/bin/pperl
Another option is to code write a client/server model program where the bulk of the work is done by a server that loads the expensive modules and a thin script that just interacts with the server over sockets or pipes.
You should also look at App::Persistent and this article, both of which were written by Jonathan Rockway (aka jrockway).

This looks almost as dramatic over here:
time perl -MClass::DBI -e0
real 0m0.084s
user 0m0.080s
sys 0m0.004s
time perl -MRose::DB::Object -e0
real 0m0.391s
user 0m0.356s
sys 0m0.036s
I'm afraid part of the difference can simply be explained by the number of dependencies in each module:
perl -MClass::DBI -le 'print scalar keys %INC'
46
perl -MRose::DB::Object -le 'print scalar keys %INC'
95
Of course, you should ask yourself how much compilation time really matters for your particular problem. And what source code would be easier to maintain for you.