Declaring variables inside or outside loops in Perl, best practices - perl

I've been working on some Perl libraries for data mining. The libraries are full of nested loops for gathering and processing information. I'm working with strict mode and I always declare my variables with my outside of the first loop. For instance:
# Pretty useless code for clarity purposes:
my $flag = 1;
my ($v1, $v2);
while ($flag) {
for $v1 (1 .. 1000) {
# Lots and lots of code...
$v2 = $v1 * 2;
}
}
For what I've read here, performance-wise, it is better to declare them outside of the loop, however, the maintenance of my code is becoming increasingly difficult because the declaration of some variables ends up pretty far away from where they are actually used.
Something like this would be easier to mantain:
my $flag = 1;
while ($flag) {
for my $v1 (1 .. 1000) {
# Lots and lots of code...
my $v2 = $v1 * 2;
}
}
I don't have much of experience with Perl since I come from working mostly with C++. At some point, I would like to open source most of my libraries, so I would like them to be as pleasing for all of the Perl gurus as possible.
From a professional Perl developer point of view, what is most appropriate choice between these options?

The general rule is to declare every variable as late as possible.
If the value of a variable doesn't need to be kept across iterations of a loop then declare it inside the loop, or as the loop control variable for a for loop.
If it needs to remain static across the loop iterations (like your $flag) then declare it immediately before the loop.
Yes, there is a minimal speed cost to be paid if you discard and reallocate a variable every time a block is executed, but programming and maintenance costs are by far the most important efficiency and should always be put first.
You shouldn't be optimising your code before it has been made to work and found to be running too slowly; and even then, moving declarations to the top of the file is a long way down the list of compromises that are likely to make a useful difference.

Optimize for readability. This means declaring variables in the smallest possible scope. Ideally, I can see the variable declaration and all usages of that variable at the same time. We can only keep a very limited amount of context in our heads, so declaring variables near their use makes it easier to understand, write, and debug code.
Understanding what variant performs better is difficult to estimate, and difficult to measure as the effect will be rather small. But if performance is roughly equivalent, we might as well use the more readable variant.
I personally often try to write code in a single assignment form where variables aren't reassigned, and mutators like push #array, $elem are avoided. This makes sure that the name of a variable and its value are always interchangeable which makes it easier to reason about code. This implies that each variable declaration is also an initialization, which removes a whole class of errors.

You should declare variables when you're ready to define them unless you need to access the answer in a larger scope. Even then passing the value back explicitly will be easier to follow.

The particular example you have given (declaration of a loop variable) probably does not have a performance penalty. As the link you quoted says the reason for a performance difference boils down to whether the variable is initialised inside the loop. In the case of a for loop it will be initialised either way.
I almost always declare the variables in the innermost scope possible. It reduces the chances of making mistakes. I would only alter that if performance became a problem in a specific loop.

Related

Unexpected behavior in nested recursive function

I have some code that behaves rather strangely.
I am inside a function, and I declare a nested one, which should check if something isn't okay. If it's not then it should sleep for five seconds and call itself again.
sub stop {
sub wait_for_stop {
my $vm_view = shift;
if ( $vm_view->runtime->powerState->val ne "poweredOff" ) {
debug("...");
sleep(5);
wait_for_stop();
}
}
debug("Waiting for the VM to stop");
wait_for_stop( #$vm_views[0] );
}
So, in the call that results in the recursion inside the if condition, if I put the parameter (as the function definition expects it), like this:
wait_for_stop($vm_view);
I get an infinite loop.
If I leave it without a parameter, as in the code example above, it works as expected.
Shouldn't $vm_view in the subsequent calls be empty? Or the last used value ($vm_view->runtime->powerState->val)? Either case should result in unexpected behavior and error.
But it works without any parameter. So why is that? Is there something I've missed from perldoc?
EDIT1: Actually, the value of $vm_views does get changed, so that's not the reason for the infinite loop.
General clarifications
I am using the VMware SDK. The $vm_views object contains the VM details. I am polling one of its methods to detect a change, in this particular case, I need to know when the machine is turned off. So, for lack of a better way, I make a call every 5 seconds until the value is satisfactory.
My purpose is to stop the VM, make modifications that can only be made while it's off, and then launch it.
Actual question
When I don't pass a parameter, the block works as expected – it waits until the value is poweredOff (the VM is off), and continues, which doesn't make much sense, at least to me.
In the case I put $vm_view as parameter, I get an infinite loop (the value will still get changed, since I'm calling a method).
So I am wondering why the function works, when after the first call, $vm_view should be undef, and therefore, be stuck in an infinite loop? [undef ne "poweredOff" -> sleep -> recursion till death]
And why, when I pass the expected value, it gets stuck?
PS: To all those saying my recursion is weird and useless in this scenario – due to a myriad of reasons, I need to use such a format (it's better suited for my needs, since, after I get this bit working, I'll modify it to add various stuff and reuse it, and, for what I have in mind, a function seems to be the best option).
You should always look at your standard tools before going for something a little more exotic like recursion. All you need here is a while loop
It's also worth noting that #$vm_views[0] should be $$vm_views[0]) or, better, $vm_views->[0]. And you don't gain anything by defining a subroutine inside another one -- the effect is the same as if it was declared separately afterwards
An infinite loop is what I would expect if $vm_view->runtime->powerState->val never returns poweredOff, and the code below won't fix that. I don't see any code that tells the VM to stop before you wait for the status change. Is that correct?
I don't understand why you say that your code works correctly when you call wait_for_stop without any parameters. You will get the fatal error
Can't call method "runtime" on an undefined value
and your program will stop. Is what you have posted the real code?
This will do what you intended. I also think it's easier to read
use strict;
use warnings;
my $vm_views;
sub stop {
debug ("Waiting for the VM to stop");
my $vm_view = $vm_views->[0];
while ( $vm_view->runtime->powerState->val ne 'poweredOff' ) {
debug('...');
sleep 5;
}
}
I think you would have a better time not calling wait_for_stop() recursively. This way might serve you better:
sub stop
{
sub wait_for_stop
{
my $vm_view = shift;
if ($vm_view->runtime->powerState->val ne "poweredOff")
{
debug("...");
#sleep(5);
#wait_for_stop();
return 0;
}
return 1;
}
debug ("Waiting for the VM to stop");
until(wait_for_stop(#$vm_views[0]))
{
sleep(5);
}
}
Your old way was rather confusing and I don't think you were passing the $vm_view variable through to the recursive subroutine call.
Update:
I tried reading it here:
https://www.vmware.com/support/developer/viperltoolkit/doc/perl_toolkit_guide.html#step3
It says:
When the script is run, the VI Perl Toolkit runtime checks the
environment variables, configuration file contents, and command-line
entries (in this order) for the necessary connection-setup details. If
none of these overrides are available, the runtime uses the defaults
(see Table 1 ) to setup the connection.
So, the "runtime" is using the default connection details even when a vm object is not defined? May be?
That still doesn't answer why it doesn't work when the parameter is passed.
You need to understand the VM SDK better. You logic for recursion and usage of function parameters are fine.
Also, the page: https://www.vmware.com/support/developer/viperltoolkit/doc/perl_toolkit_guide.html
says -
VI Perl Toolkit views have several characteristics that you should
keep in mind as you write your own scripts. Specifically, a view:
Is a Perl object
Includes properties and methods that correlate to the properties and
operations of the server-side managed object
Is a static copy of one or more server-side managed objects, and as
such (static), is not updated automatically as the state of objects
on the server changes.
So what the "vm" function returns is a static copy, which can be updated from the script. May be it is getting updated when you make a call while passing the $vm_view?
Old Answer:
Problem is not what you missed from Perl docs. The problem is with your understanding of recursion.
The purpose of recursion is to keep running until $vm_view->runtime->powerState->val becomes "PoweredOff" and then cascade back. If you don't update the value, it keeps running forever.
When you say:
I get an infinite loop.
Are you updating the $vm_view within the if condition?
Otherwise, the variable is same every time you call the function and hence you can end up in infinite loop.
If I leave it without a parameter, as in the code example above, it
works as expected.
How can it work as expected? What is expected? There is no way the function can know what value your $vm_view is being updated with.
I have simplified the code, added updating a simple variable (similar to your $vm_view) and tested. Use this to understand what is happening:
sub wait_for_stop
{
my $vm_view = shift;
if ($vm_view < 10){
print "debug...\n";
sleep(5);
$vm_view++; // update $vm_view->runtime->powerState->val here
// so that it becomes "poweredOff" at some point of time
// and breaks the recursion
wait_for_stop($vm_view);
}
}
wait_for_stop(1);
Let me know in comments how the variable is being updated and I will help resolve.

Which is better in PHP: suppress warnings with '#' or run extra checks with isset()?

For example, if I implement some simple object caching, which method is faster?
1. return isset($cache[$cls]) ? $cache[$cls] : $cache[$cls] = new $cls;
2. return #$cache[$cls] ?: $cache[$cls] = new $cls;
I read somewhere # takes significant time to execute (and I wonder why), especially when warnings/notices are actually being issued and suppressed. isset() on the other hand means an extra hash lookup. So which is better and why?
I do want to keep E_NOTICE on globally, both on dev and production servers.
I wouldn't worry about which method is FASTER. That is a micro-optimization. I would worry more about which is more readable code and better coding practice.
I would certainly prefer your first option over the second, as your intent is much clearer. Also, best to keep away edge condition problems by always explicitly testing variables to make sure you are getting what you are expecting to get. For example, what if the class stored in $cache[$cls] is not of type $cls?
Personally, if I typically would not expect the index on $cache to be unset, then I would also put error handling in there rather than using ternary operations. If I could reasonably expect that that index would be unset on a regular basis, then I would make class $cls behave as a singleton and have your code be something like
return $cls::get_instance();
The isset() approach is better. It is code that explicitly states the index may be undefined. Suppressing the error is sloppy coding.
According to this article 10 Performance Tips to Speed Up PHP, warnings take additional execution time and also claims the # operator is "expensive."
Cleaning up warnings and errors beforehand can also keep you from
using # error suppression, which is expensive.
Additionally, the # will not suppress the errors with respect to custom error handlers:
http://www.php.net/manual/en/language.operators.errorcontrol.php
If you have set a custom error handler function with
set_error_handler() then it will still get called, but this custom
error handler can (and should) call error_reporting() which will
return 0 when the call that triggered the error was preceded by an #.
If the track_errors feature is enabled, any error message generated by
the expression will be saved in the variable $php_errormsg. This
variable will be overwritten on each error, so check early if you want
to use it.
# temporarily changes the error_reporting state, that's why it is said to take time.
If you expect a certain value, the first thing to do to validate it, is to check that it is defined. If you have notices, it's probably because you're missing something. Using isset() is, in my opinion, a good practice.
I ran timing tests for both cases, using hash keys of various lengths, also using various hit/miss ratios for the hash table, plus with and without E_NOTICE.
The results were: with error_reporting(E_ALL) the isset() variant was faster than the # by some 20-30%. Platform used: command line PHP 5.4.7 on OS X 10.8.
However, with error_reporting(E_ALL & ~E_NOTICE) the difference was within 1-2% for short hash keys, and up 10% for longer ones (16 chars).
Note that the first variant executes 2 hash table lookups, whereas the variant with # does only one lookup.
Thus, # is inferior in all scenarios and I wonder if there are any plans to optimize it.
I think you have your priorities a little mixed up here.
First of all, if you want to get a real world test of which is faster - load test them. As stated though suppressing will probably be slower.
The problem here is if you have performance issues with regular code, you should be upgrading your hardware, or optimize the grand logic of your code rather than preventing proper execution and error checking.
Suppressing errors to steal the tiniest fraction of a speed gain won't do you any favours in the long run. Especially if you think that this error may keep happening time and time again, and cause your app to run more slowly than if the error was caught and fixed.

Are there performance reasons against goto? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
GOTO, does it affect the performance while it is executed and run on the device?
Is it a good practice to use GOTO in objective C or is it bad practice to use it?
And, when is it a good choice to use GOTO statement?
Thanks.
A goto is simply a jump, so that its effect on performance is practically zero. It’s a bad practice because it harms code readability; you can mostly do without it. Some of the cases where it makes sense to use goto are described in previous questions, just search for goto.
Using a go to statement is usually a bad practice, especially in a object oriented language(where you can achieve the same purpose in an OO way easyly ), but not from a performance point of view but rather from the code readability point of view...
There is nothing in BAD and GOOD practice this is up to your requirement.
If you have same code which you want to execute you can say loop then you can use goto. Well here is a small example about this I think it would clear your doubt.
Declare any label name, here hello is label then you can call it using goto statement like this -
hello:
NSLog(#"Print hello!");
goto hello;
This would print 'Print hello!' again and again.
Not affecting the performance, just for a good structure and readability which are important features of professional programming. But sometimes, using goto may help to ease complexity in cases where the loop is too deep, but you want to jump out when certain condition is triggered. Even so, it can also be avoided in other ways.
In principle goto can affect performance simply by being present in the function.
The performance difference will almost always be unnoticeable, and there are a lot of things other than goto than can slightly perturb the optimizer and affect performance. But if you're interested you could examine the emitted code for differences.
It's a basic requirement in the emitted code that the same registers must be used for the same things at the source and target of the goto[*]. This constrains the register allocation when the compiler optimizes the code. Such constraints may have no effect at all, they may slow things down or cause additional code to be emitted. If they speed things up, it can only be by accident because the compiler's heuristics were in effect incorrect when applied to the unconstrained version.
The effect might be more pronounced for a computed goto (a GNU extension), where you can store a label in a variable and goto the variable. In that case, every possible target has to share a register state with every possible source.
What doesn't (normally) make a difference to performance is goto the start or end of a block vs. the equivalent break or continue or else. It's all the same to the optimizer: the compiler breaks your code down into so-called "basic blocks" with jumps and conditional jumps between them. It doesn't normally care whether the reason for the jump is a goto or not, and it has to get the register states right no matter which. This is why almost any programming construct can be described as "goto in disguise" by someone who's only thinking about the emitted instructions.
[*] to be more precise -- there could be an implicit zap at a goto, meaning that some register is used for one thing at the source and isn't used at all at the target. But you can't have some register that the target expects to contain a particular value (like the current value of a variable) and the source doesn't. So if that was the case before and then you add the goto, either the target needs to stop expecting it, or the source needs to put it there. Typically either one is going to require extra code to shuffle values between registers and stack.

Does MATLAB perform tail call optimization?

I've recently learned Haskell, and am trying to carry the pure functional style over to my other code when possible. An important aspect of this is treating all variables as immutable, i.e. constants. In order to do so, many computations that would be implemented using loops in an imperative style have to be performed using recursion, which typically incurs a memory penalty due to the allocation a new stack frame for each function call. In the special case of a tail call (where the return value of a called function is immediately returned to the callee's caller), however, this penalty can be bypassed by a process called tail call optimization (in one method, this can be done by essentially replacing a call with a jmp after setting up the stack properly). Does MATLAB perform TCO by default, or is there a way to tell it to?
If I define a simple tail-recursive function:
function tailtest(n)
if n==0; feature memstats; return; end
tailtest(n-1);
end
and call it so that it will recurse quite deeply:
set(0,'RecursionLimit',10000);
tailtest(1000);
then it doesn't look as if stack frames are eating a lot of memory. However, if I make it recurse much deeper:
set(0,'RecursionLimit',10000);
tailtest(5000);
then (on my machine, today) MATLAB simply crashes: the process unceremoniously dies.
I don't think this is consistent with MATLAB doing any TCO; the case where a function tail-calls itself, only in one place, with no local variables other than a single argument, is just about as simple as anyone could hope for.
So: No, it appears that MATLAB does not do TCO at all, at least by default. I haven't (so far) looked for options that might enable it. I'd be surprised if there were any.
In cases where we don't blow out the stack, how much does recursion cost? See my comment to Bill Cheatham's answer: it looks like the time overhead is nontrivial but not insane.
... Except that Bill Cheatham deleted his answer after I left that comment. OK. So, I took a simple iterative implementation of the Fibonacci function and a simple tail-recursive one, doing essentially the same computation in both, and timed them both on fib(60). The recursive implementation took about 2.5 times longer to run than the iterative one. Of course the relative overhead will be smaller for functions that do more work than one addition and one subtraction per iteration.
(I also agree with delnan's sentiment: highly-recursive code of the sort that feels natural in Haskell is typically likely to be unidiomatic in MATLAB.)
There is a simple way to check this. Create this function tail_recursion_check:
function r = tail_recursion_check(n)
if n > 1
r = tail_recursion_check(n - 1);
else
error('error');
end
end
and run tail_recursion_check(10), for example. You are going to see a very long stack trace with 10 items that says error at line 3. If there were tail call optimization, you would only see one.

Perl memory usage profiling and leak detection?

I wrote a persistent network service in Perl that runs on Linux.
Unfortunately, as it runs, its Resident Stack Size (RSS) just grows, and grows, and grows, slowly but surely.
This is despite diligent efforts on my part to expunge all unneeded hash keys and delete all references to objects that would otherwise cause reference counts to remain in place and obstruct garbage collection.
Are there any good tools for profiling the memory usage associated with various native data primitives, blessed hash reference objects, etc. within a Perl program? What do you use for tracking down memory leaks?
I do not habitually spend time in the Perl debugger or any of the various interactive profilers, so a warm, gentle, non-esoteric response would be appreciated. :-)
You could have a circular reference in one of your objects. When the garbage collector comes along to deallocate this object, the circular reference means that everything referred to by that reference will never get freed. You can check for circular references with Devel::Cycle and Test::Memory::Cycle. One thing to try (although it might get expensive in production code, so I'd disable it when a debug flag is not set) is checking for circular references inside the destructor for all your objects:
# make this be the parent class for all objects you want to check;
# or alternatively, stuff this into the UNIVERSAL class's destructor
package My::Parent;
use strict;
use warnings;
use Devel::Cycle; # exports find_cycle() by default
sub DESTROY
{
my $this = shift;
# callback will be called for every cycle found
find_cycle($this, sub {
my $path = shift;
foreach (#$path)
{
my ($type,$index,$ref,$value) = #$_;
print STDERR "Circular reference found while destroying object of type " .
ref($this) . "! reftype: $type\n";
# print other diagnostics if needed; see docs for find_cycle()
}
});
# perhaps add code to weaken any circular references found,
# so that destructor can Do The Right Thing
}
You can use Devel::Leak to search for memory leaks. However, the documentation is pretty sparse... for example, just where does one get the $handle reference to pass to Devel::Leak::NoteSV()? f I find the answer, I will edit this response.
Ok it turns out that using this module is pretty straightforward (code stolen shamelessly from Apache::Leak):
use Devel::Leak;
my $handle; # apparently this doesn't need to be anything at all
my $leaveCount = 0;
my $enterCount = Devel::Leak::NoteSV($handle);
print STDERR "ENTER: $enterCount SVs\n";
# ... code that may leak
$leaveCount = Devel::Leak::CheckSV($handle);
print STDERR "\nLEAVE: $leaveCount SVs\n";
I'd place as much code as possible in the middle section, with the leaveCount check as close to the end of execution (if you have one) as possible -- after most variables have been deallocated as possible (if you can't get a variable out of scope, you can assign undef to it to free whatever it was pointing to).
What next to try (not sure if this would be best placed in a comment after Alex's question above though): What I'd try next (other than Devel::Leak):
Try to eliminate "unnecessary" parts of your program, or segment it into separate executables (they could use signals to communicate, or call each other with command-line arguments perhaps) -- the goal is to boil down an executable into the smallest amount of code that still exhibits the bad behaviour. If you're sure it's not your code that's doing it, reduce the number of external modules you're using, particularly those that have an XS implementation. If perhaps it is your own code, look for anything potentially fishy:
definitely any use of Inline::C or XS code
direct use of references, e.g. \#list or \%hash, rather than preallocated references like [ qw(foo bar) ] (the former creates another reference which may get lost; in the latter, there is just one reference to worry about, which is usually stored in a local lexical scalar
manipulating variables indirectly, e.g. $$foo where $foo is modified, which can cause autovivication of variables (although you need to disable strict 'refs' checking)
I recently used NYTProf as a profiler for a large Perl application. It doesn't track memory usage, but it does trace all executed code paths which helps with finding out where leaks originate. If what you are leaking is scarce resources such as database connections, tracing where they are allocated and closed goes a long way towards finding leaks.
A nice guide about this is included in the Perl manual : Debugging Perl memory usage