Reducing code verbosity and efficiency - perl

I came across the below where some heavy stipulations were done, finally we got a number of #hits and we need to return just one:
if ($#hits > 0)
{
my $highestScore = 0;
my $chosenMatch = "";
for $hit (#hits)
{
my $currScore = 0;
foreach $k (keys %{$hit})
{
next if $k eq $retColumn;
$currScore++ if ($hit->{$k} =~ /\S+/);
}
if ($currScore > $highestScore)
{
$chosenMatch = $hit;
$highestScore = $currScore;
}
}
return ($chosenMatch);
}
elsif ($#hits == 0)
{
return ($hits[0]);
}
That's an eye full and I was hoping to simplify the above code, I came up with:
return reduce {grep /\S+/, values %{$a} > grep /\S+/, values %{$b} ? $a : $b} #matches;
After of using of course useing, List::Util
I wonder if the terse version is any efficient and/or advantage over the original one. Also, there's one condition that's skipped: if $k eq $retColumn, how can I efficiently get that in?

There is a famous quote:
"Premature optimisation is the root of all evil" - Donald Knuth
It is almost invariably the case that making code more concise really doesn't make much difference to the efficiency, and can cause significant penalties to readability and maintainability.
Algorithm is important, code layout ... isn't really. Things like reduce, map and grep are still looping - they're just doing so behind the scenes. You've gained almost no efficiency by using them, you've just saved some bytes in your file. That's fine if they make your code more clear, but that should be your foremost consideration.
Please - keep things clear first, foremost and always. Make your algorithm good. Don't worry about replacing an explicit loop with a grep or map unless these things make your code clearer.
And in the interests of being constructive:
use strict and warnings is really important. Really really important.
To answer your original question:
I wonder if the terse version is any efficient and/or advantage over the original one
No, I think if anything the opposite. Short of profiling code speed, the rule of thumb is look at number and size of loops - a single chunk of code rarely makes much difference, but running it lots and lots of times (unnecessarily) is where you get your inefficiency.
In your first example - you have two loops, a foreach loop inside a for loop. It looks like you traverse your #hits data structure once, and 'unwrap' it to get at the inner layers.
In your second example, both your greps are loops, and your reduce is as well. If I'm reading it correctly, then it'll be traversing your data structure multiple times. (Because you are greping values $a and $b - these will be applied several times).
So I don't think you have gained either readability or efficiency by doing what you've done. But you have made a function that's going to make future maintenance programmers have to think really hard. To take another quote:
"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?" - Brian Kernighan

I wonder if the terse version is any efficient and/or advantage over the original one
The terse version is less efficient than the original because it calculates the score of every element twice, but it does have readability advantages.
The following keeps the readability gain (and even adds some):
sub get_score {
my ($match) = #_;
my #keys = grep { $_ ne $retColumn } keys %$match;
my $score = grep { /\S/ } #{$match}{ #keys };
return $score;
}
return reduce { get_score($a) > get_score($b) ? $a : $b } #matches;
You can look at any part of that sub and understand it without looking around. The least context you need to understand code, the more readable it is.
If you did need an efficiency boost, you can avoid calling get_score on every input twice by using a Schwartzian Transform. As with many optimizations, you will take a readability hit, but at least it's idiomatic (well known and thus well recognizable).
return
map { $_->[0] }
reduce { $a->[1] > $b->[1] ? $a : $b }
map { [ $match, get_score($match) ] }
#matches;

Related

Suggestions for optimizing sieve of Eratosthenes in perl

use warnings;
use strict;
my $in=<STDIN>;
my #array=(1...$in);
foreach my $j(2...sqrt($in)){
for(my $i=$j*2;$i<=$in;$i+=$j){
delete($array[$i-1]);
}
}
delete($array[0]);
open FILE, ">","I:\\Perl_tests\\primes.dat";
foreach my $i (#array){
if($i){
print FILE $i,"\n";
}
}
I'm sure there is a better way to do the array of all numbers however I don't really know of a way to do it in perl. I'm pretty inexperienced in perl. Any recommendations for speeding it up are greatly appreciated. Thanks a ton!
The fastest solution is using a module (e.g. ntheory) if you just want primes and don't care how you get them. That will be massively faster and use much less memory.
I suggest looking at the RosettaCode Sieve task. There are a number of simple sieves shown, a fast string version, some weird examples, and a couple extensible sieves.
Even the basic one uses less memory than your example, and the vector and string versions shown there use significantly less memory.
Sieve of Atkin is rarely a good method unless you are Dan Bernstein. Even then, it's slower than fast SoE implementations.
That's a fairly inefficient way to look for primes, so unless you absolutely have to use the Sieve of Erastosthenes, The Sieve of Atkin is probably a faster algorithm for finding primes.
With regard to the memory usage, here's a perlified version of this python answer. Instead of striking out the numbers from a big up front array of all the integers, it tracks which primes are divisors of non-prime numbers, and then masks out the next non-prime number after each iteration. This means that you can generate as many primes as you have precision for without using all the RAM, or any more memory than you have to. More code and more indirection makes this version slower than what you have already though.
#!/usr/bin/perl
use warnings;
use strict;
sub get() {
my $this = shift;
if ($this->{next} == 2) {
$this->{next} = 3;
return 2;
}
while (1) {
my $next = $this->{next};
$this->{next} += 2;
if (not exists $this->{used}{$next}) {
$this->{used}{$next * $next} = [$next];
return $next;
} else {
foreach my $x (#{$this->{used}{$next}}) {
push (#{$this->{used}{$next + $x}}, $x);
}
delete $this->{used}{$next};
}
}
}
sub buildSieve {
my $this = {
"used" => {},
"next" => 2,
};
bless $this;
return $this;
}
my $sieve = buildSieve();
foreach $_ (1..100) {
print $sieve->get()."\n";
}
You should be able to combine the better algorithm with the more memory efficient generative version above to come up with a good solution though.
If you want to see how the algorithm works in detail, it's quite instructive to use Data::Dumper; and print out $sieve in between each call to get().

C-style loops vs Perl loops (in Perl)

I feel like there is something I don't get about perl's looping mechanism.
It was my understanding that
for my $j (0 .. $#arr){...}
was functionally equivalent to:
for(my $i=0; $i<=$#arr; $i++){..}
However, in my code there seems to be some slight differences in the way they operate. specifically, the time in which they decide when to terminate. for example:
assume #arr is initialized with one variable in it.
These two blocks should do the same thing right?
for my $i (0 .. $#arr)
{
if(some condition that happens to be true)
{
push(#arr, $value);
}
}
and
for (my $i=0; $i<=$#arr; $i++)
{
if(some condition that happens to be true)
{
push(#arr, $value);
}
}
In execution however, even though a new value gets pushed in both cases, the first loop will stop after only one iteration.
Is this supposed to happen? if so why?
EDIT: Thank you for all of your answers, I am aware I can accomplish the same thing with other looping mechanisms. when I asked if there was a another syntax, I was specifically talking about using for. Obviously there isn't. as the syntax to do what I want is already achieved with the c style. I was only asking because I was told to avoid the c style but i still like my for loops.
$i<=$#arr is evaluated before each loop while (0 .. $#arr) is evaluated once before any loop.
As such, the first code doesn't "see" the changes to the #arr size.
Is there another syntax I can use that would force the evaluation after each iteration? (besides using c-style)
for (my $i=0; $i<=$#arr; $i++) {
...
}
is just another way of writing
my $i=0;
while ($i<=$#arr) {
...
} continue {
$i++;
}
(Except the scope of $i is slightly different.)
An alternative would be the do-while construct, although it is a little ungainly.
my $i;
do {
push #arr, $value if condition;
} while ( $i++ < #arr );

Are references better return values in Perl functions?

What are the pros and cons of returning an array or a hash compared to returning a reference on it?
Is there an impact on memory or execution time?
What are the functional differences between the two?
sub i_return_an_array
{
my #a = ();
# push things in #a;
return #a;
}
sub i_return_a_ref
{
my #a = ();
# push things in #a;
return \#a;
}
my #v = i_return_an_array();
my $v = i_return_a_ref();
Yes, there is an impact on memory and execution time - returning a reference returns a single (relatively small) scalar and nothing else. Returning an array or hash as a list makes a shallow copy of the array/hash and returns that, which can take up substantial time to make the copy and memory to store the copy if the array/hash is large.
The functional difference between the two is simply a question of whether you work with the result as an array/hash or as an arrayref/hashref. Some people consider references much more cumbersome to work with; personally, I don't consider it a significant difference.
Another functional difference is that you can't return multiple arrays or hashes as lists (they get flattened into a single list), but you can return multiple references. When this comes up, it's a killer detail that forces you to use references, but my experience is that it only comes up very rarely, so I don't know how important I'd consider it to be overall.
Going to the title question, I believe that the most important factor regarding returning lists vs. references is that you should be consistent so that you don't have to waste your time remembering which functions return arrays/hashes and which return references. Given that references are better in some situations and, at least for me, references are never significantly worse, I choose to standardize on always returning arrays/hashes as references rather than as lists.
(You could also choose to standardize on using wantarray in your subs so that they'll return lists in list context and references in scalar context, but I tend to consider that to be a largely pointless over-complication.)
The performance impact is more noticeable when the array is getting bigger, but only around 50% on perl 5.10.
I usually prefer to return a reference, since it makes the code easier to read : a function has only one return value, and avoids some pitfalls (automatic scalar evaluation) as the post referenced by manu_v mentioned.
#! /usr/bin/perl
use Benchmark;
sub i_return_an_array
{
my #a = (1 .. shift);
# push things in #a;
return #a;
}
sub i_return_a_ref
{
my #a = (1 .. shift);
# push things in #a;
return \#a;
}
for my $nb (1, 10, 100, 1000, 10000) {
Benchmark::cmpthese(0, {
"array_$nb" => sub { my #v = i_return_an_array($nb); },
"ref_$nb" => sub { my $v = i_return_a_ref($nb); },
});
}
returns :
Rate ref_1 array_1
ref_1 702345/s -- -3%
array_1 722083/s 3% --
Rate array_10 ref_10
array_10 230397/s -- -29%
ref_10 324620/s 41% --
Rate array_100 ref_100
array_100 27574/s -- -47%
ref_100 52130/s 89% --
Rate array_1000 ref_1000
array_1000 2891/s -- -51%
ref_1000 5855/s 103% --
Rate array_10000 ref_10000
array_10000 299/s -- -48%
ref_10000 578/s 93% --
On other versions of perl, figures might be different.
As far as interface is concerned, returning an array allows you to process it easier with things like map and grep without having to resort to #{bletchorousness}.
But with hashes, its often more useful to return a reference because then you can do clever things like my $val = function->{key} without having to assign to an intermediate variable.
Short answer: it depends on the size of what you return. If it's small, you can return the whole array.
For details, see Stack Overflow question Is returning a whole array from a Perl subroutine inefficient?.
Note that it's not actually possible to return an array or a hash from a function. The only thing that a function can return is a list of scalars. When you say return #a, you are returning the contents of the array.
In certain programs, it works out more cleanly to return the contents of the array. In certain programs, it works out more cleanly to return a reference to the array. Make your decision case by case.

How fast is Perl's smartmatch operator when searching for a scalar in an array?

I want to repeatedly search for values in an array that does not change.
So far, I have been doing it this way: I put the values in a hash (so I have an array and a hash with essentially the same contents) and I search the hash using exists.
I don't like having two different variables (the array and the hash) that both store the same thing; however, the hash is much faster for searching.
I found out that there is a ~~ (smartmatch) operator in Perl 5.10. How efficient is it when searching for a scalar in an array?
If you want to search for a single scalar in an array, you can use List::Util's first subroutine. It stops as soon as it knows the answer. I don't expect this to be faster than a hash lookup if you already have the hash, but when you consider creating the hash and having it in memory, it might be more convenient for you to just search the array you already have.
As for the smarts of the smart-match operator, if you want to see how smart it is, test it. :)
There are at least three cases you want to examine. The worst case is that every element you want to find is at the end. The best case is that every element you want to find is at the beginning. The likely case is that the elements you want to find average out to being in the middle.
Now, before I start this benchmark, I expect that if the smart match can short circuit (and it can; its documented in perlsyn), that the best case times will stay the same despite the array size, while the other ones get increasingly worse. If it can't short circuit and has to scan the entire array every time, there should be no difference in the times because every case involves the same amount of work.
Here's a benchmark:
#!perl
use 5.12.2;
use strict;
use warnings;
use Benchmark qw(cmpthese);
my #hits = qw(A B C);
my #base = qw(one two three four five six) x ( $ARGV[0] || 1 );
my #at_end = ( #base, #hits );
my #at_beginning = ( #hits, #base );
my #in_middle = #base;
splice #in_middle, int( #in_middle / 2 ), 0, #hits;
my #random = #base;
foreach my $item ( #hits ) {
my $index = int rand #random;
splice #random, $index, 0, $item;
}
sub count {
my( $hits, $candidates ) = #_;
my $count;
foreach ( #$hits ) { when( $candidates ) { $count++ } }
$count;
}
cmpthese(-5, {
hits_beginning => sub { my $count = count( \#hits, \#at_beginning ) },
hits_end => sub { my $count = count( \#hits, \#at_end ) },
hits_middle => sub { my $count = count( \#hits, \#in_middle ) },
hits_random => sub { my $count = count( \#hits, \#random ) },
control => sub { my $count = count( [], [] ) },
}
);
Here's how the various parts did. Note that this is a logarithmic plot on both axes, so the slopes of the plunging lines aren't as close as they look:
So, it looks like the smart match operator is a bit smart, but that doesn't really help you because you still might have to scan the entire array. You probably don't know ahead of time where you'll find your elements. I expect a hash will perform the same as the best case smart match, even if you have to give up some memory for it.
Okay, so the smart match being smart times two is great, but the real question is "Should I use it?". The alternative is a hash lookup, and it's been bugging me that I haven't considered that case.
As with any benchmark, I start off thinking about what the results might be before I actually test them. I expect that if I already have the hash, looking up a value is going to be lightning fast. That case isn't a problem. I'm more interested in the case where I don't have the hash yet. How quickly can I make the hash and lookup a key? I expect that to perform not so well, but is it still better than the worst case smart match?
Before you see the benchmark, though, remember that there's almost never enough information about which technique you should use just by looking at the numbers. The context of the problem selects the best technique, not the fastest, contextless micro-benchmark. Consider a couple of cases that would select different techniques:
You have one array you will search repeatedly
You always get a new array that you only need to search once
You get very large arrays but have limited memory
Now, keeping those in mind, I add to my previous program:
my %old_hash = map {$_,1} #in_middle;
cmpthese(-5, {
...,
new_hash => sub {
my %h = map {$_,1} #in_middle;
my $count = 0;
foreach ( #hits ) { $count++ if exists $h{$_} }
$count;
},
old_hash => sub {
my $count = 0;
foreach ( #hits ) { $count++ if exists $old_hash{$_} }
$count;
},
control_hash => sub {
my $count = 0;
foreach ( #hits ) { $count++ }
$count;
},
}
);
Here's the plot. The colors are a bit difficult to distinguish. The lowest line there is the case where you have to create the hash any time you want to search it. That's pretty poor. The highest two (green) lines are the control for the hash (no hash actually there) and the existing hash lookup. This is a log/log plot; those two cases are faster than even the smart match control (which just calls a subroutine).
There are a few other things to note. The lines for the "random" case are a bit different. That's understandable because each benchmark (so, once per array scale run) randomly places the hit elements in the candidate array. Some runs put them a bit earlier and some a bit later, but since I only make the #random array once per run of the entire program, they move around a bit. That means that the bumps in the line aren't significant. If I tried all positions and averaged, I expect that "random" line to be the same as the "middle" line.
Now, looking at these results, I'd say that a smart-match is much faster in its worst case than the hash lookup is in its worst case. That makes sense. To create a hash, I have to visit every element of the array and also make the hash, which is a lot of copying. There's no copying with the smart match.
Here's a further case I won't examine though. When does the hash become better than the smart match? That is, when does the overhead of creating the hash spread out enough over repeated searches that the hash is the better choice?
Fast for small numbers of potential matches, but not faster than the hash. Hashes are really the right tool for testing set membership. Since hash access is O(log n) and smartmatch on an array is still O(n) linear scan (albeit short-circuiting, unlike grep), with larger numbers of values in the allowed matches, smartmatch gets relatively worse.
Benchmark code (matching against 3 values):
#!perl
use 5.12.0;
use Benchmark qw(cmpthese);
my #hits = qw(one two three);
my #candidates = qw(one two three four five six); # 50% hit rate
my %hash;
#hash{#hits} = ();
sub count_hits_hash {
my $count = 0;
for (#_) {
$count++ if exists $hash{$_};
}
$count;
}
sub count_hits_smartmatch {
my $count = 0;
for (#_) {
$count++ when #hits;
}
$count;
}
say count_hits_hash(#candidates);
say count_hits_smartmatch(#candidates);
cmpthese(-5, {
hash => sub { count_hits_hash((#candidates) x 1000) },
smartmatch => sub { count_hits_smartmatch((#candidates) x 1000) },
}
);
Benchmark results:
Rate smartmatch hash
smartmatch 404/s -- -65%
hash 1144/s 183% --
The "smart" in "smart match" isn't about the searching. It's about doing the right thing at the right time based on context.
The question of whether it's faster to loop through an array or index into a hash is something you'd have to benchmark, but in general, it'd have to be a pretty small array to be quicker to skim through than indexing into a hash.

Is returning a whole array from a Perl subroutine inefficient?

I often have a subroutine in Perl that fills an array with some information. Since I'm also used to hacking in C++, I find myself often do it like this in Perl, using references:
my #array;
getInfo(\#array);
sub getInfo {
my ($arrayRef) = #_;
push #$arrayRef, "obama";
# ...
}
instead of the more straightforward version:
my #array = getInfo();
sub getInfo {
my #array;
push #array, "obama";
# ...
return #array;
}
The reason, of course, is that I don't want the array to be created locally in the subroutine and then copied on return.
Is that right? Or does Perl optimize that away anyway?
What about returning an array reference in the first place?
sub getInfo {
my $array_ref = [];
push #$array_ref, 'foo';
# ...
return $array_ref;
}
my $a_ref = getInfo();
# or if you want the array expanded
my #array = #{getInfo()};
Edit according to dehmann's comment:
It's also possible to use a normal array in the function and return a reference to it.
sub getInfo {
my #array;
push #array, 'foo';
# ...
return \#array;
}
Passing references is more efficient, but the difference is not as big as in C++. The argument values themselves (that means: the values in the array) are always passed by reference anyway (returned values are copied though).
Question is: does it matter? Most of the time, it doesn't. If you're returning 5 elements, don't bother about it. If you're returning/passing 100'000 elements, use references. Only optimize it if it's a bottleneck.
If I look at your example and think about what you want to do I'm used to write it in this manner:
sub getInfo {
my #array;
push #array, 'obama';
# ...
return \#array;
}
It seems to me as straightforward version when I need return large amount of data. There is not need to allocate array outside sub as you written in your first code snippet because my do it for you. Anyway you should not do premature optimization as Leon Timmermans suggest.
To answer the final rumination, no, Perl does not optimize this away. It can't, really, because returning an array and returning a scalar are fundamentally different.
If you're dealing with large amounts of data or if performance is a major concern, then your C habits will serve you well - pass and return references to data structures rather than the structures themselves so that they won't need to be copied. But, as Leon Timmermans pointed out, the vast majority of the time, you're dealing with smaller amounts of data and performance isn't that big a deal, so do it in whatever way seems most readable.
This is the way I would normally return an array.
sub getInfo {
my #array;
push #array, 'foo';
# ...
return #array if wantarray;
return \#array;
}
This way it will work the way you want, in scalar, or list contexts.
my $array = getInfo;
my #array = getInfo;
$array->[0] == $array[0];
# same length
#$array == #array;
I wouldn't try to optimize it unless you know it is a slow part of your code. Even then I would use benchmarks to see which subroutine is actually faster.
There's two considerations. The obvious one is how big is your array going to get? If it's less than a few dozen elements, then size is not a factor (unless you're micro-optimizing for some rapidly called function, but you'd have to do some memory profiling to prove that first).
That's the easy part. The oft overlooked second consideration is the interface. How is the returned array going to be used? This is important because whole array dereferencing is kinda awful in Perl. For example:
for my $info (#{ getInfo($some, $args) }) {
...
}
That's ugly. This is much better.
for my $info ( getInfo($some, $args) ) {
...
}
It also lends itself to mapping and grepping.
my #info = grep { ... } getInfo($some, $args);
But returning an array ref can be handy if you're going to pick out individual elements:
my $address = getInfo($some, $args)->[2];
That's simpler than:
my $address = (getInfo($some, $args))[2];
Or:
my #info = getInfo($some, $args);
my $address = $info[2];
But at that point, you should question whether #info is truly a list or a hash.
my $address = getInfo($some, $args)->{address};
What you should not do is have getInfo() return an array ref in scalar context and an array in list context. This muddles the traditional use of scalar context as array length which will surprise the user.
Finally, I will plug my own module, Method::Signatures, because it offers a compromise for passing in array references without having to use the array ref syntax.
use Method::Signatures;
method foo(\#args) {
print "#args"; # #args is not a copy
push #args, 42; # this alters the caller array
}
my #nums = (1,2,3);
Class->foo(\#nums); # prints 1 2 3
print "#nums"; # prints 1 2 3 42
This is done through the magic of Data::Alias.
3 other potentially LARGE performance improvements if you are reading an entire, largish file and slicing it into an array:
Turn off BUFFERING with sysread() instead of read() (manual warns
about mixing)
Pre-extend the array by valuing the last element -
saves memory allocations
Use Unpack() to swiftly split data like uint16_t graphics channel data
Passing an array ref to the function allows the main program to deal with a simple array while the write-once-and-forget worker function uses the more complicated "$#" and arrow ->[$II] access forms. Being quite C'ish, it is likely to be fast!
I know nothing about Perl so this is a language-neutral answer.
It is, in a sense, inefficient to copy an array from a subroutine into the calling program. The inefficiency arises in the extra memory used and the time taken to copy the data from one place to another. On the other hand, for all but the largest arrays, you might not give a damn, and might prefer to copy arrays out for elegance, cussedness or any other reason.
The efficient solution is for the subroutine to pass the calling program the address of the array. As I say, I haven't a clue about Perl's default behaviour in this respect. But some languages provide the programmer the option to choose which approach.