How would I do the equivalent of Prototype's Enumerator.detect in Perl with the least amount of code? - perl

Lately I've been thinking a lot about functional programming. Perl offers quite a few tools to go that way, however there's something I haven't been able to find yet.
Prototype has the function detect for enumerators, the descriptions is simply this:
Enumerator.detect(iterator[, context]) -> firstElement | undefined
Finds the first element for which the iterator returns true.
Enumerator in this case is any list while iterator is a reference to a function, which is applied in turn on each element of the list.
I am looking for something like this to apply in situations where performance is important, i.e. when stopping upon encountering a match saves time by disregarding the rest of the list.
I am also looking for a solution that would not involve loading any extra module, so if possible it should be done with builtins only. And if possible, it should be as concise as this for example:
my #result = map function #array;

You say you don't want a module, but this is exactly what the first function in List::Util does. That's a core module, so it should be available everywhere.
use List::Util qw(first);
my $first = first { some condition } #array;
If you insist on not using a module, you could copy the implementation out of List::Util. If somebody knew a faster way to do it, it would be in there. (Note that List::Util includes an XS implementation, so that's probably faster than any pure-Perl approach. It also has a pure-Perl version of first, in List::Util::PP.)
Note that the value being tested is passed to the subroutine in $_ and not as a parameter. This is a convenience when you're using the first { some condition} #values form, but is something you have to remember if you're using a regular subroutine. Some more examples:
use 5.010; # I want to use 'say'; nothing else here is 5.10 specific
use List::Util qw(first);
say first { $_ > 3 } 1 .. 10; # prints 4
sub wanted { $_ > 4 }; # note we're using $_ not $_[0]
say first \&wanted, 1 .. 10; # prints 5
my $want = \&wanted; # Get a subroutine reference
say first \&$want, 1 .. 10; # This is how you pass a reference in a scalar
# someFunc expects a parameter instead of looking at $_
say first { someFunc($_) } 1 .. 10;

Untested since I don't have Perl on this machine, but:
sub first(\&#) {
my $pred = shift;
die "First argument to "first" must be a sub" unless ref $pred eq 'CODE';
for my $val (#_) {
return $val if $pred->($val);
}
return undef;
}
Then use it as:
my $first = first { sub performing test } #list;
Note that this doesn't distinguish between no matches in the list and one of the elements in the list being an undefined value and having that match.

Just since its not here, a Perl function definition of first that localizes $_ for its block:
sub first (&#) {
my $code = shift;
for (#_) {return $_ if $code->()}
undef
}
my #array = 1 .. 10;
say first {$_ > 5} #array; # prints 6
While it will work fine, I don't advocate using this version, since List::Util is a core module (installed by default), and its implementation of first will usually use the XS version (written in C) which is much faster.

Related

Can I make a variable optional in a perl sub prototype?

I'd like to understand if it's possible to have a sub prototype and optional parameters in it. With prototypes I can do this:
sub some_sub (\#\#\#) {
...
}
my #foo = qw/a b c/;
my #bar = qw/1 2 3/;
my #baz = qw/X Y Z/;
some_sub(#foo, #bar, #baz);
which is nice and readable, but the minute I try to do
some_sub(#foo, #bar);
or even
some_sub(#foo, #bar, ());
I get errors:
Not enough arguments for main::some_sub at tablify.pl line 72, near "#bar)"
or
Type of arg 3 to main::some_sub must be array (not stub) at tablify.pl line 72, near "))"
Is it possible to have a prototype and a variable number of arguments? or is something similar achievable via signatures?
I know it could be done by always passing arrayrefs I was wondering if there was another way. After all, TMTOWTDI.
All arguments after a semi-colon are optional:
sub some_sub(\#\#;\#) {
}
Most people are going to expect your argument list to flatten, and you are reaching for an outdated tool to do what people don't expect.
Instead, pass data structures by reference:
some_sub( \#array1, \#array2 );
sub some_sub {
my #args = #_;
say "Array 1 has " . $args[0]->#* . " elements";
}
If you want to use those as named arrays within the sub, you can use ref aliasing
use v5.22;
use experimental qw(ref_aliasing);
sub some_sub {
\my( #array1 ) = $_[0];
...
}
With v5.26, you can move the reference operator inside the parens:
use v5.26;
use experimental qw(declared_refs);
sub some_sub {
my( \#array1 ) = $_[0];
...
}
And, remember that v5.20 introduced the :prototype attribute so you can distinguish between prototypes and signatures:
use v5.20;
sub some_sub :prototype(##;#) { ... }
I write about these things at The Effective Perler (which you already read, I see), in Perl New Features, a little bit in Preparing for Perl 7 (which is mostly about what you need to stop doing in Perl 5 to be future proof).

Does perl cache regex generation?

Suppose I have a function that dynamically generates regular expressions and then matches against them.
For example, in the following function match_here a \G anchor is inserted at the beginning of the regex. This simplifies the API because the caller does not need to remember to include the pos anchor in the pattern.
#!/usr/bin/env perl
use strict;
use warnings;
use Carp;
use Data::Dumper;
sub match_here {
my ($str, $index, $rg) = #_;
pos($str) = $index;
croak "index ($index) out of bounds" unless pos($str) == $index;
my $out;
if ($str =~ /\G$rg/) {
$out = $+[0];
}
return $out;
}
# no match starting at position 0
# prints '$VAR1 = undef;'
print Dumper(match_here("abc", 0, "b+"));
# match from 1 to 2
# prints '$VAR1 = 2;'
print Dumper(match_here("abc", 1, "b+"));
I'm wondering whether an anonymous regex object is "compiled" every time the function is evaluated or if there's some caching so that identical strings will not cause additional regex objects to be compiled.
Also, assuming that no caching is done by the Perl interpreter, is compiling a regex object expensive enough to be worth caching (possibly in an XS extension)?
From perlop(1), under the m// operator:
PATTERN may contain variables, which will be interpolated every time the pattern search is evaluated
[...]
Perl will not recompile the pattern unless an interpolated variable that it contains changes. You can force Perl to skip the test and never recompile by adding a "/o" (which stands for "once") after the trailing delimiter. Once upon a time, Perl would recompile regular expressions unnecessarily, and this modifier was useful to tell it not to do so, in the interests of speed.
So yes, there is a cache, and you can even force the use of the cache even when it's invalid by saying /o, but you really shouldn't do that.
But that cache only stores one compiled regexp per instance of the m// or s/// operator, so it only helps if the regexp is used with the same variables (e.g. your $rg) many times consecutively. If you alternate between calling it with $rg='b+' and $rg='c+' you will get a recompile every time.
For that kind of situation, you can do your own caching with the qr// operator. It explicitly compiles the regexp and returns an object that you can store and use to execute the regexp later. That could be incorporated into your match_here like this:
use feature 'state';
sub match_here {
my ($str, $index, $rg) = #_;
pos($str) = $index;
croak "index ($index) out of bounds" unless pos($str) == $index;
my $out;
state %rg_cache;
my $crg = $rg_cache{$rg} ||= qr/\G$rg/;
if ($str =~ /$crg/) {
$out = $+[0];
}
return $out;
}
To add more detail on the basic cache (when not using qr//): the fact that $rg is a newly allocated lexical variable each time makes no difference. It only matters that the value is the same as the previous one.
Here's an example to prove the point:
use re qw(Debug COMPILE);
while(<>) {
chomp;
# Insane interpolation. Do not use anything remotely like this in real code
print "MATCHED: $_\n" if /^${\(`cat refile`)}/;
}
Every time the match operator executes, it reads refile. The regular expression is ^ followed by the contents of refile. The debugging output shows that it is recompiled only if the contents of the file have changed. If the file still has the same contents as the last time, the operator notices that the same string is being passed to the regexp compiler again, and reuses the cached result.
Or try this less dramatic example:
use re qw(Debug COMPILE);
#patterns = (
'\d{3}',
'\d{3}',
'[aeiou]',
'[aeiou]',
'\d{3}',
'\d{3}'
);
for ('xyz', '123', 'other') {
for $i (0..$#patterns) {
if(/$patterns[$i]/) {
print "$_ matches $patterns[$i]\n";
} else {
print "$_ does not match $patterns[$i]\n";
}
}
}
in which there are 18 compilations and 11 of them are cache hits, even though the same "variable" (the same element of the #patterns array) is never used twice in a row.

in perl, is it bad practice to call multiple subroutines with default arguments?

I am learning perl and understand that it is a common and accepted practice to unpack subroutine arguments using shift. I also understand that it is common and acceptable practice to omit function arguments to use the default #_ array.
Considering these two things, if you call a subroutine without arguments, the #_ can (and will, if using shift) be changed. Does this mean that calling another subroutine with default arguments, or, in fact, using the #_ array after this, is considered bad practice? Consider this example:
sub total { # calculate sum of all arguments
my $running_sum;
# take arguments one by one and sum them together
while (#_) {
$running_sum += shift;
}
$running_sum;
}
sub avg { calculate the mean of given arguments
if (#_ == 0) { return }
my $sum = &total; # gets the correct answer, but changes #_
$sum / #_ # causes division by zero, since #_ is now empty
}
My gut feeling tells me that using shift to unpack arguments would actually be bad practice, unless your subroutine is actually supposed to change the passed arguments, but I have read in multiple places, including Stack Overflow, that this is not a bad practice.
So the question is: if using shift is common practice, should I always assume the passed argument list could get changed, as a side-effect of the subroutine (like the &total subroutine in the quoted example)? Is there maybe a way to pass arguments by value, so I can be sure that the argument list does not get changed, so I could use it again (like in the &avg subroutine in the quoted text)?
In general, shifting from the arguments is ok—using the & sigil to call functions isn't. (Except in some very specific situations you'll probably never encounter.)
Your code could be re-written, so that total doesn't shift from #_. Using a for-loop may even be more efficient.
sub total {
my $total = 0;
$total += $_ for #_;
$total;
}
Or you could use the sum function from List::Util:
use List::Util qw(sum);
sub avg { #_ ? sum(#_) / #_ : 0 }
Using shift isn't that common, except for extracting $self in object oriented Perl. But as you always call your functions like foo( ... ), it doesn't matter if foo shifts or doesn't shift the argument array.
(The only thing worth noting about a function is whether it assigns to elements in #_, as these are aliases for the variables you gave as arguments. Assigning to elements in #_ is usually bad.)
Even if you can't change the implementation of total, calling the sub with an explicit argument list is safe, as the argument list is a copy of the array:
(a) &total — calls total with the identical #_, and overrides prototypes.
(b) total(#_) — calls total with a copy of #_.
(c) &total(#_) — calls total with a copy of #_, and overrides prototypes.
Form (b) is standard. Form (c) shouldn't be seen, except in very few cases for subs inside the same package where the sub has a prototype (and don't use prototypes), and they have to be overridden for some obscure reason. A testament to poor design.
Form (a) is only sensible for tail calls (#_ = (...); goto &foo) or other forms of optimization (and premature optimization is the root of all evil).
You should avoid using the &func; style of calling unless you have a really good reason, and trust that others do the same.
To guard your #_ against modification by a callee, just do &func() or func.
Perl is a little too lax sometimes and having multiple ways of accessing input parameters can make smelly and inconsistent code. For want of a better answer, try to impose your own standard.
Here's a few ways I've used and seen
Shifty
sub login
{
my $user = shift;
my $passphrase = shift;
# Validate authentication
return 0;
}
Expanding #_
sub login
{
my ($user, $passphrase) = #_;
# Validate authentication
return 0;
}
Explicit indexing
sub login
{
my user = $_[0];
my user = $_[1];
# Validate authentication
return 0;
}
Enforce parameters with function prototypes (this is not popular however)
sub login($$)
{
my ($user, $passphrase) = #_;
# Validate authentication
return 0;
}
Sadly you still have to perform your own convoluted input validation/taint checking, ie:
return unless defined $user;
return unless defined $passphrase;
or better still, a little more informative
unless (defined($user) && defined($passphrase)) {
carp "Input error: user or passphrase not defined";
return -1;
}
Perldoc perlsub should really be your first port of call.
Hope this helps!
Here are some examples where the careful use of #_ matters.
1. Hash-y Arguments
Sometimes you want to write a function which can take a list of key-value pairs, but one is the most common use and you want that to be available without needing a key. For example
sub get_temp {
my $location = #_ % 2 ? shift : undef;
my %options = #_;
$location ||= $options{location};
...
}
So now if you call the function with an odd number of arguments, the first is location. This allows get_temp('Chicago') or get_temp('New York', unit => 'C') or even get_temp( unit => 'K', location => 'Nome, Ak'). This may be a more convenient API for your users. By shifting the odd argument, now #_ is an even list and may be assigned to a hash.
2. Dispatching
Lets say we have a class that we want to be able to dispatch methods by name (possibly AUTOLOAD could be useful, we will hand roll). Perhaps this is a command line script where arguments are methods. In this case we define two dispatch methods one "clean" and one "dirty". If we call with the -c flag we get the clean one. These methods find the method by name and call it. The difference is how. The dirty one leaves itself in the stack trace, the clean one has to be more cleaver, but dispatches without being in the stack trace. We make a death method which gives us that trace.
#!/usr/bin/env perl
use strict;
use warnings;
package Unusual;
use Carp;
sub new {
my $class = shift;
return bless { #_ }, $class;
}
sub dispatch_dirty {
my $self = shift;
my $name = shift;
my $method = $self->can($name) or confess "No method named $name";
$self->$method(#_);
}
sub dispatch_clean {
my $self = shift;
my $name = shift;
my $method = $self->can($name) or confess "No method named $name";
unshift #_, $self;
goto $method;
}
sub death {
my ($self, $message) = #_;
$message ||= 'died';
confess "$self->{name}: $message";
}
package main;
use Getopt::Long;
GetOptions
'clean' => \my $clean,
'name=s' => \(my $name = 'Robot');
my $obj = Unusual->new(name => $name);
if ($clean) {
$obj->dispatch_clean(#ARGV);
} else {
$obj->dispatch_dirty(#ARGV);
}
So now if we call ./test.pl to invoke the death method
$ ./test.pl death Goodbye
Robot: Goodbye at ./test.pl line 32
Unusual::death('Unusual=HASH(0xa0f7188)', 'Goodbye') called at ./test.pl line 19
Unusual::dispatch_dirty('Unusual=HASH(0xa0f7188)', 'death', 'Goodbye') called at ./test.pl line 46
but wee see dispatch_dirty in the trace. If instead we call ./test.pl -c we now use the clean dispatcher and get
$ ./test.pl -c death Adios
Robot: Adios at ./test.pl line 33
Unusual::death('Unusual=HASH(0x9427188)', 'Adios') called at ./test.pl line 44
The key here is the goto (not the evil goto) which takes the subroutine reference and immediately switches the execution to that reference, using the current #_. This is why I have to unshift #_, $self so that the invocant is ready for the new method.
Refs:
sub refWay{
my ($refToArray,$secondParam,$thirdParam) = #_;
#work here
}
refWay(\#array, 'a','b');
HashWay:
sub hashWay{
my $refToHash = shift; #(if pass ref to hash)
#and i know, that:
return undef unless exists $refToHash->{'user'};
return undef unless exists $refToHash->{'password'};
#or the same in loop:
for (qw(user password etc)){
return undef unless exists $refToHash->{$_};
}
}
hashWay({'user'=>YourName, 'password'=>YourPassword});
I tried a simple example:
#!/usr/bin/perl
use strict;
sub total {
my $sum = 0;
while(#_) {
$sum = $sum + shift;
}
return $sum;
}
sub total1 {
my ($a, $aa, $aaa) = #_;
return ($a + $aa + $aaa);
}
my $s;
$s = total(10, 20, 30);
print $s;
$s = total1(10, 20, 30);
print "\n$s";
Both print statements gave answer as 60.
But personally I feel, the arguments should be accepted in this manner:
my (arguments, #garb) = #_;
in order to avoid any sort of issue latter.
I found the following gem in http://perldoc.perl.org/perlsub.html:
"Yes, there are still unresolved issues having to do with visibility of #_ . I'm ignoring that question for the moment. (But note that if we make #_ lexically scoped, those anonymous subroutines can act like closures... (Gee, is this sounding a little Lispish? (Never mind.)))"
You might have run into one of those issues :-(
OTOH amon is probably right -> +1

Where is $_ being modified in this perl code?

The following perl code generates a warning in PerlCritic (by Activestate):
sub natural_sort {
my #sorted;
#sorted = grep {s/(^|\D)0+(\d)/$1$2/g,1} sort grep {s/(\d+)/sprintf"%06.6d",$1/ge,1} #_;
}
The warning generated is:
Don't modify $_ in list functions
More info about that warning here
I don't understand the warning because I don't think I'm modifying $_, although I suppose I must be.
Can someone explain it to me please?
Both of your greps are modifying $_ because you're using s//. For example, this:
grep {s/(^|\D)0+(\d)/$1$2/g,1}
is the same as this:
grep { $_ =~ s/(^|\D)0+(\d)/$1$2/g; 1 }
I think you'd be better off using map as you are not filtering anything with your greps, you're just using grep as an iterator:
sub natural_sort {
my $t;
return map { ($t = $_) =~ s/(^|\D)0+(\d)/$1$2/g; $t }
sort
map { ($t = $_) =~ s/(\d+)/sprintf"%06.6d",$1/ge; $t }
#_;
}
That should do the same thing and keep critic quiet. You might want to have a look at List::MoreUtils if you want some nicer list operators than plain map.
You are doing a substitution ( i.e. s/// ) in the grep, which modifies $_ i.e. the list being grepped.
This and other cases are explained in perldoc perlvar:
Here are the places where Perl will
assume $_ even if you don't use it:
The following functions:
abs, alarm, chomp, chop, chr, chroot,
cos, defined, eval, exp, glob, hex,
int, lc, lcfirst, length, log, lstat,
mkdir, oct, ord, pos, print,
quotemeta, readlink, readpipe, ref,
require, reverse (in scalar context
only), rmdir, sin, split (on its
second argument), sqrt, stat, study,
uc, ucfirst, unlink, unpack.
All file tests (-f , -d ) except for -t , which defaults to STDIN.
See -X
The pattern matching operations m//, s/// and tr/// (aka y///) when
used without an =~ operator.
The default iterator variable in a foreach loop if no other variable is
supplied.
The implicit iterator variable in the grep() and map() functions.
The implicit variable of given().
The default place to put an input record when a operation's result
is tested by itself as the sole
criterion of a while test. Outside a
while test, this will not happen.
Many people have correctly answered that the s operator is modifying $_, however in the soon to be released Perl 5.14.0 there will be the new r flag for the s operator (i.e. s///r) which rather than modify in-place will return the modified elements. Read more at The Effective Perler . You can use perlbrew to install this new version.
Edit: Perl 5.14 is now available! Announcement Announcement Delta
Here is the function suggested by mu (using map) but using this functionality:
use 5.14.0;
sub natural_sort {
return map { s/(^|\D)0+(\d)/$1$2/gr }
sort
map { s/(\d+)/sprintf"%06.6d",$1/gre }
#_;
}
The VERY important part that other answers have missed is that the line
grep {s/(\d+)/sprintf"%06.6d",$1/ge,1} #_;
Is actually modifying the arguments passed into the function, and not copies of them.
grep is a filtering command, and the value in $_ inside the code block is an alias to one of the values in #_. #_ in turn contains aliases to the arguments passed to the function, so when the s/// operator performs its substitution, the change is being made to the original argument. This is shown in the following example:
sub test {grep {s/a/b/g; 1} #_}
my #array = qw(cat bat sat);
my #new = test #array;
say "#new"; # prints "cbt bbt sbt" as it should
say "#array"; # prints "cbt bbt sbt" as well, which is probably an error
The behavior you are looking for (apply a function that modifies $_ to a copy of a list) has been encapsulated as the apply function in a number of modules. My module List::Gen contains such an implementation. apply is also fairly simple to write yourself:
sub apply (&#) {
my ($sub, #ret) = #_;
$sub->() for #ret;
wantarray ? #ret : pop #ret
}
With that, your code could be rewritten as:
sub natural_sort {
apply {s/(^|\D)0+(\d)/$1$2/g} sort apply {s/(\d+)/sprintf"%06.6d",$1/ge} #_
}
If your goal with the repeated substitutions is to perform a sort of the original data with a transient modification applied, you should look into a Perl idiom known as the Schwartzian transform which is a more efficient way of achieving that goal.

How can I prevent perl from reading past the end of a tied array that shrinks when accessed?

Is there any way to force Perl to call FETCHSIZE on a tied array before each call to FETCH? My tied array knows its maximum size, but could shrink from this size depending on the results of earlier FETCH calls. here is a contrived example that filters a list to only the even elements with lazy evaluation:
use warnings;
use strict;
package VarSize;
sub TIEARRAY { bless $_[1] => $_[0] }
sub FETCH {
my ($self, $index) = #_;
splice #$self, $index, 1 while $$self[$index] % 2;
$$self[$index]
}
sub FETCHSIZE {scalar #{$_[0]}}
my #source = 1 .. 10;
tie my #output => 'VarSize', [#source];
print "#output\n"; # array changes size as it is read, perl only checks size
# at the start, so it runs off the end with warnings
print "#output\n"; # knows correct size from start, no warnings
for brevity I have omitted a bunch of error checking code (such as how to deal with accesses starting from an index other than 0)
EDIT: rather than the above two print statements, if ONE of the following two lines is used, the first will work fine, the second will throw warnings.
print "$_ " for #output; # for loop "iterator context" is fine,
# checks FETCHSIZE before each FETCH, ends properly
print join " " => #output; # however a list context expansion
# calls FETCHSIZE at the start, and runs off the end
Update:
The actual module that implements a variable sized tied array is called List::Gen which is up on CPAN. The function is filter which behaves like grep, but works with List::Gen's lazy generators. Does anyone have any ideas that could make the implementation of filter better?
(the test function is similar, but returns undef in failed slots, keeping the array size constant, but that of course has different usage semantics than grep)
sub FETCH {
my ($self, $index) = #_;
my $size = $self->FETCHSIZE;
...
}
Ta da!
I suspect what you're missing is they're just methods. Methods called by tie magic, but still just methods you can call yourself.
Listing out the contents of a tied array basically boils down to this:
my #array;
my $tied_obj = tied #array;
for my $idx (0..$tied_obj->FETCHSIZE-1) {
push #array, $tied_obj->FETCH($idx);
}
return #array;
So you don't get any opportunity to control the number of iterations. Nor can FETCH reliably tell if its being called from #array or $array[$idx] or #array[#idxs]. This sucks. Ties kinda suck, and they're really slow. About 3 times slower than a normal method call and 10 times than a regular array.
Your example already breaks expectations about arrays (10 elements go in, 5 elements come out). What happen when a user asks for $array[3]? Do they get undef? Alternatives include just using the object API, if your thing doesn't behave exactly like an array pretending it does will only add confusion. Or you can use an object with array deref overloaded.
So, what you're doing can be done, but its difficult to get it to work well. What are you really trying to accomplish?
I think that order in which perl calls FETCH/FETCHSIZE methods can't be changed. It's perls internal part.
Why not just explicitly remove warnings:
sub FETCH {
my ($self, $index) = #_;
splice #$self, $index, 1 while ($$self[$index] || 0) % 2;
exists $$self[$index] ? $$self[$index] : '' ## replace '' with default value
}