How do map and grep work? - perl

I came across this code in a script, can you please explain what map and grep does here?
open FILE, '<', $file or die "Can't open file $file: $!\n";
my #sets = map {
chomp;
$_ =~ m/use (\w+)/;
$1;
}
grep /^use/, ( <FILE> );
close FILE;
The file pointed by $file has:
use set_marvel;
use set_caprion;
and so on...

Despite the fact that your question doesn't show any research effort, I'm going to answer it anyway, because it might be helpful for future readers who come across this page.
According to perldoc, map:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value composed of the results
of each such evaluation. In scalar context, returns the total number
of elements so generated. Evaluates BLOCK or EXPR in list context, so
each element of LIST may produce zero, one, or more elements in the
returned value.
The definition for grep, on the other hand:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value consisting of those
elements for which the expression evaluated to true. In scalar
context, returns the number of times the expression was true.
So they're similar in their input values, their return values, and the fact that they both localize $_.
In your specific code, going from right to left:
<FILE> slurps the lines in the file pointed to by the FILE filehandle and returns a list
In the context of grep, /^use/ looks at each line and returns true for the ones that match the regular expression. The return value of grep, therefore, is a list of lines that that start with use.
In the BLOCK of your map (which is only considering lines that passed the earlier grep test):
chomp removes any trailing string from $_ that corresponds to the current value of $/ (i.e., the newline). This is unnecessary, because as you'll see below, \w will never match a newline.
$_ =~ m/use (\w+)/ is a regular expression that looks for use followed by a space, followed by one or more word characters ([0-9a-zA-Z_]) in a capture group. The $_ =~ is redundant, since the match operator m// binds to $_ by default.
$1 is the first matching capture group from the previous expression. Since it's the last expression in the BLOCK, it bubbles up as the return value for each list item that was evaluated.
The end result is stored in an array named #sets, which should contain 'set_marvel', 'set_caprion', etc.
Equivalently, your code could be rewritten without map and grep like this, which may make it easier for you to understand:
my #sets;
while (<FILE>) {
next unless /^use (\w+)/;
push(#sets, $1);
}

The grep takes the <FILE> as input and uses the regular expression ^use to copy all of the lines that start with use into an array that is passed to map.
The map loops through each array entry and puts each entry in $_, then calls chomp on $_ implicitly. Then $_ =~ m/use (\w+)/; performs a regular expression on $_ that captures the word after the use and puts it into $1. Then the $1 is called to put it in #set.

Related

Substitution on string in perl changes string to an integer value

I am trying to do delete some characters matching a regex in perl and when I do that it returns an integer value.
I have tried substituting multiple spaces in a string with empty string or basically deleting the space.
#! /usr/intel/bin/perl
my $line = "foo/\\bar car";
print "$line\n";
#$line = ~s/(\\|(\s)+)+//; <--Ultimately need this, where backslash and space needs to be deleted. Tried this, returns integer value
$line = ~s/\s+//; <-- tried this, returns integer value
print "$line\n";
Expected results:
First print: foo/\bar car
Second print: foo/barcar
Actual result:
First print: foo/\\bar car
Second print: 18913234908
The proper solution is
$line =~ s/[\s\\]+//g;
Note:
g flag to substitute all occurrences
no space between = and ~
=~ is a single operator, binding the substitution operator s to the target variable $line.
Inserting a space (as in your code) means s binds to the default target, $_, because there is no explicit target, and then the return value (which is the number of substitutions made) has all its bits inverted (unary ~ is bitwise complement) and is assigned to $line.
In other words,
$line = ~ s/...//
parses as
$line = ~(s/...//)
which is equivalent to
$line = ~($_ =~ s/...//)
If you had enabled use warnings, you would've gotten the following message:
Use of uninitialized value $_ in substitution (s///) at prog.pl line 6.
You've already accepted an answer, but I thought it would be useful to give you a few more details.
As you now know,
$line = ~s/\s+//;
is completely different to:
$line =~ s/\s+//;
You wanted the second, but you typed the first. So what did you end up with?
~ is "bitwise negation operator". That is, it converts its argument to a binary number and then bit-flips that number - all the zeroes become ones and all the ones become zeros.
So you're asking for the bitwise negation of s/\s+//. Which means the bitwise negation works on the value returned by s/\s+//. And the value returned by a substitution is the number of substitutions made.
We can now work out all of the details.
s/\s+// carries out your substitution and returns the number of substitutions made (an integer).
~s/\s+// returns the bitwise negation of the integer returned by the substitution (which is also an integer).
$line = ~s/\s+// takes that second integer and assigns it to the variable $line.
Probably, the first step returns 1 (you don't use /g on your s/.../.../, so only one substitution will be made). It's easy enough to get the bitwise negation of 1.
$ perl -E'say ~1'
18446744073709551614
So that might well be the integer that you're seeing (although it might be different on a 32-bit system).

Extracting matches from perl regex using global modifier in foreach loop

I am trying to extract matched parts from a string using the global modifier.
Consider:
my $a="A B C";
my $b="A B C";
foreach ($a =~ /(\w)/g) {
print "$1\n";
}
while ($b =~ /(\w)/g) {
print "$1\n";
}
Output:
C
C
C
A
B
C
I am confused; why does the while loop work, whereas the foreach loop does not? (It prints C three times).
In short: change body of the first loop to print "$_\n".
When a global regex match is used as a list, it evaluates to a list of all captures (here: qw(A B C)). The foreach loop iterates over this list, and sets $_ to each item in turn. However, $1 points to the first capture group of the last (successful) match. As the list of matches is produced before the looping begins, this will point to the last match the whole time.
When a global regex match is used as an iterator in a while, it matches the regex and if successful executed the loop body, then tries again. Because only one match is produced at the time, $1 always refers to the first capture group in the current match.
The statement
foreach ($a =~ /(\w)/g)
Evaluates the regular expression in list context, and iterates through each item in the list. $1 is the last thing was captured in the brackets when constructing the list. The following should work:
foreach my $matched ($a =~ /(\w)/g) {
print "$matched\n";
}
However, the while syntax is usually best since it does not construct and store that temporary list.

what does these perl variables mean?

I'm a little noobish to perl coding conventions, could someone help explain:
why are there / and /< in front of perl variables?
what does\= and =~ mean, and what is the difference?
why does the code require an ending / before the ;, e.g. /start=\'([0-9]+)\'/?
The 1st 3 sub-questions were sort of solved by really the perldocs, but what does the following line means in the code?
push(#{$Start{$start}},$features);
i understand that we are pushing the $features into a #Start array but what does #$Start{$start} mean? Is it the same as:
#Start = ($start);
Within the code there is something like this:
use FileHandle;
sub open_infile {
my $file = shift;
my $in = FileHandle->new($file,"<:encoding(UTF-8)")
or die "ERROR: cannot open $file: $!\n" if ($Opt_utf8);
$in = new FileHandle("$file")
or die "ERROR: cannot open $file: $!\n" if (!$Opt_utf8);
return $in;
}
$uamf = shift #ARGV;
$uamin = open_infile($uamf);
while (<$uamin>) {
chomp;
if(/<segment /){
/start=\'([0-9]+)\'/;
/end=\'([0-9]+)\'/;
/features=\'([^\']+)\'/;
$features =~ s/annotation;//;
push(#{$Start{$start}},$features);
push(#{$End{$end}},$features);
}
}
EDITED
So after some intensive reading of the perl doc, here's somethings i've gotten
The /<segment / is a regex check that checks whether the readline
in while (<$uamin>) contains the following string: <segment.
Similarly the /start=\'([0-9]+)\'/ has nothing to to do with
instantiating any variable, it's a regex check to see whether the
readline in while (<$uamin>) contains start=\'([0-9]+)\' which
\'([0-9]+)\' refers to a numeric string.
In $features =~ s/annotation;// the =~ is use because the string
replacement was testing a regular expression match. See
What does =~ do in Perl?
Where did you see this syntax (or more to the point: have you edited stuff out of what you saw)? /foo/ represents the match operator using regular expressions, not variables. In other words, the first line is checking to see if the input string $_ contains the character sequence <segment.
The subsequent three lines essentially do nothing useful, in the sense that they run regular expression matches and then discard the results (there are side-effects, but subsequent regular expressions discard the side-effects, too).
The last line does a substitution, replacing the first occurance of the characters annotation; with the empty string in the string $features.
Run the command perldoc perlretut to learn about regex in Perl.

Can you explain the context dependent variable assignment in perl

The following is one of the many cool things that Perl can do
my ($tmp) = ($_=~ /^>(.*)/);
It finds the pattern ^>.* in the current line in a loop, and it stores the what's in the parenthesis in the $tmp variable.
What I am curious is the concept behind this syntax. How and why(under what premises) does this work?
My understanding is the snippet $_=~ /^>(.*)/ is a boolean context, but the parenthesis renders it as a list context? But how come only what is in the parenthesis in the matched pattern is stored in the variable?!
Is it some kind of special case of variable assignments I have to "memorize" or can this be perfectly explainable? if so, what is this feature called(name like "autovivifacation?")
There are two assignment operators: list assignment and scalar assignment. The choice is determined based on the LHS of the "=". (The two operators are covered in detail in here.)
In this case, a list assignment operator is used. The list assignment operator evaluates both of its operands in list context.
So what does $_=~ /^>(.*)/ do in list context? Quote perlop:
If the /g option is not used, m// in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, i.e., ($1, $2, $3...) [...] When there are no parentheses in the pattern, the return value is the list (1) for success. With or without parentheses, an empty list is returned upon failure.
In other words,
my ($match) = $_ =~ /^>(.*)/;
is equivalent to
my $match;
if ($_ =~ /^>(.*)/) {
$match = $1;
} else {
$match = undef;
}
Were the parens omitted (my $tmp = ...;), a scalar assignment would be used instead. The scalar assignment operator evaluates both of its operands in scalar context.
So what does $_=~ /^>(.*)/ do in scalar context? Quote perlop:
returns true if it succeeds, false if it fails.
In other words,
my $matched = $_ =~ /^>(.*)/;
is equivalent to
my $matched;
if ($_ =~ /^>(.*)/) {
$matched = 1; # !!1 if you want to be picky.
} else {
$matched = 0; # !!0 if you want to be picky.
}
The brackets in the search pattern make that a "group". What $_ =~ /regex/returns is an array of all the matching groups, so my ($tmp) grabs the first group into $tmp.
All operations in perl have a return value, including assignment. Thats why you can do $a=$b=1 and set $a to the result of $b=1.
You can use =~ in a boolean (well, scalar) context, but that's just because it returns an empty list / undef if there's no match, and that evaluates to false. Calling it in an array context returns an array, just like other context-sensitive functions can do using the wantarray method to determine context.

Why does Perl's shift complain 'Type of arg 1 to shift must be array (not grep iterator).'?

I've got a data structure that is a hash that contains an array of hashes. I'd like to reach in there and pull out the first hash that matches a value I'm looking for. I tried this:
my $result = shift grep {$_->{name} eq 'foo'} #{$hash_ref->{list}};
But that gives me this error: Type of arg 1 to shift must be array (not grep iterator). I've re-read the perldoc for grep and I think what I'm doing makes sense. grep returns a list, right? Is it in the wrong context?
I'll use a temporary variable for now, but I'd like to figure out why this doesn't work.
A list isn't an array.
my ($result) = grep {$_->{name} eq 'foo'} #{$hash_ref->{list}};
… should do the job though. Take the return from grep in list context, but don't assign any of the values other than the first.
I think a better way to write this would be this:
use List::Util qw/first/;
my $result = first { $_->{name} eq 'foo' } #{ $hash_ref->{list} };
Not only will it be more clear what you're trying to do, it will also be faster because it will stop grepping your array once it has found the matching element.
Another way to do it:
my $result = (grep {$_->{name} eq 'foo'} #{$hash_ref->{list}})[0];
Note that the curlies around the first argument to grep are redundant in this case, so you can avoid block setup and teardown costs with
my $result = (grep $_->{name} eq 'foo', #{$hash_ref->{list}})[0];
“List value constructors” in perldata documents subscripting of lists:
A list value may also be subscripted like a normal array. You must put the list in parentheses to avoid ambiguity. For example:
# Stat returns list value.
$time = (stat($file))[8];
# SYNTAX ERROR HERE.
$time = stat($file)[8]; # OOPS, FORGOT PARENTHESES
# Find a hex digit.
$hexdigit = ('a','b','c','d','e','f')[$digit-10];
# A "reverse comma operator".
return (pop(#foo),pop(#foo))[0];
As I recall, we got this feature when Randal Schwartz jokingly suggested it, and Chip Salzenberg—who was a patching machine in those days—implemented it that evening.
Update: A bit of searching shows the feature I had in mind was $coderef->(#args). The commit message even logs the conversation!