Extracting matches from perl regex using global modifier in foreach loop - perl

I am trying to extract matched parts from a string using the global modifier.
Consider:
my $a="A B C";
my $b="A B C";
foreach ($a =~ /(\w)/g) {
print "$1\n";
}
while ($b =~ /(\w)/g) {
print "$1\n";
}
Output:
C
C
C
A
B
C
I am confused; why does the while loop work, whereas the foreach loop does not? (It prints C three times).

In short: change body of the first loop to print "$_\n".
When a global regex match is used as a list, it evaluates to a list of all captures (here: qw(A B C)). The foreach loop iterates over this list, and sets $_ to each item in turn. However, $1 points to the first capture group of the last (successful) match. As the list of matches is produced before the looping begins, this will point to the last match the whole time.
When a global regex match is used as an iterator in a while, it matches the regex and if successful executed the loop body, then tries again. Because only one match is produced at the time, $1 always refers to the first capture group in the current match.

The statement
foreach ($a =~ /(\w)/g)
Evaluates the regular expression in list context, and iterates through each item in the list. $1 is the last thing was captured in the brackets when constructing the list. The following should work:
foreach my $matched ($a =~ /(\w)/g) {
print "$matched\n";
}
However, the while syntax is usually best since it does not construct and store that temporary list.

Related

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.

Perl: If two elements match print elements, else iterate until match and then print

I'm new to Perl and I'm trying to iterate over two elements of an array with multiple indices in each element and look for a match. If element2 matches element1, I want to print both and move to the next position in element1 and continue the loop looking for the next match. If I don't have a match, loop until I get a match. Here is what I have:
#array = split(',',$row);
foreach $element1(#array[1])
{
foreach $element2(#array[2])
{
if($element1 == $element2)
{
print "1 = $element1 : 2 = $element2 \n";
}
}
}
I'm not getting the the matched output. I've tried multiple iterations with different syntactical changes.
I can get both elements when I do this:
foreach $element1(#array[1])
{
foreach $element2(#array[2])
{
print "1 = $element1 : 2 = $element2 \n";
}
}
I thought I might not be dereferencing correctly. Any guidance or suggestions would be appreciated. Thanks.
There are a number of issues with your script. Briefly:
You should always use strict and warnings.
Array indices start at 0, not 1.
You get an element of an array with $array[0], not #array[0]. This is a common frustration for new Perl programmers. The thing to remember is that the sigil (the symbol preceding a variable name) indicates the type of value being passed (e.g. $scalar, #array, or %hash) to the left-hand side of the expression, not the type of datastructure being accessed on the right-hand side.
As #sp-asic pointed out in the comments on the OP, string comparisons are performed with eq, not ==.
References to datastructures are stored in scalars, and you dereference by prepending the sigil of the original datastructure. If $foo is a reference to an array, #$foo gets you the original array.
You apparently want to break out of your inner loops when you find a match, but you'll want to make it clear (for people who look at this code in the future, which may include yourself) which loop you're breaking out of.
Most critically, #array will be an array of strings after you split another string (the row) on commas, so it's not clear why you expect to be able to treat the strings in the first and second position as arrays that you can loop through. I have a few guesses about what you're actually trying to do, and what your inputs and expected outputs actually look like, but I'll wait for you to provide some additional information and leave the information above as general guidance in the meantime, along with a lightly-reworked version of your code below.
use strict;
use warnings;
my #array = split(',', $row);
foreach my $element1 (#$array[0]) {
foreach my $element2 (#$array[1]) {
if ($element1 eq $element2) {
print "1 = $element1 : 2 = $element2\n";
last;
}
}
}

How do map and grep work?

I came across this code in a script, can you please explain what map and grep does here?
open FILE, '<', $file or die "Can't open file $file: $!\n";
my #sets = map {
chomp;
$_ =~ m/use (\w+)/;
$1;
}
grep /^use/, ( <FILE> );
close FILE;
The file pointed by $file has:
use set_marvel;
use set_caprion;
and so on...
Despite the fact that your question doesn't show any research effort, I'm going to answer it anyway, because it might be helpful for future readers who come across this page.
According to perldoc, map:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value composed of the results
of each such evaluation. In scalar context, returns the total number
of elements so generated. Evaluates BLOCK or EXPR in list context, so
each element of LIST may produce zero, one, or more elements in the
returned value.
The definition for grep, on the other hand:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value consisting of those
elements for which the expression evaluated to true. In scalar
context, returns the number of times the expression was true.
So they're similar in their input values, their return values, and the fact that they both localize $_.
In your specific code, going from right to left:
<FILE> slurps the lines in the file pointed to by the FILE filehandle and returns a list
In the context of grep, /^use/ looks at each line and returns true for the ones that match the regular expression. The return value of grep, therefore, is a list of lines that that start with use.
In the BLOCK of your map (which is only considering lines that passed the earlier grep test):
chomp removes any trailing string from $_ that corresponds to the current value of $/ (i.e., the newline). This is unnecessary, because as you'll see below, \w will never match a newline.
$_ =~ m/use (\w+)/ is a regular expression that looks for use followed by a space, followed by one or more word characters ([0-9a-zA-Z_]) in a capture group. The $_ =~ is redundant, since the match operator m// binds to $_ by default.
$1 is the first matching capture group from the previous expression. Since it's the last expression in the BLOCK, it bubbles up as the return value for each list item that was evaluated.
The end result is stored in an array named #sets, which should contain 'set_marvel', 'set_caprion', etc.
Equivalently, your code could be rewritten without map and grep like this, which may make it easier for you to understand:
my #sets;
while (<FILE>) {
next unless /^use (\w+)/;
push(#sets, $1);
}
The grep takes the <FILE> as input and uses the regular expression ^use to copy all of the lines that start with use into an array that is passed to map.
The map loops through each array entry and puts each entry in $_, then calls chomp on $_ implicitly. Then $_ =~ m/use (\w+)/; performs a regular expression on $_ that captures the word after the use and puts it into $1. Then the $1 is called to put it in #set.

Perl significance of $#_ variable

I see when I loop through elements of an array, and test $#_ , I get -1 for each element. I am hoping someone can explain what this variable does, and what it is used for most often.
Just like $#foo is the last existing index of array #foo, $#_ is the last existing index of array #_. If #_ is empty, $#_ is -1.
It sounds like you mean to use $_. $_ is aliased by foreach, map and grep loops to the element current being processed. while (<>) also sets $_ (as it gets rewritten to while (defined($_ = <>))). As a result, $_ is used as the default argument by many builtins (e.g. say).
# Print each element on its own line
say for #a;
is short for
# Print each element on its own line
say $_ for #a;
which is the terse form of
# Print each element on its own line
for my $ele (#a) {
say $ele;
}
I believe you mean $_ which is a special variable in Perl. It holds the current value while looping through a list element. For instance, below will print out each element of #foo, one at a time.
foreach (#foo) {
print $_;
}

Can you explain the context dependent variable assignment in perl

The following is one of the many cool things that Perl can do
my ($tmp) = ($_=~ /^>(.*)/);
It finds the pattern ^>.* in the current line in a loop, and it stores the what's in the parenthesis in the $tmp variable.
What I am curious is the concept behind this syntax. How and why(under what premises) does this work?
My understanding is the snippet $_=~ /^>(.*)/ is a boolean context, but the parenthesis renders it as a list context? But how come only what is in the parenthesis in the matched pattern is stored in the variable?!
Is it some kind of special case of variable assignments I have to "memorize" or can this be perfectly explainable? if so, what is this feature called(name like "autovivifacation?")
There are two assignment operators: list assignment and scalar assignment. The choice is determined based on the LHS of the "=". (The two operators are covered in detail in here.)
In this case, a list assignment operator is used. The list assignment operator evaluates both of its operands in list context.
So what does $_=~ /^>(.*)/ do in list context? Quote perlop:
If the /g option is not used, m// in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, i.e., ($1, $2, $3...) [...] When there are no parentheses in the pattern, the return value is the list (1) for success. With or without parentheses, an empty list is returned upon failure.
In other words,
my ($match) = $_ =~ /^>(.*)/;
is equivalent to
my $match;
if ($_ =~ /^>(.*)/) {
$match = $1;
} else {
$match = undef;
}
Were the parens omitted (my $tmp = ...;), a scalar assignment would be used instead. The scalar assignment operator evaluates both of its operands in scalar context.
So what does $_=~ /^>(.*)/ do in scalar context? Quote perlop:
returns true if it succeeds, false if it fails.
In other words,
my $matched = $_ =~ /^>(.*)/;
is equivalent to
my $matched;
if ($_ =~ /^>(.*)/) {
$matched = 1; # !!1 if you want to be picky.
} else {
$matched = 0; # !!0 if you want to be picky.
}
The brackets in the search pattern make that a "group". What $_ =~ /regex/returns is an array of all the matching groups, so my ($tmp) grabs the first group into $tmp.
All operations in perl have a return value, including assignment. Thats why you can do $a=$b=1 and set $a to the result of $b=1.
You can use =~ in a boolean (well, scalar) context, but that's just because it returns an empty list / undef if there's no match, and that evaluates to false. Calling it in an array context returns an array, just like other context-sensitive functions can do using the wantarray method to determine context.