perl Grouping things and hierarchical matching - perl

I've been testing Perl regex code what is written in the perlrequick section on Grouping things and hierarchical matching
This my Perl code
my $t = "housecats";
my ($m) = $t =~ m/house(cat|)/;
print $m;
The output is cat, but should be as written in the documentation
/house(cat|)/; # matches either 'housecat' or 'house'
What is wrong? Is there something amiss?

What you're doing with this code
my $t = "housecats";
my ($m) = $t =~ m/house(cat|)/;
print $m;
is copying the first capture into $m. Parentheses () in the pattern indicate which parts of the matching string to capture and store into built-in variables $1, $2 etc. You can have as many captures as you like, and they are numbered in the same order as the opening parentheses appears in the pattern
What perlrequick is talking about is what constitutes a successful match. Normally you would write
my $t = "housecats";
my $success = $t =~ m/house(cat|)/;
print $success ? "matched\n" : "no match\n";
This code produces
matched
as the document describes. If you set $t to housemartin then the result is the same because the regex pattern successfully finds house. But if $t is hosepipe then we see no match because the string contains neither house nor housecat
If you need to extract parts of the matched string then you must use captures as described above. You can access the whole string that was matched by accessing the built-in variable $&, but doing so causes unacceptable performance degradation in all but the latest Perl versions. For backward-compatability you should simply capture the whole pattern by writing
my $t = "housecats";
my ($m) = $t =~ m/(house(cat|))/;
print $m;
which produces housecat as you expected. It also sets the values of $1 and $2 to housecat and cat respectively

You probably misunderstood the comment. It means that
for my $t (qw( housecats house )) {
my ($m) = $t =~ /house(cat|)/;
print "[$m]\n";
}
will print
[cat]
[]
i.e. the regex will match both housecat and house. If the pattern didn't match at all then $m would be undef

my $t = "housecats";
my ($m) = $t=~m/house(cat|)/gn;
print $m;

Related

When using match operator on a variable declaration using qq, why do the parentheses work around the declaration itself?

I know the title is a mouthful but I wasn't sure how to make it more succinct.
Reading up on multiline strings in Perl I came across this post at PerlMaven about here-documents. It talks about here-documents, qq and q. In all cases leading whitespace, like that used for code indentation, is retained. I understand that. The way this is mitigated in the example show is using a regex replacement to remove the leading spaces.
if ($send) {
(my $message = qq{
Dear $name,
this is a message I plan to send to you.
regards
the Perl Maven
}) =~ s/^ {8}//mg;
print $message;
}
When I was trying to adopt this style I wrote it like this (accidentally) instead:
if ($send) {
my $message = (qq{
Dear $name,
words and stuff
}) =~ s/^ {8}//mg;
print $message;
}
Perl is not a strong language for me. The incorrect syntax I tried above seemed natural to me though. Since I am using the match operator incorrectly I obviously get the error:
Can't modify string in substitution (s///) at nagios_send_html_service_mail.pl line 91, near "s/^ {8}//mg;"
In the working example why does the $message actual contain the changes? I seems like something you are allowed to do when you declare variables but I just wanted to know what it was called.
Because it's about what you're trying to change.
=~ is the binding operator, and tells you what to apply the pattern match to.
my $message = "fish";
$message =~ s/i/a/;
print $message
Will work, because you're trying to transform $message. This is what's happening in the first example - message is being set first, and then the modification is applied, because of the brackets.
However, =~ is tighter binding that =. It happens first.
This precedence is documented in perldoc perlop
So in the first example - the assignment happens first because of the brackets, and then transform. Without the brackets, it tries the transform first:
"fish" =~ s/i/a/;
Which is invalid, because it's not changing a variable at that point, but a static piece of text.
my $result = "fish" =~ s/i/a/; #gives same error.
( my $result = "fish" ) =~ s/i/a/; #works.
You could do this another way (if your version of perl is new enough) by using the r regex modifier, to return a value:
my $result = "fish" =~ s/i/a/r;
The r flag stops trying to modify the value, and just 'returns' the result of the operation, which then can be assigned to $result.
=~ has higher precedence than =, so
my $message = (qq{...}) =~ s/^ {8}//mg;
is equivalent to
my $message = ( (qq{...}) =~ s/^ {8}//mg );
This tries to modify the constant returned by qq, which is not allowed.
A scalar assignment operator in scalar context returns its left-hand side (as an lvalue)[1]. That means
( $message = qq{...} ) =~ s/^ {8}//mg;
is equivalent to
$message = qq{...};
$message =~ s/^ {8}//mg;
Furthermore, my $message returns $message (as an lvalue), so
( my $message = qq{...} ) =~ s/^ {8}//mg;
is equivalent to
my $message;
$message = qq{...};
$message =~ s/^ {8}//mg;
This is why the original solution worked.
Note that indenting changes will break your code, so your technique is fragile. Consider using the following instead:
if ($send) {
(my $message = qq{
!Dear $name,
!
!this is a message I plan to send to you.
!
!regards
! the Perl Maven
}) =~ s/^[^\S\n]+[!\n]?//mg;
print $message;
}
The above also remove the undesired blank leading line.
Finally, note that r modifier introduced in 5.14.
if ($send) {
my $message = qq{
!Dear $name,
!
!this is a message I plan to send to you.
!
!regards
! the Perl Maven
} =~ s/^[^\S\n]+[!\n]?//mgr;
print $message;
}
See Mini-Tutorial: Scalar vs List Assignment Operator for more on what the assignment operator returns.

How do match the two strings which contains parentheses in perl

How do match the two strings which contains brackets.
the perl code is here.
#!/usr/bin/perl -w
$a = "cat(S1)rat";
$b = "cat(S1)r";
if ( $a =~ $b ) {
printf("matching\n");
}
I am not getting the desired output.
please help
snk
There are several answers here, but not a lot address your fundamental misunderstanding.
Here is a simplified version of your problem:
my $str = "tex(t)";
my $pattern = "tex(t)";
if ($str =~ $pattern) {
print "match\n";
} else {
print "NO MATCH\n";
}
This prints out NO MATCH.
The reason for this is the behavior of the =~ operator.
The thing on the left of that operator is treated as a string, and the thing on the right is treated as a pattern (a regular expression).
Parentheses have special meaning in patterns, but not in strings.
For the specific example above, you could fix it with:
my $str = "tex(t)";
my $pattern = "tex\\(t\\)";
More generally, if you want to escape "special characters" in $pattern (such as *, ., etc.), you can use the \Q...\E syntax others have mentioned.
Does it make sense?
Typically, you do not see a pattern represented as a string (as with "tex(t)").
The more common way to write this would be:
if ($str =~ /tex(t)/)
Which could be fixed by writing:
if ($str =~ /tex\(t\)/)
Note that in this case, since you are using a regex object (the /.../ syntax), you do not need to double-escape the backslashes, as we did for the quoted string.
Try this code:
my $p = "cat(S1)rat";
my $q = "cat(S1)r";
if ( index( $p, $q ) == -1 ) {
print "Does not match";
} else {
print "Match";
}
You have to escape the parenthesis:
if ( $a =~ /\Q$b/ ) {
print "matching\n";
}
And please, avoid using variable names $a and $b they are reserved for sorting.
Also, there're no needs to use printf here.

What does this if statement do? (string comparison)

I am trying to understand a piece of code which loops over a file, does various assignments, then enters a set of if statements where a string is seemingly compared to nothing. What are /nonsynonymous/ and /prematureStop/ being compared to here? I am mostly experienced with python.
open(IN,$file);
while(<IN>){
chomp $_;
my #tmp = split /\t+/,$_;
my $id = join("\t",$tmp[0],$tmp[1]-1);
$id =~ s/chr//;
my #info_field = split /;/,$tmp[2];
my $vat = $info_field[$#info_field];
my $score = 0;
$self -> {VAT} ->{$id}= $vat;
$self ->{GENE} -> {$id} = $tmp[3];
if (/nonsynonymous/ || /prematureStop/){...
It is comparing against the current input line ($_).
By default, perl will automatically use the current input line ($_) when doing regex matches unless overridden (with =~).
From http://perldoc.perl.org/perlretut.html
If you're matching against the special default variable $_ , the $_ =~
part can be omitted:
$_ = "Hello World";
if (/World/) {
print "It matches\n";
}
else {
print "It doesn't match\n";
}
Often in Perl, if a specific variable isn't given, it's assumed that you want to use the default variable $_. For instance, the while loop assigns the incoming lines from <IN> to that variable, chomp $_; could just as well have been written chomp;, and the regular expressions in the if statement try to match with $_ as well.

Matching in Perl

I am trying to get text in between two dots of a line, but my program returns the entire line.
For example: I have text which looks like:
My sampledata 1,2 for perl .version 1_1.
I used the following match statement
$x =~ m/(\.)(.*)(\.)/;
My output for $x should be version 1_1, but I am getting the entire line as my match.
In your code, the value of $x will not change after the match.
When $x is successfully matched with m/(.)(.*)(.)/, your three capture groups will contain '.', 'version 1_1' and '.' respectively (in the order given). $2 will give you 'version 1_1'.
Considering that you might probably only want the part 'version 1_1', you need not capture the two dots. This code will give you the same result:
$x =~ m/\.(.*)\./;
print $1;
Try this:
my $str = "My sampledata 1,2 for perl .version 1_1.";
$str =~ /\.\K[^.]+(?=\.)/;
print $&;
The period must be escaped out of a character class.
\K resets all that has been matched before (you can replace it by a lookbehind (?<=\.))
[^.] means any character except a period.
For several results, you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.\K[^.]+(?=\.)/g);
print join("\n", #matches);
If you don't want to use twice a period you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.([^.]+)\./g);
print join("\n", #matches)."\n";
It should be simple enough to do something like this:
#!/usr/bin/perl
use warnings;
use strict;
my #tests = (
"test one. get some stuff. extra",
"stuff with only one dot.",
"another test line.capture this. whatever",
"last test . some data you want.",
"stuff with only no dots",
);
for my $test (#tests) {
# For this example, I skip $test if the match fails,
# otherwise, I move on do stuff with $want
next if $test !~ /\.(.*)\./;
my $want = $1;
print "got: $want\n";
}
Output
$ ./test.pl
got: get some stuff
got: capture this
got: some data you want

Perl regular expressions and returned array of matched groups

i am new in Perl and i need to do some regexp.
I read, when array is used like integer value, it gives count of elements inside.
So i am doing for example
if (#result = $pattern =~ /(\d)\.(\d)/) {....}
and i was thinking it should return empty array, when pattern matching fails, but it gives me still array with 2 elements, but with uninitialized values.
So how i can put pattern matching inside if condition, is it possible?
EDIT:
foreach (keys #ARGV) {
if (my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) {
if (defined $params{$result[0]}) {
print STDERR "Cmd option error\n";
}
$params{$result[0]} = (defined $result[1] ? $result[1] : 1);
}
else {
print STDERR "Cmd option error\n";
exit ERROR_CMD;
}
}
It is regexp pattern for command line options, cmd options are in long format with two hyphens preceding and possible with argument, so
--CMD[=ARG]. I want elegant solution, so this is why i want put it to if condition without some prolog etc.
EDIT2:
oh sry, i was thinking groups in #result array are always counted from 0, but accesible are only groups from branch, where the pattern is success. So if in my code command is "input", it should be in $result[0], but actually it is in $result[1]. I thought if $result[0] is uninitialized, than pattern fails and it goes to the if statement.
Consider the following:
use strict;
use warnings;
my $pattern = 42.42;
my #result = $pattern =~ /(\d)\.(\d)/;
print #result, ' elements';
Output:
24 elements
Context tells Perl how to treat #result. There certainly aren't 24 elements! Perl has printed the array's elements which resulted from your regex's captures. However, if we do the following:
print 0 + #result, ' elements';
we get:
2 elements
In this latter case, Perl interprets a scalar context for #result, so adds the number of elements to 0. This can also be achieved through scalar #results.
Edit to accommodate revised posting: Thus, the conditional in your code:
if(my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) { ...
evaluates to true if and only if the match was successful.
#results = $pattern =~ /(\d)\.(\d)/ ? ($1,$2) : ();
Try this:
#result = ();
if ($pattern =~ /(\d)\.(\d)/)
{
push #result, $1;
push #result, $2;
}
=~ is not an equal sign. It's doing a regexp comparison.
So my code above is initializing the array to empty, then assigning values only if the regexp matches.