perl, matching balanced parens using .Net regex - perl

I needed some perl code to match balanced parens in a string.
so I found this regular expresion code below from .Net and pasted it into my Perl program thinking the regex engine was similar enough for it to work:
/
\s*\(
(?: [^\(\)] | (?<openp>\() | (?<-openp>\)) )+
(?(openp)(?!))
\)\s*
/x
My understanding of how this regex works is a follows:
Match first paren:
\(
Match pattern a, b, or c at least once:
(?: <a> | <b> | <c>)+
where a, b, and c are:
a is any character that is not a paren
[^\(\)]
b is character that is a left-paren
\(
c is character that is a right-paren
\)
and:
b is a capture group that pushes to named capture "openp"
(?<openp>\()
c is a capture group that pops from named capture "openp"
(?<openp>\()
reject any regular expresssion match where openp doesn't equal zero items on stack:
(?<-openp>\))
4. match end paren
\)
Here's the perl code:
sub eat_parens($) {
my $line = shift;
if ($line !~ /
\s*\(
(?: [^\(\)] | (?<openp>\() | (?<-openp>\)) )+
(?(openp)(?!))
\)\s*
/x)
{
return $line;
}
return $';
}
sub testit2 {
my $t1 = "(( (sdfasd)sdfsas (sdfasd) )sadf) ()";
$t2 = eat_parens($t1);
print "t1: $t1\n";
print "t2: $t2\n";
}
testit2();
Error is:
$ perl x.pl
Sequence (?<-...) not recognized in regex; marked by <-- HERE in m/\s*\((?: [^\(\)] | (?<openp> \( ) | (?<- <-- HERE openp> \) ) )+ (?(openp)(?!) ) \) \s*/ at x.pl line 411.
Not sure what's causing this.... any ideas?

Here's one way to do it:
/
(?&TEXT)
(?(DEFINE)
(?<TEXT>
[^()]*+
(?: \( (?&TEXT) \)
[^()]*+
)*+
)
)
/x
It can also be done without naming anything. Search for "recursive" in perlre.

Related

Why does multiple use of `<( )>` token within `comb` not behave as expected?

I want to extract the row key(here is 28_2820201112122420516_000000), the column name(here is bcp_startSoc), and the value(here is 64.0) in $str, where $str is a row from HBase:
# `match` is OK
my $str = '28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0';
my $match = $str.match(/^ ([\d+]+ % '_') \s 'column=d:' (\w+) ',' \s timestamp '=' \d+ ',' \s 'value=' (<-[=]>+) $/);
my #match-result = $match».Str.Slip;
say #match-result; # Output: [28_2820201112122420516_000000 bcp_startSoc 64.0]
# `smartmatch` is OK
# $str ~~ /^ ([\d+]+ % '_') \s 'column=d:' (\w+) ',' \s timestamp '=' \d+ ',' \s 'value=' (<-[=]>+) $/
# say $/».Str.Array; # Output: [28_2820201112122420516_000000 bcp_startSoc 64.0]
# `comb` is NOT OK
# A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint.
# The <( is similar to other languages \K to discard any matches found before the \K.
my #comb-result = $str.comb(/<( [\d+]+ % '_' )> \s 'column=d:' <(\w+)> ',' \s timestamp '=' \d+ ',' \s 'value=' <(<-[=]>+)>/);
say #comb-result; # Expect: [28_2820201112122420516_000000 bcp_startSoc 64.0], but got [64.0]
I want comb to skip some matches, and just match what i wanted, so i use multiple <( and )> here, but only get the last match as result.
Is it possible to use comb to get the same result as match method?
TL;DR Multiple <(...)>s don't mean multiple captures. Even if they did, .comb reduces each match to a single string in the list of strings it returns. If you really want to use .comb, one way is to go back to your original regex but also store the desired data using additional code inside the regex.
Multiple <(...)>s don't mean multiple captures
The default start point for the overall match of a regex is the start of the regex. The default end point is the end.
Writing <( resets the start point for the overall match to the position you insert it at. Each time you insert one and it gets applied during processing of a regex it resets the start point. Likewise )> resets the end point. At the end of processing a regex the final settings for the start and end are applied in constructing the final overall match.
Given that your code just unconditionally resets each point three times, the last start and end resets "win".
.comb reduces each match to a single string
foo.comb(/.../) is equivalent to foo.match(:g, /.../)>>.Str;.
That means you only get one string for each match against the regex.
One possible solution is to use the approach #ohmycloudy shows in their answer.
But that comes with the caveats raised by myself and #jubilatious1 in comments on their answer.
Add { #comb-result .push: |$/».Str } to the regex
You can workaround .comb's normal functioning. I'm not saying it's a good thing to do. Nor am I saying it's not. You asked, I'm answering, and that's it. :)
Start with your original regex that worked with your other solutions.
Then add { #comb-result .push: |$/».Str } to the end of the regex to store the result of each match. Now you will get the result you want.
$str.comb( / ^ [\d+]+ % '_' | <?after d\:> \w+ | <?after value\=> .*/ )
Since you have a comma-separated 'row' of information you're examining, you could try using split() to break your matches up, and assign to an array. Below in the Raku REPL:
> my $str = '28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0';
28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0
> my #array = $str.split(", ")
[28_2820201112122420516_000000 column=d:bcp_startSoc timestamp=1605155065124 value=64.0]
> dd #array
Array #array = ["28_2820201112122420516_000000 column=d:bcp_startSoc", "timestamp=1605155065124", "value=64.0"]
Nil
> say #array.elems
3
Match on individual elements of the array:
> say #array[0] ~~ m/ ([\d+]+ % '_') \s 'column=d:' (\w+) /;
「28_2820201112122420516_000000 column=d:bcp_startSoc」
0 => 「28_2820201112122420516_000000」
1 => 「bcp_startSoc」
> say #array[0] ~~ m/ ([\d+]+ % '_') \s 'column=d:' <(\w+)> /;
「bcp_startSoc」
0 => 「28_2820201112122420516_000000」
> say #array[0] ~~ m/ [\d+]+ % '_' \s 'column=d:' <(\w+)> /;
「bcp_startSoc」
Boolean tests on matches to one-or-more array elements:
> say True if ( #array[0] ~~ m/ [\d+]+ % '_' \s 'column=d:' <(\w+)> /)
True
> say True if ( #array[2] ~~ m/ 'value=' <(<-[=]>+)> / )
True
> say True if ( #array[0] ~~ m/ [\d+]+ % '_' \s 'column=d:' <(\w+)> /) & ( #array[2] ~~ m/ 'value=' <(<-[=]>+)> / )
True
HTH.

pass by value to be made into pass by pointer-perl code [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have a function call like this below:
Send(0x39,((rLoss>>8)&0xFF),(rLoss&0xFF) );
I want to convert this function to pass by pointer. I have written two macros like
BYTE0(var) ((uint8_t *)&var)
BYTE1(var) ((uint8_t)&var)+1)
I want the result to be
Send(0x39,BYTE1(rLoss),BYTE0(rLoss) );
Can you please help me do this in perl... Thanks....
I presume that the first argument to the call is always a hex number that doesn't have to be examined or transformed. I've also presumed that args 2 and 3 are always ANDed with 0xFF. Finally, I've presumed that the function being called and the argument being shifted are simple words - i.e. match \w+. With these presumptions, the following appears to do what you want;
use v5.12;
while (<>) {
chomp ;
if (/ ^ (\w+) \( .* \) \s* ; $ /x) {
my $call = $1 ; # function name being called
s/ ^ \w+ \( //x ; # Strip off fn call and open paren
s/ \) \s* ; \s* $ //x ; # Strip off close paren and semicolon
my ($arg1 , $arg2 , $arg3) = split ',' ; # split into arguements of the call
my $new_args = join("," , $arg1 , transform($arg2) , transform($arg3)) ;
say "$call($new_args );" ;
}
else {
say $_ ;
}
}
sub transform {
$_ = shift ;
my $replacement ;
s/ ^ \s* \( //x; # Strip off opening paren
s/ \) \s* $ //x; # Strip off closing paren
s/ & 0xFF $ //x ; # Strip off ANDing all ones
if (/^ \w+ $/x) { # Simple var name left?
$replacement = "BYTE0(" . $_ . ")" ;
}
elsif (/ ^ \( (\w+) >> (\d+) \) $ /x) { # var name shifted some number of bits
my $var_name = $1 ;
my $shift_size = $2 ;
my $byte_num = $shift_size / 8 ;
$replacement = "BYTE" . $byte_num . "(" . $var_name . ")" ;
}
else {
warn "Dont understand '$_' on line $.\n";
$replacement = $_ ;
}
return $replacement
}
Its unix filter style - input on STDIN, transformed output on STDOUT. When I feed it this made up data;
hello
Send(0x39,((rLoss>>8)&0xFF),(rLoss&0xFF) );
world
Receive(0x12,(rWin&0xFF),((rWin>>16)&0xFF) );
bye
It spits out
hello
Send(0x39,BYTE1(rLoss),BYTE0(rLoss) );
world
Receive(0x12,BYTE0(rWin),BYTE2(rWin) );
bye
Hopefully, the inline comments explain the code. The decision on whether to attempt to transform the line or leave it alone is based solely on the first regex - a word (fn call) followed by something in parentheses - this may or may not be what you want. Perhaps you know that its always a call to "Send" in which case you can put that in the regex.
The only other thing you may not be familiar with is the integer division operator '/'. This is used to translate the number of bits being shifted to the BYTE num macro call.

how to pass one regex output to another regex in perl

How to combine two regex . This is my input:
1.UE_frequency_offset_flag else { 2} UE_frequency_offset_flag
2.served1 0x00 Uint8,unsigned char
#my first regex expression is used for extracting the values inside curly braces
my ($first_match) = /(\b(\d+)\b)/g;
print "$1 \n";
#my second regex expression
my ($second_match) = / \S \s+ ( \{ [^{}]+ \} | \S+ ) /x;
I was trying to combine both regex but did not get the expected output.
my ($second_match) = / \S \s+ ( \{ [^{}]+ \} |\b(\d+)\b| \S+ ) /x;
My expected output:
2,0x00
Please help where I am doing mistake?
The question is not completely clear to me, because I don't see how you want to combine two regex or pass the output of one to the other.
If you want to pass the captured part of the first regex then you need to save it to a variable:
my ($first_match) = /(\b(\d+)\b)/g;
my $captured = $1;
Then you can place the variable $captured in the second regex.
If you want to use the complete match and search inside that. Then you need to do the following:
my ($first_match) = /(\b(\d+)\b)/g;
print "$1,"; # Don't print one space then new line if you want to have a comma separating the two values
my ($second_match) = $first_match =~ / \S \s+ ( \{ [^{}]+ \} | \S+ ) /x;
Based on your input, this won't generate the expected output.
The following code would print out:
2,0x00
When processing your input.
print "$1," if /\{\s*(\d+)\s*\}/;
print "$1\n" if /(\d+x\d+)/;

using Regexp::Grammars what does (*COMMIT) do?

I'm looking at an example for Regexp::Grammars. The example has a comment around *COMMIT stating about how it will improve the error messages. I can't find any documentation on *COMMIT. What does it do ? I've included part of the example below:
use Regexp::Grammars;
qr{
\A
<Answer>
(*COMMIT) # <-- Remove this to see the error messages get less accurate
(?:
\Z
|
<warning: (?{ "Extra junk after expression at index $INDEX: '$CONTEXT'" })>
<warning: Expected end of input>
<error:>
)
<rule: Answer>
<[_Operand=Mult]>+ % <[_Op=(\+|\-)]>
(?{ $MATCH = shift #{$MATCH{_Operand}};
for my $term (#{$MATCH{_Operand}}) {
my $op = shift #{$MATCH{_Op}};
if ($op eq '+') { $MATCH += $term; }
else { $MATCH -= $term; }
}
})
|
<error: Expected valid arithmetic expression>
(*COMMIT) is documented in perlre.
(*COMMIT) is useful in causing an whole alternation to fail when one of the branch fails after reaching a certain point.
$ perl -E'
say "$_: ", /
^
(?: a (*COMMIT) b
| c (*COMMIT) d
| . z
)
/sx ?1:0
for qw( ab cd az yz );
'
ab: 1
cd: 1
az: 0
yz: 1
You could have written the following, but it could be far less efficient and far harder to write in more complex examples:
/
^
(?: a b
| c d
| [^ac] z
)
/x
It's a backtracking control verb, described in perlre
Essentially it forces a regex to fail if a later part of the pattern mismatches and in a way that would cause the regex engine to backtrack into it

regular expression is not working

my $pat = '^x.*d$';
my $dir = '/etc/inet.d';
if ( $dir =~ /$pat/xmsg ) {
print "found ";
}
how to make it sucess
Your pattern is looking for strings starting with x (^x) and ending in d (d$). The path you are trying does not match as it doesn't start with x.
You can use YAPE::Regex::Explain to help you understand regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/^x.*d$/xms;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?msx-i:^x.*d$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?msx-i: group, but do not capture (with ^ and $
matching start and end of line) (with .
matching \n) (disregarding whitespace and
comments) (case-sensitive):
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
x 'x'
----------------------------------------------------------------------
.* any character (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
d 'd'
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Also, you should not need the g modifier in this case. The documentation has plenty of information about regexes: perlre
There is an 'x' too much :
my $pat = '^.*d$';
my $dir = '/etc/inet.d';
if ( $dir =~ /$pat/xmsg ) {
print "found ";
}
My guess is that you're trying to list all files in /etc/init.d whose name matches the regular expression.
Perl isn't smart enough to figure out that when you name a string variable $dir, assign to it the full pathname of an existing directory, and pattern match against it, you don't intend to match against the pathname,
but against the filenames in that directory.
Some ways to fix this:
perldoc -f glob
perldoc -f readdir
perldoc File::Find
You may just want to use this:
if (glob('/etc/init.d/x*'))
{
warn "found\n";
}