Perl one liner to extract a multi-line pattern - perl

I have a pattern in a file as follows which can/cannot span over multiple lines :
abcd25
ef_gh
( fg*_h
hj_b*
hj ) {
What I have tried :
perl -nle 'print while m/^\s*(\w+)\s+(\w+?)\s*(([\w-0-9,* \s]))\s{/gm'
I dont know what the flags mean here but all I did was write a regex for the pattern and insert it in the pattern space .This matches well if the the pattern is in a single line as :
abcd25 ef_gh ( fg*_h hj_b* hj ) {
But fails exclusively in the multiline case !
I started with perl yesterday but the syntax is way too confusing . So , as suggested by one of our fellow SO mate ,I wrote a regex and inserted it in the code provided by him .
I hope a perl monk can help me in this case . Alternative solutions are welcome .
Input file :
abcd25
ef_gh
( fg*_h
hj_b*
hj ) {
abcd25
ef_gh
fg*_h
hj_b*
hj ) {
jhijdsiokdù ()lmolmlxjk;
abcd25 ef_gh ( fg*_h hj_b* hj ) {
Expected output :
abcd25
ef_gh
( fg*_h
hj_b*
hj ) {
abcd25 ef_gh ( fg*_h hj_b* hj ) {
The input file can have multiple patterns which coincides with the start and end pattern of the required pattern.
Thanks in advance for the replies.

Use the Flip-Flop Operator for a One-Liner
Perl makes this really easy with the flip-flop operator, which will allow you to print out all the lines between two regular expressions. For example:
$ perl -ne 'print if /^abcd25/ ... /\bhj \) {/' /tmp/foo
abcd25
ef_gh
( fg*_h
hj_b*
hj ) {
However, a simple one-liner like this won't differentiate between matches where you want to reject specific matches between the delimiting patterns. That calls for a more complex approach.
More Complicated Comparisons Benefit from Conditional Branching
One-liners aren't always the best choice, and regular expressions can get out of hand quickly if they become too complex. In such situations, you're better off writing an actual program that can use conditional branching rather than trying to use an over-clever regular expression match.
One way to do this is to build up your match with a simple pattern, and then reject any match that doesn't match some other simple pattern. For example:
#!/usr/bin/perl -nw
# Use flip-flop operator to select matches.
if (/^abcd25/ ... /\bhj \) {/) {
push #string, $_
};
# Reject multi-line patterns that don't include a particular expression
# between flip-flop delimiters. For example, "( fg" will match, while
# "^fg" won't.
if (/\bhj \) {/) {
$string = join("", #string);
undef #string;
push(#matches, $string) if $string =~ /\( fg/;
};
END {print #matches}
When run against the OP's updated corpus, this correctly yields:
abcd25
ef_gh
( fg*_h
hj_b*
hj ) {
abcd25 ef_gh ( fg*_h hj_b* hj ) {

The regex does not match even the single line. What do you think the double parentheses do?
You probably wanted
m/^\s*(\w+)\s+(\w+?)\s*\([\w0-9,*\s]+\)\s{/gm
Update: The specification has changed. The regex has (almost) not, but you have to change the code slightly:
perl -0777 -nle 'print "$1\n" while m/^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{)/gm'
Another update:
Explanation:
The switches are described in perlrun: zero, n, l, e
The regex can be auto-explained by YAPE::Regex::Explain
perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{)/)->explain'
The regular expression:
(?-imsx:^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
----------------------------------------------------------------------
\w+? word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the least amount
possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
[\w0-9,*\s]+ any character of: word characters (a-z,
A-Z, 0-9, _), '0' to '9', ',', '*',
whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
{ '{'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The /gm switches are explained in perlre

Related

perl, matching balanced parens using .Net regex

I needed some perl code to match balanced parens in a string.
so I found this regular expresion code below from .Net and pasted it into my Perl program thinking the regex engine was similar enough for it to work:
/
\s*\(
(?: [^\(\)] | (?<openp>\() | (?<-openp>\)) )+
(?(openp)(?!))
\)\s*
/x
My understanding of how this regex works is a follows:
Match first paren:
\(
Match pattern a, b, or c at least once:
(?: <a> | <b> | <c>)+
where a, b, and c are:
a is any character that is not a paren
[^\(\)]
b is character that is a left-paren
\(
c is character that is a right-paren
\)
and:
b is a capture group that pushes to named capture "openp"
(?<openp>\()
c is a capture group that pops from named capture "openp"
(?<openp>\()
reject any regular expresssion match where openp doesn't equal zero items on stack:
(?<-openp>\))
4. match end paren
\)
Here's the perl code:
sub eat_parens($) {
my $line = shift;
if ($line !~ /
\s*\(
(?: [^\(\)] | (?<openp>\() | (?<-openp>\)) )+
(?(openp)(?!))
\)\s*
/x)
{
return $line;
}
return $';
}
sub testit2 {
my $t1 = "(( (sdfasd)sdfsas (sdfasd) )sadf) ()";
$t2 = eat_parens($t1);
print "t1: $t1\n";
print "t2: $t2\n";
}
testit2();
Error is:
$ perl x.pl
Sequence (?<-...) not recognized in regex; marked by <-- HERE in m/\s*\((?: [^\(\)] | (?<openp> \( ) | (?<- <-- HERE openp> \) ) )+ (?(openp)(?!) ) \) \s*/ at x.pl line 411.
Not sure what's causing this.... any ideas?
Here's one way to do it:
/
(?&TEXT)
(?(DEFINE)
(?<TEXT>
[^()]*+
(?: \( (?&TEXT) \)
[^()]*+
)*+
)
)
/x
It can also be done without naming anything. Search for "recursive" in perlre.

Why does multiple use of `<( )>` token within `comb` not behave as expected?

I want to extract the row key(here is 28_2820201112122420516_000000), the column name(here is bcp_startSoc), and the value(here is 64.0) in $str, where $str is a row from HBase:
# `match` is OK
my $str = '28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0';
my $match = $str.match(/^ ([\d+]+ % '_') \s 'column=d:' (\w+) ',' \s timestamp '=' \d+ ',' \s 'value=' (<-[=]>+) $/);
my #match-result = $match».Str.Slip;
say #match-result; # Output: [28_2820201112122420516_000000 bcp_startSoc 64.0]
# `smartmatch` is OK
# $str ~~ /^ ([\d+]+ % '_') \s 'column=d:' (\w+) ',' \s timestamp '=' \d+ ',' \s 'value=' (<-[=]>+) $/
# say $/».Str.Array; # Output: [28_2820201112122420516_000000 bcp_startSoc 64.0]
# `comb` is NOT OK
# A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint.
# The <( is similar to other languages \K to discard any matches found before the \K.
my #comb-result = $str.comb(/<( [\d+]+ % '_' )> \s 'column=d:' <(\w+)> ',' \s timestamp '=' \d+ ',' \s 'value=' <(<-[=]>+)>/);
say #comb-result; # Expect: [28_2820201112122420516_000000 bcp_startSoc 64.0], but got [64.0]
I want comb to skip some matches, and just match what i wanted, so i use multiple <( and )> here, but only get the last match as result.
Is it possible to use comb to get the same result as match method?
TL;DR Multiple <(...)>s don't mean multiple captures. Even if they did, .comb reduces each match to a single string in the list of strings it returns. If you really want to use .comb, one way is to go back to your original regex but also store the desired data using additional code inside the regex.
Multiple <(...)>s don't mean multiple captures
The default start point for the overall match of a regex is the start of the regex. The default end point is the end.
Writing <( resets the start point for the overall match to the position you insert it at. Each time you insert one and it gets applied during processing of a regex it resets the start point. Likewise )> resets the end point. At the end of processing a regex the final settings for the start and end are applied in constructing the final overall match.
Given that your code just unconditionally resets each point three times, the last start and end resets "win".
.comb reduces each match to a single string
foo.comb(/.../) is equivalent to foo.match(:g, /.../)>>.Str;.
That means you only get one string for each match against the regex.
One possible solution is to use the approach #ohmycloudy shows in their answer.
But that comes with the caveats raised by myself and #jubilatious1 in comments on their answer.
Add { #comb-result .push: |$/».Str } to the regex
You can workaround .comb's normal functioning. I'm not saying it's a good thing to do. Nor am I saying it's not. You asked, I'm answering, and that's it. :)
Start with your original regex that worked with your other solutions.
Then add { #comb-result .push: |$/».Str } to the end of the regex to store the result of each match. Now you will get the result you want.
$str.comb( / ^ [\d+]+ % '_' | <?after d\:> \w+ | <?after value\=> .*/ )
Since you have a comma-separated 'row' of information you're examining, you could try using split() to break your matches up, and assign to an array. Below in the Raku REPL:
> my $str = '28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0';
28_2820201112122420516_000000 column=d:bcp_startSoc, timestamp=1605155065124, value=64.0
> my #array = $str.split(", ")
[28_2820201112122420516_000000 column=d:bcp_startSoc timestamp=1605155065124 value=64.0]
> dd #array
Array #array = ["28_2820201112122420516_000000 column=d:bcp_startSoc", "timestamp=1605155065124", "value=64.0"]
Nil
> say #array.elems
3
Match on individual elements of the array:
> say #array[0] ~~ m/ ([\d+]+ % '_') \s 'column=d:' (\w+) /;
「28_2820201112122420516_000000 column=d:bcp_startSoc」
0 => 「28_2820201112122420516_000000」
1 => 「bcp_startSoc」
> say #array[0] ~~ m/ ([\d+]+ % '_') \s 'column=d:' <(\w+)> /;
「bcp_startSoc」
0 => 「28_2820201112122420516_000000」
> say #array[0] ~~ m/ [\d+]+ % '_' \s 'column=d:' <(\w+)> /;
「bcp_startSoc」
Boolean tests on matches to one-or-more array elements:
> say True if ( #array[0] ~~ m/ [\d+]+ % '_' \s 'column=d:' <(\w+)> /)
True
> say True if ( #array[2] ~~ m/ 'value=' <(<-[=]>+)> / )
True
> say True if ( #array[0] ~~ m/ [\d+]+ % '_' \s 'column=d:' <(\w+)> /) & ( #array[2] ~~ m/ 'value=' <(<-[=]>+)> / )
True
HTH.

how to pass one regex output to another regex in perl

How to combine two regex . This is my input:
1.UE_frequency_offset_flag else { 2} UE_frequency_offset_flag
2.served1 0x00 Uint8,unsigned char
#my first regex expression is used for extracting the values inside curly braces
my ($first_match) = /(\b(\d+)\b)/g;
print "$1 \n";
#my second regex expression
my ($second_match) = / \S \s+ ( \{ [^{}]+ \} | \S+ ) /x;
I was trying to combine both regex but did not get the expected output.
my ($second_match) = / \S \s+ ( \{ [^{}]+ \} |\b(\d+)\b| \S+ ) /x;
My expected output:
2,0x00
Please help where I am doing mistake?
The question is not completely clear to me, because I don't see how you want to combine two regex or pass the output of one to the other.
If you want to pass the captured part of the first regex then you need to save it to a variable:
my ($first_match) = /(\b(\d+)\b)/g;
my $captured = $1;
Then you can place the variable $captured in the second regex.
If you want to use the complete match and search inside that. Then you need to do the following:
my ($first_match) = /(\b(\d+)\b)/g;
print "$1,"; # Don't print one space then new line if you want to have a comma separating the two values
my ($second_match) = $first_match =~ / \S \s+ ( \{ [^{}]+ \} | \S+ ) /x;
Based on your input, this won't generate the expected output.
The following code would print out:
2,0x00
When processing your input.
print "$1," if /\{\s*(\d+)\s*\}/;
print "$1\n" if /(\d+x\d+)/;

how can i match two consecutive words that both start with capital letters?

I want to match first and last name.
e.g. Robert Still, the words can be preceeded and followed by whitespaces however the string can only contain two words.
' Robert Still ' = true
' Robert Still ' = true
'e Robert Still 4 ' = false
this is the code that i tried
m/^\s*[A-Z].*[a-z]\s*[A-Z].*[a-z]\s*$/
Try this:
#!/usr/bin/perl -w
use strict;
my #names = ('Robert Still', ' Robert Still', 'Robert Still ', '4 Robert Still 2', 'Robert e Still');
foreach (#names){
if ($_ =~ /(^\s*[A-Z]\w+\s+[A-Z]\w+\s*$)/){
print "'$_' : true\n"
}
else {
print "'$_' : false\n";
}
}
Output:
'Robert Still' : true
' Robert Still' : true
'Robert Still ' : true
'4 Robert Still 2' : false
'Robert e Still' : false
Regex explained:
^ Start of line
\s* 0 to infinite times [greedy] Whitespace [\t \r\n\f\v]
Char class [A-Z] matches: A-Z A character range between Literal A and Literal Z
\w+ 1 to infinite times [greedy] Word character [a-zA-Z_\d]
\s+ 1 to infinite times [greedy] Whitespace [\t \r\n\f\v]
Char class [A-Z] matches: A-Z A character range between Literal A and Literal Z
\w+ 1 to infinite times [greedy] Word character [a-zA-Z_\d]
\s* 0 to infinite times [greedy] Whitespace [\t \r\n\f\v]
$ End of line
/^ \s* \p{Lu}\S+ \s+ \p{Lu}\S+ \s* \z/x

Perl Text processing on a variable before its usage

I wrote a perl script whihc will output a list containing similar entries like below:
$var = ' whatever'
$var contains: a single quote, a space, the word whatever, single quote
actually, this is key of a hash and i want to pull the value for the same. but due to the single quotes and a space in betweene, i am not able to pull the hash key value.
So, i want to strip $var as below:
$var = whatever
meaning remove the single quote, the space and the trailing single quote.
so that I can use $var as hash key to pull the respective value.
could you guide me on a perl oneliner for the same.
thnaks.
Here is several ways to do it, but beware - modifying the keys in a hash can end with unwanted results, like:
use strict;
use warnings;
use Data::Dumper;
my $src = {
"a a" => 1,
" a a " => 2,
"' a a '" => 3,
};
print "src: ", Dumper($src);
my $trg;
#$trg{ map { s/^[\s']*(.*?)[\s']*$/$1/; $_ } keys %$src } = values %$src;
print "copy: ", Dumper($trg);
will produce:
src: $VAR1 = {
' a a ' => 2,
'\' a a \'' => 3,
'a a' => 1
};
copy: $VAR1 = {
'a a' => 1
};
Any regex is possible do explain with YAPE::Regex::Explain module. (from CPAN). For the above regex:
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new( qr(^[\s']*(.*?)[\s']*$) )->explain;
will produce:
The regular expression:
(?-imsx:^[\s']*(.*?)[\s']*$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
[\s']* any character of: whitespace (\n, \r, \t,
\f, and " "), ''' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
[\s']* any character of: whitespace (\n, \r, \t,
\f, and " "), ''' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
In short the: s/^[\s']*(.*?)[\s']*$/$1/; mean:
at the beginning of the string match whitespaces or apostrophe as much times is possible,
then match anything
match at the end of string whitespaces or apostrophes as much times as possible
and keep the only the "anything" part
#!/usr/bin/perl
$string = "' my string'";
print $string . "\n";
$string =~ s/'//g;
$string =~ s/^ //g;
print $string;
Output
' my string'
my string
$var =~ tr/ '//d;
see: tr operator
or, by regex
$var =~ s/(?:^['\s]+)|'//g;
The latter will keep the spaces in the middle of the word, the former removes all spaces and single quotes.
A short test:
...
$var = q{' what ever'};
$var =~ s/
(?: # find the following group
^ # at string begin, followed by
['\s]+ # space or single quote, one or more
) # close group
| # OR
' # single quotes in the while string
//gx ; # replace by nothing, use formatted regex (x)
print "|$var|\n";
...
prints:
|what ever|
as expected.