Perl Text processing on a variable before its usage - perl

I wrote a perl script whihc will output a list containing similar entries like below:
$var = ' whatever'
$var contains: a single quote, a space, the word whatever, single quote
actually, this is key of a hash and i want to pull the value for the same. but due to the single quotes and a space in betweene, i am not able to pull the hash key value.
So, i want to strip $var as below:
$var = whatever
meaning remove the single quote, the space and the trailing single quote.
so that I can use $var as hash key to pull the respective value.
could you guide me on a perl oneliner for the same.
thnaks.

Here is several ways to do it, but beware - modifying the keys in a hash can end with unwanted results, like:
use strict;
use warnings;
use Data::Dumper;
my $src = {
"a a" => 1,
" a a " => 2,
"' a a '" => 3,
};
print "src: ", Dumper($src);
my $trg;
#$trg{ map { s/^[\s']*(.*?)[\s']*$/$1/; $_ } keys %$src } = values %$src;
print "copy: ", Dumper($trg);
will produce:
src: $VAR1 = {
' a a ' => 2,
'\' a a \'' => 3,
'a a' => 1
};
copy: $VAR1 = {
'a a' => 1
};
Any regex is possible do explain with YAPE::Regex::Explain module. (from CPAN). For the above regex:
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new( qr(^[\s']*(.*?)[\s']*$) )->explain;
will produce:
The regular expression:
(?-imsx:^[\s']*(.*?)[\s']*$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
[\s']* any character of: whitespace (\n, \r, \t,
\f, and " "), ''' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
[\s']* any character of: whitespace (\n, \r, \t,
\f, and " "), ''' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
In short the: s/^[\s']*(.*?)[\s']*$/$1/; mean:
at the beginning of the string match whitespaces or apostrophe as much times is possible,
then match anything
match at the end of string whitespaces or apostrophes as much times as possible
and keep the only the "anything" part

#!/usr/bin/perl
$string = "' my string'";
print $string . "\n";
$string =~ s/'//g;
$string =~ s/^ //g;
print $string;
Output
' my string'
my string

$var =~ tr/ '//d;
see: tr operator
or, by regex
$var =~ s/(?:^['\s]+)|'//g;
The latter will keep the spaces in the middle of the word, the former removes all spaces and single quotes.
A short test:
...
$var = q{' what ever'};
$var =~ s/
(?: # find the following group
^ # at string begin, followed by
['\s]+ # space or single quote, one or more
) # close group
| # OR
' # single quotes in the while string
//gx ; # replace by nothing, use formatted regex (x)
print "|$var|\n";
...
prints:
|what ever|
as expected.

Related

how can I partition a line into code and comment using a single regex in perl?

I want to read through a text file and partition each line into the following three variables. Each variable must be defined, although it might be equal to the empty string.
$a1code: all characters up to and not including the first non-escaped percent sign. If there is no non-escaped percent sign, this is the entire line. As we see in the example below, this also could be the empty string in a line where the following two variables are non-empty.
$a2boundary: the first non-escaped percent sign, if there is one.
$a3cmnt: any characters after the first non-escaped percent sign, if there is one.
The script below accomplishes this but requires several lines of code, two hashes, and a composite regex, that is, 2 regex combined by |.
The composite seems necessary because the first clause,
(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)
does not match a line that is pure code, no comment.
Is there a more elegant way, using a single regex and fewer steps?
In particular, is there a way to dispense with the %match hash and somehow
fill the %+ hash with all three three variables in a single step?
#!/usr/bin/env perl
use strict; use warnings;
print join('', 'perl ', $^V, "\n",);
use Data::Dumper qw(Dumper); $Data::Dumper::Sortkeys = 1;
my $count=0;
while(<DATA>)
{
$count++;
print "$count\t";
chomp;
my %match=(
a2boundary=>'',
a3cmnt=>'',
);
print "|$_|\n";
if($_=~/^(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)|(?<a1code>.*)/)
{
print "from regex:\n";
print Dumper \%+;
%match=(%match,%+,);
}
else
{
die "no match? coding error, should never get here";
}
if(scalar keys %+ != scalar keys %match)
{
print "from multiple lines of code:\n";
print Dumper \%match;
}
print "------------------------------------------\n";
}
__DATA__
This is 100\% text and below you find an empty line.
abba 5\% %comment 9\% %Borgia
%all comment
%
Result:
perl v5.34.0
1 |This is 100\% text and below you find an empty line. |
from regex:
$VAR1 = {
'a1code' => 'This is 100\\% text and below you find an empty line. '
};
from multiple lines of code:
$VAR1 = {
'a1code' => 'This is 100\\% text and below you find an empty line. ',
'a2boundary' => '',
'a3cmnt' => ''
};
------------------------------------------
2 ||
from regex:
$VAR1 = {
'a1code' => ''
};
from multiple lines of code:
$VAR1 = {
'a1code' => '',
'a2boundary' => '',
'a3cmnt' => ''
};
------------------------------------------
3 |abba 5\% %comment 9\% %Borgia|
from regex:
$VAR1 = {
'a1code' => 'abba 5\\% ',
'a2boundary' => '%',
'a3cmnt' => 'comment 9\\% %Borgia'
};
------------------------------------------
4 |%all comment|
from regex:
$VAR1 = {
'a1code' => '',
'a2boundary' => '%',
'a3cmnt' => 'all comment'
};
------------------------------------------
5 |%|
from regex:
$VAR1 = {
'a1code' => '',
'a2boundary' => '%',
'a3cmnt' => ''
};
------------------------------------------
You can use the following:
my ($a1code, $a2boundary, $a3cmnt) =
/
^
( (?: [^\\%]+ | \\. )* )
(?: (%) (.*) )?
\z
/sx;
It does not consider % escaped in abc\\%def since the preceding \ is escaped.
It requires no backtracking, and it always matches.
$a1code is always a string. It can be zero characters long (when the input is an empty string and when % is the first character), or the entire input string (when there is no unescaped %).
However, $a2boundary and $a3cmnt are only defined if there's an unescaped %. In other words, $a2boundary is equivalent to defined($a3cmnt) ? '%' : undef.
Explanation: [^\\%]+ matches non-escaped characters other than \ and %. \\. matches escaped characters. So (?: [^\\%]+ | \\. )* gets us the prefix, or the entire string if there are no unescaped %.
What about cases like this\\%string where the backslash before the percent sign is itself escaped?
Consider something like this, which instead of trying to use a regular expression to split the string into three groups, uses one to look where for it should be split, and substr to do the actual splitting:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
sub splitter {
my $line = shift;
if ($line =~ /
# Match either
(?<!\\)% # A % not preceded by a backslash
| # or
(?<=[^\\])(?:\\\\)+\K% # Any even number of backslashes followed by a %
/x) {
return (substr($line, 0, $-[0]), '%', substr($line, $+[0]));
} else {
return ($line, '', '');
}
}
while (<DATA>) {
chomp;
# Assign to an array instead of individual scalars for demonstration purposes
my #vals = splitter $_;
print Dumper(\#vals);
}
__DATA__
This is 100\% text and below you find an empty line.
abba 5\% %comment 9\% %Borgia
%all comment
%
a tricky\\%test % case
another \\\%one % to mess with you

Can somebody explain this obfuscated perl regexp script?

This code is taken from the HackBack DIY guide to rob banks by Phineas Fisher. It outputs a long text (The Sixth Declaration of the Lacandon Jungle). Where does it fetch it? I don't see any alphanumeric characters at all. What is going on here? And what does the -r switch do? It seems undocumented.
perl -Mre=eval <<\EOF
''
=~(
'(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})')
;$:="\."^ '~';$~='#'
|'(';$^= ')'^'[';
$/='`' |'.';
$,= '('
EOF
The basic idea of the code you posted is that each alphanumeric character has been replaced by a bitwise operation between two non-alphanumeric characters. For instance,
'`'|'%'
(5th line of the "star" in your code)
Is a bitwise or between backquote and modulo, whose codepoints are respectively 96 and 37, whose "or" is 101, which is the codepoint of the letter "e". The following few lines all print the same thing:
say '`' | '%' ;
say chr( ord('`' | '%') );
say chr( ord('`') | ord('%') );
say chr( 96 | 37 );
say chr( 101 );
say "e"
Your code starts with (ignore whitespaces which don't matter):
'' =~ (
The corresponding closing bracket is 28 lines later:
^'(').'"})')
(C-f this pattern to see it on the web-page; I used my editor's matching parenthesis highlighting to find it)
We can assign everything in between the opening and closing parenthesis to a variable that we can then print:
$x = '(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})';
print $x;
This will print:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
The remaining of the code is a bunch of assignments into some variables; probably here only to complete the pattern: the end of the star is:
$:="\."^'~';
$~='#'|'(';
$^=')'^'[';
$/='`'|'.';
$,='(';
Which just assigns simple one-character strings to some variables.
Back to the main code:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
This code is inside a regext which is matched against an empty string (don't forget that we had first '' =~ (...)). (?{...}) inside a regex runs the code in the .... With some whitespaces, and removing the string within the eval, this gives us:
# fetch an url from the web using curl _quitely_ (-s)
($: = `curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)
# replace end of paragraphs with newlines in the HTML fetched
=~ s|</p>|\n|g;
# Remove all HTML tags
$: =~ s/<.*?>//g;
# Print everything between SEXTA and Mexicano (+2 chars)
print $: =~ /(SEXTA.*?Mexicano..)/s
You can automate this unobfuscation process by using B::Deparse: running
perl -MO=Deparse yourcode.pl
Will produce something like:
'' =~ m[(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})];
$: = 'P';
$~ = 'h';
$^ = 'r';
$/ = 'n';
$, = '(';

How to move a substring with preg_replace or preg_match in PHP?

I want to find a substring and move it in the string instead of replacing (for example, moving it from the beginning to the end of the string).
'THIS the rest of the string' -> 'the rest of the string THIS'
I do this by the following code
preg_match('/^(THIS).?/', $str, $match);
$str = trim( $str . $match[1] );
$str = preg_replace('/^(THIS).?/', '', $str);
There should be an easier way to do this with one regex.
You may use
$re = '/^(THIS)\b\s*(.*)/s';
$str = 'THIS the rest of the string';
$result = preg_replace($re, '$2 $1', $str);
See the regex demo and a PHP demo.
Details
^ - start of string
(THIS) - Group 1 (referenced to with $1 from the replacement pattern): THIS
\b - a word boundary (if you do not need a whole word, you may remove it)
\s* - 0+ whitespaces (if there is always at least one whitespace, use \s+ and remove \b, as it will become redundant)
(.*) - Group 2 (referenced to with $2 from the replacement pattern): the rest of the string (s modifier allows . match line break chars, too).

Parse single quoted string using Marpa:r2 perl

How to parse single quoted string using Marpa:r2?
In my below code, the single quoted strings appends '\' on parsing.
Code:
use strict;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new(
{ default_action => '[values]',
source => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:start ::= Expression
# include begin
Expression ::= Param
Param ::= Unquoted
| ('"') Quoted ('"')
| (') Quoted (')
:discard ~ whitespace
whitespace ~ [\s]+
Unquoted ~ [^\s\/\(\),&:\"~]+
Quoted ~ [^\s&:\"~]+
END_OF_SOURCE
});
my $input1 = 'foo';
#my $input2 = '"foo"';
#my $input3 = '\'foo\'';
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
print "Trying to parse:\n$input1\n\n";
$recce->read(\$input1);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);
Output's:
Trying to parse:
foo
Output:
$VAR1 = [
[
'foo'
]
];
Trying to parse:
"foo"
Output:
$VAR1 = [
[
'foo'
]
];
Trying to parse:
'foo'
Output:
$VAR1 = [
[
'\'foo\''
]
]; (don't want it to be parsed like this)
Above are the outputs of all the inputs, i don't want 3rd one to get appended with the '\' and single quotes.. I want it to be parsed like OUTPUT2. Please advise.
Ideally, it should just pick the content between single quotes according to Param ::= (') Quoted (')
The other answer regarding Data::Dumper output is correct. However, your grammar does not work the way you expect it to.
When you parse the input 'foo', Marpa will consider the three Param alternatives. The predicted lexemes at that position are:
Unquoted ~ [^\s\/\(\),&:\"~]+
'"'
') Quoted ('
Yes, the last is literally ) Quoted (, not anything containing a single quote.
Even if it were ([']) Quoted ([']): Due to longest token matching, the Unquoted lexeme will match the entire input, including the single quote.
What would happen for an input like " foo " (with double quotes)? Now, only the '"' lexeme would match, then any whitespace would be discarded, then the Quoted lexeme matches, then any whitespace is discarded, then closing " is matched.
To prevent this whitespace-skipping behaviour and to prevent the Unquoted rule from being preferred due to LATM, it makes sense to describe quoted strings as lexemes. For example:
Param ::= Unquoted | Quoted
Unquoted ~ [^'"]+
Quoted ~ DQ | SQ
DQ ~ '"' DQ_Body '"' DQ_Body ~ [^"]*
SQ ~ ['] SQ_Body ['] SQ_Body ~ [^']*
These lexemes will then include any quotes and escapes, so you need to post-process the lexeme contents. You can either do this using the event system (which is conceptually clean, but a bit cumbersome to implement), or adding an action that performs this processing during parse evaluation.
Since lexemes cannot have actions, it is usually best to add a proxy production:
Param ::= Unquoted | Quoted
Unquoted ~ [^'"]+
Quoted ::= Quoted_Lexeme action => process_quoted
Quoted_Lexeme ~ DQ | SQ
DQ ~ '"' DQ_Body '"' DQ_Body ~ [^"]*
SQ ~ ['] SQ_Body ['] SQ_Body ~ [^']*
The action could then do something like:
sub process_quoted {
my (undef, $s) = #_;
# remove delimiters from double-quoted string
return $1 if $s =~ /^"(.*)"$/s;
# remove delimiters from single-quoted string
return $1 if $s =~ /^'(.*)'$/s;
die "String was not delimited with single or double quotes";
}
Your result doesn't contain \', it contains '. Dumper merely formats the result like that so it's clear what's inside the string and what isn't.
You can test this behavior for yourself:
use Data::Dumper;
my $tick = chr(39);
my $back = chr(92);
print "Tick Dumper: " . Dumper($tick);
print "Tick Print: " . $tick . "\n";
print "Backslash Dumper: " . Dumper($back);
print "Backslash Print: " . $back . "\n";
You can see a demo here: https://ideone.com/d1V8OE
If you don't want the output to contain single quotes, you'll probably need to remove them from the input yourself.
I am not so familar with Marpa::R2, but could you try to use an action on the Expression rule:
Expression ::= Param action => strip_quotes
Then, implement a simple quote stripper like:
sub MyActions::strip_quotes {
#{$_[1]}[0] =~ s/^'|'$//gr;
}

how to get the required strings from a text using perl

Here is the text to trim:
/home/netgear/Desktop/WGET-1.13/wget-1.13/src/cmpt.c:388,error,resourceLeak,Resource leak: fr
From the above text I need to get the data next to ":". How do I get 388,error,resourceLeak,Resource leak: fr?
You can use split to separate a string into a list based on a delimiter. In your case the delimiter should be a ::
my #parts = split ':', $text;
As the text you want to extract can also contain a :, use the limit argument to stop after the first one:
my #parts = split ':', $text, 2;
$parts[1] will then contain the text you wanted to extract. You could also pass the result into a list, discarding the first element:
my (undef, $extract) = split ':', $text, 2;
Aside from #RobEarl's suggestion of using split, you could use a regular expression to do this.
my ($match) = $text =~ /^[^:]+:(.*?)$/;
Regular expression:
^ the beginning of the string
[^:]+ any character except: ':' (1 or more times)
: match ':'
( group and capture to \1:
.*? any character except \n (0 or more times)
) end of \1
$ before an optional \n, and the end of the string
$match will now hold the result of capture group #1..
388,error,resourceLeak,Resource leak: fr