Perl 5: How to suppress #var interpolation in regex patterns - perl

This perl script doesn't substitute the input text_str as expected:
my $text_str ='public class ZipUtilTest extends TestCase {}';
my $find = '^(public class \\w+) extends TestCase \\{';
my $replace = '#RunWith(JUnit4.class)\\n\\1 {';
eval '$text_str =~ ' . "s#$find#$replace#mg";
say "$text_str";
Output (wrong):
(JUnit4.class)
public class ZipUtilTest {}
This revised perl script (with '#' in 'replace' escaped) substitutes as expected:
my $text_str ='public class ZipUtilTest extends TestCase {}';
my $find = '^(public class \\w+) extends TestCase \\{';
my $replace = '#RunWith(JUnit4.class)\\n\\1 {';
$replace =~ s/#/\\#/g; # Escape '#' to avoid Perl #var interpolation
eval '$text_str =~ ' . "s#$find#$replace#mg";
say "$text_str";
Output (correct):
#RunWith(JUnit4.class)
public class ZipUtilTest {}
It looks like '#RunWith' in the 'replace' pattern is treated as a Perl #variable and interpolated to an empty string.
Is there a better way to handle this than escaping the '#' character in patterns? If we have to do this, any other '#'-like characters need to be escaped?
(Note: This has nothing to do with using \Q\E to suppress the magic of regex meta characters. Please don't close this because of existing questions of that nature.)

You can use a positive-lookahead to match the { without capturing it in $1. Then the replacement string does not need to contain the $1.
When building a regex, it's better to use the regex quoting operator qr{} than strings; it will quote like a regex, not a string. This can avoid subtle bugs.
use v5.10;
my $text_str = 'public class ZipUtilTest extends TestCase {}';
# Use a positive-look ahead to match, but not capture, the {
# Quote as regex to avoid subtle quoting issues.
my $find = qr'^(public class \w+) extends TestCase(?>\s*\{)';
# Use double-quotes to interpolate the \n, but escape the \#.
my $replace = "\#RunWith(JUnit4.class)\n";
# Add the $1 to the end of the replacement.
$text_str =~ s{$find}{$replace$1};
say $text_str;
Demonstration.

It seems you want to load language-agnostic search and replace patterns from a configuration file, and then apply them via a Perl script.
If that is your goal, then using eval is not appropriate since Perl has syntax that you do not want to support, as you found out.
It is not reasonable to try to work around those Perl-specific parts by trying to escape them, since that can get rather complex. For example, you considered escaping occurrences of # as they can introduce an array name, but what if that character already is backslash-escaped? Handling this properly would require an almost complete re-implementation of Perl's string literal syntax, which doesn't sound like fun.
What I would do is to define a replacement string syntax of our own, so that we're completely independent from Perl's syntax.
For example, we might define our replacement string syntax to be entirely verbatim, except that we support certain backslash-escapes. Let's say that the syntax '\' DIGIT such as \1 replaces a capture, and that the usual backlash escapes are supported (\b \t \n \v \f \r \" \' \\ \x0A), which is the common subset of JavaScript string literals, Python 3 string literals, and Perl escapes, minus octal escapes. Note that these languages do not agree on a syntax for Unicode characters.
We can implement an interpreter for this string replacement language as follows: we parse the replacement string into an array of opcodes, alternating a literal string with the number of a capture. For example, the replacement pattern abc\1def would be parsed into ['abc', 1, 'def']:
sub parse_replacement_pattern {
my ($pattern) = #_;
my #ops = (''); # init with empty string
# use m//gc style parsing which lets us anchor patterns at the current "pos"
pos($pattern) = 0;
while (pos $pattern < length $pattern) {
if ($pattern =~ /\G([^\\]+)/gc) {
$ops[-1] .= $1;
}
elsif ($pattern =~ /\G\\n/gc) {
$ops[-1] .= "\n";
}
... # and so on for the basic escapes
elsif ($pattern =~ /\G\\x([0-9a-fA-F]{2})/gc) {
$ops[-1] .= chr $1;
}
elsif ($pattern =~ /\G\\([1-9])/gc) {
push #ops, $1, ''; # add replacement opcode + empty string
}
else {
die "invalid syntax";
}
}
return \#ops;
}
We can apply such a replacement pattern by looping through the operations, appending the literal string or the capture contents as appropriate.
sub apply_replacement_pattern {
my ($ops) = #_;
my $output = '';
my $is_capture = 0;
for my $op (#$ops) {
if ($is_capture) {
# we know that $op must be the number of a capture buffer
$output .= ${^CAPTURE}[$op - 1]; # like eval "\$$op"
}
else {
# we know that $op must be a literal string
$output .= $op;
}
$is_capture = !$is_capture;
}
return $output;
}
We can now use these functions in your test case:
my $text_str ='public class ZipUtilTest extends TestCase {}';
my $find = '^(public class \\w+) extends TestCase \\{';
my $replace = '#RunWith(JUnit4.class)\\n\\1 {';
my $replace_ops = parse_replacement_pattern($replace);
$text_str =~ s{$find}{apply_replacement_pattern($replace_ops)}mge;
say $text_str;
This produces the expected output
#RunWith(JUnit4.class)
public class ZipUtilTest {}

Here is a version that works: use $1 directly in the replacement side, not in a pre-made variable for it. It saves us some hassle.
use warnings;
use strict;
use feature 'say';
my $text_str = 'public class ZipUtilTest extends TestCase {}';
#say $text_str;
my $re = '^(public class \w+) extends TestCase \{';
#say $re;
my $replace = "\#RunWith(JUnit4.class)\n";
#say $replace;
$text_str =~ s/$re/${replace}$1 {/;
say $text_str;
Update with comments
The variables for the pattern and the replacement string are read from a configuration file. Then the "hassle" I mention becomes more serious.
If $1 is to be prepared in the replacement-string variable, it must be a mere string (of characters $ and 1) there while it need become a variable, and be evaluated, in the regex.
That means the variable must be eval-ed (or regex run with /ee), and that is the problem with the string form of eval -- input from outside: eval will evaluate (run) anything, any code. We don't need malicious action regarding text-to-become-code in config files, just consider typos.
As for nicely escaping (only) what need be escaped, one can prepare for that, a hash for example:
my %esc_char = ( at => '\#' ); # etc
and use this when composing the variable with the replacement string.
If both the pattern and replacement must come from config files and must be non-specific to Perl, as a comment says, then I am not sure how to improve the code offered in the question. Except that it should be heavily protected against running (accidentally, say) bad code.

Related

Perl jail escape

Given the following Perl code, how could one get code execution if they control $foo?
sub Parse($)
{
my $dataPt = shift;
my (#toks, $tok, $more);
Tok: for (;;) {
# find the next token
last unless $$dataPt =~ /(\S)/sg; # get next non-space character
if ($1 eq '(') { # start of list
$tok = Parse($dataPt);
} elsif ($1 eq ')') { # end of list
$more = 1;
last;
} elsif ($1 eq '"') { # quoted string
$tok = '';
for (;;) {
my $pos = pos($$dataPt);
last Tok unless $$dataPt =~ /"/sg;
$tok .= substr($$dataPt, $pos, pos($$dataPt)-1-$pos);
# we're good unless quote was escaped by odd number of backslashes
last unless $tok =~ /(\\+)$/ and length($1) & 0x01;
print("here\n");
$tok .= '"'; # quote is part of the string
}
# must protect unescaped "$" and "#" symbols, and "\" at end of string
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
# convert C escape sequences (allowed in quoted text)
$tok = eval qq{"$tok"};
} else { # key name
pos($$dataPt) = pos($$dataPt) - 1;
# allow anything in key but whitespace, braces and double quotes
# (this is one of those assumptions I mentioned)
$tok = $$dataPt =~ /([^\s()"]+)/sg ? $1 : undef;
}
push #toks, $tok if defined $tok;
}
# prevent further parsing unless more after this
pos($$dataPt) = length $$dataPt unless $more;
return #toks ? \#toks : undef;
}
$foo = '(test(foo "bar"))';
$ref = \$foo;
ParseAnt $ref;
I believe there is a way to force the parsing function to include an unescaped double quote in the $tok variable before it is processed by eval, but I was not successful in doing so.
I cannot provide more information as this code snippet is used in production.
Edit
Since the (well-meant) changes to the question happened to invalidate an early answer I am adding this note, along with the original version for the reader's convenience (what can be seen under revisions anyway) ---
Original version of this question:
Given the following Perl code, how could one get code execution if they control $str?
my $str = "malicious payload";
die if $str =~ /"/;
$str =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
eval qq{"$str"};
You can take advantage of \c to eat an inserted escape character.
\c${ print qq{0wn3d\n}; \'' }
The key code is
$str =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
This answer focuses on this as this is all that was provided intially.
There are two ways to inject code:
Closing the string literal.
This would require a literal " in the input, or its production by the validator.
Using a construct that allows code to be embedded.
These are:
$BLOCK
#BLOCK
$NAME[ EXPR ], $NAME->[ EXPR ], $BLOCK[ EXPR ], $$NAME[ EXPR ]
#NAME[ EXPR ], $NAME->#[ EXPR ], #BLOCK[ EXPR ], #$NAME[ EXPR ]
$NAME{ EXPR }, $NAME->{ EXPR }, $BLOCK{ EXPR }, $$NAME{ EXPR }
#NAME{ EXPR }, $NAME->#{ EXPR }, #BLOCK{ EXPR }, #$NAME{ EXPR }
Both EXPR and BLOCK can contain executable code.
There are various ways of getting those sequences into a string.
Fooling the validator into thinking something is already escaped.
Causing an an escape to be treated as something else.
Fooling the validator into escaping what would already escape the sequence.
Through removal of characters from the middle.
Taking advantage of $$ or $\ somehow.
The snippet's intent to is to process \ escapes as Perl would.[1] We can take advantage of \c to eat an escape character. \c eats the next character, so we can use before a $ to each the the validator's attempt to escape the $.
\c${ print qq{0wn3d\n}; \'' }
becomes
"\c\${ print qq{0wn3d\n}; \'' }"
which means
do { print qq{0wn3d\n}; chr(0x1C) }
Kudos to #bananabr for finding \c.
This, in of itself, is surely a bug. Write a parser for your language's escapes.
{ package Jail::Breaker;
use overload
'""' => sub {
my ($self) = #_;
if ($self->[0]++ < 1) {
return $self->[1]
} else {
return qq(";system '$self->[1]';")
}
},
fallback => 1;
sub new {
my ($class, $string) = #_;
bless [0, $string], $class
}
}
my $str = 'Jail::Breaker'->new('ls -la /');
die 'invalid' if $str =~ /"/;
$str =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
eval qq{"$str"};
or, similarly,
{ package Jail::Breaker;
use Tie::Scalar;
use parent -norequire => 'Tie::StdScalar';
my $fetched;
sub FETCH {
my ($self) = #_;
if ($fetched++) {
return qq(";system'$$self';")
} else {
return $$self
}
}
}
tie my $str, 'Jail::Breaker', 'ls -la /';
...
Both the solutions use an object which returns something else when read for the first time, and the "evil" string when read later.

substring match in perl, yes or no

With perl, I want to find a substring in a string. The result is a True/False and then I will decide what to do. There is good post here, but the usage is vague for me.
my $big_string = "Hello Good World";
my $pat = "world";
While the capital/non-capital letters are not important, I want to get True.
use List::Util 'any';
my $match_found = any { /$pat/ } #big_string;
if (match_found)
print "yes\n";
else
print "no\n";
Is that correct? Is there any better API for this purpose?
It's not correct because it doesn't even compile: if / else are always followed by a { } block in Perl. Also, if (match_found) is missing the $ sigil on the variable.
If you really want case insensitive matching (i.e. to ignore the differences between uppercase and lowercase letters), you need to add the i flag to the regex.
Finally, your code doesn't define a #big_string array, only a $big_string scalar.
So:
use strict;
use warnings;
use List::Util 'any';
my $big_string = "Hello Good World";
my $pat = "world";
my $match_found = any { /$pat/i } $big_string;
if ($match_found) {
print "yes\n";
} else {
print "no\n";
}
This code would work, but it can be improved.
First off, why use any at all? We don't have a list of multiple strings to check:
my $match_found = $big_string =~ /$pat/i;
Second, $pat doesn't look like it's supposed to be a regex. It's a plain string. This doesn't make any difference for alphanumeric characters (such as world), but in general we should escape all regex metacharacters in strings that aren't intended to be interpreted as regexes. This can be done with the quotemeta function (or its \Q \E short form):
my $match_found = $big_string =~ /\Q$pat\E/i;
This is our improved version:
use strict;
use warnings;
my $big_string = "Hello Good World";
my $pat = "world";
my $match_found = $big_string =~ /\Q$pat\E/i;
if ($match_found) {
print "yes\n";
} else {
print "no\n";
}
Finally, we don't even need a regex for a simple substring search. Instead we can do this:
use feature 'fc';
my $match_found = index(fc($big_string), fc($pat)) >= 0;
fc implements full Unicode case folding.

Joining Arguments - Perl

I'm trying to send arguments to a function and then in that function, join the arguments to make a string.
The code I'm trying
method send_data(\#args) {
my $string = join('%', #args);
print $string . '\n';
}
send_data('test1', 'test2', 'test3');
I know I could do it this way:
my #params = ('test1', 'test2', 'test3');
send_data(\#params);
But I prefer the first method. Without Method::Signatures you would do it this way:
sub send_data {
my (#args) = #_;
my $string = join('%', #args);
print $string . '\n';
}
send_data('test1', 'test2', 'test3');
Help?
With Method::Signatures, if you use method then the first parameter gets consumed as an implicit $self variable leaving only the remaining parameters in #args.
As this function isn't part of a class you should be using func instead of method and in this particular case there's no need for the \ either:
func send_data(#args) {
my $string = join('%', #args);
print $string . '\n';
}
send_data('test1', 'test2', 'test3');
NB: this will also print the literal string \n - if you wanted an actual newline you must enclose the \n in double quotes instead of single quotes.

Reference a variable within variable - hashes?

I have been looking for a solution to my problem and Hashes seem to be the answer after reading several posts but I am unsure how to implement this for my requirement, can anyone suggest how or even a better option?
In my code the variable $host is being set from values in a database. I loop through these values changing the value of $host each time.
I want to discard some types of host names, and to determine which hosts to discard I read in a user-configurable file which holds the Perl regex for that exclude. i.e. the config file has a line
EXCLUDE=\d+DAT\d+,\d+INF\d+
I then want to build up the Perl regexp match (logical OR), i.e.
if ( $host =~ m/\d+DAT\d+/ || $host =~ m/\d+INF\d+/ ) {
# do something
}
At the moment my code is hard wired as in the above example, but how can I dynamically construct the Perl regex after reading in the config file?
I have read the config file into an array and will start from there. The code above needs to end up like this:
if ($exclude clause) {
# do something
}
This is how I set about achieving that reading from the array:
for ($i = 1; $i < #conf; $i++) {
$exclude_clause .= "$host =~/" . #conf[$i] . "/ || ";
}
$exclude_clause =~ s/ \|\| $//;
The problem is referencing $host within the $exclude_clause. My regex string is built OK apart from the $host.
I would suggest a different approach that doesn't require you to build up a big Regex string and then evaluate it. Instead, what about using the List::MoreUtils module's any function, which accepts a block of code, evaluates it for each member of a list, and returns true once the block returns true for at least one entry in the list. For example:
use List::MoreUtils qw{ any };
if ( any { $host =~ $_ } #conf ) {
# do something
}
In the code block passed to any, the temp variable $_ contains the current entry in the list. That way you can avoid constructing a Regex in the first place.
I think you should store the complete regex in the configuration file, but it can be be a set of comma separated alternatives if need be.
You would use the qr// construct to build the regex:
my $exc1 = "\d+DAT\d+"; # Read from configuration file
my $ecc2 = "\d+INF\d+";
my $rex1 = qr/$exc1/;
my $rex2 = qr/$exc2/;
...populate $host...
if ($host =~ $rex1 || $host =~ $rex2)
{
...exclude $host...
}
else
{
...include $host...
}
Alternatively, you can build a single regex:
my $exc1 = "\d+DAT\d+"; # Read from configuration file
my $ecc2 = "\d+INF\d+";
my $rex = qr/$exc1|$exc2/;
...populate $host...
if ($host =~ $rex)
{
...exclude $host...
}
else
{
...include $host...
}
The single regex can be built from as many alternative exclusionary regex fragments as you like. Of course, if the value in the file is:
EXCLUDE=\d+DAT\d+|\d+INF\d+
then your code simplifies once more, assuming the regex string is read into $exc:
my $exc = "\d+DAT\d+|\d+INF\d+"; # Read from file
my $rex = qr/$exc/;
...populate $host...
if ($host =~ $rex)
{
...exclude $host...
}
else
{
...include $host...
}

Perl: How to detect which file exists among foo.(txt|abc)

My perl script needs to detect the extension of an existing file and print out the filename. The input that specifies the filename with a vague extension would be in this format:
foo.(txt|abc)
and the script would print "foo.txt" if it exists. If foo.txt does not exist and foo.abc exists, then it would print "foo.abc."
How can I do this detection and printing of the correct existing file in a neat and clean way?
Thanks!
Actually, you've almost got the regular expression right there: the only thing you need to do is escape the . with a backslash (since . means "any character except the newline character" in regular expressions), and it would also help to put a ?: inside of the parentheses (since you don't need to capture the file extension). Also, ^ and $ denote markers for the beginning and the end of the string (so we're matching the entire string, not just part of a string...that way we don't get a match for the file name "thisisnotfoo.txt")
Something like this should work:
use strict;
use warnings;
my $file1="foo.txt";
my $file2="foo.abc";
my $file3="some_other_file";
foreach ($file1,$file2,$file3)
{
if(/^foo\.(?:txt|abc)$/)
{
print "$_\n";
}
}
When the above code is run, the output is:
foo.txt
foo.abc
Take a look at perldoc perlretut for more stuff about regular expressions.
You may want to look at glob, but you'd have to use a different syntax. The equivalent would be:
foo.{txt,abc}
See File::Glob for more information. Also note that this will return a list of all of the matches, so you'll have to do your own rules if it should prefer one when multiple exist.
sub text_to_glob {
my ($s) = #_;
$s =~ s/([\\\[\]{}*?~\s])/\\$1/g;
return $s;
}
my $pat = 'foo.(txt|abc)';
my #possibilities;
if (my ($base, $alt) = $pat =~ /^(.*\.)\(([^()]*)\)\z/s) {
#possibilities = glob(
text_to_glob($base) .
'{' . join(',', split(/\|/, $alt)) . '}'
);
} else {
#possibilities = $pat;
}
for my $possibility (#possibilities) {
say "$possibility: ", -e $possibility ? "exists" : "doesn't exist";
}
glob, but also see File::Glob
-e
use strict;
use warnings;
FILE:
for (glob "file.{txt,abc}") {
if (-f $_) {
print $_, "\n";
last FILE;
}
}