Preserving backslashes in Perl strings - perl

Is there a way in Perl to preserve and print all backslashes in a string variable?
For example:
$str = 'a\\b';
The output is
a\b
but I need
a\\b
The problem is can't process the string in any way to escape the backslashes because
I have to read complex regular expressions from a database and don't know in which combination and number they appear and have to print them exactly as they are on a web page.
I tried with template toolkit and html and html_entity filters. The only way it works so far is to use a single quoted here document:
print <<'XYZ';
a\\b
XYZ
But then I can't interpolate variables which makes this solution useless.
I tried to write a string to a web page, into file and on the shell, but no luck, always one backslash disappears. Maybe I am totally on the wrong track, but what is the correct way to print complex regular expressions including backslashes in all combinations and numbers without any changes?
In other words:
I have a database containing hundreds of regular expressions as string data. I want to read them with perl and print them on a web page exatly as they are in the database.
There are all the time changes to these regular expressions by many administrators so I don't know in advance how and what to escape.
A typical example would look like this:
'C:\\test\\file \S+'
but it could change the next day to
'\S+ C:\\test\\file'
Maybe a correct conclusion would be to escape every backslash exactly one time no matter in which combination and in which number it appears? This would mean it works to double them up. Then the problem isn't as big as I feared. I tested it on the bash and it works with two and even three backslashes in a row (4 backslaches print 2 ones and 6 backslashes print 3 ones).

The backslash only has significance to Perl when it occurs in Perl source code, e.g.: your assignment of a literal string to a variable:
my $str = 'a\\b';
However, if you read data from a file (or a database or socket etc) any backslashes in the data you read will be preserved without you needing to take any special steps.

my $str = 'a\\b';
print $str;
This prints a\\b.
Use
my $str = 'a\\\\b';
instead

It's a PITA, but you will just have to double up the backslashes, e.g.
a\\\\b
Otherwise, you could store the backslash in another variable, and interpolate that.

The minimum to get two slashes is (unfortunately) three slashes:
use 5.016;
my $a = 'a\\\b';
say $a;

The problem I tried to solve does not exist. I confused initializing a string directly in the code with using the html forms. Using a string inside the code preserving all backslashes is only possible either with a here document or by reading a textfile containing the string. But if I just use the html form on a web page to insert a string and use escapeHTML() from the CGI module it takes care of all and you can insert the most wired combinations of special characters. They all get displayed and preserved exactly as inserted. So I should have started directly with html and database operations instead of trying to examine things first
by using strings directly in the code. Anyway, thanks for your help.

You can use the following regular expression to form your string correctly:
my $str = 'a\\b';
$str =~ s/\\/\\\\/g;
print "$str\n";
This prints a\\b.
EDIT:
You can use non-interpolating here-document instead:
my $str = <<'EOF';
a\\b
EOF
print "$str\n";
This still prints a\\b.

Grant's answer provided the hint I needed. Some of the other answers did not match Perl's operation on my system so ...
#!/usr/bin/perl
use warnings;
use strict;
my $var = 'content';
print "\'\"\N{U+0050}\\\\\\$var\n";
print <<END;
\'\"\N{U+0050}\\\\\\$var\n
END
print '\'\"\N{U+0050}\\\\\\$var\n'.$/;
my $str = '\'\"\N{U+0050}\\\\\\$var\n';
print $str.$/;
print #ARGV;
print $/;
Called from bash ... using the bash means of escaping in quotes which changes \' to '\''.
jamie#debian:~$ ./ft.pl '\'\''\"\N{U+0050}\\\\\\$var\n'
'"P\\\content
'"P\\\content
'\"\N{U+0050}\\\$var\n
'\"\N{U+0050}\\\$var\n
\'\"\N{U+0050}\\\\\\$var\n
The final line, with six backslashes in the middle, was what I had expected. Reality differed.
So:
"in here \" is interpolated
in HEREDOC \ is interpolated
'in single quotes only \' is interpolated and only for \ and ' (are there more?)
my $str = 'same limited \ interpolation';
perl.pl 'escape using bash rules' with #ARGV is not interpolated

Related

What is the meaning of the number sign (#) in a Perl regex match?

What is the meaning of below statement in perl?
($script = $0) =~ s#^.*/##g;
I am trying to understand the operator =~ along with the statement on the right side s#^.*/##g.
Thanks
=~ applies the thing on the right (a pattern match or search and replace) to the thing on the left. There's lots of documentation about =~ out there, so I'm just going to point you at a pretty good one.
There's a couple of idioms going on there which are not obvious nor well documented which might be tripping you up. Let's cover them.
First is this...
($copy = $original) =~ s/foo/bar/;
This is a way of copying a variable and performing a search and replace on it in a single step. It is equivalent to:
$copy = $original;
$copy =~ s/foo/bar/;
The =~ operates on whatever is on the left after the left hand code has been run. ($copy = $original) evaluates to $copy so the =~ acts on the copy.
s#^.*/##g is the same as s/^.*\///g but using alternative delimiters to avoid Leaning Toothpick Syndrome. You can use just about anything as a regex delimiter. # is common, though I think its ugly and hard to read. I prefer {} because they balance. s{^.*/}{}g is equivalent code.
Unrolling the idioms, you have this:
$script = $0;
$script =~ s{^.*/}{}g;
$0 is the name of the script. So this is code to copy the name of the script and strip everything up to the last slash (.* is greedy and will match as much as possible) off it. It is getting just the filename of the script.
The /g indicates to perform the match on the string as many times as possible. Since this can only ever match once (the ^ anchors it to the beginning of the string) it serves no purpose.
There's a better and safer way to do this.
use File::Basename;
$script = basename($0);
It's very, very simple:
Perl quote-like expressions can take many different characters as part separators. The separator right after the command (in this case, the s) is the separator for the rest of the operation. For example:
# Out with the "Old" and "In" with the new
$string =~ s/old/new/;
$string =~ s#old#new#;
$string =~ s(old)(new);
$string =~ s#old#new#;
All four of those expressions are the same thing. They replace the string old with new in my $string. Whatever comes after the s is the separator. Note that parentheses, curly braces, and square brackets use parings. This works out rather nicely for the q and qq which can be used instead of single quotes and double quotes:
print "The value of \$foo is \"foo\"\n"; # A bit hard to read
print qq/The value of \$foo is "$foo"\n/; # Maybe slashes weren't a great choice...
print qq(The value of \$foo is "$foo"\n); # Very nice and clean!
print qq(The value of \$foo is (believe it or not) "$foo"\n); #Still works!
The last still works because the quote like operators count opening and closing parentheses. Of course, with regular expressions, parentheses and square brackets are part of the regular expression syntax, so you won't see them so much in substitutions.
Most of the time, it is highly recommended that you stick with the s/.../.../ form just for readability. It's what people are use to and it's easy to digest. However, what if you have this?
$bin_dir =~ s/\/home\/([^\/]+)\/bin/\/Users\/$1\bin/;
Those backslashes can make it hard to read, so the tradition has been to replace the backslash separators to avoid the hills and valleys effect.
$bin_dir =~ s#/home/([^/]+)/bin#/Users/$1/bin#;
This is a bit hard to read, but at least I don't have to quote each forward slash and backslash, so it's easier to see what I'm substituting. Regular expressions are hard because good quote characters are hard to find. Various special symbols such as the ^, *, |, and + are magical regular expression characters, and could probably be in a regular expression, the # is a common one to use. It's not common in strings, and it doesn't have any special meaning in a regular expression, so it won't be used.
Getting back to your original question:
($script = $0) =~ s#^.*/##g;
is the equivalent of:
($script = $0) =~ s/^.*\///g;
But because the original programmer didn't want to backquote that slash, they changed the separator character.
As for the:
($script = $0) =~ s#^.*/##g;`
It's the same as saying:
$script = $0;
$script =~ s#^.*/##g;
You're assigning the $script variable and doing the substitution in a single step. It's very common in Perl, but it is a bit hard to understand at first.
By the way, if I understand that basic expression (Removing all characters to the last forward slash. This would have been way cleaner:
use File::Basename;
...
$script = basename($0);
Much easier to read and understand -- even for an old Perl hand.
In perl, you can use many kinds of characters as quoting characters (string, regular expression, list). lets break it down:
Assign the $script variable the contents of $0 (the string that contains the name of the calling script.)
The =~ character is the binding operator. It invokes a regular expression match or a regex search and replace. In this case, it matches against the new variable, $script.
the s character indicates a search and replace regex.
The # character is being used as the delimiter for the regex. The regex pattern quote character is usually the / character, but you can use others, including # in this case.
The regex, ^.*/. It means, "at the start of string, search for zero or more characters until a slash. This will keep capturing on each line except for newline characters (which . does not match by default.)
The # indicating the start of the 'replace' value. Usually you have a pattern here that uses any captured part of the first line.
The # again. This ends the replace pattern. Since there was nothing between the start and end of the replace pattern, everything that was found in the first is replaced with nothing.
g, or global match. The search and replace will keep happening as many times as it matches in the value.
Effectively, searches for and empties every value before the / in the value , but keeps all the newlines, in the name of the script. It's a really lazy way of getting the script name when invoked in a long script that only works with a unix-like path.
If you have a chance, consider replacing with File::Basename, a core module in Perl:
use File::Basename;
# later ...
my $script = fileparse($0);

How does split work here?

$string = 'a=1;b=2';
use Data::Dumper;
#array = split("; ?", $string);
print Dumper(\#array);
output:
$VAR1 = [
'a=1',
'b=2'
];
Anyone knows how "; ?" work here?It's not regex, but works quite like regex,so I don't understand.
I think it means "semicolon followed by optional space (just one or zero)".
It's not regex, but works quite like regex,so I don't understand.
The pattern parameter to split is always treated as a regular expression (would be better to not use a string, though). The only exception is the "single space", which is taken to mean "split on whitespace"
The first parameter of split is a regex. So I'd rather write split /; ?/, $string;.
When you use a string for the first parameter, it just means the regex can vary and has to be compiled anew each time the split is run. See perldoc -f split for details.
The regex could be read; the character ";" optionally followed by a space. See perlretut and perlreref for details.
A semicolon (the ;) followed by an optional (the ?) space (the ).

Why does this subroutine work if I type out its arguments literally, but not if I give the arguments in the form of a variable?

I am using a perl package (Biomart), that includes a subroutine called addFilter(). That subroutine needs a couple of arguments, including one that needs to be of the format "nr:nr:nr"
If I use the subroutine as follows, it works fine:
$query->addFilter("chromosomal_region", ["1:1108138:1108138","1:1110294:1110294"]);
However, if I use it like this, it does not work:
my $string = '"1:1108138:1108138","1:1110294:1110294","1:1125105:1125105"';
$query->addFilter("chromosomal_region", ['$string']);
Since there are tens of thousands of those arguments that I construct in a for loop, I really need the second way to work... What could be causing this? I hope someone can help me out, many thanks in advance!
Because you seem to be trying to write in a language that's not Perl. '"this","that","another"' isn't an array, it's a string. And '$string' doesn't interpolate or include $string in any way because it uses single quotes. It just produces a string that starts with a dollar sign and ends with "string".
Something more like what you intend would be:
my #things = ("1:1108138:1108138","1:1110294:1110294","1:1125105:1125105");
$query->addFilter("chromosomal_region", \#things);
-or-
$query->addFilter("chromosomal_region", [ #things ] );
And to build it up dynamically, you can simply do push #things, $value in a loop or whatever you need.
'$string' is literally "$string"; the variable isn't replaced with its contents. Lose the single quotes.
Of course, it's unlikely passing a reference to an array consisting of a single comma-separated string with quotes embedded in it is going to do the same thing as passing a reference to an array of strings.
Try something like:
my $ref = ["1:1108138:1108138","1:1110294:1110294"];
$query->addFilter("chromosomal_region", $ref);
I agree with hobbs...if you want to take many inputs like that, you can use a for loop and an array like this (provided you are taking inputs from STDIN):
for ($line = <STDIN> && $line ne "end\n")
{
chomp($line);
push #values,$line;
}
It takes data and puts in values array. You have to indicate the end of data by "end".
And for your error, what others said was right. Perl's variable interpolation works only for variables in double quotes.

How can I interpolate literal \t and \n in Perl strings? [duplicate]

This question already has answers here:
How can I manually interpolate string escapes in a Perl string?
(2 answers)
Closed 8 years ago.
Say I have an environment variable myvar:
myvar=\tapple\n
When the following command will print out this variable
perl -e 'print "$ENV{myvar}"'
I will literally have \tapple\n, however, I want those control chars to be evaluated and not escaped. How would I achieve it?
In the real world $ENV residing in substitution, but I hope the answer will cover that.
Use eval:
perl -e 'print eval qq{"$ENV{myvar}"}'
UPD: You can also use substitution with the ee switch, which is safer:
perl -e '(my $s = $ENV{myvar}) =~ s/(\\n|\\t)/"qq{$1}"/gee; print $s'
You should probably be using String::Escape.
use String::Escape qw(unbackslash);
my $var = unbackslash($ENV{'myvar'});
unbackslash unescapes any string escape sequences it finds, turning them into the characters they represent. If you want to explicitly only translate \n and \t, you'll probably have to do it yourself with a substitution as in this answer.
There's nothing particularly special about a sequence of characters that includes a \. If you want to substitute one sequence of characters for another, it's very simple to do in Perl:
my %sequences = (
'\\t' => "\t",
'\\n' => "\n",
'foo' => 'bar',
);
my $string = '\\tstring fool string\\tfoo\\n';
print "Before: [$string]\n";
$string =~ s/\Q$_/$sequences{$_}/g for ( keys %sequences );
print "After: [$string]\n";
The only trick with \ is to keep track of the times when Perl thinks it's an escape character.
Before: [\tstring fool string\tfoo\n]
After: [ string barl string bar
]
However, as darch notes, you might just be able to use String::Escape.
Note that you have to be extremely careful when you're taking values from environment variables. I'd be reluctant to use String::Escape since it might process quite a bit more than you are willing to translate. The safe way is to only expand the particular values you explicitly want to allow. See my "Secure Programming Techniques" chapter in Mastering Perl where I talk about this, along with the taint checking you might want to use in this case.

Why does this base64 string comparison in Perl fail?

I am trying to compare an encode_base64('test') to the string variable containing the base64 string of 'test'. The problem is it never validates!
use MIMI::Base64 qw(encode_base64);
if (encode_base64("test") eq "dGVzdA==")
{
print "true";
}
Am I forgetting anything?
Here's a link to a Perlmonks page which says "Beware of the newline at the end of the encode_base64() encoded strings".
So the simple 'eq' may fail.
To suppress the newline, say encode_base64("test", "") instead.
When you do a string comparison and it fails unexpectedly, print the strings to see what is actually in them. I put brackets around the value to see any extra whitespace:
use MIME::Base64;
$b64 = encode_base64("test");
print "b64 is [$b64]\n";
if ($b64 eq "dGVzdA==") {
print "true";
}
This is a basic debugging technique using the best debugger ever invented. Get used to using it a lot. :)
Also, sometimes you need to read the documentation for things a couple time to catch the important parts. In this case, MIME::Base64 tells you that encode_base64 takes two arguments. The second argument is the line ending and defaults to a newline. If you don't want a newline on the end of the string you need to give it another line ending, such as the empty string:
encode_base64("test", "")
Here's an interesting tip: use Perl's wonderful and well-loved testing modules for debugging. Not only will that give you a head start on testing, but sometimes they'll make your debugging output a lot faster. For example:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More 0.88;
BEGIN { use_ok 'MIME::Base64' => qw(encode_base64) }
is( encode_base64("test", "dGVzdA==", q{"test" encodes okay} );
done_testing;
Run that script, with perl or with prove, and it won't just tell you that it didn't match, it will say:
# Failed test '"test" encodes okay'
# at testbase64.pl line 6.
# got: 'gGVzdA==
# '
# expected: 'dGVzdA=='
and sharp-eyed readers will notice that the difference between the two is indeed the newline. :)