Perl regex for allowing all the special charecters - perl

The environment I am working is on PCRE which supports Perl syntax.
I want to build a regular expression which supports all the charecters and special charecters.
I have tried
(.*)
But it does not work.
For example. I am trying to redirect
From
https://oldaddress.com/select.do?cyuf=err%3Errt.com%3fsfds%4A222-3424234&p=66&j=8
To
https://newsite.com/select.do?cyuf=err%3Errt.com%3fsfds%4A222-3424234&p=66&j=8
The old site oldaddress.com successfully redirects to newsite.com but the URI part select.do?cyuf=err%3Errt.com%3fsfds%4A222-3424234&p=66&j=8 does not remain in it.
I have used regEx =~ (^.*) to handle the URI part but the regex does not support all the special characters.
I would like to implement this regEx in Perl.

Your problem is almost certainly not with your regex. The query parameters don't change, so they shouldn't be included in the regex at all as this Perl code will demonstrate:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $url = 'https://oldaddress.com/select.do?cyuf=err%3Errt.com%3fsfds%4A222-3424234&p=66&j=8';
my $new_url = 'https://newsite.com/select.do?cyuf=err%3Errt.com%3fsfds%4A222-3424234&p=66&j=8';
$url =~ s/oldaddress/newsite/;
if ($url eq $new_url) {
say "Looks like that worked";
} else {
say "Looks like you've got a problem";
}
I'm only changing the domain, so that's all I need to refer to in the regex.
If your query string isn't surviving the transformation, then that problem is down to some other problem in the technology that you are using. Without knowing more about what you're doing, we really can't be any more help.
Update: From Sawyer's comment
but how to handle special charecters ?, % using Regex using Perl
You don't seem to be reading what people are telling you.
Your regex doesn't need to handle ?, % or other special characters. Your regex only needs to deal with the bits of your string that are changing - that's the domain names and they don't include these characters.
% has no special meaning in Perl regular expressions.
? has a special meaning, you avoid that by escaping it with a backslash (so use \?).
A dot (.) in a Perl regular expression matches any character except a newline. It matches ? and % without any problems at all.
Your problem is not with your regex; it is somewhere else in your system. But because you are being so mysterious about what you're doing and how you're doing it, we really can't be any more help.
Update2: Here's another version of my program which demonstrates (.*) matching ? and %. But I can't emphasise enough that you don't need to do this.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $url = 'https://oldaddress.com/select.do?cyuf=err%3Errt.com%3fsfds%4A222-3424234&p=66&j=8';
my $new_url = 'https://newsite.com/select.do?cyuf=err%3Errt.com%3fsfds%4A222-3424234&p=66&j=8';
$url =~ s/(.*)oldaddress(.)/${1}newsite${2}/;
if ($url eq $new_url) {
say "Looks like that worked";
} else {
say "Looks like you've got a problem";
}

Related

In Perl, can you use a variable for the whole of a match string?

I'm new to Perl, though not to programming, and am working through Learning Perl. The book has exercises to match successive lines of a small text file.
I had the idea of supplying match strings from STDIN, and going through the file for each one:
while(<STDIN>) {
chomp;
$regex = $_;
seek JUNK, 0, 0;
while(<JUNK>) {
chomp();
if(/$regex/) {
say;
}
}
say '';
}
This works fine, but I can't find a way to interpolate an entire match string, e.g.
/fred/i
into the predicate. I tried
if($$matcher) # with $matcher = '/fred/'
but Perl complained.
I imagine this is my ignorance, and should welcome enlightenment.
Statement modifiers, such as /i, are a part of the code telling Perl how to perform the match, not a part of the pattern to be matched. This is why that doesn't work for you.
You have three ways to work around this (well, probably more, since this is Perl we're talking about, but three ways that I can think of straight off):
1) Use extended regex syntax and, when you want a case-insensitive match, enter (?i:fred), as suggested in comments on the question.
2) Use string eval to allow the use of the regular statement modifiers: if (eval "$_ =~ $regex") { say } Note that this method will require you to also type the surrounding slashes. e.g., You'd have to enter /fred/i; just typing in fred would not work. Note also that it's a huge security hole to do this without validating your input first, since the user's entered text is executed as Perl code, just as if it were part of the original program. (Imagine if the user entered //, system("rm -rf /") - it would test against an empty regex, then delete all the files on your computer.) So probably not a recommended approach unless you really know what you're doing and/or you're the only one who will ever run the program.
3) The most complex, but also most correct, solution is to write a parser which inspects the user's entered string to see whether any special flags are present and then responds accordingly. A very simple example which allows the user to append /i for a case-insensitive search:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
while(<STDIN>) {
chomp;
my #parts = split '/', $_;
# If the user input starts with a /, the first part will be empty, so throw
# it away.
shift #parts unless $parts[0];
my $re = shift #parts;
my %flags;
for (#parts) {
for (split '') {
$flags{i} = 1 if $_ eq 'i';
}
}
my $f = join '', keys %flags;
say "Matched" if eval qq('foo' =~ /$re/$f);
}
This also uses string eval, so it is potentially vulnerable to the same kind of security issues as #2, but $re cannot contain any / characters (the split '/' would have ended $re immediately prior to the first /), which prevents code from being inserted there and $f can contain only the letter i (or any other flags you might choose to recognize if you expand on this). So it should be safe. (But, if anyone can demonstrate an exploit I missed, please tell me about it in comments!)
Problem
What you are trying to do can be summarized by:
my $regex = '/fred/i';
my #lines = (
'A line containing some words and Fred said Hello.',
'Another line. Here is a regex embedded in the line: /fred/i',
);
for ( #lines ) {
say if /$regex/;
}
Output:
Another line. Here is a regex embedded in the line: /fred/i
We see that the second line matches $regex, whereas we wanted the first line containing Fred to match the string fred with the (case insensitive) i flag added to the regex. The problem is that the characters / and i in $regex are taken as characters to be matched literally, i.e., they are not interpreted as special characters surrounding a Regex (as part of a Perl expression).
Note:
The character / is special as part of a Perl expression for a regular expression, but it is not special inside the Regex pattern. There are however characters that are special inside the pattern, the so-called meta characters:
\ | ( ) [ { ^ $ * + ? .
see perldoc quotemeta for more information.
A solution using extended patterns
Simply change the first line to:
my $regex = '(?i)fred'; # or alternatively: (?i:fred)
Regex flags can be added to a regex pattern using "Extended patterns" described in the manual perldoc perlre :
Extended Patterns
The syntax for most of these is a pair of parentheses with a question
mark as the first thing within the parentheses. The character after
the question mark indicates the extension.
[...]
(?adlupimnsx-imnsx)
(?^alupimnsx)
One or more embedded pattern-match modifiers, to be turned on (or
turned off if preceded by "-" ) for the remainder of the pattern or
the remainder of the enclosing pattern group (if any). This is
particularly useful for dynamically-generated patterns, such as those
read in from a configuration file, taken from an argument, or
specified in a table somewhere.
[...]
These modifiers are restored at the end of the enclosing group.
Alternatively the non-capturing form can be used:
(?:pattern)
(?adluimnsx-imnsx:pattern)
(?^aluimnsx:pattern)
This is for clustering, not capturing; it groups subexpressions like
"()" , but doesn't make backreferences as "()" does.
The question has been answered in the following comment:
Try (?i:fred), see Extended
patterns in
perldoc perlre for more information
– Håkon Hægland 7 hours ago.

Can't find Unicode property definition "o" - Regular expression is not working in perl

I am trying to capture a sub-string from a string for that i am using regx but its is not working. The error which i am getting is Can't find Unicode property definition "o"
I am using Windows machine for running the below code.
Here is the code :
use strict;
use warnings;
my $path = 'C:\APTscripts\APStress\Logs\APStress_September-18---20.44.25\APTLogs\PostBootLogs\09-18-2014_15-18-32\UILogs_09-18-2014_15-50-43.txt';
my ($captured) = $path =~ /(.+?) \PostBootLogs/gx;
print "$captured\n";
You just need to escape the backslash in the pattern:
/(.+?) \\PostBootLogs/gx
You were inadvertently triggering the use of Unicode character properties with the use of \P.
As has already been demonstrated, you need to escape your literal backslashes in regular expressions.
my ($captured) = $path =~ /(.+?) \\PostBootLogs/x;
However, you can also accomplish this same task without a regex though if you use Path::Class or a similar file and directory managing module.
use Path::Class;
my $captured = file($path)->dir->parent->parent;

What is the meaning of the number sign (#) in a Perl regex match?

What is the meaning of below statement in perl?
($script = $0) =~ s#^.*/##g;
I am trying to understand the operator =~ along with the statement on the right side s#^.*/##g.
Thanks
=~ applies the thing on the right (a pattern match or search and replace) to the thing on the left. There's lots of documentation about =~ out there, so I'm just going to point you at a pretty good one.
There's a couple of idioms going on there which are not obvious nor well documented which might be tripping you up. Let's cover them.
First is this...
($copy = $original) =~ s/foo/bar/;
This is a way of copying a variable and performing a search and replace on it in a single step. It is equivalent to:
$copy = $original;
$copy =~ s/foo/bar/;
The =~ operates on whatever is on the left after the left hand code has been run. ($copy = $original) evaluates to $copy so the =~ acts on the copy.
s#^.*/##g is the same as s/^.*\///g but using alternative delimiters to avoid Leaning Toothpick Syndrome. You can use just about anything as a regex delimiter. # is common, though I think its ugly and hard to read. I prefer {} because they balance. s{^.*/}{}g is equivalent code.
Unrolling the idioms, you have this:
$script = $0;
$script =~ s{^.*/}{}g;
$0 is the name of the script. So this is code to copy the name of the script and strip everything up to the last slash (.* is greedy and will match as much as possible) off it. It is getting just the filename of the script.
The /g indicates to perform the match on the string as many times as possible. Since this can only ever match once (the ^ anchors it to the beginning of the string) it serves no purpose.
There's a better and safer way to do this.
use File::Basename;
$script = basename($0);
It's very, very simple:
Perl quote-like expressions can take many different characters as part separators. The separator right after the command (in this case, the s) is the separator for the rest of the operation. For example:
# Out with the "Old" and "In" with the new
$string =~ s/old/new/;
$string =~ s#old#new#;
$string =~ s(old)(new);
$string =~ s#old#new#;
All four of those expressions are the same thing. They replace the string old with new in my $string. Whatever comes after the s is the separator. Note that parentheses, curly braces, and square brackets use parings. This works out rather nicely for the q and qq which can be used instead of single quotes and double quotes:
print "The value of \$foo is \"foo\"\n"; # A bit hard to read
print qq/The value of \$foo is "$foo"\n/; # Maybe slashes weren't a great choice...
print qq(The value of \$foo is "$foo"\n); # Very nice and clean!
print qq(The value of \$foo is (believe it or not) "$foo"\n); #Still works!
The last still works because the quote like operators count opening and closing parentheses. Of course, with regular expressions, parentheses and square brackets are part of the regular expression syntax, so you won't see them so much in substitutions.
Most of the time, it is highly recommended that you stick with the s/.../.../ form just for readability. It's what people are use to and it's easy to digest. However, what if you have this?
$bin_dir =~ s/\/home\/([^\/]+)\/bin/\/Users\/$1\bin/;
Those backslashes can make it hard to read, so the tradition has been to replace the backslash separators to avoid the hills and valleys effect.
$bin_dir =~ s#/home/([^/]+)/bin#/Users/$1/bin#;
This is a bit hard to read, but at least I don't have to quote each forward slash and backslash, so it's easier to see what I'm substituting. Regular expressions are hard because good quote characters are hard to find. Various special symbols such as the ^, *, |, and + are magical regular expression characters, and could probably be in a regular expression, the # is a common one to use. It's not common in strings, and it doesn't have any special meaning in a regular expression, so it won't be used.
Getting back to your original question:
($script = $0) =~ s#^.*/##g;
is the equivalent of:
($script = $0) =~ s/^.*\///g;
But because the original programmer didn't want to backquote that slash, they changed the separator character.
As for the:
($script = $0) =~ s#^.*/##g;`
It's the same as saying:
$script = $0;
$script =~ s#^.*/##g;
You're assigning the $script variable and doing the substitution in a single step. It's very common in Perl, but it is a bit hard to understand at first.
By the way, if I understand that basic expression (Removing all characters to the last forward slash. This would have been way cleaner:
use File::Basename;
...
$script = basename($0);
Much easier to read and understand -- even for an old Perl hand.
In perl, you can use many kinds of characters as quoting characters (string, regular expression, list). lets break it down:
Assign the $script variable the contents of $0 (the string that contains the name of the calling script.)
The =~ character is the binding operator. It invokes a regular expression match or a regex search and replace. In this case, it matches against the new variable, $script.
the s character indicates a search and replace regex.
The # character is being used as the delimiter for the regex. The regex pattern quote character is usually the / character, but you can use others, including # in this case.
The regex, ^.*/. It means, "at the start of string, search for zero or more characters until a slash. This will keep capturing on each line except for newline characters (which . does not match by default.)
The # indicating the start of the 'replace' value. Usually you have a pattern here that uses any captured part of the first line.
The # again. This ends the replace pattern. Since there was nothing between the start and end of the replace pattern, everything that was found in the first is replaced with nothing.
g, or global match. The search and replace will keep happening as many times as it matches in the value.
Effectively, searches for and empties every value before the / in the value , but keeps all the newlines, in the name of the script. It's a really lazy way of getting the script name when invoked in a long script that only works with a unix-like path.
If you have a chance, consider replacing with File::Basename, a core module in Perl:
use File::Basename;
# later ...
my $script = fileparse($0);

Evaluating escape sequences in perl

I'm reading strings from a file. Those strings contain escape sequences which I would like to have evaluated before processing them further. So I do:
$t = eval("\"$t\"");
which works fine. But I'm having doubt about the performance. If eval is forking a perl process each time, it will be a performance killer. Another way I considered to do the job were regex, where I have found related questions in SO.
My question: is there a better, more efficient way to do it?
EDIT: before calling eval in my example $t is containing \064\065\x20a\n. It is evaluated to 45 a<LF>.
It's not quite clear what the strings in the file look like and what you do to them before passing off to eval. There's something missing in the explanation.
If you simply want to undo C-style escaping (as also used in Perl), use Encode::Escape:
use Encode qw(decode);
use Encode::Escape qw();
my $string_with_unescaped_literals = decode 'ascii-escape', $string_with_escaped_literals;
If you have placeholders in the file which look like Perl variables that you want to fill with values, then you are abusing eval as a poor man's templating engine. Use a real one that does not have the dangerous side effect of running arbitrary code.
$string =~ s/\\([rnt'"\\])/"qq|\\$1|"/gee
string eval can solve the problem too, but it brings up a host of security and maintenance issues, like # in string
oh gah don't use eval for this, thats dangerous if someone provides it with input like "system('sync;reboot');"..
But, you could do something like this:
#!/usr/bin/perl
$string = 'foo\"ba\\\'r';
printf("%s\n", $string);
$string =~ s/\\([\"\'])/$1/g;
printf("%s\n", $string);

Why does this base64 string comparison in Perl fail?

I am trying to compare an encode_base64('test') to the string variable containing the base64 string of 'test'. The problem is it never validates!
use MIMI::Base64 qw(encode_base64);
if (encode_base64("test") eq "dGVzdA==")
{
print "true";
}
Am I forgetting anything?
Here's a link to a Perlmonks page which says "Beware of the newline at the end of the encode_base64() encoded strings".
So the simple 'eq' may fail.
To suppress the newline, say encode_base64("test", "") instead.
When you do a string comparison and it fails unexpectedly, print the strings to see what is actually in them. I put brackets around the value to see any extra whitespace:
use MIME::Base64;
$b64 = encode_base64("test");
print "b64 is [$b64]\n";
if ($b64 eq "dGVzdA==") {
print "true";
}
This is a basic debugging technique using the best debugger ever invented. Get used to using it a lot. :)
Also, sometimes you need to read the documentation for things a couple time to catch the important parts. In this case, MIME::Base64 tells you that encode_base64 takes two arguments. The second argument is the line ending and defaults to a newline. If you don't want a newline on the end of the string you need to give it another line ending, such as the empty string:
encode_base64("test", "")
Here's an interesting tip: use Perl's wonderful and well-loved testing modules for debugging. Not only will that give you a head start on testing, but sometimes they'll make your debugging output a lot faster. For example:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More 0.88;
BEGIN { use_ok 'MIME::Base64' => qw(encode_base64) }
is( encode_base64("test", "dGVzdA==", q{"test" encodes okay} );
done_testing;
Run that script, with perl or with prove, and it won't just tell you that it didn't match, it will say:
# Failed test '"test" encodes okay'
# at testbase64.pl line 6.
# got: 'gGVzdA==
# '
# expected: 'dGVzdA=='
and sharp-eyed readers will notice that the difference between the two is indeed the newline. :)