Why aren't my nested lookarounds working correctly in my Perl substitution? - perl

I have a Perl substitution which converts hyperlinks to lowercase:
's/(?<=<a href=")([^"]+)(?=")/\L$1/g'
I want the substitution to ignore any links which begin with a hash, for example I want it to change the path in Foo Bar to lowercase but skip if it comes across Bar.
Nesting lookaheads to instruct it to skip these links isn't working correctly for me. This is the one-liner I've written:
perl -pi -e 's/(?<=<a href=" (?! (?<=<a href="#) ) )([^"]+)(?=")/\L$1/g' *;
Could anyone hint to me where I have gone wrong with this substitution? It executes just fine, but does not do anything.

As near as I can tell, your initial regex will work just fine, if you add the condition that the first character in the link may not be a hash # or a double quote, e.g. [^#"]
s/(?<=<a href=")([^#"][^"]+)(?=")/\L$1/gi;
In the case you have links which do not start with a hash, e.g. Foo Bar, it becomes slightly more complicated:
s{(?<=<a href=")([^#"]+)(#[^"]+)*(?=")}{ lc($1) . ($2 // "") }gei;
We now have to evaluate the substitution, since otherwise we get undefined variable warnings when the optional anchor reference is not present.

You don't need look-arounds, from what I see
use 5.010;
...
s/<a \s+ href \s* = \s* "\K([^#"][^"]*)"/\L$1"/gx;
\K means "keep" everything before it. It amounts to a variable-length look-behind.
perlre:
For various reasons \K may be significantly more efficient than the equivalent (?<=...) construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string.

Related

Not able to understand a command in perl

I need help to understand what below command is doing exactly
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
and $abc{hier} contains a path "/home/test1/test2/test3"
Can someone please let me know what the above command is doing exactly. Thanks
s/PATTERN/REPLACEMENT/ is Perl's substitution operator. It searches a string for text that matches the regex PATTERN and replaces it with REPLACEMENT.
By default, the substitution operator works on $_. To tell it to work on a different variable, you use the binding operator - =~.
The default delimiter used by the substitution operator is a slash (/) but you can change that to any other character. This is useful if your PATTERN or your REPLACEMENT contains a slash. In this case, the programmer has used # as the delimiter.
To recap:
$abc{hier} =~ s#PATTERN#REPLACEMENT#;
means "look for text in $abc{hier} that matches PATTERN and replace it with REPLACEMENT.
The substitution operator also has various options that change its behaviour. They are added by putting letters after the final delimiter. In this case we have a g. That means "make the substitution global" - or match and change all occurrences of PATTERN.
In your case, the REPLACEMENT string is empty (we have two # characters next to each other). So we're replacing the PATTERN with nothing - effectively deleting whatever matches PATTERN.
So now we have:
$abc{hier} =~ s#PATTERN*##g;
And we know it means, "in the variable $abc{hier}, look for any string that matches PATTERN and replace it with nothing".
The last thing to look at is the PATTERN (or regular expression - "regex"). You can get the full definition of regexes in perldoc perlre. But to explain what we're using here:
/tools : is the fixed string "/tools"
.* : is zero or more of any character
/dfII : is the fixed string "/dfII"
/? : is an optional slash character
.* : is (again) zero or more of any character
So, basically, we're removing bits of a file path from a value that's stored in a hash.
This =~ means "Do a regex operation on that variable."
(Actually, as ikegami correctly reminds me, it is not necessarily only regex operations, because it could also be a transliteration.)
The operation in question is s#something#else#, which means replace the "something" with something "else".
The g at the end means "Do it for all occurences of something."
Since the "else" is empty, the replacement has the effect of deleting.
The "something" is a definition according to regex syntax, roughly it means "Starting with '/tools' and later containing '/dfII', followed pretty much by anything until the end."
Note, the regex mentions at the end /?.*. In detail, this would mean "A slash (/) , or maybe not (?), and then absolutely anything (.) any number of times including 0 times (*). Strictly speaking it is not necessary to define "slash or not", if it is followed by "anything any often", because "anything" includes as slash, and anyoften would include 0 or one time; whether it is followed by more "anything" or not. I.e. the /? could be omitted, without changing the behaviour.
(Thanks ikeagami for confirming.)
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
The above commands use regular expression to strip/remove trailing /tools.*/dfII and
/tools.*/dfII/.* from value of hier member of %abc hash.
It is pretty basic perl except non standard regular expression limiters (# instead of standard /). It allows to avoid escaping / inside the regular expression (s/\/tools.*\/dfII\/?.*//g).
My personal preferred style-guide would make it s{/tools.*/dfII/?.*}{}g .

Perl global substitution of a file path

I am reading a tab delimited file using Perl; and want to apply a global substitution to a file path within this file. I have read that I need to incorporate Q and E into my substitution command; but I'm not able to get the substitution to work. I want to replace the partial string psoft/batch/cs with ps/bat/csprd.
$xl[$idx] =~ s/\Qpsoft/batch/cs\E/\Q/psoft/batch/csprd\E/g;
You can't use \Q to escape a delimiter. For example,
s/\Qa*b//
is equivalent to
s/a\*b//
and not
s/a\*b\/\/...
That means
$xl[$idx] =~ s/\Qpsoft/batch/cs\E/\Q/psoft/batch/csprd\E/g;
is equivalent to
$xl[$idx] =~ s/psoft/batch/cs <junk>
Solution:
$xl[$idx] =~ s/psoft\/batch\/cs/\/psoft\/batch\/csprd/g;
Better:
$xl[$idx] =~ s{psoft/batch/cs}{/psoft/batch/csprd}g;
In more details
There are three steps to parsing an m//, qr// or s/// operator.
The first step is to obtain the trailing flags that affect how the regex pattern is parsed (e.g. x, s, m, i, etc). Since Perl doesn't yet know how to parse the regex pattern and to keep costs down, Perl simply looks for the delimiter marking the end of the pattern and the end of the substitution (usually /), paying attention to no other character other than backslashes (\). \Q is ignored at this point.
The second step is where the double-quoted string escapes (e.g. \Q, \L, etc) and interpolation occurs. Perl won't have a regex pattern until these are processed.
Finally, Perl has a regex pattern and knows how to compile it, so the third step is to compile the regex pattern.
The first problem is that you need to use a different set of delimiters for the substitution operator. Instead of s///, you can use s{}{}. Another problem is that you should not use \Q and \E on the right side of s/// because the right side is not a regular expression. In your case, you don't need Q/E at all:
s{psoft/batch/cs}{/psoft/batch/csprd}g;
Refer to s/PATTERN/REPLACEMENT/

Funky 'x' usage in perl

My usual 'x' usage was :
print("#" x 78, "\n");
Which concatenates 78 times the string "#". But recently I came across this code:
while (<>) { print if m{^a}x }
Which prints every line of input starting with an 'a'. I understand the regexp matching part (m{^a}), but I really don't see what that 'x' is doing here.
Any explanation would be appreciated.
It's a modifier for the regex. The x modifier tells perl to ignore whitespace and comments inside the regex.
In your example code it does not make a difference because there are no whitespace or comments in the regex.
The "x" in your first case, is a repetition operator, which takes the string as the left argument and the number of times to repeat as the right argument. Perl6 can replicate lists using the "xx" repetition operator.
Your second example uses the regular expression m{^a}x. While you may use many different types of delimiters, neophytes may like to use the familiar notation, which uses a forward slash: m/^a/x
The "x" in a regex is called a modifier or a flag and is but one of many optional flags that may be used. It is used to ignore whitespace in the regex pattern, but it also allows the use of normal comments inside. Because regex patterns can get really long and confusing, using whitespace and comments are very helpful.
Your example is very short (all it says is if the first letter of the line starts with "a"), so you probably wouldn't need whitespace or comments, but you could if you wanted to.
Example:
m/^a # first letter is an 'a'
# <-- you can put more regex on this line because whitespace is ignored
# <-- and more here if you want
/x
In this use case 'x' is a regex modifier which "Extends your pattern's legibility by permitting whitespace and comments." according to the perl documentation. However it seems redundant here

How do I escape special characters for a substitution in a Perl one-liner?

Is there some way to replace a string such as #or * or ? or & without needing to put a "\" before it?
Example:
perl -pe 'next if /^#/; s/\#d\&/new_value/ if /param5/' test
In this example I need to replace a #d& with new_value but the old value might contain any character, how do I escape only the characters that need to be escaped?
You have several problems:
You are using \b incorrectly
You are replacing code with shell variables
You need to quote metacharacters
From perldoc perlre
A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it
Neither of the characters # or & are \w characters. So your match is guaranteed to fail. You may want to use something like s/(^|\s)\#d\&(\s|$)/${1}new text$2/
(^|\s) says to match either the start of the string (^)or a whitespace character (\s).
(\s|$) says to match either the end of the string ($) or a whitespace character (\s).
To solve the second problem, you should use %ENV.
To solve the third problem, you should use the \Q and \E escape sequences to escape the value in $ENV{a}.
Putting it all together we get:
#!/bin/bash
export a='#d&'
export b='new text'
echo 'param5 #d&' |
perl -pe 'next if /^#/; s/(^|\s)\Q$ENV{a}\E(\s|$)/$1$ENV{b}$2/ if /param5/'
Which prints
param5 new text
As discussed at perldoc perlre:
...Today it is more common to use the quotemeta() function or the "\Q" metaquoting
escape sequence to disable all metacharacters' special meanings like this:
/$unquoted\Q$quoted\E$unquoted/
Beware that if you put literal backslashes (those not inside interpolated variables) between "\Q" and "\E", double-quotish backslash interpolation may
lead to confusing results. If you need to use literal backslashes within "\Q...\E", consult "Gory details of parsing quoted constructs" in perlop.
You can also use a ' as the delimiter in the s/// operation to make everything be parsed literally:
my $text = '#';
$text =~ s'#'1';
print $text;
In your example, you can do (note the single quotes):
perl -pe 's/\b\Q#f&\E\b/new_value/g if m/param5/ and not /^ *#/'
The other answers have covered the question, now here's your meta-problem: Leaning Toothpick Syndrome. Its when the delimiter and escapes start to blur together:
s/\/foo\/bar\\/\/bar\/baz/
The solution is to use a different delimiter. You can use just about anything, but balanced braces work best. Most editors can parse them and you generally don't have to worry about escaping.
s{/foo/bar\\}{/bar/baz}
Here's your regex with braced delimiters.
s{\#d\&}{new_value}
Much easier on the eyeholes.
If you really want to avoid typing the \s, put your search string into a variable and then use that in your regex instead. You don't need quotemeta or \Q ... \E in that case. For example:
my $s = '#d&';
s/$s/new_value/g;
If you must use this in a one-liner, bear in mind that you will have to escape the $s if you use "s to contain your perl code, or escape the 's if you use 's to contain your perl code.
If you have a string like
my $var1 = abc$123
and you want to replace it with abcd then you have to use \Q \E. If you don't then no matter what perl doesn't replace the string.
This is the only thing that worked for me.
my $var2 = s/\Q$var1\E/abcd/g;

How can I prevent Perl from interpreting \ as an escape character?

How can I print a address string without making Perl take the slashes as escape characters? I don't want to alter the string by adding more escape characters also.
What you're asking about is called interpolation. See the documentation for "Quote-Like Operators" at perldoc perlop, but more specifically the way to do it is with the syntax called the "here-document" combined with single quotes:
Single quotes indicate the text is to be treated literally with no interpolation of its content. This is similar to single quoted strings except that backslashes have no special meaning, with \ being treated as two backslashes and not one as they would in every other quoting construct.
This is the only form of quoting in perl where there is no need to worry about escaping content, something that code generators can and do make good use of.
For example:
my $address = <<'EOF';
blah#blah.blah.com\with\backslashes\all\over\theplace
EOF
You may want to read up on the various other quoting operators such as qw and qq (at the same document as I referenced above), as they are very commonly used and make good shorthand for other more long-winded ways of escaping content.
Use single quotes. For example
print 'lots\of\backslashes', "\n";
gives
lots\of\backslashes
If you want to interpolate variables, use the . operator, as in
$var = "pesky";
print 'lots\of\\' . $var . '\backslashes', "\n";
Notice that you have to escape the backslash at the end of the string.
As an alternative, you could use join:
print join("\\" => "lots", "of", $var, "backslashes"), "\n";
We could give much more helpful answers if you'd give us sample code.
It depends what you're escaping, but the Quote-like operators may help.
See the perlop man page.
Use the backslah two times,
print "This is a backslah character \\";