Perl global substitution of a file path

Perl global substitution of a file path - perl

I am reading a tab delimited file using Perl; and want to apply a global substitution to a file path within this file. I have read that I need to incorporate Q and E into my substitution command; but I'm not able to get the substitution to work. I want to replace the partial string psoft/batch/cs with ps/bat/csprd.
$xl[$idx] =~ s/\Qpsoft/batch/cs\E/\Q/psoft/batch/csprd\E/g;

You can't use \Q to escape a delimiter. For example,
s/\Qa*b//
is equivalent to
s/a\*b//
and not
s/a\*b\/\/...
That means
$xl[$idx] =~ s/\Qpsoft/batch/cs\E/\Q/psoft/batch/csprd\E/g;
is equivalent to
$xl[$idx] =~ s/psoft/batch/cs <junk>
Solution:
$xl[$idx] =~ s/psoft\/batch\/cs/\/psoft\/batch\/csprd/g;
Better:
$xl[$idx] =~ s{psoft/batch/cs}{/psoft/batch/csprd}g;
In more details
There are three steps to parsing an m//, qr// or s/// operator.
The first step is to obtain the trailing flags that affect how the regex pattern is parsed (e.g. x, s, m, i, etc). Since Perl doesn't yet know how to parse the regex pattern and to keep costs down, Perl simply looks for the delimiter marking the end of the pattern and the end of the substitution (usually /), paying attention to no other character other than backslashes (\). \Q is ignored at this point.
The second step is where the double-quoted string escapes (e.g. \Q, \L, etc) and interpolation occurs. Perl won't have a regex pattern until these are processed.
Finally, Perl has a regex pattern and knows how to compile it, so the third step is to compile the regex pattern.

The first problem is that you need to use a different set of delimiters for the substitution operator. Instead of s///, you can use s{}{}. Another problem is that you should not use \Q and \E on the right side of s/// because the right side is not a regular expression. In your case, you don't need Q/E at all:
s{psoft/batch/cs}{/psoft/batch/csprd}g;
Refer to s/PATTERN/REPLACEMENT/

Related

Not able to understand a command in perl

I need help to understand what below command is doing exactly
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
and $abc{hier} contains a path "/home/test1/test2/test3"
Can someone please let me know what the above command is doing exactly. Thanks

s/PATTERN/REPLACEMENT/ is Perl's substitution operator. It searches a string for text that matches the regex PATTERN and replaces it with REPLACEMENT.
By default, the substitution operator works on $_. To tell it to work on a different variable, you use the binding operator - =~.
The default delimiter used by the substitution operator is a slash (/) but you can change that to any other character. This is useful if your PATTERN or your REPLACEMENT contains a slash. In this case, the programmer has used # as the delimiter.
To recap:
$abc{hier} =~ s#PATTERN#REPLACEMENT#;
means "look for text in $abc{hier} that matches PATTERN and replace it with REPLACEMENT.
The substitution operator also has various options that change its behaviour. They are added by putting letters after the final delimiter. In this case we have a g. That means "make the substitution global" - or match and change all occurrences of PATTERN.
In your case, the REPLACEMENT string is empty (we have two # characters next to each other). So we're replacing the PATTERN with nothing - effectively deleting whatever matches PATTERN.
So now we have:
$abc{hier} =~ s#PATTERN*##g;
And we know it means, "in the variable $abc{hier}, look for any string that matches PATTERN and replace it with nothing".
The last thing to look at is the PATTERN (or regular expression - "regex"). You can get the full definition of regexes in perldoc perlre. But to explain what we're using here:
/tools : is the fixed string "/tools"
.* : is zero or more of any character
/dfII : is the fixed string "/dfII"
/? : is an optional slash character
.* : is (again) zero or more of any character
So, basically, we're removing bits of a file path from a value that's stored in a hash.

This =~ means "Do a regex operation on that variable."
(Actually, as ikegami correctly reminds me, it is not necessarily only regex operations, because it could also be a transliteration.)
The operation in question is s#something#else#, which means replace the "something" with something "else".
The g at the end means "Do it for all occurences of something."
Since the "else" is empty, the replacement has the effect of deleting.
The "something" is a definition according to regex syntax, roughly it means "Starting with '/tools' and later containing '/dfII', followed pretty much by anything until the end."
Note, the regex mentions at the end /?.*. In detail, this would mean "A slash (/) , or maybe not (?), and then absolutely anything (.) any number of times including 0 times (*). Strictly speaking it is not necessary to define "slash or not", if it is followed by "anything any often", because "anything" includes as slash, and anyoften would include 0 or one time; whether it is followed by more "anything" or not. I.e. the /? could be omitted, without changing the behaviour.
(Thanks ikeagami for confirming.)

$abc{hier} =~ s#/tools.*/dfII/?.*##g;
The above commands use regular expression to strip/remove trailing /tools.*/dfII and
/tools.*/dfII/.* from value of hier member of %abc hash.
It is pretty basic perl except non standard regular expression limiters (# instead of standard /). It allows to avoid escaping / inside the regular expression (s/\/tools.*\/dfII\/?.*//g).
My personal preferred style-guide would make it s{/tools.*/dfII/?.*}{}g .

Difference between /.../ and m/.../ in Perl

What is difference between /.../ and m/.../?
use strict;
use warnings;
my $str = "This is a testing for modifier";
if ($str =~ /This/i) { print "Modifier...\n"; }
if ($str =~ m/This/i) { print "W/O Modifier...\n"; }
However, I checked with this site for Reference not clearly understand with the theory

There's no difference. If you just supply /PATTERN/ then it assumes m. However, if you're using an alternative delimiter, you need to supply the m. E.g. m|PATTERN| won't work as |PATTERN|.
In your example, i is the modifier as it's after the pattern. m is the operation. (as opposed to s, tr, y etc.)
Perhaps slightly confusingly - you can use m as a modifier, but only if you put if after the match.
m/PATTERN/m will cause ^ and $ to match differently than in m/PATTERN/, but it's the trailing m that does this, not the leading one.

Perl has a number of quote-like operators where you can choose the delimiter to suit the data you're passing to the operator.
q(...) creates a single-quoted string
qq(...) creates a double-quoted string
qw(...) creates a list by splitting its arguments on white-space
qx(...) executes a command and returns the output
qr(...) compiles a regular expression
m(...) matches its argument as a regular expression
(There's also s(...)(...) but I've left that off the list as it has two arguments)
For some of these, you can omit the letter at the start of the operator if you choose the default delimiter.
You can omit q if you use single quote characters ('...').
You can omit qq if you use double quote characters ("...").
You can omit qx if you use backticks (`...`).
You can omit m if you use slashes (/.../).
So, to answer your original question, m/.../ and /.../ are the same, but because slashes are the default delimitor for the match operator, you can omit the m.

Why aren't my nested lookarounds working correctly in my Perl substitution?

I have a Perl substitution which converts hyperlinks to lowercase:
's/(?<=<a href=")([^"]+)(?=")/\L$1/g'
I want the substitution to ignore any links which begin with a hash, for example I want it to change the path in Foo Bar to lowercase but skip if it comes across Bar.
Nesting lookaheads to instruct it to skip these links isn't working correctly for me. This is the one-liner I've written:
perl -pi -e 's/(?<=<a href=" (?! (?<=<a href="#) ) )([^"]+)(?=")/\L$1/g' *;
Could anyone hint to me where I have gone wrong with this substitution? It executes just fine, but does not do anything.

As near as I can tell, your initial regex will work just fine, if you add the condition that the first character in the link may not be a hash # or a double quote, e.g. [^#"]
s/(?<=<a href=")([^#"][^"]+)(?=")/\L$1/gi;
In the case you have links which do not start with a hash, e.g. Foo Bar, it becomes slightly more complicated:
s{(?<=<a href=")([^#"]+)(#[^"]+)*(?=")}{ lc($1) . ($2 // "") }gei;
We now have to evaluate the substitution, since otherwise we get undefined variable warnings when the optional anchor reference is not present.

You don't need look-arounds, from what I see
use 5.010;
...
s/<a \s+ href \s* = \s* "\K([^#"][^"]*)"/\L$1"/gx;
\K means "keep" everything before it. It amounts to a variable-length look-behind.
perlre:
For various reasons \K may be significantly more efficient than the equivalent (?<=...) construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string.

What do these various pieces of syntax mean?

I'm trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I'd appreciate it.

The circumflex ^ is regex for "start of line", and $ is regex for "end of line"; so sed will only remove lines which contain exactly "EOR:" and nothing else.
The Perl script is basically perl -00 -ne '/(re)g(ex)/ && print "re ex\n"' with a big ole regex instead of the simple placeholder I put here. In particular, the /x modifier allows you to split the regex over several lines. So the first / is the start of the regex and the final / is the end of the regex and the lines in between form the regex together.
The /s modifier changes how Perl interprets . in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that .* can match multiple lines.
\s matches a single whitespace character; \s+ matches as many whitespace characters as possible, but there has to be at least one.
(.*?) matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named $1, $2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply (a(b)) to the string "ab", $1 will be "ab" and $2 will be "b").
Finally, \n matches a literal newline. So the (.*?) non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (I
imagine these are gene sequences, not "tags"?)
It doesn't really make sense to run sed separately; Perl would be quite capable of removing the EOR: lines before attempting to match the regex.

Let's see...
Yes, sed will empty the lines with EOR:
The first / in the Perl script means a regexp pattern. Concretely, it is searching for a pattern in the form below
The regex ends with "xs", which means that the regex will match multiple lines of the input
The script also will print as output the strings found in the tags (see below). The $1 and $2 mean the elements contained in the first pair of parentheses ($1) and in the second ($2).
. The form is this one:
TAGA01:<spaces><string1>
<whatever here>
TAGCC00:<spaces><string2>
In this case, $1 is <string1> and $2 is <string2>.

How do I escape special characters for a substitution in a Perl one-liner?

Is there some way to replace a string such as #or * or ? or & without needing to put a "\" before it?
Example:
perl -pe 'next if /^#/; s/\#d\&/new_value/ if /param5/' test
In this example I need to replace a #d& with new_value but the old value might contain any character, how do I escape only the characters that need to be escaped?

You have several problems:
You are using \b incorrectly
You are replacing code with shell variables
You need to quote metacharacters
From perldoc perlre
A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it
Neither of the characters # or & are \w characters. So your match is guaranteed to fail. You may want to use something like s/(^|\s)\#d\&(\s|$)/${1}new text$2/
(^|\s) says to match either the start of the string (^)or a whitespace character (\s).
(\s|$) says to match either the end of the string ($) or a whitespace character (\s).
To solve the second problem, you should use %ENV.
To solve the third problem, you should use the \Q and \E escape sequences to escape the value in $ENV{a}.
Putting it all together we get:
#!/bin/bash
export a='#d&'
export b='new text'
echo 'param5 #d&' |
perl -pe 'next if /^#/; s/(^|\s)\Q$ENV{a}\E(\s|$)/$1$ENV{b}$2/ if /param5/'
Which prints
param5 new text

As discussed at perldoc perlre:
...Today it is more common to use the quotemeta() function or the "\Q" metaquoting
escape sequence to disable all metacharacters' special meanings like this:
/$unquoted\Q$quoted\E$unquoted/
Beware that if you put literal backslashes (those not inside interpolated variables) between "\Q" and "\E", double-quotish backslash interpolation may
lead to confusing results. If you need to use literal backslashes within "\Q...\E", consult "Gory details of parsing quoted constructs" in perlop.
You can also use a ' as the delimiter in the s/// operation to make everything be parsed literally:
my $text = '#';
$text =~ s'#'1';
print $text;
In your example, you can do (note the single quotes):
perl -pe 's/\b\Q#f&\E\b/new_value/g if m/param5/ and not /^ *#/'

The other answers have covered the question, now here's your meta-problem: Leaning Toothpick Syndrome. Its when the delimiter and escapes start to blur together:
s/\/foo\/bar\\/\/bar\/baz/
The solution is to use a different delimiter. You can use just about anything, but balanced braces work best. Most editors can parse them and you generally don't have to worry about escaping.
s{/foo/bar\\}{/bar/baz}
Here's your regex with braced delimiters.
s{\#d\&}{new_value}
Much easier on the eyeholes.

If you really want to avoid typing the \s, put your search string into a variable and then use that in your regex instead. You don't need quotemeta or \Q ... \E in that case. For example:
my $s = '#d&';
s/$s/new_value/g;
If you must use this in a one-liner, bear in mind that you will have to escape the $s if you use "s to contain your perl code, or escape the 's if you use 's to contain your perl code.

If you have a string like
my $var1 = abc$123
and you want to replace it with abcd then you have to use \Q \E. If you don't then no matter what perl doesn't replace the string.
This is the only thing that worked for me.
my $var2 = s/\Q$var1\E/abcd/g;