meaning of the following regular expressions written in perl - perl

Here is a piece of code
while($l=~/(\\\s*)$/) {
statements;
}
$l contains a line of text taken form file, in effect this code is for go through lines in file.
Questions:
I don't clearly understand what the condition in while is doing. I think it is trying to match group of \ followed by some number of white spaces at the end of line and loop should stop whenever a line ends with \ and may be some white spaces. I am not sure of it.
I came across statement $a ~= s/^(.*$)/$1/ . What I understand that ^ will force matching at the beginning of string, but in (.*$) would mean match all the characters at the end of string . Dose it mean that the statement is trying to find if any group of character at the end is same as group of character in the beginning of text ?

It is interesting to note that this statement:
while ( $l =~ /(\\\s*)$/ ) {
Is an infinite loop unless $l is altered inside the loop so that the regex no longer matches. As has already been mentioned by others, this is what it matches:
( ... ) a capture group, captures string to $1 (that's the number one, not lower case L)
\\ matches a literal backslash
\s* matches 0 or more whitespace characters.
$ matches end of line with optional newline.
Since you do not have the /g modifier, this regex will not iterate through matches, it will simply check if there is a match, resetting the regex each iteration, thereby causing an endless loop.
The statement
$a ~= s/^(.*$)/$1/
Looks rather pointless. It captures a string of characters up until end of string, then replaces it with itself. The captured text is stored in $1 and is simply replaced. The only marginally useful thing about this regex is that:
It matches up until newline \n, and nothing further, which may be of some use to a parser. A period . matches any character except newline, unless the /s modifier is present on the regex.
It captures the line in $1 for future use. However, a simple /^(.*$)/ would do the same.

1. the while
Usually while (regex) is used with the /g modifier, otherwise, if it matches, you get an infinite loop (unless you exit the loop, like using last).
statements would be executed continuously in an infinite loop.
In your case, adding the g
while($l=~/(\\\s*)$/g)
will have the while make only one loop, due to the $ - making a match unique (whatever matches up to the end of string is unique, as $ marks the end, and there is nothing after...).
2. $a ~= s/^(.*$)/$1/
This is a substitution. If the string ^.*$ matches (and it will, since ^.*$ matches (almost, see comment) anything) it is replaced with... $1 or what's inside the (), ie itself, since the match occurs from 1st char to the end of string
^ means beginning of string
(.*) means all chars
$ end of string
so that will replace $a with itself - probably not what you want.

it matches a literal backslash followed by 0 or more spaces followed by the end of the line.

it executes statements for all the lines in that text file that contain a \, followed by zero or more spaces ( \s* ), at the end of the line ($).

It matches lines that end with a backslash character, ignoring any trailing whitespace characters.
Ending a line with a backslash is used in some languages and data files to indicate that the line is being continued on the next line. So I suspect this is part of a parser that merges these continuation lines.
If you enter a regular expression at RegExr and hover your mouse over the pieces, it displays the meaning of each piece in a tooltip.

(\\\s*)$ this regex means --- a \ followed by zero or more number of white space characters which is followed by end of the line. Since you have your regex in (...), you can extract what you matched using $1, if you need.
http://rubular.com/r/dtHtEPh5DX
EDIT -- based on your update
$a ~= s/^(.$)/$1/ --- this is search and replace. So your regex matches a line which contains exactly one character (since you use . http://www.regular-expressions.info/dot.html), except a new-line character. Since you use (...), the character which matched the regex is extracted and stored in variable a
EDIT -- you changed your regex so here is the updated answer
$a ~= s/^(.*$)/$1/ -- same as above except now it matches zero or more characters (except new-line)

Related

Alphanumeric substitution with vim

I'm using the vscode vimplugin. I have a bunch of lines that look like:
Terry,169,80,,,47,,,22,,,6,,
I want to remove all the alphanumeric characters after the first comma so I get:
Terry,,,,,,,,,,,,,
In command mode I tried:
s/^.+\,[a-zA-Z0-9-]\+//g
But this does not appear to do anything. How can I get this working?
edit:
s/^[^,]\+,[a-zA-Z0-9-]\+//g
\+ is greedy; ^.\+, eats the entire line up to the last ,.
Instead of the dot (which means "any character") use [^,] which means "any but a comma". Then ^[^,]\+, means "any characters up to the first comma".
The problem with your requirement is that you want to anchor at the beginning using ^ so you cannot use flag g — with the anchor any substitution will be done once. The only way I can solve the puzzle is to use expressions: match and preserve the anchored text and then use function substitute() with flag g.
I managed with the following expression:
:s/\(^[^,]\+\)\(,\+\)\(.\+\)$/\=submatch(1) . submatch(2) . substitute(submatch(3), '[^,]', '', 'g')/
Let me split it in parts. Searching:
\(^[^,]\+\) — first, match any non-commas
\(,\+\) — any number of commas
\(.\+\)$ — all chars to the end of the string
Substituting:
\= — the substitution is an expression
See http://vimdoc.sourceforge.net/htmldoc/change.html#sub-replace-expression
submatch(1) — replace with the first match (non-commas anchored with ^)
submatch(2) — replace with the second match (commas)
substitute(submatch(3), '[^,]', '', 'g') — replace in the rest of the string
The last call to substitute() is simple, it replaces all non-commas with empty strings.
PS. Tested in real vim, not vscode.

what does this $tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge; (perl code) mean?

What does this mean?
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
This comes from a cve study blog which written in Perl. I know this is a regular expression, the content in the second {} should replace that in the first, but I do NOT get what '\\'.($2 || $1)means.
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
It is a substitution operator s/// applied to the string $tok, with the modifiers sge. The delimiters of the operator has been changed from / to {}. Lets break that regex down
s{
\\(.) # (1) match a backslash followed by 1 character, capture
| # (2) or
( # (3) start capture parens
[\$\#] # (4) either a literal $ or #
| # (5) or
\\$ # (6) backslash at the end of line (including newline)
) # end capture parens
}{ # replace with
'\\'.($2 || $1)} # (7) backslash concatenated with either capture 2 or 1
sge; # (8) s = . matches newline, g = match multiple times, e = eval
Judging (at a glance) from the rest of that blog code, this code is not written by someone skilled at Perl. So I will take their comments at face value:
# must protect unescaped "$" and "#" symbols, and "\" at end of string
The eval (8) is apparently to concatenate a backslash with either capture group 2 (2) or 1 (1), depending on which is "true". Or rather, which one matched the string.
Looking closer at the code, (1) and (6) are very similar. The latter one will trigger only at the end of a line that does not have a newline, whereas the first one will handle all other cases, including end of line with a newline (because of /s modifier).
(1) will match any escaped character, so \1, or \$ or \\ anything with a backslash followed by a character. If we look at the replacement part (7), we see that this capture group is the fallback, which will only trigger if the second capture group fails. The second capture group also only matches if the first fails. Confusing? Maybe a little.
(2) triggers if the matching character is not a backslash followed by a character. Now we are looking for a literal $ or #. Or failing that, a backslash at the end of line. But wait a minute, we already checked for backslash? Yes, but this is an edge case.
In the case of (1) matching, $2 will be undefined, and $1, the first capture group, a single character, will be put back into the text. The backslash that was before it will be removed in (1), and then put back in (7). This will not really do anything, just make the regex not destroy already escaped characters.
In the case of (2) matching, it will either be an end of line backslash that is consumed (6) and put back (7), or it will be a $ or # which is consumed (4) and put back (7), with a backslash in front.
So basically what the OP says in the comment is happening.

How to find and replace with sed, except when between curly braces?

I have a command like this, it is marking words to appear in an index in the document:
sed -i "s/\b$line\b/\\\keywordis\{$line\}\{$wordis\}\{$definitionis\}/g" file.txt
The problem is, it is finding matches within existing matches, which means its e.g. "hello" is replaced with \keywordis{hello}{a common greeting}, but then "greeting" might be searched too, and \keywordis{hello}{a common \keywordis{greeting}{a phrase used when meeting someone}}...
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
Curley brackets in this case will always appear on the same line.
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
First tokenize input. Place something unique, like | or byte \x01 between every \keywordis{hello}{a common greeting} and store that in hold space. Something along s/\\the regex to match{hello}{a common greeting}/\x01&\x01/g'.
Ten iterate over elements in hold space. Use \n to separate elements already parsed from not parsed - input from output. If the element matches the format \keywordis{hello}{a common greeting}, just move it to the front before the newline in hold space, if it does not, perform the replacement. Here's an example: Identify and replace selective space inside given text file , it uses double newline \n\n as input/output separator.
Because, as you noted, replacements can have overlapping words with the patterns you are searching for, I believe the simplest would be after each replacement shuffling the pattern space like for ready output and starting the process all over for the current line.
Then on the end, shuffle the hold space to remove \x01 and newline and any leftovers and output.
Overall, it's Latex. I believe it would be simpler to do it manually.
By "eating" the string from the back and placing it in front of input/output separator inside pattern space, I simplified the process. The following program:
sed '
# add our input/output separator - just a newline
s/^/\n/
: loop
# l1000
# Ignore any "\keywords" and "{stuff}"
/^\([^\n]*\)\n\(.*\)\(\\[^{}]*\|{[^{}]*}\)$/{
s//\3\1\n\2/
b loop
}
# Replace hello followed by anthing not {}
# We match till the end because regex is greedy
# so that .* will eat everything.
/^\([^\n]*\)\n\(.*\)hello\([{}]*\)$/{
s//\\keywordis{hello}{a common greeting}\3\1\n\2/
b loop
}
# Hello was not matched - ignore anything irrelevant
# note - it has to match at least one character after newline
/^\([^\n]*\)\n\(.*\)\([^{}]\+\)$/{
s//\3\1\n\2/
b loop
}
s/\n//
' <<<'
\keywordis{hello}{hello} hello {some other hello} another hello yet
'
outputs:
\keywordis{hello}{hello} \keywordis{hello}{a common greeting} {some other hello} another \keywordis{hello}{a common greeting} yet

sed - remove specific subscript from string

please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one

What do these various pieces of syntax mean?

I'm trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I'd appreciate it.
The circumflex ^ is regex for "start of line", and $ is regex for "end of line"; so sed will only remove lines which contain exactly "EOR:" and nothing else.
The Perl script is basically perl -00 -ne '/(re)g(ex)/ && print "re ex\n"' with a big ole regex instead of the simple placeholder I put here. In particular, the /x modifier allows you to split the regex over several lines. So the first / is the start of the regex and the final / is the end of the regex and the lines in between form the regex together.
The /s modifier changes how Perl interprets . in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that .* can match multiple lines.
\s matches a single whitespace character; \s+ matches as many whitespace characters as possible, but there has to be at least one.
(.*?) matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named $1, $2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply (a(b)) to the string "ab", $1 will be "ab" and $2 will be "b").
Finally, \n matches a literal newline. So the (.*?) non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (I
imagine these are gene sequences, not "tags"?)
It doesn't really make sense to run sed separately; Perl would be quite capable of removing the EOR: lines before attempting to match the regex.
Let's see...
Yes, sed will empty the lines with EOR:
The first / in the Perl script means a regexp pattern. Concretely, it is searching for a pattern in the form below
The regex ends with "xs", which means that the regex will match multiple lines of the input
The script also will print as output the strings found in the tags (see below). The $1 and $2 mean the elements contained in the first pair of parentheses ($1) and in the second ($2).
. The form is this one:
TAGA01:<spaces><string1>
<whatever here>
TAGCC00:<spaces><string2>
In this case, $1 is <string1> and $2 is <string2>.