Conditional replacement of a character - pcre

I would like to replace a character in a long string only if a special sequence is present in the input.
Example:
This string is a sample! I wrote it to describe my problem! I hope somebody can help me with this! I have the ID: 12345! That's all!
My desired output is:
This string is a sample. I wrote it to describe my problem. I hope somebody can help me with this. I have the ID: 12345. That's all.
Only when '12345' present in the input string.
I tried (positive|negative) look(ahead|behind)
(?<!=12345)(!+(.*))+
Does not work, so as ?=, ?!...
Is this possible with PCRE replacement in one step?

In general, this is possible with any regex flavor supporting \G "string start/end of the previous match" operator. You may replace with $1 + desired text when searching with the following patterns:
(?:\G(?!^)|^(?=.*CHECKME))(.*?)REPLACEME <-- Replace REPLACEME if CHECKME is present
(?:\G(?!^)|^(?!.*CHECKME))(.*?)REPLACEME <-- Replace REPLACEME if CHECKME is absent
With Perl/PCRE/Onigmo that support \K, you may replace with your required text when searching with
(?:\G(?!^)|^(?=.*CHECKME)).*?\KREPLACEME <-- Replace REPLACEME if CHECKME is present
(?:\G(?!^)|^(?!.*CHECKME)).*?\KREPLACEME <-- Replace REPLACEME if CHECKME is absent
In your case, since the text searched for is a single character, you may use a more efficient regex with just one .*:
(?:\G(?!^)|^(?=.*12345))[^!]*\K!
and replace with . (or with $1. if you use (?:\G(?!^)|^(?=.*12345))([^!]*)!). See the regex demo.
If there can be line breaks in the string use (?s)(?:\G(?!^)|^(?=.*12345))[^!]*\K!.
Details
(?:\G(?!^)|^(?=.*12345)) - either the end of the previous match (\G(?!^)) or (|) the start of a string position followed with any 0+ chars as many as possible up to the last occurrence of 12345 (^(?=.*12345))
[^!]* - 0 or more chars other than !
\K - match reset operator that discards all text matched so far in the match memory buffer
! - a ! char.

Related

Alphanumeric substitution with vim

I'm using the vscode vimplugin. I have a bunch of lines that look like:
Terry,169,80,,,47,,,22,,,6,,
I want to remove all the alphanumeric characters after the first comma so I get:
Terry,,,,,,,,,,,,,
In command mode I tried:
s/^.+\,[a-zA-Z0-9-]\+//g
But this does not appear to do anything. How can I get this working?
edit:
s/^[^,]\+,[a-zA-Z0-9-]\+//g
\+ is greedy; ^.\+, eats the entire line up to the last ,.
Instead of the dot (which means "any character") use [^,] which means "any but a comma". Then ^[^,]\+, means "any characters up to the first comma".
The problem with your requirement is that you want to anchor at the beginning using ^ so you cannot use flag g — with the anchor any substitution will be done once. The only way I can solve the puzzle is to use expressions: match and preserve the anchored text and then use function substitute() with flag g.
I managed with the following expression:
:s/\(^[^,]\+\)\(,\+\)\(.\+\)$/\=submatch(1) . submatch(2) . substitute(submatch(3), '[^,]', '', 'g')/
Let me split it in parts. Searching:
\(^[^,]\+\) — first, match any non-commas
\(,\+\) — any number of commas
\(.\+\)$ — all chars to the end of the string
Substituting:
\= — the substitution is an expression
See http://vimdoc.sourceforge.net/htmldoc/change.html#sub-replace-expression
submatch(1) — replace with the first match (non-commas anchored with ^)
submatch(2) — replace with the second match (commas)
substitute(submatch(3), '[^,]', '', 'g') — replace in the rest of the string
The last call to substitute() is simple, it replaces all non-commas with empty strings.
PS. Tested in real vim, not vscode.

How to do negate or subtract a regex from another regex result in just one line of regex

I am trying to do a regex string to find all cases of force unwrapping in swift. This will search all words with exclamation points in the entire code base. However, the regex that I already have has included implicit declaration of variable which I am trying to exclude.
This is the regex that I already have.
(:\s)?\w+(?<!as)\)*!
And it works fine. It searches for "variableName!", "(variableName)!", "hello.hello!". The exclusion of force casting also works. It avoids cases like "hello as! UIView", But I am trying also to exclude another cases such as "var hello: UIView!" which has an exclamation point. That's the problem I am having. I tried negative lookahead and negative lookbehind and nothing solved this kind of case.
This is the sample regex I am working on
(:\s)?\w+(?<!as)\)*!
And this is the result
testing.(**test)))!**
Details lists capture **groups!**
hello as! hello
**Hello!**
**testing!**
testing**.test!**
Hello != World
var noNetworkBanner**: StatusBarNotificationBanner!** <-- need to exclude
"var noNetworkBanner**: StatusBarNotificationBanner!**" <-- need to exclude
You may use
(?<!:\s)\b\w+(?<!\bas)\b\)*!
I added \b word boundaries to match whole words only, and changed the (:\s)? optional group to a negative lookbehind, (?<!:\s), that disallows a : + space before the word you need to match.
See the regex demo and the regex graph:
Details
(?<!:\s) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a : and a whitespace
\b - word boundary
\w+ - 1+ word chars
(?<!\bas) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a whole word as
\b - word boundary
\)* - 0 or more ) chars
! - a ! char.

meaning of the following regular expressions written in perl

Here is a piece of code
while($l=~/(\\\s*)$/) {
statements;
}
$l contains a line of text taken form file, in effect this code is for go through lines in file.
Questions:
I don't clearly understand what the condition in while is doing. I think it is trying to match group of \ followed by some number of white spaces at the end of line and loop should stop whenever a line ends with \ and may be some white spaces. I am not sure of it.
I came across statement $a ~= s/^(.*$)/$1/ . What I understand that ^ will force matching at the beginning of string, but in (.*$) would mean match all the characters at the end of string . Dose it mean that the statement is trying to find if any group of character at the end is same as group of character in the beginning of text ?
It is interesting to note that this statement:
while ( $l =~ /(\\\s*)$/ ) {
Is an infinite loop unless $l is altered inside the loop so that the regex no longer matches. As has already been mentioned by others, this is what it matches:
( ... ) a capture group, captures string to $1 (that's the number one, not lower case L)
\\ matches a literal backslash
\s* matches 0 or more whitespace characters.
$ matches end of line with optional newline.
Since you do not have the /g modifier, this regex will not iterate through matches, it will simply check if there is a match, resetting the regex each iteration, thereby causing an endless loop.
The statement
$a ~= s/^(.*$)/$1/
Looks rather pointless. It captures a string of characters up until end of string, then replaces it with itself. The captured text is stored in $1 and is simply replaced. The only marginally useful thing about this regex is that:
It matches up until newline \n, and nothing further, which may be of some use to a parser. A period . matches any character except newline, unless the /s modifier is present on the regex.
It captures the line in $1 for future use. However, a simple /^(.*$)/ would do the same.
1. the while
Usually while (regex) is used with the /g modifier, otherwise, if it matches, you get an infinite loop (unless you exit the loop, like using last).
statements would be executed continuously in an infinite loop.
In your case, adding the g
while($l=~/(\\\s*)$/g)
will have the while make only one loop, due to the $ - making a match unique (whatever matches up to the end of string is unique, as $ marks the end, and there is nothing after...).
2. $a ~= s/^(.*$)/$1/
This is a substitution. If the string ^.*$ matches (and it will, since ^.*$ matches (almost, see comment) anything) it is replaced with... $1 or what's inside the (), ie itself, since the match occurs from 1st char to the end of string
^ means beginning of string
(.*) means all chars
$ end of string
so that will replace $a with itself - probably not what you want.
it matches a literal backslash followed by 0 or more spaces followed by the end of the line.
it executes statements for all the lines in that text file that contain a \, followed by zero or more spaces ( \s* ), at the end of the line ($).
It matches lines that end with a backslash character, ignoring any trailing whitespace characters.
Ending a line with a backslash is used in some languages and data files to indicate that the line is being continued on the next line. So I suspect this is part of a parser that merges these continuation lines.
If you enter a regular expression at RegExr and hover your mouse over the pieces, it displays the meaning of each piece in a tooltip.
(\\\s*)$ this regex means --- a \ followed by zero or more number of white space characters which is followed by end of the line. Since you have your regex in (...), you can extract what you matched using $1, if you need.
http://rubular.com/r/dtHtEPh5DX
EDIT -- based on your update
$a ~= s/^(.$)/$1/ --- this is search and replace. So your regex matches a line which contains exactly one character (since you use . http://www.regular-expressions.info/dot.html), except a new-line character. Since you use (...), the character which matched the regex is extracted and stored in variable a
EDIT -- you changed your regex so here is the updated answer
$a ~= s/^(.*$)/$1/ -- same as above except now it matches zero or more characters (except new-line)

sed - remove specific subscript from string

please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one

How to get a perfect match for a regexp pattern in Perl?

I've to match a regular-expression, stored in a variable:
#!/bin/env perl
use warnings;
use strict;
my $expr = qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx;
$str = "abcd[3] xyzg[4:0]";
if ($str =~ m/$expr/) {
print "\n%%%%%%%%% $`-----$&-----$'\n";
}
else {
print "\n********* NOT MATCHED\n";
}
But I'm getting the outout in $& as
%%%%%%%%% -----abcd[3] xyzg-----[4:0]
But expecting, it shouldn't go inside the if clause.
What is intended is:
if $str = "abcd xyzg" => %%%%%%%%% -----abcd xyzg----- (CORRECT)
if $str = "abcd[2] xyzg" => %%%%%%%%% -----abcd[2] xyzg----- (CORRECT)
if $str = "abcd[2] xyzg[3] => %%%%%%%%% -----abcd[2] xyzg[3]----- (CORRECT)
if $str = "abcd[2:0] xyzg[3] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2:0] xyzg[3:0] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2] xyzg[3:0]" => ********* NOT MATCHED (CORRECT/INTENDED)
but output is %%%%%%%%% -----abcd[2] xyzg-----[3:0] (WRONG)
OR better to say this is not intended.
In this case, it should/my_expectation go to the else block.
Even I don't know, why $& take a portion of the string (abcd[2] xyzg), and $' having [3:0]?
HOW?
It should match the full, not a part like the above. If it didn't, it shouldn't go to the if clause.
Can anyone please help me to change my $expr pattern, so that I can have what is intended?
By default, Perl regexes only look for a matching substring of the given string. In order to force comparison against the entire string, you need to indicate that the regex begins at the beginning of the string and ends at the end by using ^ and $:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)$/;
(Also, there's no reason to have the /x modifier, as your regex doesn't include any literal whitespace or # characters, and there's no reason for the /s modifier, as you're not using ..)
EDIT: If you don't want the regex to match against the entire string, but you want it to reject anything in which the matching portion is followed by something like "[0:0]", the simplest way would be to use lookahead:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\]|(?=[^[\w])|$ ))/x;
This will match anything that takes the following form:
beginning of the string (which your example in the comments seems to imply you want)
zero or more whitespace characters
one or more word characters
optional: [, one or more digits, ]
one or more whitespace characters
one or more word characters
one of the following, in descending order of preference:
[, one or more digits, ]
an empty string followed by (but not including!) a character that is neither [ nor a word character (The exclusion of word characters is to keep the regex engine from succeeding on "a[0] bc[1:2]" by only matching "a[0] b".)
end of string (A space is needed after the $ to keep it from merging with the following ) to form the name of a special variable, and this entails the reintroduction of the /x option.)
Do you have any more unstated requirements that need to be satisfied?
The short answer is your regexp is wrong.
We can't fix it for you without you explaining what you need exactly, and the community is not going to write a regexp exactly for your purpose because that's just too localized a question that only helps you this one time.
You need to ask something more general about regexps that we can explain to you, that will help you fix your regexp, and help others fix theirs.
Here's my general answer when you're having trouble testing your regexp. Use a regexp tool, like the regex buddy one.
So I'm going to give a specific answer about what you're overlooking here:
Let's make this example smaller:
Your pattern is a(bc+d)?. It will match: abcd abccd etc. While it will not match bcd nor bzd in the case of abzd it will match as matching only a because the whole group of bc+d is optional. Similarly it will match abcbcd as a dropping the whole optional group that couldn't be matched (at the second b).
Regexps will match as much of the string as they can and return a true match when they can match something and have satisfied the entire pattern. If you make something optional, they will leave it out when they have to including it only when it's present and matches.
Here's what you tried:
qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx
First, s and x aren't needed modifiers here.
Second, this regex can match:
Any or no whitespace followed by
a word of at least one alpha character followed by
optionally a grouped square bracketed number with at least one digit (eg [0] or [9999]) followed by
at least one white space followed by
a word of at least one alpha character followed by
optionally a square bracketed number with at least one digit.
Clearly when you ask it to match abcd[0] xyzg[0:4] the colon ends the \d+ pattern but doesn't satisfy the \] so it backtracks the whole group, and then happily finds the group was optional. So by not matching the last optional group, your pattern has matched successfully.