Match all words expect non duplicates - pcre

Hello Foo Bar World Foo World Bar Test Foo
foo bar
I want my regex to match everything but non duplicate words:
It should match all the following words in the test string: Foo Bar World
It shouldn't match: Hello Test because those are not duplicate.
How can I accomplish this?

With positive look-ahead assertions you can. It's not especially efficient and I wouldn't use regex for the task.
/(\b\w++\b)(?=(?>.*?\1(?:.(?!\1))*)$)/gs
Edit: Missed Casimir et Hippolyte's comment answer, which is faster but less compatible.

Related

How to do negate or subtract a regex from another regex result in just one line of regex

I am trying to do a regex string to find all cases of force unwrapping in swift. This will search all words with exclamation points in the entire code base. However, the regex that I already have has included implicit declaration of variable which I am trying to exclude.
This is the regex that I already have.
(:\s)?\w+(?<!as)\)*!
And it works fine. It searches for "variableName!", "(variableName)!", "hello.hello!". The exclusion of force casting also works. It avoids cases like "hello as! UIView", But I am trying also to exclude another cases such as "var hello: UIView!" which has an exclamation point. That's the problem I am having. I tried negative lookahead and negative lookbehind and nothing solved this kind of case.
This is the sample regex I am working on
(:\s)?\w+(?<!as)\)*!
And this is the result
testing.(**test)))!**
Details lists capture **groups!**
hello as! hello
**Hello!**
**testing!**
testing**.test!**
Hello != World
var noNetworkBanner**: StatusBarNotificationBanner!** <-- need to exclude
"var noNetworkBanner**: StatusBarNotificationBanner!**" <-- need to exclude
You may use
(?<!:\s)\b\w+(?<!\bas)\b\)*!
I added \b word boundaries to match whole words only, and changed the (:\s)? optional group to a negative lookbehind, (?<!:\s), that disallows a : + space before the word you need to match.
See the regex demo and the regex graph:
Details
(?<!:\s) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a : and a whitespace
\b - word boundary
\w+ - 1+ word chars
(?<!\bas) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a whole word as
\b - word boundary
\)* - 0 or more ) chars
! - a ! char.

Print strings alongside math results in s// substitution with the /e modifier

I am trying to write a very simple one liner to find cases of:
foo N
and replace them with
foo N-Y
For example, if I had 3 files and they had the following lines in them:
foo 5
foo 3
foo 9
After the script is run with Y=4, the lines would read:
foo 1
foo -1
foo 5
I stumbled upon an existing thread that suggested using /e to run code in the replace half of the substitute command and was able to effectively subtract Y from all my matches, but I have no idea how to best print "foo" back into the file since when I try to separate foo and the number into two capture groups and print them back in, perl thinks I am trying to multiply them and wants an operator.
Here's where I'm at:
find . -iname "*somematch*" -exec perl -pi -e 's/(Foo *)(\d+)/$1$2-4/e' {} \;
Of course this doesn't work, "Scalar found where operator expected at -e line 1, near "$1$2." I'm at a loss as to how best to proceed without writing something much longer.
Edit: To be more specific, if I have the /e option enabled to be able to perform math in the substitution, is there a simple way to print the string in another capture group in that substitution without it trying to do math to it?
Alternatively, is there a simple way to surgically perform the substitution on only part of the pattern? I tried to combine m// and s/// to achieve the results but ended up getting nowhere.
The replacement part is treated as code under /e so it need be written using legal syntax, like you'd use in a program. Writing $t$v isn't legal syntax ($1$2 in your regex).
One way to concatenate strings is $t . $v. Then you also need parenthesis around the addition, since by precedence rules the strings $1 and $2 are concatenated first, and that alphanumeric string attempted in addition, drawing a warning. So
perl -i -pe's/(Foo *)([0-9]+)/$1.($2-4)/e'
I replaced \d with [0-9] since \d matches all kinds of "digits," from all over Unicode, what doesn't seem to be what you need.
There is another way if the math comes after the rest of the pattern, as it does in your examples
perl -i -pe's/Foo *\K([0-9]+)/$1-4/e'
Here the \K is a form of positive lookbehind which drops all matches previous to that anchor, so they are not consumed. Thus only the [0-9]+ is replaced, as needed.

PCRE: Difference between .* and .*? in regular expressions

I was wondering, why .* and .*? is not the same in PCRE regular expressions (for example in PHP's preg_match(). Dot . is symbol for any possible character and * is symbol for 0 to infinity repetition. Why is there symbol ? which means 0 to 1 repetition? However it is not obviously the same, because .*? is not interchangeable with .*, but I can't see logic difference, I have to always try what works and what does not work in certain case. I suppose that .* should match nothing to anything and ? is redundant, because it specify that .* can be 0 or 1 times - but zero times is empty string and empty string should be matched by .* too.
Can anyone explain me what is the exact difference and show me short example?
Thanks
i love wantons because they are tasty snacks
In the above string, let's say you try to match it with i.*s. The result would be the entire string, because this is called a greedy match. It matches from the first instance of i until the last instance of s.
If you were to use the non-greedy modifier ?, like i.*?s, then you would result in the following:
i love wantons
This is because the non-greedy ? modifier only matches until the first instance of s.
* is a greedy match - in other words, match zero to many times, as many times as possible. *? is a minimal match - in other words, match zero to many times, as few times as possible for the rest of the pattern to make sense. Similarly, +? is a minimally-matching version of +.
Consider the string this is "quoted" and this is "also quoted". The regular expression ".*" would match one result, "quoted" and this is "also quoted"; ".*?" would match twice, "quoted" and "also quoted".

How to mach a string when not followed by another with sed

Does anybody knows how to match with sed each 'foo' instance excepted when it is following by 'bar' in the following string?
'foo boo foo
foo bar foo
foo
foo'
Desired result (matched instances in bold)
'MaTcHeD boo
MaTcHeD foo
bar MaTcHeD
MaTcHeD MaTcHeD'
After a big amount of tests, I found:
sed -e "s/foo\( *\)bar/FOO\1BAR/g" -e "s/foo/MaTcHeD/g" -e "s/FOO\( *\)BAR/foo\1bar/g"
It consists in first matching 'foo bar' instances and replacing them with some temporary string (here 'FOO BAR'), then in matching the resting 'foo' instances before replacing "back" the 'foo bar' ones to their original version... (I hope I am clear...)
But anyway this is not clean at all. I would be surprised there is not a more straight way to do it, even if I have not been able to find it out so far.
Any hint would be appreciated. :-)
Thank you very much,
if you have to stick to sed, your idea is ok. However if the original string has FOO BAR, your solution fails. You have to always choose a right temp string. It makes your script insecure.
you could improve it by not choosing regular String as the temp string, but those invisible strings.
For example, this line works:
sed -r 's/foo( *)bar/\x94\1\x98/g; s/foo/Matched/g;s/\x94( *)\x98/foo\1bar/g' file

How to match a pattern and exclude another pattern within a string in Perl

My objective is to match a pattern that doesn't contain '#' before the pattern, for example this:
array = ("# abc", "# abcd" "abc" " abc ", "abcd" "abc # foo")
I want to match "abc". " abc" . "abcd" . "abc # foo"
What regular expression do I need so as to match only patterns of 'abc' that do not contain '#'?
I tried m/[^#]+abc/g but it doesn't work.
look for regex lookbehind and lookahead.
something like this:
m/^(?<!#).*/g
may work, it's a negative look behind.
Your criterion isn't at all clear. Do you want to reject anything that has # as the first character?
print "$_\n" for grep /^[^#]/, #array;
will do that. But if you also want to check for the abc after possible leading space then you need
print "$_\n" for grep /^\s*abc/, #array;
these produce the same results from your data and select the items you say you want.
If you don't want # anywhere before abc, you were almost there. Try this: ^[^#]*abc.
this should work, it uses negative lookahead:
^[^#]*abc(?!(.*#))
New thoughts...
I read carefully your question again and I found out I didn't get exactly what you meant. Your intention is really confused and all I can say for sure is you DON'T want to get a match if the string has 1 or many # behind abc in the same line (and I guess you don't care if there's something else or not between them). I was confused because you explicitly say you WANT to match "abc # foo" but at the same time incoherently you say "to match only patterns of 'abc' that do not contain '#'".
If we want to follow this new interpretation, the correct regular expression may be:
(?<!(#.*))abc.*
The expression will not consume any text behind abc and will match from that point on if only anything before doesn't contain a # at all.