In regex, can you match a zero-length position immediately following a given character (for insertion/replacement purposes)? [duplicate] - swift

I found these things in my regex body but I haven't got a clue what I can use them for.
Does somebody have examples so I can try to understand how they work?
(?!) - negative lookahead
(?=) - positive lookahead
(?<=) - positive lookbehind
(?<!) - negative lookbehind
(?>) - atomic group

Examples
Given the string foobarbarfoo:
bar(?=bar) finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar) finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar finds the 2nd bar ("bar" which does not have "foo" before it)
You can also combine them:
(?<=foo)bar(?=bar) finds the 1st bar ("bar" with "foo" before it and "bar" after it)
Definitions
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look ahead negative (?!)
Find expression A where expression B does not follow:
A(?!B)
Look behind positive (?<=)
Find expression A where expression B precedes:
(?<=B)A
Look behind negative (?<!)
Find expression A where expression B does not precede:
(?<!B)A
Atomic groups (?>)
An atomic group exits a group and throws away alternative patterns after the first matched pattern inside the group (backtracking is disabled).
(?>foo|foot)s applied to foots will match its 1st alternative foo, then fail as s does not immediately follow, and stop as backtracking is disabled
A non-atomic group will allow backtracking; if subsequent matching ahead fails, it will backtrack and use alternative patterns until a match for the entire expression is found or all possibilities are exhausted.
(foo|foot)s applied to foots will:
match its 1st alternative foo, then fail as s does not immediately follow in foots, and backtrack to its 2nd alternative;
match its 2nd alternative foot, then succeed as s immediately follows in foots, and stop.
Some resources
http://www.regular-expressions.info/lookaround.html
http://www.rexegg.com/regex-lookarounds.html
Online testers
https://regex101.com

Lookarounds are zero width assertions. They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion. They don't consume any character - the matching for regex following them (if any), will start at the same cursor position.
Read regular-expression.info for more details.
Positive lookahead:
Syntax:
(?=REGEX_1)REGEX_2
Match only if REGEX_1 matches; after matching REGEX_1, the match is discarded and searching for REGEX_2 starts at the same position.
example:
(?=[a-z0-9]{4}$)[a-z]{1,2}[0-9]{2,3}
REGEX_1 is [a-z0-9]{4}$ which matches four alphanumeric chars followed by end of line.
REGEX_2 is [a-z]{1,2}[0-9]{2,3} which matches one or two letters followed by two or three digits.
REGEX_1 makes sure that the length of string is indeed 4, but doesn't consume any characters so that search for REGEX_2 starts at the same location. Now REGEX_2 makes sure that the string matches some other rules. Without look-ahead it would match strings of length three or five.
Negative lookahead
Syntax:
(?!REGEX_1)REGEX_2
Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.
example:
(?!.*\bFWORD\b)\w{10,30}$
The look-ahead part checks for the FWORD in the string and fails if it finds it. If it doesn't find FWORD, the look-ahead succeeds and the following part verifies that the string's length is between 10 and 30 and that it contains only word characters a-zA-Z0-9_
Look-behind is similar to look-ahead: it just looks behind the current cursor position. Some regex flavors like javascript doesn't support look-behind assertions. And most flavors that support it (PHP, Python etc) require that look-behind portion to have a fixed length.
Atomic groups basically discards/forgets the subsequent tokens in the group once a token matches. Check this page for examples of atomic groups

Grokking lookaround rapidly.
How to distinguish lookahead and lookbehind?
Take 2 minutes tour with me:
(?=) - positive lookahead
(?<=) - positive lookbehind
Suppose
A B C #in a line
Now, we ask B, Where are you?
B has two solutions to declare it location:
One, B has A ahead and has C bebind
Two, B is ahead(lookahead) of C and behind (lookhehind) A.
As we can see, the behind and ahead are opposite in the two solutions.
Regex is solution Two.

Why - Suppose you are playing wordle, and you've entered "ant". (Yes three-letter word, it's only an example - chill)
The answer comes back as blank, yellow, green, and you have a list of three letter words you wish to use a regex to search for? How would you do it?
To start off with you could start with the presence of the t in the third position:
[a-z]{2}t
We could improve by noting that we don't have an a
[b-z]{2}t
We could further improve by saying that the search had to have an n in it.
(?=.*n)[b-z]{2}t
or to break it down;
(?=.*n) - Look ahead, and check the match has an n in it, it may have zero or more characters before that n
[b-z]{2} - Two letters other than an 'a' in the first two positions;
t - literally a 't' in the third position

I used look behind to find the schema and look ahead negative to find tables missing with(nolock)
expression="(?<=DB\.dbo\.)\w+\s+\w+\s+(?!with\(nolock\))"
matches=re.findall(expression,sql)
for match in matches:
print(match)

Related

What's the common denominator for regex "pattern" in OpenAPI?

I'm using FastAPI, which allows pattern=re.compile("(?P<foo>[42a-z]+)...").
https://editor.swagger.io/ shows an error for this pattern.
My guess is that Python's named group syntax (?P<name>...) is different from ES2018 (?<name>...).
But, come to think of it, the idea of OpenAPI is interoperability, and some other language, esp. a compiled language may use yet another notation, or may not support named groups in the regular expressions at all.
What common denominator of regular expression syntax should I use?
OpenAPI uses json schema, and the json schema spec defines regex as "A regular expression, which SHOULD be valid according to the ECMA-262 regular expression dialect." Here is the relevant ECMA-262 section.
Of course non-javascript implementations probably won't care too much about it, and just use the default regex library of their platform. So good luck with figuring out the common denominator :)
I suggest just using as simple regexes as possible. And add some tests for it, using the library that you use in production.
Json Schema recommends a specific subset of regular expressions because the authors accept that most implementations will not support full ECMA 262 syntax:
https://json-schema.org/understanding-json-schema/reference/regular_expressions.html
A single unicode character (other than the special characters below) matches itself.
.: Matches any character except line break characters. (Be aware that what constitutes a line break character is somewhat dependent on your platform and language environment, but in practice this rarely matters).
^: Matches only at the beginning of the string.
$: Matches only at the end of the string.
(...): Group a series of regular expressions into a single regular expression.
|: Matches either the regular expression preceding or following the | symbol.
[abc]: Matches any of the characters inside the square brackets.
[a-z]: Matches the range of characters.
[^abc]: Matches any character not listed.
[^a-z]: Matches any character outside of the range.
+: Matches one or more repetitions of the preceding regular expression.
*: Matches zero or more repetitions of the preceding regular expression.
?: Matches zero or one repetitions of the preceding regular expression.
+?, *?, ??: The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired and you want to match as few characters as possible.
(?!x), (?=x): Negative and positive lookahead.
{x}: Match exactly x occurrences of the preceding regular expression.
{x,y}: Match at least x and at most y occurrences of the preceding regular expression.
{x,}: Match x occurrences or more of the preceding regular expression.
{x}?, {x,y}?, {x,}?: Lazy versions of the above expressions.
P.S. Kudos to #erosb for the idea how to find this recommendation.

Searching for two Word wildcard strings that are nested

I'm having trouble finding the proper Word wildcard string to find numbers that fit the following patterns:
"NN NN NN" or "NN NN NN.NN" (where N is any number 0-9)
The trouble is the first string is a subset of the second string. My goal is to find a single wildcard string that will capture both. Unfortunately, I need to use an operator that is zero or more occurrences for the ".NN" portion and that doesn't exist.
I'm having to do two searches, and I'm using the following patterns:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}?[!0-9]
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}.[0-9]{2}
The problem is that first pattern (in bold). It works well unless I have the number in a table or something and there is nothing after it to match (or not match, if you will) the [!0-9].
You could use a single wildcard Find:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1,4}
or:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1;4}
to capture both. Which you use depends on your regional settings.

PATINDEX incorrect result when looking for dash character "-"

This simple example shows the issue I've run into, but I don't understand why...
I'm testing for the location of the first character that is either a lower or upper case letter, a single dash, or a period in a string parameter passed to me.
These two pattern matches appear to check the same thing, and yet run this code yourself and it will print a 0 then a 3:
PRINT PATINDEX ( '%[a-z,A-Z,-,.]%', '16-82')
PRINT PATINDEX ( '%[-,a-z,A-Z,.]%', '16-82')
I don't understand why it works only if the dash character is the first one we check for.
Is this a bug? Or working as designed and I missed something... I'm using SQL Server 2016, but I don't think that matters.
A dash within a character group may play either of the two roles:
It may denote the dash itself, like it does in the expression [-abc]
It may denote the "everything inbetween" operator, like it does in the expression [a-z].
In your particular example, the character group [a-z,A-Z,-,.] denotes the following:
Everything from a to z
Comma ,
Everything from A to Z
Everything from , to , (i.e. just the comma again).
Dot .
In fact, you probably wanted to write [-a-zA-Z.]

How can I write a regex that matches words that overlap themselves?

I'm trying to match a word forwards and backwards in a string but it isn't catching all matches. For example, searching for the word "AB" in the string "AAABAAABAAA", I create and use the regex /AB|BA/, but it only matches the two "AB" substrings, and ignores the "BA" substrings.
I'm using RegexKitLite on the iPhone, but I think this is a more general regex problem (I see the same behavior in online regex testers). Nevertheless, here's the code I'm using to enumerate the matches:
[#"AAABAAABAAA" enumerateStringsMatchedByRegex:#"AB|BA" usingBlock:
^(NSInteger captureCount,
NSString * const capturedStrings[captureCount],
const NSRange capturedRanges[captureCount],
volatile BOOL * const stop) {
NSLog(#"%#", capturedStrings[0]);
}];
Output:
AB
AB
I don't know which online tester you tried, but http://www.regextester.com/ (for example) will not consider the same character for multiple matches. In this case, since ABA matches AB, the B is not considered for the BA match. It's purely a guess that RegexKitLite is implemented similarly.
Even if you don't consider the mirrored variant, the original search string may overlap with itself. For example, if you search ABCA|ACBA in ABCABCACBACBA you'll get two of four matches, searching in both directions will be the same.
It should be possible to find matches incrementally, but perhaps not with RegexKitLite
I would say, thats not possible in one turn. The regex matches for the given pattern and "eats" the matched characters. So if you search AB|BA in ABA the first found pattern is AB, then the regex continue to search on the third A.
So it is not possible to find overlapping patterns with the same regex and using the | operator.
I'm not sure how you'd accomplish exactly what I think you're asking for without reversing the string and testing twice.
However, I suppose it depends on what you're after exactly. If you're simply trying to determine if the pattern occurs in the string backwards or forwards, and not so much how it occurs, then you could do something like this:
ABA?|BAB?
The ? makes the last character optional on each side of the |. In the case of AAABAAABAAA, it'll find ABA twice. In the case of AB it'll find AB, and in the case of BA it'll find BA.
Here it is with test cases...
http://regexhero.net/tester/?id=a387ae0a-1707-4d9e-856b-ebe2176679bb

Removing a trailing Space from Regex Matched group

I'm using regular expression lib icucore via RegKit on the iPhone to
replace a pattern in a large string.
The Pattern i'm looking for looks some thing like this
| hello world (P1)|
I'm matching this pattern with the following regular expression
\|((\w*|.| )+)\((\w\d+)\)\|
This transforms the input string into 3 groups when a match is found, of which group 1(string) and group 3(string in parentheses) are of interest to me.
I'm converting these formated strings into html links so the above would be transformed into
Hello world
My problem is the trailing space in the third group. Which when the link is highlighted and underlined, results with the line extending beyond the printed characters.
While i know i could extract all the matches and process them manually, using the search and replace feature of the icu lib is a much cleaner solution, and i would rather not do that as a result.
Many thanks as always
Would the following work as an alternate regular expression?
\|((\w*|.| )+)\s+\((\w\d+)\)\| Where inserting the extra \s+ pulls the space outside the 1st grouping.
Though, given your example & regex, I'm not sure why you don't just do:
\|(.+)\s+\((\w\d+)\)\|
Which will have the same effect. However, both your original regex and my simpler one would both fail, however on:
| hello world (P1)| and on the same line | howdy world (P1)|
where it would roll it up into 1 match.
\|\s*([\w ,.-]+)\s+\((\w\d+)\)\|
will put the trailing space(s) outside the capturing group. This will of course only work if there always is a space. Can you guarantee that?
If not, use
\|\s*([\w ,.-]+(?<!\s))\s*\((\w\d+)\)\|
This uses a lookbehind assertion to make sure the capturing group ends in a non-space character.