Why is my regex putting a character in a different capture group than what I am expecting? - regex-group

I have the given strings:
bar"{foo}"bar
bar{foo}bar
I am trying to get both the first and third capture group to be just bar. The second capture group should be {foo} both with and without quotes for each match respectively.
I have the following regex:
(^.*)("?\{.*\}"?)(.*$)
With these results:
Match 1
Full match. bar"{foo}"bar
Group 1. bar"
Group 2. {foo}"
Group 3. bar
Match 2
Full match. bar{foo}bar
Group 1. bar
Group 2. {foo}
Group 3. bar
Why are the " characters not both in the second group? I do not understand why it would be in the first if I am specifically calling it out in the second group. Do I need to tell the first group to ignore it or use it as a right bound?

As I was finishing writing this question I realized the problem I had. I needed to use a reluctant quantifier so the first group would not grab the " before the next group had a shot.
Here is the small correction - I added .*? (reluctant) instead of .* (greedy):
(^.*?)("?\{.*\}"?)(.*$)
Good explanation here: Greedy vs. Reluctant vs. Possessive Quantifiers

Related

Regex : findall with a repeated capture group

I would like to understand why :
re.findall(r"(\d[A-Za-z]+)", "My user name is 3e4r 5fg")
returns
['3e', '4r', '5fg']
while :
re.findall(r"(\d[A-Za-z]+)+", "My user name is 3e4r 5fg")
returns
['4r', '5fg']
I tested some combinations with spaces between groups of "digit-letter" and 2 points clearly are involved in :
spaces between those groups
last "+".
I don't really understand why adding "+" after the group changes the result. Can someone explain me the steps of the process which leads to those different answers? Thank you very much.
When you put + after parenthesis you are searching for a pattern that contains one or more sub pattern with 1 digit and (one or more) letters'
so this phrase: "(\d[A-Za-z]+)+" return 2 matches:
3e4r
5fg
When you put a sub-pattern in parenthesis it means that all matches this sub-pattern will enter in a group, the groups is:
3e
5fg
The function re.findall returns only the groups (Unless there are no groups then it returns the matches ).

regexp_extract is getting spaces

I have this sample data to test regexp_extract function.
message_txt="test 9341Come Products Preferred*TEST*TEST, the mfg SYSTEM, paid18.26 toward the"
message_txt="mfg of TR tt 100 test, paid $861.82 toward your "
message_txt="TEST 0.015% , paid $1119.00toward your "
I need to extract the numeric value between "paid" and "toward", i.e. 18.26, 861.82 and 1119.00. I execute the below statement
regexp_extract(col("message_txt"),"(?i)paid\\s+(.*?)\\s+(?i)toward",1)
... but getting only spaces.
I don't know regexp_extract() but it looks to me like...
You don't want $ in your results, so you need to move that outside of the capture group.
There aren't always spaces before/after the target, so \\s needs to be optional.
There's no point in having a 2nd (?i).
It's usually better to describe exactly what's permitted in the capture group.
Try something like: "(?i)paid\\s*\\$?([\\d.]+)\\s*toward"

Can I write a PCRE conditional that only needs the no-match part?

I am trying to create a regular expression to determine if a string contains a number for an SQL statement. If the value is numeric, then I want to add 1 to it. If the number is not numeric, I want to return a 1. More or less. Here is the SQL:
SELECT
field,
CASE
WHEN regexp_like(field, '^ *\d*\.?\d* *$') THEN dec(field) + 1
ELSE 1
END nextnumber
FROM mytable
This actually works, and returns something like this:
INVALID 1
00000 1
00001E 1
00379 380
00013 14
99904 99905
But to push the envelope of understanding, what if I wanted to cover negative numbers, or those with a positive sign. The sign would have to immediately precede or follow the number, but not both, and I would not want to allow white space between the sign and the number.
I came up with a conditional expression with a capture group to capture the sign on the front of the number to determine if a sign was allowed on the end, but it seems a little awkward to handle given I don't really need a yes-pattern.
Here is the modified regex: ^ ([+-]?)*\d*\.?\d*(?(1) *|[+-]? *)$
This works at regex101.com, but in order for it to work I need to have something before the pipe, so I have to duplicate the next pattern in both the yes-pattern and the no-pattern.
All that background for this question: How can I avoid that duplication?
EDIT: DB2 for i uses International Components for Unicode to provide regular expression processing. It turns out that this library does not support conditionals like PRCE, so I changed the tags on this question. The answer given by Wiktor Stribiżew provides a working alternative to the conditional by using a negative lookahead.
You do not have to duplicate the end pattern, just move it outside the conditional:
^ *([+-])?\d*\.?\d*(?(1)|[+-]?) *$
See the regex demo. So, the yes-part is empty, and the no-part has an optional pattern.
You may also solve it with a mere negative lookahead:
^ *([+-](?!.*[-+]))?\d*\.?\d*[+-]? *$
See another regex demo. Here, ([+-](?!.*[-+]))? matches (optionally) a + or - that are not followed with any 0+ char followed with another + or -.

In DB2 SQL RegEx, how can a conditional replacement be done without CASE WHEN END..?

I have a DB2 v7r3 SQL SELECT statement with three instances of REGEXP_SUBSTR(), all with the same regex pattern string, each of which extract one of three groups.
I'd like to change the first SUBSTR to REGEXP_REPLACE() to do a conditional replacement if there's no match, to insert a default value similarly to the ELSE section of a CASE...END. But I can't make it work. I could easily use a CASE, but it seems more compact & efficient to use RegEx.
For example, I have descriptions of food containers sizes, in various states of completeness:
12X125
6X350
1X1500
1500ML
1000
The last two don't have the 'nnX' part at the beginning, in which case '1X' is assumed and needs to be inserted.
This is my current working pattern string:
^(?:(\d{1,3})(?:X))?((?:\d{1,4})(?:\.\d{1,3})?)(L|ML|PK|Z|)$
The groups returned are: quantity, size, and unit.
But only the first group needs the conditional replacement:
(?:(\d{1,3})(?:X))?
This RexEgg webpage describes the (?=...) operator, and it seems to be what I need, but I'm not sure. It's in the list of operators for my version of DB2, but I can't make it work. Frankly, it's a bit deeper than my regex knowledge, and I can't even make it work in my favorite online regex tester, Regex101.
So...does anyone have any idea or suggestions..? Thanks.
Try this (replace "digits not followed by X_or_digit"):
with t(s) as (values
'12X125'
, '6X350'
, '1X1500'
, '1500'
, '1125'
)
select regexp_replace(s, '^([\d]+(?![X\d]))', '1X\1')
from t;

PCRE Regex - How to return matches with multiline string looking for multiple strings in any order

I need to use Perl-compatible regex to match several strings which appear over multiple lines in a file.
The matches need to appear in any order (server servernameA.company.com followed by servernameZ.company.com followed by servernameD.company.com or any order combination of the three). Note: All matches will appear at the beginning of each line.
In my testing with grep -P, I haven't even been able to produce a match on simple string terms that appear in any order over new lines (even when using the /s and /m modifiers). I am pretty sure from reading I need a look-ahead assertion but the samples I used didn't produce a match for me even after analyzing each bit of the regex to make sure it was relevant to my scenario.
Since I need to support this in Production, I would like an answer that is simple and relatively straight-forward to interpret.
Sample Input
irrelevant_directive = 0
# Comment
server servernameA.company.com iburst
additional_directive = yes
server servernameZ.company.com iburst
server servernameD.company.com iburst
# Additional Comment
final_directive = true
Expectation
The regex should match and return the 3 lines beginning with server (that appear in any order) if and only if there is a perfect match for strings'serverA.company.com', 'serverZ.company.com', and 'serverD.company.com' followed by iburst. All 3 strings must be included.
Finally, if the answer (or a very similar form of the answer) can address checking for strings in any order on a single line, that would be very helpful. For example, if I have a single-line string of: preauth param audit=true silent deny=5 severe=false unlock_time=1000 time=20ms and I want to ensure the terms deny=5 and time=20ms appear in any order and if so match.
Thank you in advance for your assistance.
Regarding the main issue [for the secondary question see Casimir et Hippolyte answer] (using x modifier): https://regex101.com/r/mkxcap/5
(?:
(?<a>.*serverA\.company\.com\s+iburst.*)
|(?<z>.*serverZ\.company\.com\s+iburst.*)
|(?<d>.*serverD\.company\.com\s+iburst.*)
|[^\n]*(?:\n|$)
)++
(?(a)(?(z)(?(d)(*ACCEPT))))(*SKIP)(*F)
The matches are now all in the a, z and d capturing groups.
It's not the most efficient (it goes three times over each line with backtracking...), but the main takeaway is to register the matches with capturing groups and then checking for them being defined.
You don't need to use the PCRE features, you can simply write in ERE:
grep -E '.*(\bdeny=5\b.*\btime=20ms\b|\btime=20ms\b.*\bdeny=5\b).*' file
The PCRE approach will be different: (however you can also use the previous pattern)
grep -P '^(?=.*\bdeny=5\b).*\btime=20ms\b.*' file