Regex : findall with a repeated capture group - regex-group

I would like to understand why :
re.findall(r"(\d[A-Za-z]+)", "My user name is 3e4r 5fg")
returns
['3e', '4r', '5fg']
while :
re.findall(r"(\d[A-Za-z]+)+", "My user name is 3e4r 5fg")
returns
['4r', '5fg']
I tested some combinations with spaces between groups of "digit-letter" and 2 points clearly are involved in :
spaces between those groups
last "+".
I don't really understand why adding "+" after the group changes the result. Can someone explain me the steps of the process which leads to those different answers? Thank you very much.

When you put + after parenthesis you are searching for a pattern that contains one or more sub pattern with 1 digit and (one or more) letters'
so this phrase: "(\d[A-Za-z]+)+" return 2 matches:
3e4r
5fg
When you put a sub-pattern in parenthesis it means that all matches this sub-pattern will enter in a group, the groups is:
3e
5fg
The function re.findall returns only the groups (Unless there are no groups then it returns the matches ).

Related

Why is my regex putting a character in a different capture group than what I am expecting?

I have the given strings:
bar"{foo}"bar
bar{foo}bar
I am trying to get both the first and third capture group to be just bar. The second capture group should be {foo} both with and without quotes for each match respectively.
I have the following regex:
(^.*)("?\{.*\}"?)(.*$)
With these results:
Match 1
Full match. bar"{foo}"bar
Group 1. bar"
Group 2. {foo}"
Group 3. bar
Match 2
Full match. bar{foo}bar
Group 1. bar
Group 2. {foo}
Group 3. bar
Why are the " characters not both in the second group? I do not understand why it would be in the first if I am specifically calling it out in the second group. Do I need to tell the first group to ignore it or use it as a right bound?
As I was finishing writing this question I realized the problem I had. I needed to use a reluctant quantifier so the first group would not grab the " before the next group had a shot.
Here is the small correction - I added .*? (reluctant) instead of .* (greedy):
(^.*?)("?\{.*\}"?)(.*$)
Good explanation here: Greedy vs. Reluctant vs. Possessive Quantifiers

Can I write a PCRE conditional that only needs the no-match part?

I am trying to create a regular expression to determine if a string contains a number for an SQL statement. If the value is numeric, then I want to add 1 to it. If the number is not numeric, I want to return a 1. More or less. Here is the SQL:
SELECT
field,
CASE
WHEN regexp_like(field, '^ *\d*\.?\d* *$') THEN dec(field) + 1
ELSE 1
END nextnumber
FROM mytable
This actually works, and returns something like this:
INVALID 1
00000 1
00001E 1
00379 380
00013 14
99904 99905
But to push the envelope of understanding, what if I wanted to cover negative numbers, or those with a positive sign. The sign would have to immediately precede or follow the number, but not both, and I would not want to allow white space between the sign and the number.
I came up with a conditional expression with a capture group to capture the sign on the front of the number to determine if a sign was allowed on the end, but it seems a little awkward to handle given I don't really need a yes-pattern.
Here is the modified regex: ^ ([+-]?)*\d*\.?\d*(?(1) *|[+-]? *)$
This works at regex101.com, but in order for it to work I need to have something before the pipe, so I have to duplicate the next pattern in both the yes-pattern and the no-pattern.
All that background for this question: How can I avoid that duplication?
EDIT: DB2 for i uses International Components for Unicode to provide regular expression processing. It turns out that this library does not support conditionals like PRCE, so I changed the tags on this question. The answer given by Wiktor Stribiżew provides a working alternative to the conditional by using a negative lookahead.
You do not have to duplicate the end pattern, just move it outside the conditional:
^ *([+-])?\d*\.?\d*(?(1)|[+-]?) *$
See the regex demo. So, the yes-part is empty, and the no-part has an optional pattern.
You may also solve it with a mere negative lookahead:
^ *([+-](?!.*[-+]))?\d*\.?\d*[+-]? *$
See another regex demo. Here, ([+-](?!.*[-+]))? matches (optionally) a + or - that are not followed with any 0+ char followed with another + or -.

Microsoft graph Mail Search Strict value

I have an issue with the search parameters. I want to pass a phrase in my query. For exemple i'm looking for emails where the subject is "Test 1".
For this i'm doing a get on this ressource.
https://graph.microsoft.com/v1.0/me/messages?$search="subject:Test 1"
But the behaviour of this query is : Looking for mails that contains "Test" in the subject OR 1 in any other fields.
Refering to the KQL Syntax
A phrase (includes two or more words together, separated by spaces; however, the words must be enclosed in double quotation marks)
So, to do what i want i have to put double quotes (") around my phrase to do a strict value search. Like below
subject:"Test 1"
The problem it's at this point. Microsoft graph api already use double quotes (") after the parameters $search.
?$search="Key words"
So I can't do what is mentioned in the KQL doc.
https://graph.microsoft.com/v1.0/me/messages?$search="subject:"Test 1""
It's throwing an error :
"Syntax error: character '1' is not valid at position 15 in '\"subject:\"test 1\"\"'.",
It's an expected behaviour. I was pretty sure it will not work.
If someone has any suggestions for a solution or a workaround, I'm a buyer.
What I've already tried so far :
Use simple quote
Remove the quotes right after $select=
Remove the subject part $select="Test 1", same behaviour as the first request mentioned in this post. It will looks for emails that contain "test" or "1".
Best regards.
EDIT :
After sasfrog's anwser :
I used $filter : It works well with simple operator AND, OR.I have some errors by using the Not Operator. And btw you have to use the orderby parameter to show the result by date and add the field in filter parameters.
Exemple 1 (working, what I asked for first) :
https://graph.microsoft.com/v1.0/me/messages/?$orderby=receivedDateTime desc &$filter=receivedDateTime ge 1900-01-01T00:00:00Z AND contains(subject,'test 1')
Exemple 2 (not working)
https://graph.microsoft.com/v1.0/me/messages/?$orderby=receivedDateTime desc &$filter=(receivedDateTime ge 1900-01-01T00:00:00Z AND contains(subject,'test 1')) NOT(contains(from/EmailAddress/address,[specific address]))
EDIT 2
After some test with the filter parameters.
The NOT operator is still not working so to workaround use "ne" (non-equals)
the example 2 becomes :
https://graph.microsoft.com/v1.0/me/messages/?$orderby=receivedDateTime desc&$filter=(receivedDateTime ge 1900-01-01T00:00:00Z AND contains(subject,'test 1')) AND (from/EmailAddress/address ne [specific address])
UPDATE : OTHER SOLUTION WITH $search
Using $filter is great but it looks like it was sometimes pretty slow. So I found a workaround aboutmy issue.
It's to use AND operator between all terms.
Exemple 4 :
I'm looking for the mails where the subject is test 1;
Let value = "test 1". So you have to splice it by using space separator. And after write some code to manipulate this array, to obtain something like below.
$search="(subject:test AND subject:1)"
The brackets can be important if you use a multiple fields search. And Voilà.
Not sure if it's sufficient for what you're doing, but how about using the contains function within a filter query instead:
https://graph.microsoft.com/v1.0/me/messages?$filter=contains(subject,'Test 1')
Sounds like you're already looking at the doco but here it is just in case.
Update also, this worked for me using the search method:
https://graph.microsoft.com/v1.0/me/messages?$search="subject:'Test 1'"

Partial String Replacement using PowerShell

Problem
I am working on a script that has a user provide a specific IP address and I want to mask this IP in some fashion so that it isn't stored in the logs. My problem is, that I can easily do this when I know what the first three values of the IP typically are; however, I want to avoid storing/hard coding those values into the code to if at all possible. I also want to be able to replace the values even if the first three are unknown to me.
Examples:
10.11.12.50 would display as XX.XX.XX.50
10.12.11.23 would also display as XX.XX.XX.23
I have looked up partial string replacements, but none of the questions or problems that I found came close to doing this. I have tried doing things like:
# This ended up replacing all of the numbers
$tempString = $str -replace '[0-9]', 'X'
I know that I am partway there, but I aiming to only replace only the first 3 sets of digits so, basically every digit that is before a '.', but I haven't been able to achieve this.
Question
Is what I'm trying to do possible to achieve with PowerShell? Is there a best practice way of achieving this?
Here's an example of how you can accomplish this:
Get-Content 'File.txt' |
ForEach-Object { $_ = $_ -replace '\d{1,3}\.\d{1,3}\.\d{1,3}','xx.xx.xx' }
This example matches a digit 1-3 times, a literal period, and continues that pattern so it'll capture anything from 0-999.0-999.0-999 and replace with xx.xx.xx
TheIncorrigible1's helpful answer is an exact way of solving the problem (replacement only happens if 3 consecutive .-separated groups of 1-3 digits are matched.)
A looser, but shorter solution that replaces everything but the last .-prefixed digit group:
PS> '10.11.12.50' -replace '.+(?=\.\d+$)', 'XX.XX.XX'
XX.XX.XX.50
(?=\.\d+$) is a (positive) lookahead assertion ((?=...)) that matches the enclosed subexpression (a literal . followed by 1 or more digits (\d) at the end of the string ($)), but doesn't capture it as part of the overall match.
The net effect is that only what .+ captured - everything before the lookahead assertion's match - is replaced with 'XX.XX.XX'.
Applied to the above example input string, 10.11.12.50:
(?=\.\d+$) matches the .-prefixed digit group at the end, .50.
.+ matches everything before .50, which is 10.11.12.
Since the (?=...) part isn't captured, it is therefore not included in what is replaced, so it is only substring 10.11.12 that is replaced, namely with XX.XX.XX, yielding XX.XX.XX.50 as a result.

Tableau Filter Formula

I am trying to filter workgroup name that only contains BL or CL so I used the formula...
STARTSWITH([wrkgrp_shrt_nm], "BL") or STARTSWITH([wrkgrp_shrt_nm], "CL" )
I get the little green check, but when I hit apply it is blank and nothing pulls through
I tried another formula...
if right([wrkgrp_shrt_nm],2) = 'BL' then 1 elseif
right([wrkgrp_shrt_nm],2) = 'CL' then 1 elseif
right([wrkgrp_shrt_nm],2) then 0
end
but I am only getting an error
any suggestions?
If you want "contains", you can just call contains()
contains(wrkgrp_shrt_nm, 'BL') or contains(wrkgrp_shrt_nm, 'CL')
Does the same thing as the find() solution Fred posted, just a little easier to read in this case. I'm not sure why Fred says you cannot use IF. I use IF all the time without problems.
BTW, in case you were wondering, the square brackets around field names are optional if the field name does not include spaces or punctuation, and function names are not case sensitive.
To clarify, you're asking for "contains BL or CL", but your formula specify STARTSWITH which will be true is your field [wrkgrp_shrt_nm] starts with the string "BL" or the string "CL".
If you want "contains", you could use FIND:
FIND([wrkgrp_shrt_nm], 'BL' ) > 0 OR FIND([wrkgrp_shrt_nm], 'CL' ) > 0
You cannot use IF in a condition field, but you can use inline IF (IIF), however it's not necessary in your case.
Edit:
I can totally be wrong with my comment on IF (because I'm still new in Tableau) but I tried IF in a condition field of a filter (as the OP asked) and I can't make it work. I use IF all the time in Calculated Fields however. I'll try again...