how to get matched group number in pcre2 - pcre

I want to use pcre2 to match string.
For example, I have several string pattern, "a","b","c","d", and "e".
I have a long text "str" to match.
Now I construct a pattern "a|b|c|d|e" to match "str" use pcre2_match.
How to know which pattern is matched?
I just want to get the matched pattern number, not "a" or "b", as I don't want to compare the matched pattern with "a","b","c","d","e" again.

Assuming you're using the PCRE2 library directly and have access to all of its features, you have several solutions for this, from the simplest to the most involved:
Use numbered capture groups: (a)|(b)|(c)|(d)
Use named capture groups: (?<a>a)|(?<b>b)|(?<c>c)|(?<d>d)
Use marks: a(*MARK:a)|b(*MARK:b)|c(*MARK:c)|d(*MARK:d)
Use callouts: a(?C{a})|b(?C{b})|c(?C{c})|d(?C{d})
If you really can't modify your input pattern, use PCRE2_AUTO_CALLOUT and find some way to map pattern offsets to branches, then rememeber the last pattern offset seen before the end of the match

Related

Why is #regex used in task.json in Azure DevOps extension? What does it check for?

I came across this and was wondering what this means and how it works?
What's the significance of using #regex here and how does it expand?
https://github.com/microsoft/azure-pipelines-tasks/blob/master/Tasks/DownloadPackageV0/task.json
"endpointUrl": "{{endpoint.url}}/{{ **#regex ([a-fA-F0-9\\-]+/)[a-fA-F0-9\\-]+ feed }}_apis**/Packaging/Feeds/{{ **#regex [a-fA-F0-9\\-]*/([a-fA-F0-9\\-]+) feed** }}{{#if view}}#{{{view}}}{{/if}}/Packages?includeUrls=false"
Also I would like to know how many packages will it return and display in the Task input UI dropdown if there are thousands of packages in the feed. Is there a known limit like first 100 or something?
#regex doesn't appear to actually be documented anywhere, but it takes two space-delimited arguments. The first is a regular expression and the second is a "path expression" identifying what value to match against, in this case the value of the feed input parameter. If the regex matches the value, it returns the first capturing subexpression, otherwise it returns the empty string.
In this particular context, the feed parameter is formatted as 'projectId/feedId', where projectId and feedId are GUIDs, and projectId and the / are eliminated for organization-scoped feeds (i.e. feeds that are not inside a project). The first regex therefore extracts the project ID and inserts it into the URL, and the second regex extracts the feed ID and inserts it into the URL.
As of this writing, the default limit on the API it's calling is 1000.
Regex stands for regular expression, which allows you to match any pattern rather than an exact string. You can find more info on how to use it in Azure Devops here
This regex is very specific. In this case, the regex ([a-fA-F0-9\\-]+/)[a-fA-F0-9\\-]+\ matches one or more of the following 1) letters a-f (small or capital) Or 2) \ Or 3) - followed by / and then again one or more of those characters.
You can copy the regex [a-fA-F0-9\\-]+/)[a-fA-F0-9\\-]+ into https://regexr.com/ to play around with it, to see what does and doesn't match the pattern.
Examples:
it matches: a/a a/b abcdef-\/dcba
but doesn't match: /a, abcdef, this-doesn't-match
Note that the full endpoint consists of concatenations of both regular expression and hardcoded strings!

Can I Exclude Certain Pattern Matches In Rosie?

I want to match all five digit numbers except for a specific pattern. So I want to be able to match 12345 but exclude 00000. Is there a pattern which I can use in Rosie to match this set of patterns?
Yes this is possible. Given the example above the correct expression would be
allButFiveZeroes = {!"00000" [0-9]{5}}
The !"00000" is referred to as negative lookahead.

In DB2 SQL RegEx, how can a conditional replacement be done without CASE WHEN END..?

I have a DB2 v7r3 SQL SELECT statement with three instances of REGEXP_SUBSTR(), all with the same regex pattern string, each of which extract one of three groups.
I'd like to change the first SUBSTR to REGEXP_REPLACE() to do a conditional replacement if there's no match, to insert a default value similarly to the ELSE section of a CASE...END. But I can't make it work. I could easily use a CASE, but it seems more compact & efficient to use RegEx.
For example, I have descriptions of food containers sizes, in various states of completeness:
12X125
6X350
1X1500
1500ML
1000
The last two don't have the 'nnX' part at the beginning, in which case '1X' is assumed and needs to be inserted.
This is my current working pattern string:
^(?:(\d{1,3})(?:X))?((?:\d{1,4})(?:\.\d{1,3})?)(L|ML|PK|Z|)$
The groups returned are: quantity, size, and unit.
But only the first group needs the conditional replacement:
(?:(\d{1,3})(?:X))?
This RexEgg webpage describes the (?=...) operator, and it seems to be what I need, but I'm not sure. It's in the list of operators for my version of DB2, but I can't make it work. Frankly, it's a bit deeper than my regex knowledge, and I can't even make it work in my favorite online regex tester, Regex101.
So...does anyone have any idea or suggestions..? Thanks.
Try this (replace "digits not followed by X_or_digit"):
with t(s) as (values
'12X125'
, '6X350'
, '1X1500'
, '1500'
, '1125'
)
select regexp_replace(s, '^([\d]+(?![X\d]))', '1X\1')
from t;

Regular expression repeitition: how to match expressions of variable lengths?

Essentially, here's what I want to do:
if ($expression =~ /^\d{num}\w{num}$/)
{
#doSomething
}
where num is not an identifier, but could stand for any integer greater than 0 (\d and \w were arbitrarily chosen). I want to match a string iff it contains two groups of related characters, one group immediately followed by the other, and the number of characters in each group is the same.
For this example, 123abc and 021202abcdef would match, but 43abc would not, neither would 12ab3c or 1234acbcde.
Don’t think of the string as growing from left to right, but rather from the outside in:
xy
x(xy)y
xx(xy)yy
Your regex would then be something like:
/^(x(?1)?y)$/
Where (?1) is a reference to the outer pair of parentheses. ? makes it optional in order to give a “base case” of sorts to the recursive match. This is probably the simplest example of how regexes can be used to match context-free grammars—though it’s generally easier to get right with a parser generator or parser combinator library.
Well, there's
if ($expression =~ /^(\d+)([[:alpha:]]+)$/ && length($1)==length($2))
{
#doSomething
}
A regex isn't always the best option.

matlab regexprep

How to use matlab regexprep , for multiple expression and replacements?
file='http:xxx/sys/tags/Rel/total';
I want to replace 'sys' with sys1 and 'total' with 'total1'. For a single expression a replacement it works like this:
strrep(file,'sys', 'sys1')
and want to have like
strrep(file,'sys','sys1','total','total1') .
I know this doesn't work for strrep
Why not just issue the command twice?
file = 'http:xxx/sys/tags/Rel/total';
file = strrep(file,'sys','sys1')
strrep(file,'total','total1')
To solve it you need substitute functionality with regex, try to find in matlab's regexes something similar to this in php:
$string = 'http:xxx/sys/tags/Rel/total';
preg_replace('/http:(.*?)\//', 'http:${1}1/', $string);
${1} means 1st match group, that is what in parenthesis, (.*?).
http:(.*?)\/ - match pattern
http:${1}1/ - replace pattern with second 1 as you wish to add (first 1 is a group number)
http:xxx/sys/tags/Rel/total - input string
The secret is that whatever is matched by (.*?) (whether xxx or yyyy or 1234) will be inserted instead of ${1} in replace pattern, and then replace instead of old stuff into the input string. Welcome to see more examples on substitute functionality in php.
As documented in the help page for regexprep, you can specify pairs of patterns and replacements like this:
file='http:xxx/sys/tags/Rel/total';
regexprep(file, {'sys' 'total'}, {'sys1' 'total1'})
ans =
http:xxx/sys1/tags/Rel/total1
It is even possible to use tokens, should you be able to define a match pattern for everything you want to replace:
regexprep(file, '/([st][yo][^/$]*)', '/$11')
ans =
http:xxx/sys1/tags/Rel/total1
However, care must be taken with the first approach under certain circumstances, because MATLAB replaces the pairs one after another. That is to say if, say, the first pattern matches a string and replaces it with something that is subsequently matched by a later pattern, then that will also be replaced by the later replacement, even though it might not have matched the later pattern in the original string.
Example:
regexprep('This\is{not}LaTeX.', {'\\' '([{}])'}, {'\\textbackslash{}' '\\$1'})
ans =
This\textbackslash\{\}is\{not\}LaTeX.
=> This\{}is{not}LaTeX.
and
regexprep('This\is{not}LaTeX.', {'([{}])' '\\'}, {'\\$1' '\\textbackslash{}'})
ans =
This\textbackslash{}is\textbackslash{}{not\textbackslash{}}LaTeX.
=> This\is\not\LaTeX.
Both results are unintended, and there seems to be no way around this with consecutive replacements instead of simultaneous ones.