Can I Exclude Certain Pattern Matches In Rosie? - rosie-pattern-language

I want to match all five digit numbers except for a specific pattern. So I want to be able to match 12345 but exclude 00000. Is there a pattern which I can use in Rosie to match this set of patterns?

Yes this is possible. Given the example above the correct expression would be
allButFiveZeroes = {!"00000" [0-9]{5}}
The !"00000" is referred to as negative lookahead.

Related

Can I write a PCRE conditional that only needs the no-match part?

I am trying to create a regular expression to determine if a string contains a number for an SQL statement. If the value is numeric, then I want to add 1 to it. If the number is not numeric, I want to return a 1. More or less. Here is the SQL:
SELECT
field,
CASE
WHEN regexp_like(field, '^ *\d*\.?\d* *$') THEN dec(field) + 1
ELSE 1
END nextnumber
FROM mytable
This actually works, and returns something like this:
INVALID 1
00000 1
00001E 1
00379 380
00013 14
99904 99905
But to push the envelope of understanding, what if I wanted to cover negative numbers, or those with a positive sign. The sign would have to immediately precede or follow the number, but not both, and I would not want to allow white space between the sign and the number.
I came up with a conditional expression with a capture group to capture the sign on the front of the number to determine if a sign was allowed on the end, but it seems a little awkward to handle given I don't really need a yes-pattern.
Here is the modified regex: ^ ([+-]?)*\d*\.?\d*(?(1) *|[+-]? *)$
This works at regex101.com, but in order for it to work I need to have something before the pipe, so I have to duplicate the next pattern in both the yes-pattern and the no-pattern.
All that background for this question: How can I avoid that duplication?
EDIT: DB2 for i uses International Components for Unicode to provide regular expression processing. It turns out that this library does not support conditionals like PRCE, so I changed the tags on this question. The answer given by Wiktor Stribiżew provides a working alternative to the conditional by using a negative lookahead.
You do not have to duplicate the end pattern, just move it outside the conditional:
^ *([+-])?\d*\.?\d*(?(1)|[+-]?) *$
See the regex demo. So, the yes-part is empty, and the no-part has an optional pattern.
You may also solve it with a mere negative lookahead:
^ *([+-](?!.*[-+]))?\d*\.?\d*[+-]? *$
See another regex demo. Here, ([+-](?!.*[-+]))? matches (optionally) a + or - that are not followed with any 0+ char followed with another + or -.

Wildcard searching between words with CRC mode in Sphinx

I use sphinx with CRC mode and min_infix_length = 1 and I want to use wildcard searching between character of a keyword. Assume I have some data like these in my index files:
name
-------
mickel
mick
mickol
mickil
micknil
nickol
nickal
and when I search for all record that their's name start with 'mick' and end with 'l':
select * from all where match ('mick*l')
I expect the results should be like this:
name
-------
mickel
mickol
mickil
micknil
but nothing returned. How can I do that?
I know that I can do this in dict=keywords mode but I should use crc mode for some reasons.
I also used '^' and '$' operators and didn't work.
You can't use 'middle' wildcards with CRC. One of the reaons for dict=keywords, the wildcards it can support are much more flexible.
With CRC, it 'precomputes' all the wildcard combinations, and injects them as seperate keywords in index, eg for
eg mickel as a document word, and with min_prefix_len=1, indexer willl create the words:
mickel
mickel*
micke*
mick*
mic*
mi*
m*
... as words in index, so all the combinations can match. If using min_infix_len, it also has to do all the combinations at the start as well (so (word_length)^2 + 1 combinations)
... if it had to precompute all the combinations for wildcards in the middle, would be a lot more again. Particularly if then allows all for middle AND start/end combinations as well)
Although having said that, you can rewrite
select * from all where match ('mick*l')
as
select * from all where match ('mick* *l')
because with min_infix_len, the start and end will be indexed as sperate words. Jus need to insist that both match. (although can't think how to make them bot match the same word!)

how to get matched group number in pcre2

I want to use pcre2 to match string.
For example, I have several string pattern, "a","b","c","d", and "e".
I have a long text "str" to match.
Now I construct a pattern "a|b|c|d|e" to match "str" use pcre2_match.
How to know which pattern is matched?
I just want to get the matched pattern number, not "a" or "b", as I don't want to compare the matched pattern with "a","b","c","d","e" again.
Assuming you're using the PCRE2 library directly and have access to all of its features, you have several solutions for this, from the simplest to the most involved:
Use numbered capture groups: (a)|(b)|(c)|(d)
Use named capture groups: (?<a>a)|(?<b>b)|(?<c>c)|(?<d>d)
Use marks: a(*MARK:a)|b(*MARK:b)|c(*MARK:c)|d(*MARK:d)
Use callouts: a(?C{a})|b(?C{b})|c(?C{c})|d(?C{d})
If you really can't modify your input pattern, use PCRE2_AUTO_CALLOUT and find some way to map pattern offsets to branches, then rememeber the last pattern offset seen before the end of the match

Regular expression repeitition: how to match expressions of variable lengths?

Essentially, here's what I want to do:
if ($expression =~ /^\d{num}\w{num}$/)
{
#doSomething
}
where num is not an identifier, but could stand for any integer greater than 0 (\d and \w were arbitrarily chosen). I want to match a string iff it contains two groups of related characters, one group immediately followed by the other, and the number of characters in each group is the same.
For this example, 123abc and 021202abcdef would match, but 43abc would not, neither would 12ab3c or 1234acbcde.
Don’t think of the string as growing from left to right, but rather from the outside in:
xy
x(xy)y
xx(xy)yy
Your regex would then be something like:
/^(x(?1)?y)$/
Where (?1) is a reference to the outer pair of parentheses. ? makes it optional in order to give a “base case” of sorts to the recursive match. This is probably the simplest example of how regexes can be used to match context-free grammars—though it’s generally easier to get right with a parser generator or parser combinator library.
Well, there's
if ($expression =~ /^(\d+)([[:alpha:]]+)$/ && length($1)==length($2))
{
#doSomething
}
A regex isn't always the best option.

Regex for matching any number between 0 to 100?

I need a regex to match any number between 0 to 100 including decimal numbers example:
my expression should match 1,2,2.3 ,40,40.12 ,100,100.00 like this ..thanks in advance?
Assuming you have to allow for a leading sign, you are best off writing
if ( /(?<![-+.\d])([-+]?\d+(?:\.\d*)?(?![-+.\d])/ and $1 >= 0 and $1 <= 100 ) { .. }
But if you are forced into using a regex, then you need
if ( /(?<![-+.\d])(([-+]?(?:100|\d\d)(?:\.\d*)?(?![-+.\d])/ ) { .. }
These pattern may well be more complex than necessary because they allow for the number appearing anywhere in the string. If you are simply checking an entire string to see if it matches the criteria then it could be much shorter
This would work:
(100(\.0+))|([0-9]{1,2}(\.[0-9]+)?)
match either "100" (with optional dot plus one or more zeroes) or one or two digits, optionally followed by a dot and at least one digit.
EDITED!!!
This problem was much more difficult than I initially realized. With some amount of effort, I have produced a new regex that is without error. Enjoy.
/(?<!\d)(?<!\.)(100(?:(?!\.)|(?:\.0*+|\.))(?=\D)|[0-9]?[0-9](?:\.|\.[0-9]*+)?(?=[\D]))/
This pattern will capture in $1