What's the common denominator for regex "pattern" in OpenAPI? - openapi

I'm using FastAPI, which allows pattern=re.compile("(?P<foo>[42a-z]+)...").
https://editor.swagger.io/ shows an error for this pattern.
My guess is that Python's named group syntax (?P<name>...) is different from ES2018 (?<name>...).
But, come to think of it, the idea of OpenAPI is interoperability, and some other language, esp. a compiled language may use yet another notation, or may not support named groups in the regular expressions at all.
What common denominator of regular expression syntax should I use?

OpenAPI uses json schema, and the json schema spec defines regex as "A regular expression, which SHOULD be valid according to the ECMA-262 regular expression dialect." Here is the relevant ECMA-262 section.
Of course non-javascript implementations probably won't care too much about it, and just use the default regex library of their platform. So good luck with figuring out the common denominator :)
I suggest just using as simple regexes as possible. And add some tests for it, using the library that you use in production.

Json Schema recommends a specific subset of regular expressions because the authors accept that most implementations will not support full ECMA 262 syntax:
https://json-schema.org/understanding-json-schema/reference/regular_expressions.html
A single unicode character (other than the special characters below) matches itself.
.: Matches any character except line break characters. (Be aware that what constitutes a line break character is somewhat dependent on your platform and language environment, but in practice this rarely matters).
^: Matches only at the beginning of the string.
$: Matches only at the end of the string.
(...): Group a series of regular expressions into a single regular expression.
|: Matches either the regular expression preceding or following the | symbol.
[abc]: Matches any of the characters inside the square brackets.
[a-z]: Matches the range of characters.
[^abc]: Matches any character not listed.
[^a-z]: Matches any character outside of the range.
+: Matches one or more repetitions of the preceding regular expression.
*: Matches zero or more repetitions of the preceding regular expression.
?: Matches zero or one repetitions of the preceding regular expression.
+?, *?, ??: The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired and you want to match as few characters as possible.
(?!x), (?=x): Negative and positive lookahead.
{x}: Match exactly x occurrences of the preceding regular expression.
{x,y}: Match at least x and at most y occurrences of the preceding regular expression.
{x,}: Match x occurrences or more of the preceding regular expression.
{x}?, {x,y}?, {x,}?: Lazy versions of the above expressions.
P.S. Kudos to #erosb for the idea how to find this recommendation.

Related

What characters are allowed in the name of a rule in Drools?

I haven't been able to find in Drools documentation, which characters (beyond alphabet letters) are allowed/disallowed in a rule name in Drools - does anyone know or have a reference?
The only relevant section of Drools doc I've found so far does not specify:
Each rule must have a unique name within the rule package. If you use the same rule name more than once in any DRL file in the package, the rules fail to compile. Always enclose rule names with double quotation marks (rule "rule name") to prevent possible compilation errors, especially if you use spaces in rule names.
I think I have discovered, anecdotally, that some "grouping" characters do not work in rule names (seems rules named with can't be found or aren't included) - or at least, in extension rules (the extended rule seems to work with grouping chars, but not its extension; example below): The grouping chars include parentheses "()", square brackets "[]", and "curly braces" "{}". Although less than & greater than "<>" work, so I'm so far replacing the former with the latter.
Or are there escape chars for the problematic grouping chars?
Example:
rule "(grouping chars, and commas, work here)"
when
// conditions LHS
then
end
// removing parentheses, or replacing with < >,
// from below line works
rule "(grouping chars DON'T work here)"
extends "(grouping chars, and commas, work here)"
when
then
// consequences RHS
I haven't discovered either way yet with all other characters (for example, other punctuation; except I have discovered commas "," work). But it would be nice to know ahead of time what characters are allowed.
Theoretically every identifier inside a string should work, but you might have empirically found some combination that is breaking the grammar somehow.
Thanks for the investigation, I've filled a Jira, please take a look at it

Searching for two Word wildcard strings that are nested

I'm having trouble finding the proper Word wildcard string to find numbers that fit the following patterns:
"NN NN NN" or "NN NN NN.NN" (where N is any number 0-9)
The trouble is the first string is a subset of the second string. My goal is to find a single wildcard string that will capture both. Unfortunately, I need to use an operator that is zero or more occurrences for the ".NN" portion and that doesn't exist.
I'm having to do two searches, and I'm using the following patterns:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}?[!0-9]
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}.[0-9]{2}
The problem is that first pattern (in bold). It works well unless I have the number in a table or something and there is nothing after it to match (or not match, if you will) the [!0-9].
You could use a single wildcard Find:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1,4}
or:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1;4}
to capture both. Which you use depends on your regional settings.

Regex match invalid pattern ios swift 4 [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

Include slashes and parentheses in tokens

Background
I have search indexes containing Greek characters. Many people don't know how to type Greek so they enter something called "beta-code". Beta-code can be converted into Greek. For example, beta-code "NO/MOU" would be converted to "νόμου". Characters such as a slash or parenthesis is used to indicate an accent.
Desired Behavior
I want users to be able to search using either beta-code or text in the Greek script. I figured out that the Whoosh Variations class provides the mechanism I need and it almost solves my problem.
Problem
The Variation class works well except for when a slash or a parenthesis are used to indicate an accent in a users' query. The problem is the query are parsed such that the the special characters used to denote the accent result in the words being split up. For example, a search for "NO/MOU" results in the Variations class being asked to find variations of "no" and "mou" instead of "NO/MOU".
Question
Is there a way to influence how the query is parsed such that slashes and parentheses are included in the search words (i.e. that a search for "NO/MOU" results in a search for a token of ""NO/MOU" instead of "no" and "mou")?
The search parser uses a Tokenizer class for breaking up the search string into individual terms. Whoosh will use the class that is associated with the schema. For example, the case below, the SimpleAnalyzer() will be used when searching the "content" field.
Schema( verse_id = NUMERIC(unique=True, stored=True),
content = TEXT(analyzer=SimpleAnalyzer()) )
By default, the SimpleAnalyzer() uses the following regular expression to tokenize search terms: "\w+(.?\w+)*"
To use a different regular expression, assign the first argument to the SimpleAnalyzer to another regular expression. For example, to include beta-code characters (slashes, parentheses, etc.) in tokens, use the following SimpleAnalyzer:
SimpleAnalyzer( rcompile(r"[\w/*()=\+|&']+(\.?[\w/*()=\+|&']+)*") )
Searches will now allow terms to include the special beta-code characters and the Variations class will be able to convert the term to the unicode version.

Force CL-Lex to read whole word

I'm using CL-Lex to implement a lexer (as input for CL-YACC) and my language has several keywords such as "let" and "in". However, while the lexer recognizes such keywords, it does too much. When it finds words such as "init", it returns the first token as IN, while it should return a "CONST" token for the "init" word.
This is a simple version of the lexer:
(define-string-lexer lexer
(...)
("in" (return (values :in $#)))
("[a-z]([a-z]|[A-Z]|\_)" (return (values :const $#))))
How do I force the lexer to fully read the whole word until some whitespace appears?
This is both a correction of Kaz's errors, and a vote of confidence for the OP.
In his original response, Kaz states the order of Unix lex precedence exactly backward. From the lex documentation:
Lex can handle ambiguous specifications. When more than one expression can
match the current input, Lex chooses as follows:
The longest match is preferred.
Among rules which matched the same number of characters, the rule given
first is preferred.
In addition, Kaz is wrong to criticize the OP's solution of using Perl-regex word-boundary matching. As it happens, you are allowed (free of tormenting guilt) to match words in any way that your lexer generator will support. CL-LEX uses Perl regexes, which use \b as a convenient syntax for the more cumbersome lex approximate of :
%{
#include <stdio.h>
%}
WC [A-Za-z']
NW [^A-Za-z']
%start INW NIW
{WC} { BEGIN INW; REJECT; }
{NW} { BEGIN NIW; REJECT; }
<INW>a { printf("'a' in wordn"); }
<NIW>a { printf("'a' not in wordn"); }
All things being equal, finding a way to unambiguously match his words is probably better than the alternative.
Despite Kaz wanting to slap him, the OP has answered his own question correctly, coming up with a solution that takes advantage of the flexibility of his chosen lexer generator.
Your example lexer above has two rules, both of which match a sequence of exactly two characters. Moreover, they have common matches (the language matched by the second is a strict superset of the first).
In the classic Unix lex, if two rules both match the same length of input, precedence is given to the rule which occurs first in the specification. Otherwise, the longest possible match dominates.
(Although without RTFM, I can't say that that is what happens in CL-LEX, it does make a plausible hypothesis of what is happening in this case.)
It looks like you're missing a regex Kleene operator to match a longer token in the second rule.