how to enable the shortest match rule in flex (lexer)? - match

By default flex uses the longest match rule.
Is there any way to override this behavior to make it match the shortest sequence?
Thank you

This page in the Flex manual says that it doesn't have any non-greedy operators because it is a scanner rather than a parser, and suggests regular expressions could be used to add the missing functionality.

Related

Add prefix to all class names in visual studio code

i aim to append prefix to each class name in my code. So i want class="text-black border-b p-2 border-gray-100 flex justify-between" to become class="tw-text-black tw-border-b tw-p-2 tw-border-gray-100 tw-flex tw-justify-between"
Try this find and replace:
Find: (?<=class="[^"]*)([^\s"]+)
Replace: tw-$1
See regex101 demo.
That uses a non-fixed length positive lookbehind (due to the * which can be any number) - so that regex will work in vscode's Find widget but not in the Search across files input. So only one file at a time.
Then the idea is to get all blocks that do not contain a space or " as a group $1.

What's the common denominator for regex "pattern" in OpenAPI?

I'm using FastAPI, which allows pattern=re.compile("(?P<foo>[42a-z]+)...").
https://editor.swagger.io/ shows an error for this pattern.
My guess is that Python's named group syntax (?P<name>...) is different from ES2018 (?<name>...).
But, come to think of it, the idea of OpenAPI is interoperability, and some other language, esp. a compiled language may use yet another notation, or may not support named groups in the regular expressions at all.
What common denominator of regular expression syntax should I use?
OpenAPI uses json schema, and the json schema spec defines regex as "A regular expression, which SHOULD be valid according to the ECMA-262 regular expression dialect." Here is the relevant ECMA-262 section.
Of course non-javascript implementations probably won't care too much about it, and just use the default regex library of their platform. So good luck with figuring out the common denominator :)
I suggest just using as simple regexes as possible. And add some tests for it, using the library that you use in production.
Json Schema recommends a specific subset of regular expressions because the authors accept that most implementations will not support full ECMA 262 syntax:
https://json-schema.org/understanding-json-schema/reference/regular_expressions.html
A single unicode character (other than the special characters below) matches itself.
.: Matches any character except line break characters. (Be aware that what constitutes a line break character is somewhat dependent on your platform and language environment, but in practice this rarely matters).
^: Matches only at the beginning of the string.
$: Matches only at the end of the string.
(...): Group a series of regular expressions into a single regular expression.
|: Matches either the regular expression preceding or following the | symbol.
[abc]: Matches any of the characters inside the square brackets.
[a-z]: Matches the range of characters.
[^abc]: Matches any character not listed.
[^a-z]: Matches any character outside of the range.
+: Matches one or more repetitions of the preceding regular expression.
*: Matches zero or more repetitions of the preceding regular expression.
?: Matches zero or one repetitions of the preceding regular expression.
+?, *?, ??: The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired and you want to match as few characters as possible.
(?!x), (?=x): Negative and positive lookahead.
{x}: Match exactly x occurrences of the preceding regular expression.
{x,y}: Match at least x and at most y occurrences of the preceding regular expression.
{x,}: Match x occurrences or more of the preceding regular expression.
{x}?, {x,y}?, {x,}?: Lazy versions of the above expressions.
P.S. Kudos to #erosb for the idea how to find this recommendation.

Parsing commas in Sphinx

I have a field which can have multiple commas which are actually critical to some regex pattern matching.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
I cannot change the char prior to indexing (e.g. COMMA) so I have some anchor for the pattern and can't properly pattern extract w/o.
My only thought is to use exceptions to map ,=>COMMA (this won't process large text fields so not a huge issue). I'm curious as to if this will work and what the pipeline is i.e. what it could possibly break that I'm not considering. AFAIK Exceptions happen first and do not obey charset so this might in fact work. I get I can test it to see if it does but again I am more concenred with what this might break given my rudimentary knowledge of the pipeline of Sphinx Indexing.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
Just use U+2C to add comma to your charset_table, e.g.
charset_table=a..z,A..Z,0..9,U+2C
You might also want to add it to blend_chars instead to consider a comma both a word separator and not.

How to generalize special entities

We use Apache UIMA Ruta for processing our documents. The input documents contains all kind of patterns that we try to recognize and translate to a hierarchy of annotations.
One of the things we will do with the result is to decorate the input text with links. For that it's import that we know the original position information of the found annotations.
Some of the annotations are based on value lists. We use MarkTable to resolve them.
The problem we have is that input document can contain different kind of special entities. For example, the document can contain also words that contain entities like & 𝌆. These can also exist in words / sentences that will be looked up into valuelists.
We are searching for an option to generalize (convert) all that kind of options to a normal "plain text" format, so that we don't have to add all kind of options, with special entities to the valuelists.
Doing a pre-processing of the document and replace them all (for example with the HTMLConverter engine) is AFAIK not a good option, because that will also change the position info. & should match on &, but still seen as size 5.
I tried to use the replace action, that will add an extra "replacement" attribute to the annotation. When I add an interceptor (aspect) to the getCoveredText of the annotation class, and return replacement instead of real text if available, the matching will succeed. But this give problems if the replacement text contains spacers (the end position is still equal with the original text / first RutBasic).
Any suggestions how we can solve this?
I solved this issue by building a pre- and post processor for the content.
In the pre-processor I replace text fragments with other text. For example the & and &AMP; will be replaced by a normal &. While preprocessing I store each replacement details in an replacement object, that will be added to an ordered list. A replacement object contains the original text and the difference in length (& is 4 characters longer than a single &).
After annotating with RUTA(and other annotators) I correct all the found annotation values (text) to the original value and I fix the position information (begin and end) of the annotations, so that they match with the original content. I use the list with replacement details for this process.

Regular Expression for number.(space), objective-c

I have an NSArray of lines (objective-c iphone), and I'm trying to find the line which starts with a number, followed by a dot and a space, but can have any number of spaces (including none) before it, and have any text following it eg:
1. random text
2. text random
3.
what regular expression would I use to get this? (I'm trying to learn it, and I needed the above expression anyway, so I thought I'd use it as an example)
With C#:
#"^ *[0-9]+\. "
It doesn't check for the presence of something after the ., so this is legal:
1.(space)
If you delete the # and escape the \ it should work with other languages (it is pretty "down-to-earth" as RegExpes go)
I may suggest (Perl-compatible regexp):
^\s*\d+\.\s
At the beginning of a line:
Any number (0-n) of spaces
One or more digits
A dot
A space
Something like
^\s*\d+\.
But it depends on the language.
/^\s*[0-9]+\.\s+/
would be my guess providing you don't have any space before the number