Is this a bug of lex? - lex

rule:
%%
AAAA print("AAAA : %s\n",yytext);
AAA print("AAA : %s\n",yytext);
AA print("AA : %s\n",yytext);
And the input is AAAAA,the output is:
AAAA : AAAA
A
Instead of :
AAA : AAA
AA : AAA
Is it a bug of lex?

No, it's adhering to the specification.
The rule is (man 1p lex)
During pattern matching, lex shall search the set of patterns for the single longest possible match. Among rules that match the same number of characters, the rule given first shall be chosen.
so it will always greedily search for the longest AAAA first. This rule is common in lexical conventions of many languages. Eg. C++:
void f(void*=0);
would fail to parse, because the characters *= are interpreted as assign-multiplication operator (which is the longest match), not * and then =.
The reason behind this rule is it can be implemented efficiently. The scanner with this rule only needs O(1) space (including input, ie. input need not fit into memory) and O(N) time. If it were to check that the rest of the input can be tokenized as well, it would need O(N) space and as much as O(N^2) time. Particularly the memory consumption was crucial in the Middle Ages of computing when all compilation was done in linear passes. And I'm sure you wouldn't appreciate O(N^2) running time when parsing today's several-hundred-thousand line source files (eg. C files including headers). Second, the scanners thus generated are very fast and help a lot when parsing.
Last, but not least, the rule is simple to understand. As an example of the opposite, consider ANTLR's rule for tokenization, which would sometimes fail to match even though a prefix of the current token is a token, and the input minus that prefix is tokenizable. For example:
TOK1 : 12
TOK2 : (13)+
would fail to match '12131312'. No such surprises happen with lex; so I suggest to take the rule as-is.

Related

In regex, can you match a zero-length position immediately following a given character (for insertion/replacement purposes)? [duplicate]

I found these things in my regex body but I haven't got a clue what I can use them for.
Does somebody have examples so I can try to understand how they work?
(?!) - negative lookahead
(?=) - positive lookahead
(?<=) - positive lookbehind
(?<!) - negative lookbehind
(?>) - atomic group
Examples
Given the string foobarbarfoo:
bar(?=bar) finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar) finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar finds the 2nd bar ("bar" which does not have "foo" before it)
You can also combine them:
(?<=foo)bar(?=bar) finds the 1st bar ("bar" with "foo" before it and "bar" after it)
Definitions
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look ahead negative (?!)
Find expression A where expression B does not follow:
A(?!B)
Look behind positive (?<=)
Find expression A where expression B precedes:
(?<=B)A
Look behind negative (?<!)
Find expression A where expression B does not precede:
(?<!B)A
Atomic groups (?>)
An atomic group exits a group and throws away alternative patterns after the first matched pattern inside the group (backtracking is disabled).
(?>foo|foot)s applied to foots will match its 1st alternative foo, then fail as s does not immediately follow, and stop as backtracking is disabled
A non-atomic group will allow backtracking; if subsequent matching ahead fails, it will backtrack and use alternative patterns until a match for the entire expression is found or all possibilities are exhausted.
(foo|foot)s applied to foots will:
match its 1st alternative foo, then fail as s does not immediately follow in foots, and backtrack to its 2nd alternative;
match its 2nd alternative foot, then succeed as s immediately follows in foots, and stop.
Some resources
http://www.regular-expressions.info/lookaround.html
http://www.rexegg.com/regex-lookarounds.html
Online testers
https://regex101.com
Lookarounds are zero width assertions. They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion. They don't consume any character - the matching for regex following them (if any), will start at the same cursor position.
Read regular-expression.info for more details.
Positive lookahead:
Syntax:
(?=REGEX_1)REGEX_2
Match only if REGEX_1 matches; after matching REGEX_1, the match is discarded and searching for REGEX_2 starts at the same position.
example:
(?=[a-z0-9]{4}$)[a-z]{1,2}[0-9]{2,3}
REGEX_1 is [a-z0-9]{4}$ which matches four alphanumeric chars followed by end of line.
REGEX_2 is [a-z]{1,2}[0-9]{2,3} which matches one or two letters followed by two or three digits.
REGEX_1 makes sure that the length of string is indeed 4, but doesn't consume any characters so that search for REGEX_2 starts at the same location. Now REGEX_2 makes sure that the string matches some other rules. Without look-ahead it would match strings of length three or five.
Negative lookahead
Syntax:
(?!REGEX_1)REGEX_2
Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.
example:
(?!.*\bFWORD\b)\w{10,30}$
The look-ahead part checks for the FWORD in the string and fails if it finds it. If it doesn't find FWORD, the look-ahead succeeds and the following part verifies that the string's length is between 10 and 30 and that it contains only word characters a-zA-Z0-9_
Look-behind is similar to look-ahead: it just looks behind the current cursor position. Some regex flavors like javascript doesn't support look-behind assertions. And most flavors that support it (PHP, Python etc) require that look-behind portion to have a fixed length.
Atomic groups basically discards/forgets the subsequent tokens in the group once a token matches. Check this page for examples of atomic groups
Grokking lookaround rapidly.
How to distinguish lookahead and lookbehind?
Take 2 minutes tour with me:
(?=) - positive lookahead
(?<=) - positive lookbehind
Suppose
A B C #in a line
Now, we ask B, Where are you?
B has two solutions to declare it location:
One, B has A ahead and has C bebind
Two, B is ahead(lookahead) of C and behind (lookhehind) A.
As we can see, the behind and ahead are opposite in the two solutions.
Regex is solution Two.
Why - Suppose you are playing wordle, and you've entered "ant". (Yes three-letter word, it's only an example - chill)
The answer comes back as blank, yellow, green, and you have a list of three letter words you wish to use a regex to search for? How would you do it?
To start off with you could start with the presence of the t in the third position:
[a-z]{2}t
We could improve by noting that we don't have an a
[b-z]{2}t
We could further improve by saying that the search had to have an n in it.
(?=.*n)[b-z]{2}t
or to break it down;
(?=.*n) - Look ahead, and check the match has an n in it, it may have zero or more characters before that n
[b-z]{2} - Two letters other than an 'a' in the first two positions;
t - literally a 't' in the third position
I used look behind to find the schema and look ahead negative to find tables missing with(nolock)
expression="(?<=DB\.dbo\.)\w+\s+\w+\s+(?!with\(nolock\))"
matches=re.findall(expression,sql)
for match in matches:
print(match)

How does the handling of combining characters in the Unicode Collation Algorithm work?

I maintain an open-source, pure-Python implementation of the Unicode Collation Algorithm called pyuca.
While it meets my needs in sorting Ancient Greek text (and seems to meet the needs of many other people), I'm looking to improve its coverage of rarer cases by getting it to the point where it passes the entire suite of official conformance tests.
However, 1,869 of the tests (just over 1%) fail. The first failure is at 0332 0334 which the test files suggest should get the sort key | 004A 0021 | 0002 0002 |.
pyuca, however, forms the sort key | 0021 004A | 0002 0002 |.
At first I thought this might be due to lack of support for non-starter characters (S2.1.1 thru S2.1.3 of the algorithm in the latest spec). However, my subsequent implementation of this part did nothing to change the sort key and a manual working through the algorithm on paper also fails to trigger that section which has me wondering if I'm just missing something.
The relevant steps in the algorithm are:
S2.1.1 If there are any non-starters following S, process each non-starter C.
S2.1.2 If C is not blocked from S, find if S + C has a match in the table.
S2.1.3 If there is a match, replace S by S + C, and remove C.
The key phrase is "If there is a match". In the test mentioned above that fails, there is no match for 0332 0334 and so this part of the algorithm cannot explain why the sort key should be in a different order to what my implementation produces.
Can anyone explain what part of the UCA would form a sort key like the test file suggests?
Does it work better if you shove the string into Normalization Form D first? (Step 1.)
This is an utter wild guess based on the fact that 0332 0334 is not in NFD. I haven't tried to work through the algorithm at all.

Perl part-of-speech tagging: need tag set for Lingua::EN::Tagger

So, I want to use Lingua::EN::Tagger, but I can't seem to find the domain of tags (aka the parts of speech: ie MD, NN, etc.
I found something similar to what I want here: http://engtagger.rubyforge.org/
but this is for ruby, not perl, and I'm not sure if the tags are going to be the exact same set.
Thanks in advance.
The tag set is available in the readme from Lingua0EN-Tagger-0.23 (http://cpansearch.perl.org/src/ACOBURN/Lingua-EN-Tagger-0.23/README)
Lingua::EN::Tagger
This module uses part-of-speech statistics from the Penn Treebank
to assign POS tags to English text. The tagger applies a bigram (two-word)
Hidden Markov Model to guess the appropriate POS tag for a word. That means
that the tagger will try to assign a POS tag based on the known POS tags
for a given word and the POS tag assigned to its predecessor.
The tagger tends to assume unknown words are nouns, but this behavior is
configurable.
The POS tagger can also be used to find maximal noun phrases in tagged text.
You can also use this module to extract all nouns and/or noun phrases.
TAG SET
----------------------------------------------------------------
The set of POS tags used here is a modified version of the
Penn Treebank tagset. Tags with non-letter characters have been
redefined to work better in our data structures. Also, the
``Determiner'' tag (DET) has been changed from `DT', in order to
avoid confusion with the HTML tag, <DT>.
-----------------------------------------------------------------
CC Conjunction, coordinating and, or
CD Adjective, cardinal number 3, fifteen
DET Determiner this, each, some
EX Pronoun, existential there there
FW Foreign words
IN Preposition / Conjunction for, of, although, that
JJ Adjective happy, bad
JJR Adjective, comparative happier, worse
JJS Adjective, superlative happiest, worst
LS Symbol, list item A, A.
MD Verb, modal can, could, 'll
NN Noun aircraft, data
NNP Noun, proper London, Michael
NNPS Noun, proper, plural Australians, Methodists
NNS Noun, plural women, books
PDT Determiner, prequalifier quite, all, half
POS Possessive 's, '
PRP Determiner, possessive second mine, yours
PRPS Determiner, possessive their, your
RB Adverb often, not, very, here
RBR Adverb, comparative faster
RBS Adverb, superlative fastest
RP Adverb, particle up, off, out
SYM Symbol *
TO Preposition to
UH Interjection oh, yes, mmm
VB Verb, infinitive take, live
VBD Verb, past tense took, lived
VBG Verb, gerund taking, living
VBN Verb, past/passive participle taken, lived
VBP Verb, base present form take, live
VBZ Verb, present 3SG -s form takes, lives
WDT Determiner, question which, whatever
WP Pronoun, question who, whoever
WPS Determiner, possessive & question whose
WRB Adverb, question when, how, however
PP Punctuation, sentence ender ., !, ?
PPC Punctuation, comma ,
PPD Punctuation, dollar sign $
PPL Punctuation, quotation mark left ``
PPR Punctuation, quotation mark right ''
PPS Punctuation, colon, semicolon, elipsis :, ..., -
LRB Punctuation, left bracket (, {, [
RRB Punctuation, right bracket ), }, ]

Force CL-Lex to read whole word

I'm using CL-Lex to implement a lexer (as input for CL-YACC) and my language has several keywords such as "let" and "in". However, while the lexer recognizes such keywords, it does too much. When it finds words such as "init", it returns the first token as IN, while it should return a "CONST" token for the "init" word.
This is a simple version of the lexer:
(define-string-lexer lexer
(...)
("in" (return (values :in $#)))
("[a-z]([a-z]|[A-Z]|\_)" (return (values :const $#))))
How do I force the lexer to fully read the whole word until some whitespace appears?
This is both a correction of Kaz's errors, and a vote of confidence for the OP.
In his original response, Kaz states the order of Unix lex precedence exactly backward. From the lex documentation:
Lex can handle ambiguous specifications. When more than one expression can
match the current input, Lex chooses as follows:
The longest match is preferred.
Among rules which matched the same number of characters, the rule given
first is preferred.
In addition, Kaz is wrong to criticize the OP's solution of using Perl-regex word-boundary matching. As it happens, you are allowed (free of tormenting guilt) to match words in any way that your lexer generator will support. CL-LEX uses Perl regexes, which use \b as a convenient syntax for the more cumbersome lex approximate of :
%{
#include <stdio.h>
%}
WC [A-Za-z']
NW [^A-Za-z']
%start INW NIW
{WC} { BEGIN INW; REJECT; }
{NW} { BEGIN NIW; REJECT; }
<INW>a { printf("'a' in wordn"); }
<NIW>a { printf("'a' not in wordn"); }
All things being equal, finding a way to unambiguously match his words is probably better than the alternative.
Despite Kaz wanting to slap him, the OP has answered his own question correctly, coming up with a solution that takes advantage of the flexibility of his chosen lexer generator.
Your example lexer above has two rules, both of which match a sequence of exactly two characters. Moreover, they have common matches (the language matched by the second is a strict superset of the first).
In the classic Unix lex, if two rules both match the same length of input, precedence is given to the rule which occurs first in the specification. Otherwise, the longest possible match dominates.
(Although without RTFM, I can't say that that is what happens in CL-LEX, it does make a plausible hypothesis of what is happening in this case.)
It looks like you're missing a regex Kleene operator to match a longer token in the second rule.

How can I write a regex that matches words that overlap themselves?

I'm trying to match a word forwards and backwards in a string but it isn't catching all matches. For example, searching for the word "AB" in the string "AAABAAABAAA", I create and use the regex /AB|BA/, but it only matches the two "AB" substrings, and ignores the "BA" substrings.
I'm using RegexKitLite on the iPhone, but I think this is a more general regex problem (I see the same behavior in online regex testers). Nevertheless, here's the code I'm using to enumerate the matches:
[#"AAABAAABAAA" enumerateStringsMatchedByRegex:#"AB|BA" usingBlock:
^(NSInteger captureCount,
NSString * const capturedStrings[captureCount],
const NSRange capturedRanges[captureCount],
volatile BOOL * const stop) {
NSLog(#"%#", capturedStrings[0]);
}];
Output:
AB
AB
I don't know which online tester you tried, but http://www.regextester.com/ (for example) will not consider the same character for multiple matches. In this case, since ABA matches AB, the B is not considered for the BA match. It's purely a guess that RegexKitLite is implemented similarly.
Even if you don't consider the mirrored variant, the original search string may overlap with itself. For example, if you search ABCA|ACBA in ABCABCACBACBA you'll get two of four matches, searching in both directions will be the same.
It should be possible to find matches incrementally, but perhaps not with RegexKitLite
I would say, thats not possible in one turn. The regex matches for the given pattern and "eats" the matched characters. So if you search AB|BA in ABA the first found pattern is AB, then the regex continue to search on the third A.
So it is not possible to find overlapping patterns with the same regex and using the | operator.
I'm not sure how you'd accomplish exactly what I think you're asking for without reversing the string and testing twice.
However, I suppose it depends on what you're after exactly. If you're simply trying to determine if the pattern occurs in the string backwards or forwards, and not so much how it occurs, then you could do something like this:
ABA?|BAB?
The ? makes the last character optional on each side of the |. In the case of AAABAAABAAA, it'll find ABA twice. In the case of AB it'll find AB, and in the case of BA it'll find BA.
Here it is with test cases...
http://regexhero.net/tester/?id=a387ae0a-1707-4d9e-856b-ebe2176679bb