Lex - Do the rules only match to the tokens or over the sequence of tokens? - lex

In Lex, there are a set of rules defined. Do the rules only apply to the tokens that are delimitated by spaces and such? If there is any block in the line buffer that matches the rule, will it work?
For example, if I want to write a rule that checks the function of a program (e.g., void sum()) with a rule such as "void "[a-zA-Z]+"()", this rule should be matched to a line. Can rules that span over more than a single token work in Lex?

When you call yylex, it finds the longest match starting at the current buffer pointer. It does not search for a token. It will match spaces if (and only if) the rule can match space characters.
Lex has no idea of what a token is other than "a sequence of characters which match a rule", so the question about whether a rule can span more than a token is meaningless. By definition, anything which matches a rule is a token.
There must always be some rule which matches, since the scanner will never match anything which doesn't start at the current buffer pointer. By default, lex adds a rule (if necessary) at the end which matches any character and echoes it to yyout. Unless you're writing a transducer, that is almost certainly not what you want, so I always recommend that you add
%option nodefault
(assuming you are actually using flex, which is by far the most common lex implementation); that will suppress the default rule and give you a warning if it is possible that no rule matches the input. Then you can define your own fallback rule, which might be something like:
.|\n { return yytext[0]; }
or which might throw a scanner error.

Related

How to properly parse a multi-character token in tree-sitter scanner function

Tree-sitter allows you to use an external scanner for those tokens that are tricky to parse or that depend on specific states like multiline strings.
The scanner takes a convenient lexer object with several methods that allow you to "scan" the document looking for the proper token characters.
Two of the key parts of this lexer are lookahead, which tells you the next character the lexer is "lookin at" and advance, which will move the lexer pointer to the next caracter.
However, after reading the docs and several other parsers that make use of this it's still not clear to me if calling this methods will "affect" the overall tree-sitter parser of if they are just local to my function invocation.
Specially tricky is trying to parse a multi-character token (more than 2 characters in fact) because you need to "advance the lexer, consuming the potential next chars that may be part of other tokens. One possible escape is to just return false after consuming the tokens and let tree-sitter go to the next step in the parsing, but thay may skip other valid tokens that potentially depend on the characters that I already consumed.
Of course I can move this parsing to the bottom of the scan function, but then maybe other shorter tokens may shadow this longer one and also produce an incorrect parsing.
As far as I know, there is no way to "rewind" the parser to undo the "consumption" of the characters, so I am not sure how to deal with this.
The tokens that I'm trying to parse are {js| for string opening and |js} for string closing.

What characters are allowed in the name of a rule in Drools?

I haven't been able to find in Drools documentation, which characters (beyond alphabet letters) are allowed/disallowed in a rule name in Drools - does anyone know or have a reference?
The only relevant section of Drools doc I've found so far does not specify:
Each rule must have a unique name within the rule package. If you use the same rule name more than once in any DRL file in the package, the rules fail to compile. Always enclose rule names with double quotation marks (rule "rule name") to prevent possible compilation errors, especially if you use spaces in rule names.
I think I have discovered, anecdotally, that some "grouping" characters do not work in rule names (seems rules named with can't be found or aren't included) - or at least, in extension rules (the extended rule seems to work with grouping chars, but not its extension; example below): The grouping chars include parentheses "()", square brackets "[]", and "curly braces" "{}". Although less than & greater than "<>" work, so I'm so far replacing the former with the latter.
Or are there escape chars for the problematic grouping chars?
Example:
rule "(grouping chars, and commas, work here)"
when
// conditions LHS
then
end
// removing parentheses, or replacing with < >,
// from below line works
rule "(grouping chars DON'T work here)"
extends "(grouping chars, and commas, work here)"
when
then
// consequences RHS
I haven't discovered either way yet with all other characters (for example, other punctuation; except I have discovered commas "," work). But it would be nice to know ahead of time what characters are allowed.
Theoretically every identifier inside a string should work, but you might have empirically found some combination that is breaking the grammar somehow.
Thanks for the investigation, I've filled a Jira, please take a look at it

Nesting Math Functions in Javascript

I am working on an Acrobat form that should only accept positive, whole numbers in a field.
It is ideal if the number is simply reformated to suit the criteria. For example, if a user types in "-1.4", it should simply change to "1".
Is it acceptable to use this as the "Validation Script" for the field:
if (event.value) event.value = Math.abs(Math.round(event.value));
It seems to work, but is it ok to nest functions like this in general, or will it lead to issues.
Rather than change the value during the validation event, prevent an invalid value from being entered in the first place. To allow only numbers with no dashes to be entered, add the following to the custom keystroke event.
event.rc = !(/[a-zA-Z\-]/.test(event.change));
You may want to modify the regex to prevent other characters as well. I just did the bare minimum. Remember that you'll need to allow for the delete key, return key, and backspace to be permitted so you can't just limit the regex to 0-9 (which would be the obvious thing to do).

Lua pattern matching for email address

I having the following code:
if not (email:match("[A-Za-z0-9%.]+#[%a%d]+%.[%a%d]+")) then
print(false)
end
It doesn't currently catch
"test#yahoo,ca" or "test#test1.test2,com"
as an error.
I thought by limiting the input to %a - characters and %d - digits, I would by default catch any punctuation, including commas.
But I guess I'm wrong. Or there's something else that I'm just not seeing.
A second pair of eyes would be appreciated.
In the example of "test#test1.test2,com", the pattern matches test#test1.test2 and stops because of the following ,. It's not lying, it does match, just not what you expected. To fix, use anchors:
^[A-Za-z0-9%.]+#[%a%d]+%.[%a%d]+$
You can further simplify it to:
^[%w.]+#%w+%.%w+$
in which %w matches an alphanumeric character.
I had a hard time finding a true email validation function for Lua.
I couldn't find any that would allow some of the special cases that emails to allow. Things like + or quotes are actually acceptable in emails.
I wrote my own Lua function that could pass all the tests that are outlined in the spec for email addresses.
http://ohdoylerules.com/snippets/validate-email-with-lua
I also added a bunch of commentd, so if there is some strange validation that you want to ignore, just remove the if statement for that particular check.

Force CL-Lex to read whole word

I'm using CL-Lex to implement a lexer (as input for CL-YACC) and my language has several keywords such as "let" and "in". However, while the lexer recognizes such keywords, it does too much. When it finds words such as "init", it returns the first token as IN, while it should return a "CONST" token for the "init" word.
This is a simple version of the lexer:
(define-string-lexer lexer
(...)
("in" (return (values :in $#)))
("[a-z]([a-z]|[A-Z]|\_)" (return (values :const $#))))
How do I force the lexer to fully read the whole word until some whitespace appears?
This is both a correction of Kaz's errors, and a vote of confidence for the OP.
In his original response, Kaz states the order of Unix lex precedence exactly backward. From the lex documentation:
Lex can handle ambiguous specifications. When more than one expression can
match the current input, Lex chooses as follows:
The longest match is preferred.
Among rules which matched the same number of characters, the rule given
first is preferred.
In addition, Kaz is wrong to criticize the OP's solution of using Perl-regex word-boundary matching. As it happens, you are allowed (free of tormenting guilt) to match words in any way that your lexer generator will support. CL-LEX uses Perl regexes, which use \b as a convenient syntax for the more cumbersome lex approximate of :
%{
#include <stdio.h>
%}
WC [A-Za-z']
NW [^A-Za-z']
%start INW NIW
{WC} { BEGIN INW; REJECT; }
{NW} { BEGIN NIW; REJECT; }
<INW>a { printf("'a' in wordn"); }
<NIW>a { printf("'a' not in wordn"); }
All things being equal, finding a way to unambiguously match his words is probably better than the alternative.
Despite Kaz wanting to slap him, the OP has answered his own question correctly, coming up with a solution that takes advantage of the flexibility of his chosen lexer generator.
Your example lexer above has two rules, both of which match a sequence of exactly two characters. Moreover, they have common matches (the language matched by the second is a strict superset of the first).
In the classic Unix lex, if two rules both match the same length of input, precedence is given to the rule which occurs first in the specification. Otherwise, the longest possible match dominates.
(Although without RTFM, I can't say that that is what happens in CL-LEX, it does make a plausible hypothesis of what is happening in this case.)
It looks like you're missing a regex Kleene operator to match a longer token in the second rule.