How ANTLR decides whether terminals should be separated with whitespaces or not? - swift

I'm writing lexical analyzer in Swift for Swift. I used ANTLR's grammar, but I faced with problem that I don't understand how ANTLR decides whether terminals should be separated with whitespaces.
Here's the grammar: https://github.com/antlr/grammars-v4/blob/master/swift/Swift.g4
Assume we have casting in Swift. It can also operate with optional types (Int?, String?) and with non-optional types (Int, String). Here are valid examples: "as? Int", "as Int", "as?Int". Invalid examples: "asInt" (it isn't a cast). I've implemented logic, when terminals in grammar rules can be separated with 0 or more WS (whitespace) symbols. But with this logic "asInt" is matching a cast, because it contains "as" and a type "Int" and have 0 or more WS symbols. But it should be invalid.
Swift grammar contains these rules:
DOT : '.' ;
LCURLY : '{' ;
LPAREN : '(' ;
LBRACK : '[' ;
RCURLY : '}' ;
RPAREN : ')' ;
RBRACK : ']' ;
COMMA : ',' ;
COLON : ':' ;
SEMI : ';' ;
LT : '<' ;
GT : '>' ;
UNDERSCORE : '_' ;
BANG : '!' ;
QUESTION: '?' ;
AT : '#' ;
AND : '&' ;
SUB : '-' ;
EQUAL : '=' ;
OR : '|' ;
DIV : '/' ;
ADD : '+' ;
MUL : '*' ;
MOD : '%' ;
CARET : '^' ;
TILDE : '~' ;
It seems that all these terminals can be separated with other's with 0 WS symbols, and other terminals don't (e.g. "as" + Identifier).
Am I right? If I'm right, the problem is solved. But there may be more complex logic.
Now if I have rules
WS : [ \n\r\t\u000B\u000C\u0000]+
a : 'str1' b
b : 'str2' c
c : '+' d
d : 'str3'
I use them as if they were these rules:
WS : [ \n\r\t\u000B\u000C\u0000]+
a : WS? 'str1' WS? 'str2' WS? '+' WS? 'str3' WS?
And I suppose that they should be like these (I don't know and that is the question):
WS : [ \n\r\t\u000B\u000C\u0000]+
a: 'str1' WS 'str2' WS? '+' WS? 'str3'
(notice WS is not optional between 'str1' and 'str2')
So there's 2 questions:
Am I right?
What I missed?
Thanks.

Here's the ANTLR WS rule in your Swift grammar:
WS : [ \n\r\t\u000B\u000C\u0000]+ -> channel(HIDDEN) ;
The -> channel(HIDDEN) instruction tells the lexer to put these tokens on a separate channel, so the parser won't see them at all. You shouldn't litter your grammar with WS rules - it'd become unreadable.
ANTLR works in two steps: you have the lexer and the parser. The lexer produces the tokens, and the parser tries to figure out a concrete syntax tree from these tokens and the grammar.
The lexer in ANTLR works like this:
Consume characters as long as they match any lexer rule.
If several rules match the text you've consumed, use the first one which appears in the grammar
Literal strings in the grammar (like 'as') are turned into implicit lexer rules (equivalent to TOKEN_AS: 'as'; except the name will be just 'as'). These end up first in the lexer rules list.
Example 1
Let's see the consequences of these when lexing as?Int (with a space at the end):
a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
as? does not match any lexer rule
Therefore, you consume as, which will become a token. Now you have to decide which will be the token type. Both Identifier and 'as' rules match. 'as' is an implicit lexer rule, and considered to appear first in the grammar, therefore it takes precedence. The lexer emits a token with text as of type 'as'.
Next token.
?... potentially matches the QUESTION rule
?I doesn't match any rule
Therefore, you consume ? from the input and emit a token of type QUESTION with text ?.
Next token.
I... potentially matches Identifier
In... potentially matches Identifier
Int... potentially matches Identifier
Int (followed by a space) does not match anything
Therefore, you consume Int from the input and emit a token of type Identifier with text Int.
Next token.
You have a space there, it matches the WS rule.
You consume that space, and emit a WS token on the HIDDEN channel. The parser won't see this.
Example 2
Now let's see how asInt is tokenized.
a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
asI... potentially matches Identifier
asIn... potentially matches Identifier
asInt... potentially matches Identifier
asInt followed by a space doesn't match any lexer rule.
Therefore, you consume asInt from the input stream, and emit an Identifier token with text asInt.
The parser
The parser stage is only interested in the token types it gets. It does not care about what text they contain. Tokens outside the default channel are ignored, which means the following inputs:
as?Int - tokens: 'as' QUESTION Identifier
as? Int - tokens: 'as' QUESTION WS Identifier
as ? Int - tokens: 'as' WS QUESTION WS Identifier
Will all result in the parser seeing the following token types: 'as' QUESTION Identifier, as WS is on a separate channel.

Related

Include commentary to the matlab grammar using antlr4

could anybody help me with these two problems please?
First one is almost solved for me by question regular expression for multiline commentary in matlab , but I do not know how exactly I should use ^.*%\{(?:\R(?!.*%\{).*)*\R\h*%\}$ or where in grammar if I want use is with antlr4. I have been using matlab grammar from this source.
Second one is related to another type of commentary in matlab which is a = 3 % type any ascii I want.... In this case worked, when I insert label alternative to the rule context unary_expression in this form:
unary_expression
: postfix_expression
| unary_operator postfix_expression
| postfix_expression COMMENT
;
where COMMENT: '%' [ a-zA-Z0-9]*;, but when I use [\x00-\x7F] instead of [ a-zA-Z0-9]* (what I found here) parsing goes wrong, see example bellow:
INPUT FOR PARSER: a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
ANTLR OUTPUT : Exception in thread "main" java.lang.RuntimeException: set is empty
at org.antlr.v4.runtime.misc.IntervalSet.getMaxElement(IntervalSet.java:421)
at org.antlr.v4.runtime.atn.ATNSerializer.serialize(ATNSerializer.java:169)
at org.antlr.v4.runtime.atn.ATNSerializer.getSerialized(ATNSerializer.java:601)
at org.antlr.v4.Tool.generateInterpreterData(Tool.java:745)
at org.antlr.v4.Tool.processNonCombinedGrammar(Tool.java:400)
at org.antlr.v4.Tool.process(Tool.java:361)
at org.antlr.v4.Tool.processGrammarsOnCommandLine(Tool.java:328)
at org.antlr.v4.Tool.main(Tool.java:172)
line 1:9 token recognition error at: '$'
line 1:20 token recognition error at: '"'
line 1:21 token recognition error at: '!'
line 1:22 token recognition error at: '"'
line 1:38 token recognition error at: '$'
line 1:43 token recognition error at: '"'
line 1:10 missing {',', ';', CR} at 'L'
line 1:32 missing {',', ';', CR} at '3'
Can anybody please tell me what have I done wrong? And what is the best practice for this problem? (I am not exactly regex person...)
Let's take the simple one first.
this looks (to me) like a typical "comment everything through the end of the line" comment.
Assuming I'm correct, then best not to consider what all the valid characters are that might be contained, but rather to think about what not to consume.
Try: COMMENT: '%' ~[\r\n]* '\r'? '\n';
(I notice that you did not include anything in your rule to terminate it at the end of the line, so I've added that).
This basically says: once I see a % consume everything that is not a \r or `nand stop when you see an option\rfollowed by a required\n'.
Generally, comments can occur just about anywhere within a grammar structure, so it's VERY useful to "shove the off to the side" rather than inject them everywhere you allow them in the grammar.
So, a short grammar:
grammar test
;
test: ID EQ INT;
EQ: '=';
INT: [0-9]+;
COMMENT: '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
ID: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
You'll notice that I removed the COMMENT element from the test rule.
test file:
a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
(be sure to include the \n)
➜ grun test test -tree -tokens < test.txt
[#0,0:0='a',<ID>,1:0]
[#1,2:2='=',<'='>,1:2]
[#2,4:4='3',<INT>,1:4]
[#3,6:48='% $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[#4,49:48='<EOF>',<EOF>,2:0]
(test a = 3)
You still get a COMMENT token, it's just ignored when matching the parser rules.
Now for the multiline comments:
ANTLR uses a rather "regex-like" syntax for Lexer rules, but, don't be fooled, it's not (it's actually more powerful as it can pair up nested brackets, etc.)
From a quick reading, MatLab multiline tokens start with a %{ and consume everything until a %}. This is very similar to the prior rule, it just doesn't care about \ror\n`), so:
MLCOMMENT: '%{' .*? '%}' -> channel(HIDDEN);
Included in grammar:
grammar test
;
test: ID EQ INT;
EQ: '=';
INT: [0-9]+;
COMMENT: '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
MLCOMMENT: '%{' .*? '%}' -> channel(HIDDEN);
ID: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
Input file:
a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
%{
A whole bunch of stuff
on several
lines
%}
➜ grun test test -tree -tokens < test.txt
[#0,0:0='a',<ID>,1:0]
[#1,2:2='=',<'='>,1:2]
[#2,4:4='3',<INT>,1:4]
[#3,6:48='% $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[#4,50:106='%{\n A whole bunch of stuff\n on several\n lines\n%}',<MLCOMMENT>,channel=1,3:0]
[#5,108:107='<EOF>',<EOF>,8:0]
(test a = 3)

ANTLR4 lexer rule creates errors or conflicts on perl grammar

I am having an issue on my PERL grammar, here are the relevant parts of my grammar :
element
: element (ASTERISK_CHAR | SLASH_CHAR | PERCENT_CHAR) element
| word
;
SLASH_CHAR: '/';
REGEX_STRING
: '/' (~('/' | '\r' | '\n') | NEW_LINE)* '/'
;
fragment NEW_LINE
: '\r'? '\n'
;
If the rule REGEX_STRING is not commented, then the following perl doesn't parse :
$b = 1/2;
$c = 1/2;
<2021/08/20-19:24:37> <ERROR> [parsing.AntlrErrorLogger] - Unit 1: <unknown>:2:6: extraneous input '/2;\r\n$c = 1/' expecting {<EOF>, '=', '**=', '+=', '-=', '.=', '*=', '/=', '%=', CROSS_EQUAL, '&=', '|=', '^=', '&.=', '|.=', '^.=', '<<=', '>>=', '&&=', '||=', '//=', '==', '>=', '<=', '<=>', '<>', '!=', '>', '<', '~~', '++', '--', '**', '.', '+', '-', '*', '/', '%', '=~', '!~', '&&', '||', '//', '&', '&.', '|', '|.', '^', '^.', '<<', '>>', '..', '...', '?', ';', X_KEYWORD, AND, CMP, EQ, FOR, FOREACH, GE, GT, IF, ISA, LE, LT, OR, NE, UNLESS, UNTIL, WHEN, WHILE, XOR, UNSIGNED_INTEGER}
Note that it doesn't matter where the lexer rule REGEX_STRING is used, even if it is not present anywhere in the parser rules just being here makes the parsing fails (so the issue is lexer side).
If I remove the lexer rule REGEX_STRING, then it gets parsed just fine, but then I can't parse :
$dateCalc =~ /^([0-9]{4})([0-9]{2})([0-9]{2})/
Also, I noticed that this perl parses, so there seems to be some kind of interaction between the first and the second '/'.
$b = 12; # Removed the / between 1 and 2
$c = 1/2; # Removing the / here would work as well.
I can't seem to find how to write my regex lexer rule to not make something fail.
What am I missing ? How can I parse both expressions just fine ?
The basic issue here is that ANTLR4, like many other parsing frameworks, performs lexical analysis independent of the syntax; the same tokens are produced regardless of which tokens might be acceptable to the parser. So it is the lexical analyser which must decide whether a given / is a division operator or the start of a regex, a decision which can really only be made using syntactic information. (There are parsing frameworks which do not have this limitation, and thus can be used to implement for scannerless parsers. These include PEG-based parsers and GLR/GLR parsers.)
There's an example of solving this lexical ambiguity, which also shows up in parsing ECMAScript, in the ANTLR4 example directory. (That's a github permalink so that the line numbers cited below continue to work.)
The basic strategy is to decide whether a / can start a regular expression based on the immediately previous token. This works in ECMAScript because the syntactic contexts in which an operator (such as / or /=) can appear are disjoint from the contexts in which an operand can appear. This will probably not translate directly into a Perl parser, but it might help show the possibilities.
Line 780-782: The regex token itself is protected by a semantic guard:
RegularExpressionLiteral
: {isRegexPossible()}? '/' RegularExpressionBody '/' RegularExpressionFlags
;
Lines 154-182: The guard function itself is simple, but obviously required a certain amount of grammatical analysis to generate the correct test. (Note: The list of tokens has been abbreviated; see the original file for the complete list):
private boolean isRegexPossible() {
if (this.lastToken == null) {
return true;
}
switch (this.lastToken.getType()) {
case Identifier:
case NullLiteral:
...
// After any of the tokens above, no regex literal can follow.
return false;
default:
// In all other cases, a regex literal _is_ possible.
return true;
}
}
}
Lines 127-147 In order for that to work, the scanner must retain the previous token in the member variable last_token. (Comments removed for space):
#Override
public Token nextToken() {
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
this.lastToken = next;
}
return next;
}

XText cross referencing

I have written following grammar
Model:
package = PackageDec?
greetings+=Greeting*
usage+=Usage* ;
PackageDec:
'package' name=QualifiedName ;
Greeting:
'greet' name=ID '{' ops += Operation* '}' ;
Operation:
'op' name=ID ('(' ')' '{' '}')? ;
QualifiedName:
ID ('.' ID)*;
Usage:
'use';
With above i can write following script.
package p1.p2
greet G1 {op f1 op f2 }
Now i need to write something like this:
package p1.p2
greet G1 {op f1 op f2 op f3}
use p1.p2.G1.f1
use p1.p2.G1
use p1.p2.G1.f3
To support that i changed Usage RULE like this
Usage:
'use' head=[Greet|QualifiedName] =>('.' tail=[Operation])?
However when i generate xtext artifacts it is complaining about multiple alternatives.
Please let me know how to write correct grammar rule for this.
This is because QualifiedName consumes dots (.). Adding ('.' ...)? makes two alternatives. Consider input
a.b.c
This could be parsed as
head="a" tail = "b.c"
head="a.b" tail = "c"
If I understand your intention of using predicate => right, than you just have to replace
head=[Greet|QualifiedName]
with
head=[Greet]
In this case however you will not be able to parse references with dots.
As a solution I would recommend to substitute your dot with some other character. For example with colon:
Usage:
'use' head=[Greet|QualifiedName] (':' tail=[Operation])?

Forcing gaps between words in a Marpa grammar

I'm trying to set up a grammar that requires that [\w] characters cannot appear directly adjacent to each other if they are not in the same lexeme. That is, words must be separated from each other by a space or punctuation.
Consider the following grammar:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This parses successfully. Now I want to change the grammar to force a separation between 9 and september. I thought of doing this by introducing an unused lexeme that matches [\w]+:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+ ### <== Add unused lexeme to match joined keywords
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
Unfortunately, this grammar fails with:
A lexeme is not accessible from the start symbol: word
Marpa::R2 exception at marpa.pl line 3.
Although this can be resolved by using a lexeme default statement:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
lexeme default = action => [value] ### <== Fix exception by adding lexeme default statement
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This results in the following output:
Inaccessible symbol: word
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error:
* The error was at line 1, column 1, and at character 0x0039 '9', ...
* here: 9september
Marpa::R2 exception at marpa.pl line 16.
That is, the parse has failed due to the fact that there is no gap between 9 and september which is exactly what I want to happen. The only fly in the ointment is that there is an annoying Inaccessible symbol: word message on STDERR because the word lexeme is not used in the actual grammar.
I see that in Marpa::R2::Grammar I could have declared word as inaccessible_ok in the constructor options but I can't do that in Marpa::R2::Scanless.
I also could have done something like the following:
Rule ::= nine september
nine ~ word
september ~ word
then used a pause to use custom code to examine the actual lexeme value and return the appropriate lexeme depending on the value.
What is the best way to construct a grammar that uses keywords or numbers and words but will disallow adjacent lexemes to be run together without white space or punctuation separating them?
Well, the obvious solution is to require some whitespace in between (on the G1 level). When we use the following grammar
:default ::= action => ::array
:start ::= Rule
Rule ::= '9' (Ws) 'september'
Ws ::= [\s]+
:discard ~ whitespace
whitespace ~ [\s]+
then 9september fails, but 9 september is parsed. Important points to note:
Lexemes can be both discarded and required, when they are both a longest token. This is why the :discard and Ws rule don't interfere with each other. Marpa doesn't mind this kind of “ambiguity”.
The Ws rule is enclosed in parens, which discards the value – to keep the resulting parse tree clean.
You do not usually want to use tricks like phantom lexemes to misguide the parser. That way lies breakage.
When every bit of whitespace is important, you might want to get rid of :discard ~ whitespace. This is meant to be used e.g. for C-like languages where whitespace traditionally does not matter.

Scala string pattern matching for mathematical symbols

I have the following code:
val z: String = tree.symbol.toString
z match {
case "method +" | "method -" | "method *" | "method ==" =>
println("no special op")
false
case "method /" | "method %" =>
println("we have the special div operation")
true
case _ =>
false
}
Is it possible to create a match for the primitive operations in Scala:
"method *".matches("(method) (+-*==)")
I know that the (+-*) signs are used as quantifiers. Is there a way to match them anyway?
Thanks from a avidly Scala scholar!
Sure.
val z: String = tree.symbol.toString
val noSpecialOp = "method (?:[-+*]|==)".r
val divOp = "method [/%]".r
z match {
case noSpecialOp() =>
println("no special op")
false
case divOp() =>
println("we have the special div operation")
true
case _ =>
false
}
Things to consider:
I choose to match against single characters using [abc] instead of (?:a|b|c).
Note that - has to be the first character when using [], or it will be interpreted as a range. Likewise, ^ cannot be the first character inside [], or it will be interpreted as negation.
I'm using (?:...) instead of (...) because I don't want to extract the contents. If I did want to extract the contents -- so I'd know what was the operator, for instance, then I'd use (...). However, I'd also have to change the matching to receive the extracted content, or it would fail the match.
It is important not to forget () on the matches -- like divOp(). If you forget them, a simple assignment is made (and Scala will complain about unreachable code).
And, as I said, if you are extracting something, then you need something inside those parenthesis. For instance, "method ([%/])".r would match divOp(op), but not divOp().
Much the same as in Java. To escape a character in a regular expression, you prefix the character with \. However, backslash is also the escape character in standard Java/Scala strings, so to pass it through to the regular expression processing you must again prefix it with a backslash. You end up with something like:
scala> "+".matches("\\+")
res1 : Boolean = true
As James Iry points out in the comment below, Scala also has support for 'raw strings', enclosed in three quotation marks: """Raw string in which I don't need to escape things like \!""" This allows you to avoid the second level of escaping, that imposed by Java/Scala strings. Note that you still need to escape any characters that are treated as special by the regular expression parser:
scala> "+".matches("""\+""")
res1 : Boolean = true
Escaping characters in Strings works like in Java.
If you have larger Strings which need a lot of escaping, consider Scala's """.
E. g. """String without needing to escape anything \n \d"""
If you put three """ around your regular expression you don't need to escape anything anymore.