ANTLR4 lexer rule creates errors or conflicts on perl grammar - perl

I am having an issue on my PERL grammar, here are the relevant parts of my grammar :
element
: element (ASTERISK_CHAR | SLASH_CHAR | PERCENT_CHAR) element
| word
;
SLASH_CHAR: '/';
REGEX_STRING
: '/' (~('/' | '\r' | '\n') | NEW_LINE)* '/'
;
fragment NEW_LINE
: '\r'? '\n'
;
If the rule REGEX_STRING is not commented, then the following perl doesn't parse :
$b = 1/2;
$c = 1/2;
<2021/08/20-19:24:37> <ERROR> [parsing.AntlrErrorLogger] - Unit 1: <unknown>:2:6: extraneous input '/2;\r\n$c = 1/' expecting {<EOF>, '=', '**=', '+=', '-=', '.=', '*=', '/=', '%=', CROSS_EQUAL, '&=', '|=', '^=', '&.=', '|.=', '^.=', '<<=', '>>=', '&&=', '||=', '//=', '==', '>=', '<=', '<=>', '<>', '!=', '>', '<', '~~', '++', '--', '**', '.', '+', '-', '*', '/', '%', '=~', '!~', '&&', '||', '//', '&', '&.', '|', '|.', '^', '^.', '<<', '>>', '..', '...', '?', ';', X_KEYWORD, AND, CMP, EQ, FOR, FOREACH, GE, GT, IF, ISA, LE, LT, OR, NE, UNLESS, UNTIL, WHEN, WHILE, XOR, UNSIGNED_INTEGER}
Note that it doesn't matter where the lexer rule REGEX_STRING is used, even if it is not present anywhere in the parser rules just being here makes the parsing fails (so the issue is lexer side).
If I remove the lexer rule REGEX_STRING, then it gets parsed just fine, but then I can't parse :
$dateCalc =~ /^([0-9]{4})([0-9]{2})([0-9]{2})/
Also, I noticed that this perl parses, so there seems to be some kind of interaction between the first and the second '/'.
$b = 12; # Removed the / between 1 and 2
$c = 1/2; # Removing the / here would work as well.
I can't seem to find how to write my regex lexer rule to not make something fail.
What am I missing ? How can I parse both expressions just fine ?

The basic issue here is that ANTLR4, like many other parsing frameworks, performs lexical analysis independent of the syntax; the same tokens are produced regardless of which tokens might be acceptable to the parser. So it is the lexical analyser which must decide whether a given / is a division operator or the start of a regex, a decision which can really only be made using syntactic information. (There are parsing frameworks which do not have this limitation, and thus can be used to implement for scannerless parsers. These include PEG-based parsers and GLR/GLR parsers.)
There's an example of solving this lexical ambiguity, which also shows up in parsing ECMAScript, in the ANTLR4 example directory. (That's a github permalink so that the line numbers cited below continue to work.)
The basic strategy is to decide whether a / can start a regular expression based on the immediately previous token. This works in ECMAScript because the syntactic contexts in which an operator (such as / or /=) can appear are disjoint from the contexts in which an operand can appear. This will probably not translate directly into a Perl parser, but it might help show the possibilities.
Line 780-782: The regex token itself is protected by a semantic guard:
RegularExpressionLiteral
: {isRegexPossible()}? '/' RegularExpressionBody '/' RegularExpressionFlags
;
Lines 154-182: The guard function itself is simple, but obviously required a certain amount of grammatical analysis to generate the correct test. (Note: The list of tokens has been abbreviated; see the original file for the complete list):
private boolean isRegexPossible() {
if (this.lastToken == null) {
return true;
}
switch (this.lastToken.getType()) {
case Identifier:
case NullLiteral:
...
// After any of the tokens above, no regex literal can follow.
return false;
default:
// In all other cases, a regex literal _is_ possible.
return true;
}
}
}
Lines 127-147 In order for that to work, the scanner must retain the previous token in the member variable last_token. (Comments removed for space):
#Override
public Token nextToken() {
Token next = super.nextToken();
if (next.getChannel() == Token.DEFAULT_CHANNEL) {
this.lastToken = next;
}
return next;
}

Related

Include commentary to the matlab grammar using antlr4

could anybody help me with these two problems please?
First one is almost solved for me by question regular expression for multiline commentary in matlab , but I do not know how exactly I should use ^.*%\{(?:\R(?!.*%\{).*)*\R\h*%\}$ or where in grammar if I want use is with antlr4. I have been using matlab grammar from this source.
Second one is related to another type of commentary in matlab which is a = 3 % type any ascii I want.... In this case worked, when I insert label alternative to the rule context unary_expression in this form:
unary_expression
: postfix_expression
| unary_operator postfix_expression
| postfix_expression COMMENT
;
where COMMENT: '%' [ a-zA-Z0-9]*;, but when I use [\x00-\x7F] instead of [ a-zA-Z0-9]* (what I found here) parsing goes wrong, see example bellow:
INPUT FOR PARSER: a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
ANTLR OUTPUT : Exception in thread "main" java.lang.RuntimeException: set is empty
at org.antlr.v4.runtime.misc.IntervalSet.getMaxElement(IntervalSet.java:421)
at org.antlr.v4.runtime.atn.ATNSerializer.serialize(ATNSerializer.java:169)
at org.antlr.v4.runtime.atn.ATNSerializer.getSerialized(ATNSerializer.java:601)
at org.antlr.v4.Tool.generateInterpreterData(Tool.java:745)
at org.antlr.v4.Tool.processNonCombinedGrammar(Tool.java:400)
at org.antlr.v4.Tool.process(Tool.java:361)
at org.antlr.v4.Tool.processGrammarsOnCommandLine(Tool.java:328)
at org.antlr.v4.Tool.main(Tool.java:172)
line 1:9 token recognition error at: '$'
line 1:20 token recognition error at: '"'
line 1:21 token recognition error at: '!'
line 1:22 token recognition error at: '"'
line 1:38 token recognition error at: '$'
line 1:43 token recognition error at: '"'
line 1:10 missing {',', ';', CR} at 'L'
line 1:32 missing {',', ';', CR} at '3'
Can anybody please tell me what have I done wrong? And what is the best practice for this problem? (I am not exactly regex person...)
Let's take the simple one first.
this looks (to me) like a typical "comment everything through the end of the line" comment.
Assuming I'm correct, then best not to consider what all the valid characters are that might be contained, but rather to think about what not to consume.
Try: COMMENT: '%' ~[\r\n]* '\r'? '\n';
(I notice that you did not include anything in your rule to terminate it at the end of the line, so I've added that).
This basically says: once I see a % consume everything that is not a \r or `nand stop when you see an option\rfollowed by a required\n'.
Generally, comments can occur just about anywhere within a grammar structure, so it's VERY useful to "shove the off to the side" rather than inject them everywhere you allow them in the grammar.
So, a short grammar:
grammar test
;
test: ID EQ INT;
EQ: '=';
INT: [0-9]+;
COMMENT: '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
ID: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
You'll notice that I removed the COMMENT element from the test rule.
test file:
a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
(be sure to include the \n)
➜ grun test test -tree -tokens < test.txt
[#0,0:0='a',<ID>,1:0]
[#1,2:2='=',<'='>,1:2]
[#2,4:4='3',<INT>,1:4]
[#3,6:48='% $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[#4,49:48='<EOF>',<EOF>,2:0]
(test a = 3)
You still get a COMMENT token, it's just ignored when matching the parser rules.
Now for the multiline comments:
ANTLR uses a rather "regex-like" syntax for Lexer rules, but, don't be fooled, it's not (it's actually more powerful as it can pair up nested brackets, etc.)
From a quick reading, MatLab multiline tokens start with a %{ and consume everything until a %}. This is very similar to the prior rule, it just doesn't care about \ror\n`), so:
MLCOMMENT: '%{' .*? '%}' -> channel(HIDDEN);
Included in grammar:
grammar test
;
test: ID EQ INT;
EQ: '=';
INT: [0-9]+;
COMMENT: '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
MLCOMMENT: '%{' .*? '%}' -> channel(HIDDEN);
ID: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
Input file:
a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
%{
A whole bunch of stuff
on several
lines
%}
➜ grun test test -tree -tokens < test.txt
[#0,0:0='a',<ID>,1:0]
[#1,2:2='=',<'='>,1:2]
[#2,4:4='3',<INT>,1:4]
[#3,6:48='% $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[#4,50:106='%{\n A whole bunch of stuff\n on several\n lines\n%}',<MLCOMMENT>,channel=1,3:0]
[#5,108:107='<EOF>',<EOF>,8:0]
(test a = 3)

Why is $ split valid syntax? [duplicate]

I just discovered that perl ignores space between the sigil and its variable name and was wondering if someone could tell me if this was the expected behaviour. I've never run into this before and it can result in strange behaviour inside of strings.
For example, in the following code, $bar will end up with the value 'foo':
my $foo = 'foo';
my $bar = "$ foo";
This also works with variable declarations:
my $
bar = "foo\n";
print $bar;
The second case doesn't really matter much to me but in the case of string interpolation this can lead to very confusing behaviour. Anyone know anything about this?
Yes, it is part of the language. No, you should not use it for serious code. As for being confusing in interpolation, all dollar signs (that are not part of a variable) should be escaped, not just the ones next to letters, so it shouldn't be a problem.
I do not know if this is the real reason behind allowing whitespace in between the sigil and the variable name, but it allows you to do things like
my $ count = 0;
my $file_handle_foo = IO::File->new;
which might be seen by some people as handy (since it puts the sigils and the unique parts of the variable names next to each other). It is also useful for Obfu (see the end of line 9 and beginning of line 10):
#!/usr/bin/perl -w # camel code
use strict;
$_='ev
al("seek\040D
ATA,0, 0;");foreach(1..3)
{<DATA>;}my #camel1hump;my$camel;
my$Camel ;while( <DATA>){$_=sprintf("%-6
9s",$_);my#dromedary 1=split(//);if(defined($
_=<DATA>)){#camel1hum p=split(//);}while(#dromeda
ry1){my$camel1hump=0 ;my$CAMEL=3;if(defined($_=shif
t(#dromedary1 ))&&/\S/){$camel1hump+=1<<$CAMEL;}
$CAMEL--;if(d efined($_=shift(#dromedary1))&&/\S/){
$camel1hump+=1 <<$CAMEL;}$CAMEL--;if(defined($_=shift(
#camel1hump))&&/\S/){$camel1hump+=1<<$CAMEL;}$CAMEL--;if(
defined($_=shift(#camel1hump))&&/\S/){$camel1hump+=1<<$CAME
L;;}$camel.=(split(//,"\040..m`{/J\047\134}L^7FX"))[$camel1h
ump];}$camel.="\n";}#camel1hump=split(/\n/,$camel);foreach(#
camel1hump){chomp;$Camel=$_;y/LJF7\173\175`\047/\061\062\063\
064\065\066\067\070/;y/12345678/JL7F\175\173\047`/;$_=reverse;
print"$_\040$Camel\n";}foreach(#camel1hump){chomp;$Camel=$_;y
/LJF7\173\175`\047/12345678/;y/12345678/JL7F\175\173\0 47`/;
$_=reverse;print"\040$_$Camel\n";}';;s/\s*//g;;eval; eval
("seek\040DATA,0,0;");undef$/;$_=<DATA>;s/\s*//g;( );;s
;^.*_;;;map{eval"print\"$_\"";}/.{4}/g; __DATA__ \124
\1 50\145\040\165\163\145\040\157\1 46\040\1 41\0
40\143\141 \155\145\1 54\040\1 51\155\ 141
\147\145\0 40\151\156 \040\141 \163\16 3\
157\143\ 151\141\16 4\151\1 57\156
\040\167 \151\164\1 50\040\ 120\1
45\162\ 154\040\15 1\163\ 040\14
1\040\1 64\162\1 41\144 \145\
155\14 1\162\ 153\04 0\157
\146\ 040\11 7\047\ 122\1
45\15 1\154\1 54\171 \040
\046\ 012\101\16 3\16
3\15 7\143\15 1\14
1\16 4\145\163 \054
\040 \111\156\14 3\056
\040\ 125\163\145\14 4\040\
167\1 51\164\1 50\0 40\160\
145\162 \155\151
\163\163 \151\1
57\156\056

How ANTLR decides whether terminals should be separated with whitespaces or not?

I'm writing lexical analyzer in Swift for Swift. I used ANTLR's grammar, but I faced with problem that I don't understand how ANTLR decides whether terminals should be separated with whitespaces.
Here's the grammar: https://github.com/antlr/grammars-v4/blob/master/swift/Swift.g4
Assume we have casting in Swift. It can also operate with optional types (Int?, String?) and with non-optional types (Int, String). Here are valid examples: "as? Int", "as Int", "as?Int". Invalid examples: "asInt" (it isn't a cast). I've implemented logic, when terminals in grammar rules can be separated with 0 or more WS (whitespace) symbols. But with this logic "asInt" is matching a cast, because it contains "as" and a type "Int" and have 0 or more WS symbols. But it should be invalid.
Swift grammar contains these rules:
DOT : '.' ;
LCURLY : '{' ;
LPAREN : '(' ;
LBRACK : '[' ;
RCURLY : '}' ;
RPAREN : ')' ;
RBRACK : ']' ;
COMMA : ',' ;
COLON : ':' ;
SEMI : ';' ;
LT : '<' ;
GT : '>' ;
UNDERSCORE : '_' ;
BANG : '!' ;
QUESTION: '?' ;
AT : '#' ;
AND : '&' ;
SUB : '-' ;
EQUAL : '=' ;
OR : '|' ;
DIV : '/' ;
ADD : '+' ;
MUL : '*' ;
MOD : '%' ;
CARET : '^' ;
TILDE : '~' ;
It seems that all these terminals can be separated with other's with 0 WS symbols, and other terminals don't (e.g. "as" + Identifier).
Am I right? If I'm right, the problem is solved. But there may be more complex logic.
Now if I have rules
WS : [ \n\r\t\u000B\u000C\u0000]+
a : 'str1' b
b : 'str2' c
c : '+' d
d : 'str3'
I use them as if they were these rules:
WS : [ \n\r\t\u000B\u000C\u0000]+
a : WS? 'str1' WS? 'str2' WS? '+' WS? 'str3' WS?
And I suppose that they should be like these (I don't know and that is the question):
WS : [ \n\r\t\u000B\u000C\u0000]+
a: 'str1' WS 'str2' WS? '+' WS? 'str3'
(notice WS is not optional between 'str1' and 'str2')
So there's 2 questions:
Am I right?
What I missed?
Thanks.
Here's the ANTLR WS rule in your Swift grammar:
WS : [ \n\r\t\u000B\u000C\u0000]+ -> channel(HIDDEN) ;
The -> channel(HIDDEN) instruction tells the lexer to put these tokens on a separate channel, so the parser won't see them at all. You shouldn't litter your grammar with WS rules - it'd become unreadable.
ANTLR works in two steps: you have the lexer and the parser. The lexer produces the tokens, and the parser tries to figure out a concrete syntax tree from these tokens and the grammar.
The lexer in ANTLR works like this:
Consume characters as long as they match any lexer rule.
If several rules match the text you've consumed, use the first one which appears in the grammar
Literal strings in the grammar (like 'as') are turned into implicit lexer rules (equivalent to TOKEN_AS: 'as'; except the name will be just 'as'). These end up first in the lexer rules list.
Example 1
Let's see the consequences of these when lexing as?Int (with a space at the end):
a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
as? does not match any lexer rule
Therefore, you consume as, which will become a token. Now you have to decide which will be the token type. Both Identifier and 'as' rules match. 'as' is an implicit lexer rule, and considered to appear first in the grammar, therefore it takes precedence. The lexer emits a token with text as of type 'as'.
Next token.
?... potentially matches the QUESTION rule
?I doesn't match any rule
Therefore, you consume ? from the input and emit a token of type QUESTION with text ?.
Next token.
I... potentially matches Identifier
In... potentially matches Identifier
Int... potentially matches Identifier
Int (followed by a space) does not match anything
Therefore, you consume Int from the input and emit a token of type Identifier with text Int.
Next token.
You have a space there, it matches the WS rule.
You consume that space, and emit a WS token on the HIDDEN channel. The parser won't see this.
Example 2
Now let's see how asInt is tokenized.
a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
asI... potentially matches Identifier
asIn... potentially matches Identifier
asInt... potentially matches Identifier
asInt followed by a space doesn't match any lexer rule.
Therefore, you consume asInt from the input stream, and emit an Identifier token with text asInt.
The parser
The parser stage is only interested in the token types it gets. It does not care about what text they contain. Tokens outside the default channel are ignored, which means the following inputs:
as?Int - tokens: 'as' QUESTION Identifier
as? Int - tokens: 'as' QUESTION WS Identifier
as ? Int - tokens: 'as' WS QUESTION WS Identifier
Will all result in the parser seeing the following token types: 'as' QUESTION Identifier, as WS is on a separate channel.

Error in the semantic values returned by bison

A part of my bison grammar is as shown
head: OPEN statement CLOSE
{
$$=$2;
}
;
statement: word
{
$$=$1;
}
| statement word
{
$$=$1;
printf("%s",$$);
}
;
Now if my input is [hai hello] where [ is the OPEN & ] is the CLOSE respectively,then in the printf statement I get the output as "hai hello" itself..but in the $$ of head I get "hai hello]". Same happens with other grammars too.i.e., if i try to print valye of $1,the values of $2,$3,... are also printed.. why is it so.
The problem is probably in your lexer -- you probably have lexer actions that do something like yylval.str = yytext; to return a semantic value. The problem is that yytext is a pointer into the scanner's read buffer and is only valid until the next call to yylex. So all your semantic values in the parser quickly become dangling pointers and what they point at is no longer valid.
You need to make a copy of the token string in the lexer. Use an action something like yylval.str = strdup(yytext);. Of course, then you have potential memory leak issues in your parser -- you need to free the $n values you don't need anymore.

Laundering tainted data

When I do laundering tainted data with checking whether it has any bad characters are there unicode-properties which will filter the bad characters?
User-Defined Character Properties in perlunicode
package Characters::Sid_com;
sub InBad {
return <<"BAD";
0000\t10FFFF
BAD
}
sub InEvil {
return <<"EVIL";
0488
0489
EVIL
}
sub InStupid {
return <<"STUPID";
E630\tE64F
F8D0\tF8FF
STUPID
}
⋮
die 'No.' if $tring =~ /
(?: \p{Characters::Sid_com::InBad}
| \p{Characters::Sid_com::InEvil}
| \p{Characters::Sid_com::InStupid}
)
/x;
I think "no" is an understatement for an answer, but there you have it. No, Unicode does not have a concept of "bad" or "good" characters (let alone "ugly" ones).
XML (and thus XHTML) can only contains these chars:
\x09 \x0A \x0D
\x{0020}-\x{D7FF}
\x{E000}-\x{FFFD}
\x{10000}-\x{10FFFF}
Of the above, the following should be avoided:
\x7F-\x84
\x86-\x9F
\x{FDD0}-\x{FDEF}
\x{1FFFE}-\x{1FFFF}
\x{2FFFE}-\x{2FFFF}
\x{3FFFE}-\x{3FFFF}
\x{4FFFE}-\x{4FFFF}
\x{5FFFE}-\x{5FFFF}
\x{6FFFE}-\x{6FFFF}
\x{7FFFE}-\x{7FFFF}
\x{8FFFE}-\x{8FFFF}
\x{9FFFE}-\x{9FFFF}
\x{AFFFE}-\x{AFFFF}
\x{BFFFE}-\x{BFFFF}
\x{CFFFE}-\x{CFFFF}
\x{DFFFE}-\x{DFFFF}
\x{EFFFE}-\x{EFFFF}
\x{FFFFE}-\x{FFFFF}
\x{10FFFE}-\x{10FFFF}
If you are generating XHTML, you need to escape the following:
& ⇒ &
< ⇒ <
> ⇒ > (optional)
" ⇒ " (optional except in attribute values delimited with ")
' ⇒ &apos; (optional except in attribute values delimited with ')
HTML should have the same if not looser requirements, so if you stick to this, you should be safe.