Forcing gaps between words in a Marpa grammar - perl

I'm trying to set up a grammar that requires that [\w] characters cannot appear directly adjacent to each other if they are not in the same lexeme. That is, words must be separated from each other by a space or punctuation.
Consider the following grammar:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This parses successfully. Now I want to change the grammar to force a separation between 9 and september. I thought of doing this by introducing an unused lexeme that matches [\w]+:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+ ### <== Add unused lexeme to match joined keywords
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
Unfortunately, this grammar fails with:
A lexeme is not accessible from the start symbol: word
Marpa::R2 exception at marpa.pl line 3.
Although this can be resolved by using a lexeme default statement:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
lexeme default = action => [value] ### <== Fix exception by adding lexeme default statement
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This results in the following output:
Inaccessible symbol: word
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error:
* The error was at line 1, column 1, and at character 0x0039 '9', ...
* here: 9september
Marpa::R2 exception at marpa.pl line 16.
That is, the parse has failed due to the fact that there is no gap between 9 and september which is exactly what I want to happen. The only fly in the ointment is that there is an annoying Inaccessible symbol: word message on STDERR because the word lexeme is not used in the actual grammar.
I see that in Marpa::R2::Grammar I could have declared word as inaccessible_ok in the constructor options but I can't do that in Marpa::R2::Scanless.
I also could have done something like the following:
Rule ::= nine september
nine ~ word
september ~ word
then used a pause to use custom code to examine the actual lexeme value and return the appropriate lexeme depending on the value.
What is the best way to construct a grammar that uses keywords or numbers and words but will disallow adjacent lexemes to be run together without white space or punctuation separating them?

Well, the obvious solution is to require some whitespace in between (on the G1 level). When we use the following grammar
:default ::= action => ::array
:start ::= Rule
Rule ::= '9' (Ws) 'september'
Ws ::= [\s]+
:discard ~ whitespace
whitespace ~ [\s]+
then 9september fails, but 9 september is parsed. Important points to note:
Lexemes can be both discarded and required, when they are both a longest token. This is why the :discard and Ws rule don't interfere with each other. Marpa doesn't mind this kind of “ambiguity”.
The Ws rule is enclosed in parens, which discards the value – to keep the resulting parse tree clean.
You do not usually want to use tricks like phantom lexemes to misguide the parser. That way lies breakage.
When every bit of whitespace is important, you might want to get rid of :discard ~ whitespace. This is meant to be used e.g. for C-like languages where whitespace traditionally does not matter.

Related

Greedy negative lookbehind (in Swift)

I'm in need of a regular expression that acts like the following:
matches (any part of foo() in the following statement):
foo()
arg: foo()
foo()
(arg: foo()) {}
does not match:
#foo()
I currently have the following, but it has some problems:
^\s*?(?<!#)((\w+?)\()
^\s*? includes any whitespace at the beginning of the line, which means arg: foo() doesn't match the foo() bit. I had to include this to get the # lookbehind working correctly;
(?<!#) is a lookbehind to discard the match if a # before the thing() is matched;
(\w+?)\( matches the part of thething( correctly, only if there's no # before it.
If there's no ^\s*? in the regex, it would be behaving partly correct, but this shouldn't happen. It should rather discard the match entirely (not just for one character):
It has to discard the match entirely if any # is before it, although it must match this correctly: #Mode foo() (the foo() bit, disregarding the #Mode before it).
If there are any tips to help me out, that would be awesome!
Use
(?<![\w#])\w+\(\)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
[\w#] any character of: word characters (a-z,
A-Z, 0-9, _), '#'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
\) ')'

Scala Syntax Specification mismatch if-else with one line expression end by semicolon?

I'm learning Scala Syntax Specification.
Confused by the if-else syntax:
Expr1 ::= ‘if’ ‘(’ Expr ‘)’ {nl} Expr [[semi] ‘else’ Expr]
| ...
How could it match below if-else with one line expression end by semicolon ?
if (true) // \n
println(1); //\n
else //\n
println(2); //\n
Notice there're 4 lines and each followed by a '\n'. I have these questions:
When the 1st ; after println(1) match semi before else( [[semi] ‘else’ Expr] ), how to match the 2nd '\n' after ; after println(1) ?
How to match the 3rd '\n' after else ?
How to match the 2nd ; and the 4th '\n' after println(2) ? Since if-else don't match any ; or '\n' at tail.
I think you are being confused by thinking that all newlines must match the nl token. That is not correct.
Newlines are in general simply treated as whitespace. There is a very long subsection on newlines in the Lexical Syntax chapter section 1.2 Newline characters which explains in detail, when, exactly, a newline character is an nl token and when it isn't.
Only the first newline character in your example is an nl token, the other three are just whitespace.
in Scala, semicolon ; doesn't exist (is ignored)
if-else statement is so simple with brackets as :
if (true) {
"\n" // this will be returned
println(1) // this will be ignored
"\n" // this will be ignored
} else {
"\n" // this will be returned
println(2) // this will be ignored
"\n" // this will be ignored
}
or, you can use without accolades, but the statement must be writed in one line:
if (true)
"\n" // this will be returned, can not have another line here
else
"\n"
without comments: if (true) "\n" else "\n"
More about if-else in Scala

How ANTLR decides whether terminals should be separated with whitespaces or not?

I'm writing lexical analyzer in Swift for Swift. I used ANTLR's grammar, but I faced with problem that I don't understand how ANTLR decides whether terminals should be separated with whitespaces.
Here's the grammar: https://github.com/antlr/grammars-v4/blob/master/swift/Swift.g4
Assume we have casting in Swift. It can also operate with optional types (Int?, String?) and with non-optional types (Int, String). Here are valid examples: "as? Int", "as Int", "as?Int". Invalid examples: "asInt" (it isn't a cast). I've implemented logic, when terminals in grammar rules can be separated with 0 or more WS (whitespace) symbols. But with this logic "asInt" is matching a cast, because it contains "as" and a type "Int" and have 0 or more WS symbols. But it should be invalid.
Swift grammar contains these rules:
DOT : '.' ;
LCURLY : '{' ;
LPAREN : '(' ;
LBRACK : '[' ;
RCURLY : '}' ;
RPAREN : ')' ;
RBRACK : ']' ;
COMMA : ',' ;
COLON : ':' ;
SEMI : ';' ;
LT : '<' ;
GT : '>' ;
UNDERSCORE : '_' ;
BANG : '!' ;
QUESTION: '?' ;
AT : '#' ;
AND : '&' ;
SUB : '-' ;
EQUAL : '=' ;
OR : '|' ;
DIV : '/' ;
ADD : '+' ;
MUL : '*' ;
MOD : '%' ;
CARET : '^' ;
TILDE : '~' ;
It seems that all these terminals can be separated with other's with 0 WS symbols, and other terminals don't (e.g. "as" + Identifier).
Am I right? If I'm right, the problem is solved. But there may be more complex logic.
Now if I have rules
WS : [ \n\r\t\u000B\u000C\u0000]+
a : 'str1' b
b : 'str2' c
c : '+' d
d : 'str3'
I use them as if they were these rules:
WS : [ \n\r\t\u000B\u000C\u0000]+
a : WS? 'str1' WS? 'str2' WS? '+' WS? 'str3' WS?
And I suppose that they should be like these (I don't know and that is the question):
WS : [ \n\r\t\u000B\u000C\u0000]+
a: 'str1' WS 'str2' WS? '+' WS? 'str3'
(notice WS is not optional between 'str1' and 'str2')
So there's 2 questions:
Am I right?
What I missed?
Thanks.
Here's the ANTLR WS rule in your Swift grammar:
WS : [ \n\r\t\u000B\u000C\u0000]+ -> channel(HIDDEN) ;
The -> channel(HIDDEN) instruction tells the lexer to put these tokens on a separate channel, so the parser won't see them at all. You shouldn't litter your grammar with WS rules - it'd become unreadable.
ANTLR works in two steps: you have the lexer and the parser. The lexer produces the tokens, and the parser tries to figure out a concrete syntax tree from these tokens and the grammar.
The lexer in ANTLR works like this:
Consume characters as long as they match any lexer rule.
If several rules match the text you've consumed, use the first one which appears in the grammar
Literal strings in the grammar (like 'as') are turned into implicit lexer rules (equivalent to TOKEN_AS: 'as'; except the name will be just 'as'). These end up first in the lexer rules list.
Example 1
Let's see the consequences of these when lexing as?Int (with a space at the end):
a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
as? does not match any lexer rule
Therefore, you consume as, which will become a token. Now you have to decide which will be the token type. Both Identifier and 'as' rules match. 'as' is an implicit lexer rule, and considered to appear first in the grammar, therefore it takes precedence. The lexer emits a token with text as of type 'as'.
Next token.
?... potentially matches the QUESTION rule
?I doesn't match any rule
Therefore, you consume ? from the input and emit a token of type QUESTION with text ?.
Next token.
I... potentially matches Identifier
In... potentially matches Identifier
Int... potentially matches Identifier
Int (followed by a space) does not match anything
Therefore, you consume Int from the input and emit a token of type Identifier with text Int.
Next token.
You have a space there, it matches the WS rule.
You consume that space, and emit a WS token on the HIDDEN channel. The parser won't see this.
Example 2
Now let's see how asInt is tokenized.
a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
asI... potentially matches Identifier
asIn... potentially matches Identifier
asInt... potentially matches Identifier
asInt followed by a space doesn't match any lexer rule.
Therefore, you consume asInt from the input stream, and emit an Identifier token with text asInt.
The parser
The parser stage is only interested in the token types it gets. It does not care about what text they contain. Tokens outside the default channel are ignored, which means the following inputs:
as?Int - tokens: 'as' QUESTION Identifier
as? Int - tokens: 'as' QUESTION WS Identifier
as ? Int - tokens: 'as' WS QUESTION WS Identifier
Will all result in the parser seeing the following token types: 'as' QUESTION Identifier, as WS is on a separate channel.

How to identify and extract simple nested tokens with a BNF lexer?

I have no idea how to get documentation about this. I just discovered that most of the compilers are using the Backus–Naur Form to describe a language.
From the Marpa::R2 perl package, get this simple example that parse arithmetic strings such as 42 * 1 + 7:
:default ::= action => [name,values]
lexeme default = latm => 1
Calculator ::= Expression action => ::first
Factor ::= Number action => ::first
Term ::=
Term '*' Factor action => do_multiply
| Factor action => ::first
Expression ::=
Expression '+' Term action => do_add
| Term action => ::first
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
I would like to modify this in order to recursively parse an XML like sample such as:
<foo>
Some content here
<bar>
I am nested into foo
</bar>
A nested block was before me.
</foo>
And express it into something like:
>(Some content here)
>>(I am nested into foo)
>(A nested block was before me)
Where I may use this function:
sub block($content, $level) {
for each $content line
$line = (">" x $level).$content
return $content
}
Was would be a good start for me?
There is an open-source Marpa-powered XML parser.

Marpa parser can't seem to cope with optional first symbol?

I've been getting to grips with the Marpa parser and encountered a problem when the first symbol is optional. Here's an example:
use strict;
use warnings;
use 5.10.0;
use Marpa::R2;
use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= <optional a> 'X'
<optional a> ~ a *
a ~ 'a'
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\"X");
When I run this, I get the following error:
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error:
* The error was at line 1, column 1, and at character 0x0058 'X', ...
* here: X
Marpa::R2 exception at small.pl line 20
at /usr/local/lib/perl/5.14.2/Marpa/R2.pm line 126
Marpa::R2::exception('Error in SLIF parse: No lexemes accepted at line 1, column 1\x{a}...') called at /usr/local/lib/perl/5.14.2/Marpa/R2/Scanless.pm line 1545
Marpa::R2::Scanless::R::read_problem('Marpa::R2::Scanless::R=ARRAY(0x95cbfd0)', 'no lexemes accepted') called at /usr/local/lib/perl/5.14.2/Marpa/R2/Scanless.pm line 1345
Marpa::R2::Scanless::R::resume('Marpa::R2::Scanless::R=ARRAY(0x95cbfd0)', 0, -1) called at /usr/local/lib/perl/5.14.2/Marpa/R2/Scanless.pm line 926
Marpa::R2::Scanless::R::read('Marpa::R2::Scanless::R=ARRAY(0x95cbfd0)', 'SCALAR(0x95aeb1c)') called at small.pl line 20
Perl version 5.14.2 (debian wheezy)
Marpa version 2.068000
(I see there's a brand new Marpa 2.069 that I haven't tried yet)
Is this something I'm doing wrong in my grammar?
In Marpa Scanless, your grammar has two levels: The main, high-level grammar where you can attribute actions and such, and the low-level lexing grammar. They are executed independently (which is expected if you have used traditional parser/lexers, but is very confusing when you come from regexes to Marpa).
Now on the low level grammar, Marpa recognizes your input as a single X, not “zero as and then an X”. However, the high-level grammar requires the optional a symbol to be present.
There best way around that is to make the a optional in the high-level grammar:
<optional a> ::= <many a>
<optional a> ::= # empty
<many a> ~ a* # would work the same here with "a+"
a ~ 'a'