How to identify and extract simple nested tokens with a BNF lexer? - lex

I have no idea how to get documentation about this. I just discovered that most of the compilers are using the Backus–Naur Form to describe a language.
From the Marpa::R2 perl package, get this simple example that parse arithmetic strings such as 42 * 1 + 7:
:default ::= action => [name,values]
lexeme default = latm => 1
Calculator ::= Expression action => ::first
Factor ::= Number action => ::first
Term ::=
Term '*' Factor action => do_multiply
| Factor action => ::first
Expression ::=
Expression '+' Term action => do_add
| Term action => ::first
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
I would like to modify this in order to recursively parse an XML like sample such as:
<foo>
Some content here
<bar>
I am nested into foo
</bar>
A nested block was before me.
</foo>
And express it into something like:
>(Some content here)
>>(I am nested into foo)
>(A nested block was before me)
Where I may use this function:
sub block($content, $level) {
for each $content line
$line = (">" x $level).$content
return $content
}
Was would be a good start for me?

There is an open-source Marpa-powered XML parser.

Related

XText cross referencing

I have written following grammar
Model:
package = PackageDec?
greetings+=Greeting*
usage+=Usage* ;
PackageDec:
'package' name=QualifiedName ;
Greeting:
'greet' name=ID '{' ops += Operation* '}' ;
Operation:
'op' name=ID ('(' ')' '{' '}')? ;
QualifiedName:
ID ('.' ID)*;
Usage:
'use';
With above i can write following script.
package p1.p2
greet G1 {op f1 op f2 }
Now i need to write something like this:
package p1.p2
greet G1 {op f1 op f2 op f3}
use p1.p2.G1.f1
use p1.p2.G1
use p1.p2.G1.f3
To support that i changed Usage RULE like this
Usage:
'use' head=[Greet|QualifiedName] =>('.' tail=[Operation])?
However when i generate xtext artifacts it is complaining about multiple alternatives.
Please let me know how to write correct grammar rule for this.
This is because QualifiedName consumes dots (.). Adding ('.' ...)? makes two alternatives. Consider input
a.b.c
This could be parsed as
head="a" tail = "b.c"
head="a.b" tail = "c"
If I understand your intention of using predicate => right, than you just have to replace
head=[Greet|QualifiedName]
with
head=[Greet]
In this case however you will not be able to parse references with dots.
As a solution I would recommend to substitute your dot with some other character. For example with colon:
Usage:
'use' head=[Greet|QualifiedName] (':' tail=[Operation])?

Forcing gaps between words in a Marpa grammar

I'm trying to set up a grammar that requires that [\w] characters cannot appear directly adjacent to each other if they are not in the same lexeme. That is, words must be separated from each other by a space or punctuation.
Consider the following grammar:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This parses successfully. Now I want to change the grammar to force a separation between 9 and september. I thought of doing this by introducing an unused lexeme that matches [\w]+:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+ ### <== Add unused lexeme to match joined keywords
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
Unfortunately, this grammar fails with:
A lexeme is not accessible from the start symbol: word
Marpa::R2 exception at marpa.pl line 3.
Although this can be resolved by using a lexeme default statement:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
lexeme default = action => [value] ### <== Fix exception by adding lexeme default statement
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This results in the following output:
Inaccessible symbol: word
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error:
* The error was at line 1, column 1, and at character 0x0039 '9', ...
* here: 9september
Marpa::R2 exception at marpa.pl line 16.
That is, the parse has failed due to the fact that there is no gap between 9 and september which is exactly what I want to happen. The only fly in the ointment is that there is an annoying Inaccessible symbol: word message on STDERR because the word lexeme is not used in the actual grammar.
I see that in Marpa::R2::Grammar I could have declared word as inaccessible_ok in the constructor options but I can't do that in Marpa::R2::Scanless.
I also could have done something like the following:
Rule ::= nine september
nine ~ word
september ~ word
then used a pause to use custom code to examine the actual lexeme value and return the appropriate lexeme depending on the value.
What is the best way to construct a grammar that uses keywords or numbers and words but will disallow adjacent lexemes to be run together without white space or punctuation separating them?
Well, the obvious solution is to require some whitespace in between (on the G1 level). When we use the following grammar
:default ::= action => ::array
:start ::= Rule
Rule ::= '9' (Ws) 'september'
Ws ::= [\s]+
:discard ~ whitespace
whitespace ~ [\s]+
then 9september fails, but 9 september is parsed. Important points to note:
Lexemes can be both discarded and required, when they are both a longest token. This is why the :discard and Ws rule don't interfere with each other. Marpa doesn't mind this kind of “ambiguity”.
The Ws rule is enclosed in parens, which discards the value – to keep the resulting parse tree clean.
You do not usually want to use tricks like phantom lexemes to misguide the parser. That way lies breakage.
When every bit of whitespace is important, you might want to get rid of :discard ~ whitespace. This is meant to be used e.g. for C-like languages where whitespace traditionally does not matter.

Marpa parser can't seem to cope with optional first symbol?

I've been getting to grips with the Marpa parser and encountered a problem when the first symbol is optional. Here's an example:
use strict;
use warnings;
use 5.10.0;
use Marpa::R2;
use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= <optional a> 'X'
<optional a> ~ a *
a ~ 'a'
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\"X");
When I run this, I get the following error:
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error:
* The error was at line 1, column 1, and at character 0x0058 'X', ...
* here: X
Marpa::R2 exception at small.pl line 20
at /usr/local/lib/perl/5.14.2/Marpa/R2.pm line 126
Marpa::R2::exception('Error in SLIF parse: No lexemes accepted at line 1, column 1\x{a}...') called at /usr/local/lib/perl/5.14.2/Marpa/R2/Scanless.pm line 1545
Marpa::R2::Scanless::R::read_problem('Marpa::R2::Scanless::R=ARRAY(0x95cbfd0)', 'no lexemes accepted') called at /usr/local/lib/perl/5.14.2/Marpa/R2/Scanless.pm line 1345
Marpa::R2::Scanless::R::resume('Marpa::R2::Scanless::R=ARRAY(0x95cbfd0)', 0, -1) called at /usr/local/lib/perl/5.14.2/Marpa/R2/Scanless.pm line 926
Marpa::R2::Scanless::R::read('Marpa::R2::Scanless::R=ARRAY(0x95cbfd0)', 'SCALAR(0x95aeb1c)') called at small.pl line 20
Perl version 5.14.2 (debian wheezy)
Marpa version 2.068000
(I see there's a brand new Marpa 2.069 that I haven't tried yet)
Is this something I'm doing wrong in my grammar?
In Marpa Scanless, your grammar has two levels: The main, high-level grammar where you can attribute actions and such, and the low-level lexing grammar. They are executed independently (which is expected if you have used traditional parser/lexers, but is very confusing when you come from regexes to Marpa).
Now on the low level grammar, Marpa recognizes your input as a single X, not “zero as and then an X”. However, the high-level grammar requires the optional a symbol to be present.
There best way around that is to make the a optional in the high-level grammar:
<optional a> ::= <many a>
<optional a> ::= # empty
<many a> ~ a* # would work the same here with "a+"
a ~ 'a'

Valid identifier characters in Scala

One thing I find quite confusing is knowing which characters and combinations I can use in method and variable names. For instance
val #^ = 1 // legal
val # = 1 // illegal
val + = 1 // legal
val &+ = 1 // legal
val &2 = 1 // illegal
val £2 = 1 // legal
val ¬ = 1 // legal
As I understand it, there is a distinction between alphanumeric identifiers and operator identifiers. You can mix an match one or the other but not both, unless separated by an underscore (a mixed identifier).
From Programming in Scala section 6.10,
An operator identifier consists of one or more operator characters.
Operator characters are printable ASCII characters such as +, :, ?, ~
or #.
More precisely, an operator character belongs to the Unicode set
of mathematical symbols(Sm) or other symbols(So), or to the 7-bit
ASCII characters that are not letters, digits, parentheses, square
brackets, curly braces, single or double quote, or an underscore,
period, semi-colon, comma, or back tick character.
So we are excluded from using ()[]{}'"_.;, and `
I looked up Unicode mathematical symbols on Wikipedia, but the ones I found didn't include +, :, ? etc. Is there a definitive list somewhere of what the operator characters are?
Also, any ideas why Unicode mathematical operators (rather than symbols) do not count as operators?
Working from the EBNF syntax in the spec:
upper ::= ‘A’ | ... | ‘Z’ | ‘$’ | ‘_’ and Unicode category Lu
lower ::= ‘a’ | ... | ‘z’ and Unicode category Ll
letter ::= upper | lower and Unicode categories Lo, Lt, Nl
digit ::= ‘0’ | ... | ‘9’
opchar ::= “all other characters in \u0020-007F and Unicode
categories Sm, So except parentheses ([]) and periods”
But also taking into account the very beginning on Lexical Syntax that defines:
Parentheses ‘(’ | ‘)’ | ‘[’ | ‘]’ | ‘{’ | ‘}’.
Delimiter characters ‘‘’ | ‘’’ | ‘"’ | ‘.’ | ‘;’ | ‘,’
Here is what I come up with. Working by elimination in the range \u0020-007F, eliminating letters, digits, parentheses and delimiters, we have for opchar... (drumroll):
! # % & * + - / : < = > ? # \ ^ | ~
and also Sm and So - except for parentheses and periods.
(Edit: adding valid examples here:). In summary, here are some valid examples that highlights all cases - watch out for \ in the REPL, I had to escape as \\:
val !#%&*+-/:<=>?#\^|~ = 1 // all simple opchars
val simpleName = 1
val withDigitsAndUnderscores_ab_12_ab12 = 1
val wordEndingInOpChars_!#%&*+-/:<=>?#\^|~ = 1
val !^©® = 1 // opchars ans symbols
val abcαβγ_!^©® = 1 // mixing unicode letters and symbols
Note 1:
I found this Unicode category index to figure out Lu, Ll, Lo, Lt, Nl:
Lu (uppercase letters)
Ll (lowercase letters)
Lo (other letters)
Lt (titlecase)
Nl (letter numbers like roman numerals)
Sm (symbol math)
So (symbol other)
Note 2:
val #^ = 1 // legal - two opchars
val # = 1 // illegal - reserved word like class or => or #
val + = 1 // legal - opchar
val &+ = 1 // legal - two opchars
val &2 = 1 // illegal - opchar and letter do not mix arbitrarily
val £2 = 1 // working - £ is part of Sc (Symbol currency) - undefined by spec
val ¬ = 1 // legal - part of Sm
Note 3:
Other operator-looking things that are reserved words: _ : = => <- <: <% >: # # and also \u21D2 ⇒ and \u2190 ←
The language specification. gives the rule in Chapter 1, lexical syntax (on page 3):
Operator characters. These consist of all printable ASCII
characters \u0020-\u007F. which are in none of the sets above,
mathematical sym- bols(Sm) and other symbols(So).
This is basically the same as your extract of Programming in Programming in Scala. + is not an Unicode mathematical symbol, but it is definitely an ASCII printable character not listed above (not a letter, including _ or $, a digit, a paranthesis, a delimiter).
In your list:
# is illegal not because the character is not an operator character
(#^ is legal), but because it is a reserved word (on page 4), for type projection.
&2 is illegal because you mix an operator character & and a non-operator character, digit 2
£2 is legal because £ is not an operator character: it is not a seven bit ASCII, but 8 bit extended ASCII. It is not nice, as $ is not one either (it is considered a letter).
use backticks to escape limitations and use Unicode symbols
val `r→f` = 150
println(`r→f`)

Lisp grammar in yacc

I am trying to build a Lisp grammar. Easy, right? Apparently not.
I present these inputs and receive errors...
( 1 1)
23 23 23
ui ui
This is the grammar...
%%
sexpr: atom {printf("matched sexpr\n");}
| list
;
list: '(' members ')' {printf("matched list\n");}
| '('')' {printf("matched empty list\n");}
;
members: sexpr {printf("members 1\n");}
| sexpr members {printf("members 2\n");}
;
atom: ID {printf("ID\n");}
| NUM {printf("NUM\n");}
| STR {printf("STR\n");}
;
%%
As near as I can tell, I need a single non-terminal defined as a program, upon which the whole parse tree can hang. But I tried it and it didn't seem to work.
edit - this was my "top terminal" approach:
program: slist;
slist: slist sexpr | sexpr;
But it allows problems such as:
( 1 1
Edit2: The FLEX code is...
%{
#include <stdio.h>
#include "a.yacc.tab.h"
int linenumber;
extern int yylval;
%}
%%
\n { linenumber++; }
[0-9]+ { yylval = atoi(yytext); return NUM; }
\"[^\"\n]*\" { return STR; }
[a-zA-Z][a-zA-Z0-9]* { return ID; }
.
%%
An example of the over-matching...
(1 1 1)
NUM
matched sexpr
NUM
matched sexpr
NUM
matched sexpr
(1 1
NUM
matched sexpr
NUM
matched sexpr
What's the error here?
edit: The error was in the lexer.
Lisp grammar can not be represented as context-free grammar, and yacc can not parse all lisp code.
It is because of lisp features such as read-evaluation and programmable reader. So, in order just to read an arbitrary lisp code, you need to have a full lisp running. This is not some obscure, non-used feature, but it is actually used. E.g., CL-INTERPOL, CL-SQL.
If the goal is to parse a subset of lisp, then the program text is a sequence of sexprs.
The error is really in the lexer. Your parentheses end up as the last "." in the lexer, and don't show up as parentheses in the parser.
Add rules like
\) { return RPAREN; }
\( { return LPAREN; }
to the lexer and change all occurences of '(', ')' to LPAREN and RPAREN respectively in the parser. (also, you need to #define LPAREN and RPAREN where you define your token list)
Note: I'm not sure about the syntax, could be the backslashes are wrong.
You are correct in that you need to define a non-terminal. That would be defined as a set of sexpr. I'm not sure of the YACC syntax for that. I'm partial to ANTLR for parser generators and the syntax would be:
program: sexpr*
Indicating 0 or more sexpr.
Update with YACC syntax:
program : /* empty */
| program sexpr
;
Not in YACC, but might be helpful anyway, here's a full grammar in ANTLR v3 that works for the cases you described(excludes strings in the lexer because it's not important for this example, also uses C# console output because that's what I tested it with):
program: (sexpr)*;
sexpr: list
| atom {Console.WriteLine("matched sexpr");}
;
list:
'('')' {Console.WriteLine("matched empty list");}
| '(' members ')' {Console.WriteLine("matched list");}
;
members: (sexpr)+ {Console.WriteLine("members 1");};
atom: Id {Console.WriteLine("ID");}
| Num {Console.WriteLine("NUM");}
;
Num: ( '0' .. '9')+;
Id: ('a' .. 'z' | 'A' .. 'Z')+;
Whitespace : ( ' ' | '\r' '\n' | '\n' | '\t' ) {Skip();};
This won't work exactly as is in YACC because YACC generates and LALR parser while ANTLR is a modified recursive descent. There is a C/C++ output target for ANTLR if you wanted to go that way.
Do you neccesarily need a yacc/bison parser? A "reads a subset of lisp syntax" reader isn't that hard to implement in C (start with a read_sexpr function, dispatch to a read_list when you see a '(', that in turn builds a list of contained sexprs until a ')' is seen; otherwise, call a read_atom that collects an atom and returns it when it can no longer read atom-constituent characters).
However, if you want to be able to read arbritary Common Lisp, you'll need to (at the worst) implement a Common Lisp, as CL can modify the reader run-time (and even switch between different read-tables run-time under program control; quite handy when you're wanting to load code written in another language or dialect of lisp).
It's been a long time since I worked with YACC, but you do need a top-level non-terminal. Could you be more specific about "tried it" and "it didn't seem to work"? Or, for that matter, what the errors are?
I'd also suspect that YACC might be overkill for such a syntax-light language. Something simpler (like recursive descent) might work better.
You could try this grammar here.
I just tried it, my "yacc lisp grammar" works fine :
%start exprs
exprs:
| exprs expr
/// if you prefer right recursion :
/// | expr exprs
;
list:
'(' exprs ')'
;
expr:
atom
| list
;
atom:
IDENTIFIER
| CONSTANT
| NIL
| '+'
| '-'
| '*'
| '^'
| '/'
;