Avoiding nested objects using ModelBuilderSemantics in Grako - grako

If you take a look at the grammar below, you can see a primary rule, expression, which gets parsed into more specific expression types.
expression::Expression
=
or_ex:and_expr {'||' or_ex:and_expr}+
| andex:and_expr
;
and_expr::AndExpression
=
and_ex:sub_expr {'&&' and_ex:sub_expr}+
| subex:sub_expr
;
sub_expr::SubExpression
=
{'!!'}* '!(' not_ex:expression ')'
| {'!!'}* '(' sub_ex:expression ')'
| compex:comp_expr
;
comp_expr::CompareExpression
=
comp:identifier operator:('>=' | '<=' | '==' | '!=' | '>' | '<') comp:identifier
;
identifier::str
=
?/[a-zA-Z][A-Za-z0-9_]*/?
;
The parsing of the test_input, below, works as expected, but I would prefer to label the and_expr element in the expression rule with an '#' instead of 'andex'. My hope was that the parsed output would result in only a CompareExpression object which is inside a not_ex element in an Expression object.
!(a == b)
It seems that when using the '#' label on the and_expr element, there are no attributes shown in the Expression object! Is this a bug or intentional? Must I label all elements with names and not use the '#' label when using ModelBuilderSemantics?
Another issue I've been facing is that if a later rule, such as comp_expr, did not have an associated class name, its elements would appear in a dictionary when printed, but the dot notation accessor would fail with an AttributeError, i.e. "AttributeError: 'dict' object has no attribute 'comp'". Is there any way to use the dot notation accessor even when rules do not have class names associated with them?

Some of the criteria I use:
Not every rule must have an associated Node class.
Rules with a closure {} as main expression are good for returning a list.
Rules with a a choice | as main expression are best returning whatever the successful option returns, even if this often requires factoring the option into its own rule.
Precedence is important.
Ect.
The idea is that generated parse model should be easy to use, specially with walkers, with a minimum of if-else or isinstance().
This is how I would do your example:
start
=
expression $
;
expression
=
| or_expre
| and_expre
| sub_expre
;
or_expre::OrExpression
=
operands:'||'.{and_expre}+
;
and_expr::AndExpression
=
operands:'&&'.{sub_expre}+
;
sub_expr
=
| not_expr
| comp_expre
| atomic
;
not_expre::NotExpression
=
'!!' ~ sub_expr
;
comp_expr::CompareExpression
=
lef:atomic operator:('>=' | '<=' | '==' | '!=' | '>' | '<') ~ right:atomic
;
atomic
=
| group_expre
| identifier
;
group_expr::GroupExpression
=
'(' ~ expre:expression ')'
;
identifier::str
=
/[a-zA-Z][A-Za-z0-9_]*/
;

Related

Using Reactive Extensions, how can I ignore a sequence of characters based on delimiters?

I have an app that uses Rx to receive data from a device on the serial port. So I have an IObservable<char> that I slice and dice into various strings. However, the device vendor added some debugging information that is enclosed in braces:
interesting stuff {debug stuff} interesting stuff
source ---a-b-c-{-d-e-b-u-g-}-d-e-f---|
| | | | | |
output ---a-b-c---------------d-e-f---|
I need to filter out (discard, ignore) the {debug stuff} from my character sequence?. Is there a simple way to do that? "When you see this character, ignore elements until you see this other character".
I looked at Until but that would terminate the sequence and I don't want that to happen...
This should do what you want, assuming no nested or unbalanced brackets.
source
.Scan((prev, c) =>
{
if (prev == '{')
return c == '}' ? c : '{';
else
return c;
})
.Where(c => c != '{' && c != '}')
It converts everything after the { into { until the }, then filters out all braces. The diagrammed output is:
source ---a-b-c-{-d-e-b-u-g-}-d-e-f---|
scan ---a-b-c-{-{-{-{-{-{-}-d-e-f---|
| | | | | |
where ---a-b-c---------------d-e-f---|

Extracting specific data from a string with regex using Powershell

I'm returning some data like this in powershell :
1)Open;#1
2)Open;#1;#Close;#2;#pending;#6
3)Closed;#5
But I want an output like this :
1)1 Open
2)
1 Open
2 Close
6 pending
3)
5 Closed
The code:
$lookupitem = $lookupList.Items
$CMRSItems = $list.Items | where {$_['ID'] -le 5}
$CMRSItems | ForEach-Object {
$realval = $_['EventType']
Write-Host "RefNumber: " $_['RefID']
Write-Host $realval
}
Any help would be appreciated as my powershell isn't that good.
Without regular expressions, you could do something like the following:
Ignore everything up to the first ')' character
Split the string on the ';' character
foreach pair of the split string
the state is the first part (ignore potentially leading '#')
the number is the second part (ignore leading '#')
Or you could do it using the .NET System.Text.RegularExpressions.Regex class with the following regular expression:
(?:#?(?<state>[a-zA-Z]+);#(?<number>\d);?)
The Captures property on the MatchCollection returned by the Matches method would be a collection in which each item will contain two instances in the Group collection; named state and number respectively.

xText Variable/Attribute Assignment

I built a grammar in xText to recognize formal expressions of a specific format
and to use the generated object tree in Java.
This is what it looks like:
grammar eu.gemtec.device.espa.texpr.Texpr with org.eclipse.xtext.common.Terminals
generate texpr "http://www.gemtec.eu/device/espa/texpr/Texpr"
Model:
(expressions+=AbstractExpression)*
;
AbstractExpression:
MatcherExpression | Assignment;
MatcherExpression:
TerminalMatcher ({Operation.left=current} operator='or' right= MatcherExpression)?
;
TerminalMatcher returns MatcherExpression:
'(' MatcherExpression ')' | {MatcherLiteral} value=Literal
;
Literal:
CharMatcher | ExactMatcher
;
CharMatcher:
type=('text'|'number'|'symbol'|'whitespace') ('(' cardinality=Cardinality ')')?
;
/* Kardinalitäten für CharMatcher*/
Cardinality:
CardinalityMin | CardinalityMinMax | CardinalityMax| CardinalityExact
;
CardinalityMin: min=INT '->';
CardinalityMinMax: min=INT '->' max=INT;
CardinalityMax: '->' max=INT;
CardinalityExact: exact=INT;
ExactMatcher:
(ignoreCase='ignoreCase''(' expected=STRING ')') | expected=STRING
;
/* Variablenzuweisung
*
* z.B. $myVar=number
* */
Assignment:
'$' name=ID '=' expression=MatcherExpression
;
Everything works fine except for the 'cardinality' assignment.
The Expressions look like this:
text number(3) - (an arbitrary amount of letters followed by exactly 3 numbers)
symbol number(2->) - (an arbitrary amount of special characters followed by at least 2 numbers)
whitespace number(->4) - (an arbitrary amount of whitespaces followed by a maximum of 4 numbers)
number(3->6) - (at least 3 numbers but not more than 6)
When I run Eclipse with this grammar (so that my language is recognized and has code completion and so on), everything I type is shown in the "Outline"-tab as a tree-structure as it should, except for the cardinality values.
When I add a cardinality statement to a CharMatcher, the little plus appears before it, but when I click on it it just disappears.
Can anyone tell me why this does not work?
I found the solution myself, I think the problem was that the compiler could not decide which class to create at this point:
Cardinality:
CardinalityMin | CardinalityMinMax | CardinalityMax| CardinalityExact
;
CardinalityMin: min=INT '->';
CardinalityMinMax: min=INT '->' max=INT;
CardinalityMax: '->' max=INT;
CardinalityExact: exact=INT;
So I simplified the whole thing a little, it now looks like this:
Cardinality:
CardinalityMinMax | CardinalityExact
;
CardinalityMinMax: (min=INT '..' max=INT) | (min=INT '..') | ('..' max=INT);
CardinalityExact: exact=INT;
It is still not shown in the "Outline"-Tab, but I suppose that is a problem of the visualisation.
The generated classes now work as intended.

Handling multiple return values in ANTLR

I have a simple rule in ANTLR:
title returns [ElementVector<Element> v]
#init{
$v = new ElementVector<Element>() ;
}
: '[]'
| '[' title_args {$v.add($title_args.ele);} (',' title_args {$v = $title_args.ele ;})* ']'
;
with title_args being:
title_args returns [Element ele]
: author {$ele = new Element("author", $author.text); }
| location {$ele = new Element("location", $location.text); }
;
Trying to compile that I get confronted with a 127 error in the title rule: title_args is a non-unique reference.
I've followed the solution given to another similar question in this website (How to deal with list return values in ANTLR) however it only seems to work with lexical rules.
Is there a specific way to go around it ?
Thank you,
Christos
You have 2 title_args in your expression, you need to alias them. Try this:
| '[' t1=title_args {$v.add($t1.ele);} (',' t2=title_args {$v = $t2.ele ;})* ']'
t1 and t2 are arbitrary aliases you can choose anything you want as long as they match up.
I think the problem is your reusing the title_args var. Try changing one of those variable names.
Yeah, I had the same problem.
You need to change one of the variable names; for example, do like the following:
title_args
title_args1
in your code instead of using title_args twice.
If title_args is a parser rule, then just create the same rule with the name title_args1.
So, basically there would be two rules with the same functionality.

Lisp grammar in yacc

I am trying to build a Lisp grammar. Easy, right? Apparently not.
I present these inputs and receive errors...
( 1 1)
23 23 23
ui ui
This is the grammar...
%%
sexpr: atom {printf("matched sexpr\n");}
| list
;
list: '(' members ')' {printf("matched list\n");}
| '('')' {printf("matched empty list\n");}
;
members: sexpr {printf("members 1\n");}
| sexpr members {printf("members 2\n");}
;
atom: ID {printf("ID\n");}
| NUM {printf("NUM\n");}
| STR {printf("STR\n");}
;
%%
As near as I can tell, I need a single non-terminal defined as a program, upon which the whole parse tree can hang. But I tried it and it didn't seem to work.
edit - this was my "top terminal" approach:
program: slist;
slist: slist sexpr | sexpr;
But it allows problems such as:
( 1 1
Edit2: The FLEX code is...
%{
#include <stdio.h>
#include "a.yacc.tab.h"
int linenumber;
extern int yylval;
%}
%%
\n { linenumber++; }
[0-9]+ { yylval = atoi(yytext); return NUM; }
\"[^\"\n]*\" { return STR; }
[a-zA-Z][a-zA-Z0-9]* { return ID; }
.
%%
An example of the over-matching...
(1 1 1)
NUM
matched sexpr
NUM
matched sexpr
NUM
matched sexpr
(1 1
NUM
matched sexpr
NUM
matched sexpr
What's the error here?
edit: The error was in the lexer.
Lisp grammar can not be represented as context-free grammar, and yacc can not parse all lisp code.
It is because of lisp features such as read-evaluation and programmable reader. So, in order just to read an arbitrary lisp code, you need to have a full lisp running. This is not some obscure, non-used feature, but it is actually used. E.g., CL-INTERPOL, CL-SQL.
If the goal is to parse a subset of lisp, then the program text is a sequence of sexprs.
The error is really in the lexer. Your parentheses end up as the last "." in the lexer, and don't show up as parentheses in the parser.
Add rules like
\) { return RPAREN; }
\( { return LPAREN; }
to the lexer and change all occurences of '(', ')' to LPAREN and RPAREN respectively in the parser. (also, you need to #define LPAREN and RPAREN where you define your token list)
Note: I'm not sure about the syntax, could be the backslashes are wrong.
You are correct in that you need to define a non-terminal. That would be defined as a set of sexpr. I'm not sure of the YACC syntax for that. I'm partial to ANTLR for parser generators and the syntax would be:
program: sexpr*
Indicating 0 or more sexpr.
Update with YACC syntax:
program : /* empty */
| program sexpr
;
Not in YACC, but might be helpful anyway, here's a full grammar in ANTLR v3 that works for the cases you described(excludes strings in the lexer because it's not important for this example, also uses C# console output because that's what I tested it with):
program: (sexpr)*;
sexpr: list
| atom {Console.WriteLine("matched sexpr");}
;
list:
'('')' {Console.WriteLine("matched empty list");}
| '(' members ')' {Console.WriteLine("matched list");}
;
members: (sexpr)+ {Console.WriteLine("members 1");};
atom: Id {Console.WriteLine("ID");}
| Num {Console.WriteLine("NUM");}
;
Num: ( '0' .. '9')+;
Id: ('a' .. 'z' | 'A' .. 'Z')+;
Whitespace : ( ' ' | '\r' '\n' | '\n' | '\t' ) {Skip();};
This won't work exactly as is in YACC because YACC generates and LALR parser while ANTLR is a modified recursive descent. There is a C/C++ output target for ANTLR if you wanted to go that way.
Do you neccesarily need a yacc/bison parser? A "reads a subset of lisp syntax" reader isn't that hard to implement in C (start with a read_sexpr function, dispatch to a read_list when you see a '(', that in turn builds a list of contained sexprs until a ')' is seen; otherwise, call a read_atom that collects an atom and returns it when it can no longer read atom-constituent characters).
However, if you want to be able to read arbritary Common Lisp, you'll need to (at the worst) implement a Common Lisp, as CL can modify the reader run-time (and even switch between different read-tables run-time under program control; quite handy when you're wanting to load code written in another language or dialect of lisp).
It's been a long time since I worked with YACC, but you do need a top-level non-terminal. Could you be more specific about "tried it" and "it didn't seem to work"? Or, for that matter, what the errors are?
I'd also suspect that YACC might be overkill for such a syntax-light language. Something simpler (like recursive descent) might work better.
You could try this grammar here.
I just tried it, my "yacc lisp grammar" works fine :
%start exprs
exprs:
| exprs expr
/// if you prefer right recursion :
/// | expr exprs
;
list:
'(' exprs ')'
;
expr:
atom
| list
;
atom:
IDENTIFIER
| CONSTANT
| NIL
| '+'
| '-'
| '*'
| '^'
| '/'
;