Avoid Custom Terminals Hiding (Suppressing) Derived Ones - eclipse

I started playing around with xtext a few days ago and just went through the tutorials. Maybe the solution has been covered in the reference somewhere but I cannot get it right quickly.
My problem is this. I tried to write a simple grammar which mixed in org.eclipse.xtext.common.Terminals . Then I wanted to insert a cusotm terminal FILE_NAME like this:
terminal FILE_NAME:
( !('/' | '\\' | ':' | '*' | '?' | '"' | '<' | '>' | '|') )+
;
That's basically what a filename is allowed to be under Windows. However, by doing that, inherited rules like ID, INT, etc. would never be matched, because they are always generated after custom terminals.
Can that kind of problem be avoided gracefully (as repeatless as possible and as general as possible)? Thanks in advance!

Terminal rules (aka lexer rules) are used to tokenize the input sequenze. IMHO there should be a minimum of semantics in terminal rules.
You try to express a specialized parser rule which accepts only valid file names.
Have a look at parser phases described in the Xtext Documentation [1]. My suggestion:
Lexing: Instead of using a specialized terminal rule go with STRING.
Validation: Write a validation rule for an EClass with a 'fileName' EAttribute.
as repeatless as possible and as general as possible
You don't want to repeat your validation for every EClass with a 'fileName' EAttribute. Introduce a new super type with a 'fileName' EAttribute if you have a refined Ecore model.
Than you can implement one general validation rule #check_fileName_is_valid(ElementWithFile).
And if you don't have a refined MM use meta model hints within your grammar. If you provide a generalized super type Xtext's Ecore inferrer will pull up common features of the subtypes. Ex:
ElementWithFile: A | B;
A: ... 'file' fileName=STRING ...;
B: ... 'file' fileName=STRING ...;
// => Ecore: ElementWithFile.fileName<EString>
[1] http://www.eclipse.org/Xtext/documentation.html#DSL

Related

Why does a similar rule in the ANTLR grammar file produce a completely different tree?

I am using the grammar file at https://github.com/antlr/grammars-v4/blob/master/sql/tsql/TSqlParser.g4. It has a built_in_functions grammar rule. I want to parse a new function, DAYZ, as a built-in function. I introduced it thus in the .g4
built_in_functions
// https://msdn.microsoft.com/en-us/library/ms173784.aspx
: BINARY_CHECKSUM '(' '*' ')' #BINARY_CHECKSUM
// https://msdn.microsoft.com/en-us/library/ms186819.aspx
| DATEADD '(' datepart=ID ',' number=expression ',' date=expression ')' #DATEADD
| DAYZ '(' date=expression ')' #DAYZ
When I use grun to test the grammar, I get unexpected results for DAYZ. For a DATEDIFF I get what I expect.
For DAYZ, I get the following tree
Why does the parser not treat DAYZ as satisfying the rule built_in_functions like it does for DATEDIFF ? If the parser recognizes DAYZ eventually as an _Id, it should do the same for DATEDIFF. There must be something wrong in the way I am introducing DAYZ into the grammar but I can't figure it out. Any help appreciated. And apologies if I am not using the correct ANTLR terminology. I am a newbie to ANTLR.
I am using antlr-4.9.2-complete.jar
Move your lexer rule for DAYZ to appear before the ID rule in the TSqlLexer.g4 file.
since the id_ rule recognizing the token, then it must be being tokenized as an ID token. This will happen if you DAYZ rule definition is after the ID rule definition.
When ANTLR finds two lexer rules that match the same string of input characters (i.e. "DAYZ"), then it will use whichever rule appears first in the grammar.

Need to explain the kdb/q script to save partitioned table

I'm trying to understand this snippet code from:
https://code.kx.com/q/kb/loading-from-large-files/
to customize it by myself (e.x partition by hours, minutes, number of ticks,...):
$ cat fs.q
\d .Q
/ extension of .Q.dpft to separate table name & data
/ and allow append or overwrite
/ pass table data in t, table name in n, : or , in g
k)dpfgnt:{[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
{[d;g;t;i;x]#[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'!r;
#[;f;`p#]#[d;`.d;:;f,r#&~f=r:!r];n}
/ generalization of .Q.dpfnt to auto-partition and save a multi-partition table
/ pass table data in t, table name in n, name of column to partition on in c
k)dcfgnt:{[d;c;f;g;n;t]*p dpfgnt[d;;f;g;n]'?[t;;0b;()]',:'(=;c;)'p:?[;();();c]?[t;();1b;(,c)!,c]}
\d .
r:flip`date`open`high`low`close`volume`sym!("DFFFFIS";",")0:
w:.Q.dcfgnt[`:db;`date;`sym;,;`stats]
.Q.fs[w r#]`:file.csv
But I couldn't find any resources to give me detail explain. For example:
if[~&/qm'r:+en[d]t;'`unmappable];
what does it do with the parameter d?
(Promoting this to an answer as I believe it helps answer the question).
Following on from the comment chain: in order to translate the k code into q code (or simply to understand the k code) you have a few options, none of which are particularly well documented as it defeats the purpose of the q language - to be the wrapper which obscures the k language.
Option 1 is to inspect the built-in functions in the .q namespace
q).q
| ::
neg | -:
not | ~:
null | ^:
string | $:
reciprocal| %:
floor | _:
...
Option 2 is to inspect the q.k script which creates the above namespace (be careful not to edit/change this):
vi $QHOME/q.k
Option 3 is to lookup some of the nuggets of documentation on the code.kx website, for example https://code.kx.com/q/wp/parse-trees/#k4-q-and-qk and https://code.kx.com/q/basics/exposed-infrastructure/#unary-forms
Options 4 is to google search for reference material for other/similar versions of k, for example k2/k3. They tend to be similar-ish.
Final point to note is that in most of these example you'll see a colon (:) after the primitives....this colon is required in q/kdb to use the monadic form of the primitive (most are heavily overloaded) while in k it is not required to explicitly force the monadic form. This is why where will show as &: in the q reference but will usually just be & in actual k code

How do I write a perl6 macro to enquote text?

I'm looking to create a macro in P6 which converts its argument to a string.
Here's my macro:
macro tfilter($expr) {
quasi {
my $str = Q ({{{$expr}}});
filter-sub $str;
};
}
And here is how I call it:
my #some = tfilter(age < 50);
However, when I run the program, I obtain the error:
Unable to parse expression in quote words; couldn't find final '>'
How do I fix this?
Your use case, converting some code to a string via a macro, is very reasonable. There isn't an established API for this yet (even in my head), although I have come across and thought about the same use case. It would be nice in cases such as:
assert a ** 2 + b ** 2 == c ** 2;
This assert statement macro could evaluate its expression, and if it fails, it could print it out. Printing it out requires stringifying it. (In fact, in this case, having file-and-line information would be a nice touch also.)
(Edit: 007 is a language laboratory to flesh out macros in Perl 6.)
Right now in 007 if you stringify a Q object (an AST), you get a condensed object representation of the AST itself, not the code it represents:
$ bin/007 -e='say(~quasi { 2 + 2 })'
Q::Infix::Addition {
identifier: Q::Identifier "infix:+",
lhs: Q::Literal::Int 2,
rhs: Q::Literal::Int 2
}
This is potentially more meaningful and immediate than outputting source code. Consider also the fact that it's possible to build ASTs that were never source code in the first place. (And people are expected to do this. And to mix such "synthetic Qtrees" with natural ones from programs.)
So maybe what we're looking at is a property on Q nodes called .source or something. Then we'd be able to do this:
$ bin/007 -e='say((quasi { 2 + 2 }).source)'
2 + 2
(Note: doesn't work yet.)
It's an interesting question what .source ought to output for synthetic Qtrees. Should it throw an exception? Or just output <black box source>? Or do a best-effort attempt to turn itself into stringified source?
Coming back to your original code, this line fascinates me:
my $str = Q ({{{$expr}}});
It's actually a really cogent attempt to express what you want to do (turn an AST into its string representation). But I doubt it'll ever work as-is. In the end, it's still kind of based on a source-code-as-strings kind of thinking à la C. The fundamental issue with it is that the place where you put your {{{$expr}}} (inside of a string quote environment) is not a place where an expression AST is able to go. From an AST node type perspective, it doesn't typecheck because expressions are not a subtype of quote environments.
Hope that helps!
(PS: Taking a step back, I think you're doing yourself a disservice by making filter-sub accept a string argument. What will you do with the string inside of this function? Parse it for information? In that case you'd be better off analyzing the AST, not the string.)
(PPS: Moritz++ on #perl6 points out that there's an unrelated syntax error in age < 50 that needs to be addressed. Perl 6 is picky about things being defined before they are used; macros do not change this equation much. Therefore, the Perl 6 parser is going to assume that age is a function you haven't declared yet. Then it's going to consider the < an opening quote character. Eventually it'll be disappointed that there's no >. Again, macros don't rescue you from needing to declare your variables up-front. (Though see #159 for further discussion.))

Xtext disambiguation

Given the following grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Program:
{Range} ID '.' '.' ID
| {Group} ID ID ID ID
;
terminal ID:
'a' | '.'
;
and the following input:
a . . a
I would argue that there are two ways in which the string can be parsed: as a Range (the first alternative) or as a Group (the second alternative). When I try this in my generated IDE and inspect the Ecore model, a Range is instantiated.
What makes Xtext decide in favor of the Range?
Edit: specifically, I'm wondering why the Xtext grammar itself is not ambiguous, since a range 'a'..'z' can be parsed as either a Group of Keyword, Wildcard, Wildcard, Keyword or as a CharacterRange of Keyword, Keyword.
Keywords become Lexer Rules a well. Thus you have two Lexer Rules
terminal FULL_STOP_KEYWORD: '.' ;
and
terminal ID: 'a' | '.';
The Lexer is not stateful. Only one rule can win. Thus '.' will always be lexed as Keyword and never as ID

BNF grammar + Gold LALR parser, failing to distinguish special case NewLine from Whitespace

I want to consider whitespaces and newlines as normal whitespaces.
I want to distinguish newlines from other whitespaces moreover to allow special case.
First attempt to write a compliant grammar fails.
Here is the grammar:
! ------------------------------------------------- Sets
{WS} = {Whitespace} - {CR} - {LF}
{ID Head} = {Letter} + [_]
{ID Tail} = {Alphanumeric} + [_]
{String Chars} = {Printable} + {HT} - ["\]
! ------------------------------------------------- Terminals
! The following defines the Whitespace terminal using the {WS}
! set - which excludes the carriage return and line feed
! characters
Whitespace = {WS}+ | {CR}{LF} | {CR} | {LF}
!NewLine = {CR}{LF} | {CR} | {LF}
MyNewLine = {CR}{LF} | {CR} | {LF}
They are ambiguous because they both contain the same sub-set {CR}{LF} | {CR} | {LF}.
Given the input {CR}{LF} the parser has no way to tell which terminal it should match.
A table-driven parser isn't really designed to handle "special cases" directly. If you want to ignore new-lines in some contexts, but attribute meaning to them in others then you'll have to handle that in your reductions (i.e. tokenize the newlines separately, and discard them in your reductions), but that will get ugly.
A (potentially) better solution is to use tokenizer states (possibly controlled from the parser), to change how the newline inputs are tokenized. It's hard to say without fully understanding your grammar. Plus, it's been a few years since I've messed with this stuff.
I think the grammar is ambiguous in the sense that both Whitespace and MyNewLine match new line charachters. Since it throws a wobbly doing it your way, I suggest detecting whitespace and new lines separately and deciding what to do with the newline on a case by case basis.
I am not too experienced in the area, but thats what I remember from my Theory Of Computation class and Compiler Design class.
I hope this helps.
A late answer.
To my dismay, I'm just a recent late bloomer ;-) member.
Keep using the usual Line-Based Grammar Declarations
! ====================================================================
{Whitespace Ch} = {Whitespace} - {CR} - {LF}
Whitespace = {Whitespace Ch}+
Newline = {CR}{LF} | {CR} | {LF}
! ====================================================================
Whitespace vs. Newline distinction is already taken into account!
Consider addressing your special case when writing your production rules.
For complex case you may even need to define some virtual terminal (advanced technique).
You may elaborate your grammar and ask by posting it again.
Last Edit: Please, share if you've already addressed the issue. Thanks.