Here exampel from paragraph 1.3.1 Integer Literals from scala documentation
integerLiteral ::= (decimalNumeral | hexNumeral) [‘L’ | ‘l’]
decimalNumeral ::= ‘0’ | nonZeroDigit {digit}
hexNumeral ::= ‘0’ ‘x’ hexDigit {hexDigit}
digit ::= ‘0’ | nonZeroDigit
nonZeroDigit ::= ‘1’ | ... | ‘9’
It does not seem like regular expression or SGML language.
Generally I understand what it means, but what the language is used for that?
As mentioned in comments it is EBNF.
Here is a proove with scala gramma description
Related
Flutter:
Framework • revision 18116933e7 (vor 8 Wochen) • 2021-10-15 10:46:35 -0700
Engine • revision d3ea636dc5
Tools • Dart 2.14.4
Antrl4:
antlr4: ^4.9.3
I would like to implement a simple tool that formats text like in the following definition: https://www.motoslave.net/sugarcube/2/docs/#markup-style
So basically each __ is the start of an underlined text and the next __ is the end.
I got some issues with the following input:
^^subscript=^^
Shell: line 1:13 token recognition error at '^'
Shell: line 1:14 extraneous input '' expecting {'==', '//', '''', '__', '~~', '^^', TEXT}
MyLexer.g4:
STRIKETHROUGH : '==';
EMPHASIS : '//';
STRONG : '\'\'';
UNDERLINE : '__';
SUPERSCRIPT : '~~';
SUBSCRIPT : '^^';
TEXT
: ( ~[<[$=/'_^~] | '<' ~'<' | '=' ~'=' | '/' ~'/' | '\'' ~'\'' | '_' ~'_' | '~' ~'~' | '^' ~'^' )+
;
MyParser.g4:
options {
tokenVocab=SugarCubeLexer;
//language=Dart;
}
parse
: block EOF
;
block
: statement*
;
statement
: strikethroughStyle
| emphasisStyle
| strongStyle
| underlineStyle
| superscriptStyle
| subscriptStyle
| unstyledStatement
;
unstyledStatement
: plaintext
;
strikethroughStyle
: STRIKETHROUGH (emphasisStyle | strongStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* STRIKETHROUGH
;
emphasisStyle
: EMPHASIS (strikethroughStyle | strongStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* EMPHASIS
;
strongStyle
: STRONG (strikethroughStyle | emphasisStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* STRONG
;
underlineStyle
: UNDERLINE (strikethroughStyle | emphasisStyle | strongStyle | superscriptStyle | subscriptStyle | unstyledStatement)* UNDERLINE
;
superscriptStyle
: SUPERSCRIPT (strikethroughStyle | emphasisStyle | strongStyle | underlineStyle | subscriptStyle | unstyledStatement)* SUPERSCRIPT
;
subscriptStyle
: SUBSCRIPT (strikethroughStyle | emphasisStyle | strongStyle | underlineStyle | superscriptStyle | unstyledStatement)* SUBSCRIPT
;
plaintext
: TEXT
;
I would be super happy for any help. Thanks
It's you TEXT rule:
TEXT
: (
~[<[$=/'_^~]
| '<' ~'<'
| '=' ~'='
| '/' ~'/'
| '\'' ~'\''
| '_' ~'_'
| '~' ~'~'
| '^' ~'^'
)+
;
You can't write a Lexer rule in ANTLR like you're trying to do (i.e. a '^' unless it's followed by another '^'). The ~'^' means "any character that's not ^")
if you run your input through grun with a -tokens option, you'll see that the TEXT token pulls everything through the EOL
[#0,0:1='^^',<'^^'>,1:0]
[#1,2:14='subscript=^^\n',<TEXT>,1:2]
[#2,15:14='<EOF>',<EOF>,2:0]
Try something like this:
grammar MyParser
;
parse: block EOF;
block: statement*;
statement
: STRIKETHROUGH statement STRIKETHROUGH # Strikethrough
| EMPHASIS statement EMPHASIS # Emphasis
| STRONG statement STRONG # Strong
| UNDERLINE statement UNDERLINE # Underline
| SUPERSCRIPT statement SUPERSCRIPT # SuperScript
| SUBSCRIPT statement SUBSCRIPT # Subscript
| plaintext # unstyledStatement
;
plaintext: TEXT+;
STRIKETHROUGH: '==';
EMPHASIS: '//';
STRONG: '\'\'';
UNDERLINE: '__';
SUPERSCRIPT: '~~';
SUBSCRIPT: '^^';
TEXT: .;
This grammar correctly parses your input, but at the expense of turning everything other than your special characters into single character tokens.
With a bit more thought, we can minimize this:
grammar MyParser
;
parse: block EOF;
block: statement*;
statement
: STRIKETHROUGH statement STRIKETHROUGH # Strikethrough
| EMPHASIS statement EMPHASIS # Emphasis
| STRONG statement STRONG # Strong
| UNDERLINE statement UNDERLINE # Underline
| SUPERSCRIPT statement SUPERSCRIPT # SuperScript
| SUBSCRIPT statement SUBSCRIPT # Subscript
| (U_TEXT | TEXT)+ # unstyledStatement
;
STRIKETHROUGH: '==';
EMPHASIS: '//';
STRONG: '\'\'';
UNDERLINE: '__';
SUPERSCRIPT: '~~';
SUBSCRIPT: '^^';
U_TEXT: ~[=/'_~^]+;
TEXT: .;
This adds the U_TEXT lexer rule. This rule will pull together all unambiguous characters into a single token. This significantly reduces the number of tokens produced. (as well as the number of diagnostic warnings). It should perform much better than the first (I've not tried/timed it on large enough input to see the difference, but the resulting parse tree is much better.
Elaboration:
The ANTLR lexer rule evaluation works by examining your input. When multiple rules could match the next n characters of input, then it will continue looking at input characters until a character fails to match any of the "active" lexer rules. This establishes the longest run of characters that could match a lexer rule. If this is a single rule, it wins (by virtue of having matched the longest sequence of input characters). If there is more than one rule matching the same run of input characters then the lexer matches the first of those rules to appear in your grammar. (Technically, these situations are "ambiguities", as, looking at the whole grammar, there are multiple ways that ANTLR could have tokenized it. But, since ANTLR has deterministic rules for resolving these ambiguities, they're not really a problem.)
Lexer rules, just don't have the ability to use negation except for negating a set of characters (that appear between [ and ]). That means we can't write a rule to match a "< not followed by another <". We can match "<<" as a longer token than "<". To do that, we have to ensure that all tokens that could start one of your two character sequences, match a single token rule. However, we want to avoid making ALL other characters single character rules, so we can introduce a rules that is "everything but on our our special characters". This will greedily consume everything that isn't possibly "problematic". Leaving only the special characters to be caught by the single character `'.'`` rule at the end of the grammar.
Here they say we can generate code using EBNF but I don't understand how, it seems to only accept JSON. Does anyone know how to do it?
Thank you in advance.
The link that you mentioned, does not say that we can generate a new snippet using EBNF.
they have documented something like :
Below is the EBNF (extended Backus-Naur form) for snippets
And then.. they have given EBNF for snippets.
any ::= tabstop | placeholder | choice | variable | text
tabstop ::= '$' int | '${' int '}'
placeholder ::= '${' int ':' any '}'
choice ::= '${' int '|' text (',' text)* '|}'
variable ::= '$' var | '${' var }'
| '${' var ':' any '}'
| '${' var '/' regex '/' (format | text)+ '/' options '}'
format ::= '$' int | '${' int '}'
| '${' int ':' '/upcase' | '/downcase' | '/capitalize' '}'
| '${' int ':+' if '}'
| '${' int ':?' if ':' else '}'
| '${' int ':-' else '}' | '${' int ':' else '}'
regex ::= JavaScript Regular Expression value (ctor-string)
options ::= JavaScript Regular Expression option (ctor-options)
var ::= [_a-zA-Z] [_a-zA-Z0-9]*
int ::= [0-9]+
text ::= .*
It tells what is the possible combination and keywords which are accepted by the Snippet. It is indeed JSON format I can say by looking at the EBNF. The snippet creation is limited to this at the moment. we can not generate advanced snippet in the current release (Version 1.24).
Please read through the document to gether some more information on how to make a new snippet with variables given and the replacement logic. Thanks.
I've tried to read SLS, but it has some strange BNF-like notation. Can any one clarify this notation. For example the Types chapter has the following:
Type ::= FunctionArgTypes ‘=>’ Type
| InfixType [ExistentialClause]
FunctionArgTypes ::= InfixType
| ‘(’ [ ParamType {‘,’ ParamType } ] ‘)’
ExistentialClause ::= ‘forSome’ ‘{’ ExistentialDcl {semi ExistentialDcl} ‘}’
ExistentialDcl ::= ‘type’ TypeDcl
| ‘val’ ValDcl
InfixType ::= CompoundType {id [nl] CompoundType}
CompoundType ::= AnnotType {‘with’ AnnotType} [Refinement]
| Refinement
AnnotType ::= SimpleType {Annotation}
SimpleType ::= SimpleType TypeArgs
| SimpleType ‘#’ id | StableId
| Path ‘.’ ‘type’
| ‘(’ Types ’)’
TypeArgs ::= ‘[’ Types ‘]’
Types ::= Type {‘,’ Type}
Symbols like ::= and | are clear to me, but what the difference between [] and {}. Also i couldn't find description for things like id, [nl], Refinment, Type.
You are right, the notation used in SLS is called EBNF - Extended Backus–Naur Form. It was developed by Niklaus Wirth, the creator of Pascal, and if i'm not mistaken, he was a supervisor of prof. Odersky in his Ph. D research. All Scala syntax is described in the end of SLS (page 159), there you can find Type, Refinment, nl and other things used in Scala.
As for EBNF it self, here the complete table of it's syntax:
Usage Notation
definition =
concatenation ,
termination ;
alternation |
option [ ... ]
repetition { ... }
grouping ( ... )
terminal string " ... "
terminal string ' ... '
comment (* ... *)
special sequence ? ... ?
exception -
The notation in SLS is slightly modified, i.e ::= is used instead of a simple = and space used for concatenation instead of ,
I am trying to write antlr grammar so that I can create a match on a certain ID.
I need to match a character that starts with the character 'n' and ends with 'd'
And this ID can have space.
Everywhere else I want to ignore the whitespace
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9'
| ('+'|'-'|'*'|'/'|'_')
| '='
| '~'
| '{'
| '}'
| ','
| NA
)+
;
NA : 'n'[ ]['a'..'z']'d' ;
WS : [ \t\n]+ -> skip;
I tested this with an expression A1=not attempted
It considers A1=not as an ID and attempted as an error node
Can you have a grammar that ignore white spaces but makes an exception for a certain string as "not attempted"
You should try to seperate ID ("A1") from the rest. Further you need to take care on the priority of lexical rules. Your "n...d" should have higher priority, so take it as one of your first lexer rules.
A working grammar (only tested for your example "A1=not attempted" is:
statement : ID expr;
expr : OP expr
| (NA | ID | OP)
;
NA : 'n'[a-zA-Z ]*'d' ;
ID
: (
'a'..'z'
| 'A'..'Z'
| '0'..'9'
| ('+'|'-'|'*'|'/'|'_')
)+ ;
OP : '='
| '~'
| '{'
| '}'
| ','
;
WS : [ \t\r\n]+ -> skip;
Try it with start rule statement. I changed the NA Rule so it will match zero or more characters a to z and A to Z and Whitspace in any order.
Good Luck with ANTLR, its a nice tool.
Why are the parentheses needed here? Are there some precedence rules I should know?
scala> 'x' match { case _ => 1 } + 1
<console>:1: error: ';' expected but identifier found.
'x' match { case _ => 1 } + 1
^
scala> ('x' match { case _ => 1 }) + 1
res2: Int = 2
Thanks!
As Agilesteel says, a match is not considered as a simple expression, nor is an if statement, so you need to surround the expression with parentheses. From The Scala Language
Specification, 6 Expressions, p73, the match is an Expr, as is an if. Only SimpleExpr are accepted either side of the + operator.
To convert an Expr into a SimpleExpr, you have to surround it with ().
Copied for completeness:
Expr ::= (Bindings | id | ‘_’) ‘=>’ Expr
| Expr1
Expr1 ::= ‘if’ ‘(’ Expr ‘)’ {nl} Expr [[semi] else Expr]
| ‘while’ ‘(’ Expr ‘)’ {nl} Expr
| ‘try’ ‘{’ Block ‘}’ [‘catch’ ‘{’ CaseClauses ‘}’] [‘finally’ Expr]
| ‘do’ Expr [semi] ‘while’ ‘(’ Expr ’)’
| ‘for’ (‘(’ Enumerators ‘)’ | ‘{’ Enumerators ‘}’) {nl} [‘yield’] Expr
| ‘throw’ Expr
| ‘return’ [Expr]
| [SimpleExpr ‘.’] id ‘=’ Expr
| SimpleExpr1 ArgumentExprs ‘=’ Expr
| PostfixExpr
| PostfixExpr Ascription
| PostfixExpr ‘match’ ‘{’ CaseClauses ‘}’
PostfixExpr ::= InfixExpr [id [nl]]
InfixExpr ::= PrefixExpr
| InfixExpr id [nl] InfixExpr
PrefixExpr ::= [‘-’ | ‘+’ | ‘~’ | ‘!’] SimpleExpr
SimpleExpr ::= ‘new’ (ClassTemplate | TemplateBody)
| BlockExpr
| SimpleExpr1 [‘_’]
SimpleExpr1 ::= Literal
| Path
| ‘_’
| ‘(’ [Exprs] ‘)’
| SimpleExpr ‘.’ id s
| SimpleExpr TypeArgs
| SimpleExpr1 ArgumentExprs
| XmlExpr
Exprs ::= Expr {‘,’ Expr}
BlockExpr ::= ‘{’ CaseClauses ‘}’
| ‘{’ Block ‘}’
Block ::= {BlockStat semi} [ResultExpr]
ResultExpr ::= Expr1
| (Bindings | ([‘implicit’] id | ‘_’) ‘:’ CompoundType) ‘=>’ Block
Ascription ::= ‘:’ InfixType
| ‘:’ Annotation {Annotation}
| ‘:’ ‘_’ ‘*’
After some inspection in the Scala specification, I think I can give it a shot.
If I am wrong please correct me.
first, an if or match are defined as Expr - expressions.
You are trying to create an infix expression (defined by the use of the operator between two expressions)
However the especification (section 3.2.8) states that :
All type infix operators have the same precedence; parentheses have to
be used for grouping
It also also states that:
In a sequence of consecutive type infix operations t0 op1 t1 op2 . .
.opn tn, all operators op1, . . . , opn must have the same
associativity. If they are all left-associative, the sequence is
interpreted as (. . . (t0 op1 t1) op2 . . .) opn tn.
So my take is that Scala does not know what to reduce first: the match or the method '+' invocation.
Take a look at this answer
Please correct me if I am wrong.
A match expression is not considered as simple expression. Here is a similar example:
scala> val foo = "bar" + if(3 < 5) 3 else 5 // does not compile
scala> val foo = "bar" + (if(3 < 5) 3 else 5) // does compile
Apparently you can't write complex expressions wherever you want. I don't know why and hope that someone with more knowledge of the topic will give you a better answer.