could anybody help me with these two problems please?
First one is almost solved for me by question regular expression for multiline commentary in matlab , but I do not know how exactly I should use ^.*%\{(?:\R(?!.*%\{).*)*\R\h*%\}$ or where in grammar if I want use is with antlr4. I have been using matlab grammar from this source.
Second one is related to another type of commentary in matlab which is a = 3 % type any ascii I want.... In this case worked, when I insert label alternative to the rule context unary_expression in this form:
unary_expression
: postfix_expression
| unary_operator postfix_expression
| postfix_expression COMMENT
;
where COMMENT: '%' [ a-zA-Z0-9]*;, but when I use [\x00-\x7F] instead of [ a-zA-Z0-9]* (what I found here) parsing goes wrong, see example bellow:
INPUT FOR PARSER: a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
ANTLR OUTPUT : Exception in thread "main" java.lang.RuntimeException: set is empty
at org.antlr.v4.runtime.misc.IntervalSet.getMaxElement(IntervalSet.java:421)
at org.antlr.v4.runtime.atn.ATNSerializer.serialize(ATNSerializer.java:169)
at org.antlr.v4.runtime.atn.ATNSerializer.getSerialized(ATNSerializer.java:601)
at org.antlr.v4.Tool.generateInterpreterData(Tool.java:745)
at org.antlr.v4.Tool.processNonCombinedGrammar(Tool.java:400)
at org.antlr.v4.Tool.process(Tool.java:361)
at org.antlr.v4.Tool.processGrammarsOnCommandLine(Tool.java:328)
at org.antlr.v4.Tool.main(Tool.java:172)
line 1:9 token recognition error at: '$'
line 1:20 token recognition error at: '"'
line 1:21 token recognition error at: '!'
line 1:22 token recognition error at: '"'
line 1:38 token recognition error at: '$'
line 1:43 token recognition error at: '"'
line 1:10 missing {',', ';', CR} at 'L'
line 1:32 missing {',', ';', CR} at '3'
Can anybody please tell me what have I done wrong? And what is the best practice for this problem? (I am not exactly regex person...)
Let's take the simple one first.
this looks (to me) like a typical "comment everything through the end of the line" comment.
Assuming I'm correct, then best not to consider what all the valid characters are that might be contained, but rather to think about what not to consume.
Try: COMMENT: '%' ~[\r\n]* '\r'? '\n';
(I notice that you did not include anything in your rule to terminate it at the end of the line, so I've added that).
This basically says: once I see a % consume everything that is not a \r or `nand stop when you see an option\rfollowed by a required\n'.
Generally, comments can occur just about anywhere within a grammar structure, so it's VERY useful to "shove the off to the side" rather than inject them everywhere you allow them in the grammar.
So, a short grammar:
grammar test
;
test: ID EQ INT;
EQ: '=';
INT: [0-9]+;
COMMENT: '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
ID: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
You'll notice that I removed the COMMENT element from the test rule.
test file:
a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
(be sure to include the \n)
➜ grun test test -tree -tokens < test.txt
[#0,0:0='a',<ID>,1:0]
[#1,2:2='=',<'='>,1:2]
[#2,4:4='3',<INT>,1:4]
[#3,6:48='% $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[#4,49:48='<EOF>',<EOF>,2:0]
(test a = 3)
You still get a COMMENT token, it's just ignored when matching the parser rules.
Now for the multiline comments:
ANTLR uses a rather "regex-like" syntax for Lexer rules, but, don't be fooled, it's not (it's actually more powerful as it can pair up nested brackets, etc.)
From a quick reading, MatLab multiline tokens start with a %{ and consume everything until a %}. This is very similar to the prior rule, it just doesn't care about \ror\n`), so:
MLCOMMENT: '%{' .*? '%}' -> channel(HIDDEN);
Included in grammar:
grammar test
;
test: ID EQ INT;
EQ: '=';
INT: [0-9]+;
COMMENT: '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
MLCOMMENT: '%{' .*? '%}' -> channel(HIDDEN);
ID: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
Input file:
a = 3 % $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa
%{
A whole bunch of stuff
on several
lines
%}
➜ grun test test -tree -tokens < test.txt
[#0,0:0='a',<ID>,1:0]
[#1,2:2='=',<'='>,1:2]
[#2,4:4='3',<INT>,1:4]
[#3,6:48='% $£ K JFKL£J"!"OIJ+2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[#4,50:106='%{\n A whole bunch of stuff\n on several\n lines\n%}',<MLCOMMENT>,channel=1,3:0]
[#5,108:107='<EOF>',<EOF>,8:0]
(test a = 3)
I am trying to identify emojis within a sentence
def extractEmojiFromSentence (sentence: Any) : Seq[String] = {
return raw"[\p{block=Emoticons}\p{block=Miscellaneous Symbols and Pictographs}\p{block=Supplemental Symbols and Pictographs}]".r.findAllIn(sentence.toString).toSeq
}
This gives the following error
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unknown character block name {Supplemental Symbols and Pictographs}
near index 112 [\p{block=Emoticons}\p{block=Miscellaneous Symbols and
Pictographs}\p{block=Supplemental Symbols and Pictographs}]
Do I have to import some libraries into my build.sbt . Or which is the reason for the above error?
UPDATE
Im tyring the below code as suggested in the comment
val x = raw"\p{block=Supplemental Symbols and Pictographs}".r.findAllIn(mySentence.toString).toSeq
But im getting the below error
Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character block name {Supplemental Symbols and Pictographs} near index 45
\p{block=Supplemental Symbols and Pictographs}
^
It appears that the regex engine in your JVM version does not recognize that block label. (Mine doesn't either.)
You can just supply the equivalent character range instead.
def extractEmojiFromSentence(sentence: String): Seq[String] =
("[\\p{block=Emoticons}" +
"\\p{block=Miscellaneous Symbols and Pictographs}" +
"\uD83E\uDD00-\uD83E\uDDFF]") //Supplemental Symbols & Pictographs
.r.findAllIn(sentence).toSeq
I'm using the SystemVerilog stringify operator, `", in a macro, as below. The case is deliberately contrived to show the bug:
module my_test();
`define print(x) $fwrite(log_file, `"x`")
`define println(x) $fwrite(log_file, `"x\n`")
integer log_file;
initial begin
log_file = $fopen("result.txt", "w");
`print(A);
`print(B);
`println(C);
`println(D);
`print(E);
`print(F);
end
endmodule
This gives the output (no trailing newline):
ABC
`D
`EF
Why are there `s in the output, but only from the println?
Is this documented behaviour in the spec, or a bug in my simulator (Aldec Active-HDL)?
This is a bug in your tool. However, the second `" is not needed and gives you the results you are looking for.
Please find the error in lines 15,17 and 19:
%{
#include<stdio.h>
int c=0;
FILE *fp;
%}
operator [+-*/]
identifier [a-zA-Z][a-zA-Z0-9]*
number [0-9]+
expression ({identifier}|{number}){operator}({identifier}|{number})
%%
\n { c++; }
^"#".+ ;
^("int "|"float "|"char ").+ ;
"void main()" ;
{identifier}"="({expression}+";") {printf("Valid arithmetic expression in line %d",c+1);ECHO;printf("\n");}
{identifier}"="({number}|{identifier}";") {printf("Valid assignment statement in line %d",c+1);ECHO;printf("\n");}
({number}|([0-9]+[a-zA-Z0-9]*))"="{expression}+ {printf("Invalid: rules for naming identifier are violated in line %d",c+1);ECHO;printf("\n");}
{identifier}"=;" {printf("Invalid right side of expression missing in line %d",c+1);ECHO;printf("\n");}
{operator}{operator}+ {printf("Invalid multiple operators cannot occur consecutively in line %d",c+1);ECHO;printf("\n");}
. ;
%%
main()
{
yyin=fopen("3b.txt","r");
yylex();
fclose(yyin);
}
I don't think that your error "Negative Range in Character Class" is actually on lines 15, 17, or 19. I believe that it is on line 6. Your code says operator [+-*/], by which you appear to mean "the symbols +, -, *, and /".
However, the - is actually being interpreted as a "range" from + to *. Since + is character 43 and * is character 42, that range is backwards.
If you escape the - with \ before it, you should not have that error anymore.
I have written following grammar
Model:
package = PackageDec?
greetings+=Greeting*
usage+=Usage* ;
PackageDec:
'package' name=QualifiedName ;
Greeting:
'greet' name=ID '{' ops += Operation* '}' ;
Operation:
'op' name=ID ('(' ')' '{' '}')? ;
QualifiedName:
ID ('.' ID)*;
Usage:
'use';
With above i can write following script.
package p1.p2
greet G1 {op f1 op f2 }
Now i need to write something like this:
package p1.p2
greet G1 {op f1 op f2 op f3}
use p1.p2.G1.f1
use p1.p2.G1
use p1.p2.G1.f3
To support that i changed Usage RULE like this
Usage:
'use' head=[Greet|QualifiedName] =>('.' tail=[Operation])?
However when i generate xtext artifacts it is complaining about multiple alternatives.
Please let me know how to write correct grammar rule for this.
This is because QualifiedName consumes dots (.). Adding ('.' ...)? makes two alternatives. Consider input
a.b.c
This could be parsed as
head="a" tail = "b.c"
head="a.b" tail = "c"
If I understand your intention of using predicate => right, than you just have to replace
head=[Greet|QualifiedName]
with
head=[Greet]
In this case however you will not be able to parse references with dots.
As a solution I would recommend to substitute your dot with some other character. For example with colon:
Usage:
'use' head=[Greet|QualifiedName] (':' tail=[Operation])?