End of line character for regular expressions in ml-lex - lex

What is the end of line character for regular expressions in ml-lex?
"$" is used for catching the end of line character in regular expressions in most other languages, but if I use it in case of ml-lex, it gives me an error -
mllex a.lex
ml-lex: error, line 45: lookahead is unimplemented
unhandled exception: Error
I am currently appending all my regular expressions with an additional \n character for explicitly catching the end of line character. However removing the caught extra \n character is making the code ugly.
I read somewhere that $ is not implemented in ml-lex.
So, can there be any other solution for my problem? Please help.

Unfortunately, it looks like the $ character is not implemented in ML-Lex according to this manual:
"The dollar sign of C Lex $ is not implemented, since it is an
abbreviation for lookahead involving the newline character (that is,
it is an abbreviation for /\n)."
And it is also noted in this user guide:
"The dollar sign of C Lex $ is not implemented, since it is an
abbreviation for lookahead involving the newline character that is, it
is an abbreviation for /\n."
So... that would at least explain (and back up your reading of the $ not being implemented in ML-Lex). Unfortunately, that probably means that for now, at least, you might just need to keep using your existing method for checking those end-of-lines... even it doesn't look super clean.

Related

How to fix ICU Lexing Error: Unexpected character in Flutter

I am using flutter_localizations to localize my app.
Since updating to Flutter 3.7 i am getting this error:
ICU Syntax Error: Expected "identifier" but found "}".
This =|(){}[] obviously
This =|\(){}[] obviously is the text that i have in my .arb file.
I understand that curly braces "{}" have a special meaning and should be escaped, but i can not find the way to correctly escape them, has anyone managed to do so?
One simple way to reproduce the issue is simply following the steps to add localization support here, and then instead of the hello world string, write anything that includes the character "{".
P.S.: There is a releted issue open on Github. Be sure to go and check there for updates!
There is an escaping syntax that is implemented but not enabled by default as it is a new feature that wasn't completely backward compatible with existing ICU message strings.
First, add the following to your l10n.yaml file:
use-escaping: true
Then, this will allow you to wrap parts of your strings in single quotes to ignore any syntax within the single quotes; to use a single quote normally as a character and not an escape, use a double single quote. For example,
{
message: "This '{isn''t}' obvious"
}
becomes
String get message => "This {isn't} obvious";
See here for information on the syntax. I'll add this to the documentation later.

How do I specify a unicode literal that requires more than four hex digits in Antlr?

I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify. To be concrete, I want to declare the following rule:
ID_Continue : [\uE0100-\uE01EF] ;
Unfortunately, it doesn't work. This rule will match characters that are not in this range. (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits):
ID_Continue : [\U000E0100-\U000E01EF] ;
But it seems to result in the same unwanted behaviour.
I am using Antlr4 and the IntelliJ plugin for it for testing.
Does Antlr4 not support unicode literals above \uFFFF?
No, ANTLR's max is the same as Java's Character.MAX_VALUE
If you look at (a part of) ANTLR4's lexer grammar you will see these rules:
// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
: Esc
( [btnfr"'\\] // The standard escaped character set such as tab, newline, etc.
| UnicodeEsc // A Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
...
fragment UnicodeEsc
: 'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
;
...
fragment Esc : '\\' ;
Note: the limitation to the BMP is purely a Java limitation. Other targets might go much further. For instance my MySQL grammar, written for ANTLR3 (C target) can easily lex e.g. emojis from beyond the BMP. This works for quoted strings as well as IDENTIFIERs.
What's a bit strange here is however that I haven't specified that range in the grammar (it uses only the BMP). Still the parser can parse any utf-8 input. Might be a bug in the target runtime, though I'm happy it exists :-D

Do Unicode's line breaking rules require the last character to be a mandatory break?

I'm trying to use libunibreak (https://github.com/adah1972/libunibreak) to mark the possible line breaks in some given unicode text.
Libunibreak gives back four possible options for each code unit in some text:
LINEBREAK_MUSTBREAK
LINEBREAK_ALLOWBREAK
LINEBREAK_NOBREAK
LINEBREAK_INSIDEACHAR
Hopefully these are self explanatory. I would expect that MUSTBREAK corresponds to newline characters like LF. However, for any given text Libunibreak always indicates that the last character is MUSTBREAK
So for example with the string "abc", the output would be [NOBREAK,NOBREAK,MUSTBREAK]. For "abc\n" the output would be [NOBREAK,NOBREAK,NOBREAK,MUSTBREAK]. I use the MUSTBREAK attribute to start a new line when drawing text so the first case ("abc") creates an extra linebreak that shouldn't be there.
Is this behaviour what Unicode specifies or is this a quirk of the library implementation I'm using?
Yes, this is what the Unicode line breaking algorithm specifies. Rule LB3 in UAX #14: Unicode Line Breaking Algorithm, section 6.1 "Non-tailorable Line Breaking Rules" says:
Always break at the end of text.
The spec further explains:
[This rule is] designed to deal with degenerate cases, so that there is [...] at least one line break for the whole text.

complete list of special characters (e.g \nquit)

I desperatly tried to find out what symbol '\nquit' is... and I couldnt find any reference in the web.
What I tried to find is a complete list of all of those characters (\n, \p, \0, ...) but I couldn't find any.
cheers usche
Wikipedia has a list of C language escapes here.
As noted in my comment, I believe this represents the newline (linefeed) character \n followed by the word quit (which would be forced by the newline to the beginning of the next line of output). But in that case the string should be "double"-quoted rather than 'single'-quoted.

ack-grep: chars escaping

My goal is to find all "<?=" occurrences with ack. How can I do that?
ack "<?="
Doesn't work. Please tell me how can I fix escaping here?
Since ack uses Perl regular expressions, your problem stems from the fact that in Perl RegEx language, ? is a special character meaning "last match is optional". So what you are grepping for is = preceded by an optional <
So you need to escape the ? if that's just meant to be a regular character.
To escape, there are two approaches - either <\?= or <[?]=; some people find the second form of escaping (putting a special character into a character class) more readable than backslash-escape.
UPDATE As Josh Kelley graciously added in the comment, a third form of escaping is to use the \Q operator which escapes all the following special characters till \E is encountered, as follows: \Q<?=\E
Rather than trying to remember which characters have to be escaped, you can use -Q to quote everything that needs to be quoted.
ack -Q "<?="
This is the best solution if you will want to find by simple text.
(if you need not find by regular expression.)
ack "<\?="
? is a regex operator, so it needs escaping