For reference, I'm using Prolog v7.4.2 on Windows 10, 64-bit
Entering the following code in the REPL:
write("\U0001D7F6"). % Mathematical Monospace Digit Zero
gives me this error in the output:
ERROR: Syntax error: Illegal character code
ERROR: write("
ERROR: ** here **
ERROR: \U0001D7F6") .
I know for a fact that U+1D7F6 is a valid Unicode character, so what's up?
SWI-Prolog internally uses C wchar_t to represent Unicode characters. On Windows these are 16 bit and intended to hold UTF-16 encoded strings. SWI-Prolog however uses wchar_t to get nice arrays of code points and thus effectively only supports UCS-2 on Windows (code points u0000..uffff).
On non-Windows systems, wchar_t is usually 32 bits and thus the complete Unicode range is supported.
It is not a trivial thing to fix as handling wchar_t as UTF-16 looses the nice property that each element of the array is exactly one code point and using our own 32-bit type means we cannot use the C library wide character functions and have to reimplement them in SWI-Prolog. This is not only work, but replacing them with pure C versions also looses the optimization typically present in modern C runtime libraries.
The ISO core standard syntax for char codes looks different. The following works in SICStus Prolog, Jekejeke Prolog, SWI-Prolog, etc.. for example, and is thus more portable:
Using SWI-Prolog on a Mac:
Welcome to SWI-Prolog (threaded, 64 bits, version 7.5.8)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
?- set_prolog_flag(double_quotes, codes).
true.
?- X = "\x1D7F6\".
X = [120822].
?- write('\x1D7F6\'), nl.
𝟶
And Jekejeke Prolog on a Mac:
Jekejeke Prolog 2, Runtime Library 1.2.2
(c) 1985-2017, XLOG Technologies GmbH, Switzerland
?- X = "\x1D7F6\".
X = [120822]
?- write('\x1D7F6\'), nl.
𝟶
The underlying syntax is found in the ISO core standard at section 6.4.2.1 hexadecimal escape sequence. It reads as follows and is shorter than the U-syntax:
hex_esc_seq --> "\x" hex_digit { hex_digit } "\".
For comparison, I get:
?- write('\U0001D7F6').
𝟶
What is your environment and what do the flags say?
For example:
$ set | grep LANG
LANG=en_US.UTF-8
and also:
?- current_prolog_flag(encoding, F).
F = utf8.
Related
I have filenames using char16_t characters:
char16_t Text[2560] = u"ThisIsTheFileName.txt";
char16_t const* Filename = Text;
How can I check if the file exists already? I know that I can do so for wchar_t using _wfindfirst(). But I need char16_t here.
Is there an equivalent function to _wfindfirst() for char16_t?
Background for this is that I need to work with Unicode characters and want my code working on Linux (32-bit) as well as on other platforms (16-bit).
findfirst() is the counterpart to _wfindfirst().
However, both findfirst() and _wfindfirst() are specific to Windows. findfirst() accepts ANSI (outdated legacy stuff). _wfindfirst() accepts UTF-16 in the form of wchar_t (which is not exactly the same thing as char16_t).
ANSI and UTF-16 are generally not used on Linux. findfirst()/_wfindfirst() are not included in the gcc compiler.
Linux uses UTF-8 for its Unicode format. You can use access() to check for file permission, or use opendir()/readdir()/closedir() as the equivalent to findfirst().
If you have a UTF-16 filename from Windows, you can convert the name to UTF-8, and use the UTF-8 name in Linux. See How to convert UTF-8 std::string to UTF-16 std::wstring?
Consider using std::filesystem in C++17 or higher.
Note that a Windows or Linux executable is 32-bit or 64-bit, that doesn't have anything to do with the character set. Some very old systems are 16-bit, you probably don't come across them.
I am using MIT scheme, and would like to be able to do something like this:
(define π 3.14159265)
Without having an encoding error like this:
;Illegal character: #\U+80
;To continue, call RESTART with an option number:
; (RESTART 1) => Return to read-eval-print level 1
MIT Scheme does have Unicode support, but it appears that it doesn't have support for unicode in the code, which is what I am looking to do. It turns out that ISO-8859-1 (the encoding used in MIT Scheme) does not have any greek letters within it, which is a pity.
Solutions that might work, but are not very good:
Writing all of my code into text files and using the built in unicode support to read in the unicode characters as code.
Rewriting the entire interpreter to accept unicode names
Using a different lisp implementation which allows for Unicode names.
Can't wait to hear from the Stack Overflowers!
You can use unicode symbols in guile, gambit, scm, and chicken for sure.
I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify. To be concrete, I want to declare the following rule:
ID_Continue : [\uE0100-\uE01EF] ;
Unfortunately, it doesn't work. This rule will match characters that are not in this range. (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits):
ID_Continue : [\U000E0100-\U000E01EF] ;
But it seems to result in the same unwanted behaviour.
I am using Antlr4 and the IntelliJ plugin for it for testing.
Does Antlr4 not support unicode literals above \uFFFF?
No, ANTLR's max is the same as Java's Character.MAX_VALUE
If you look at (a part of) ANTLR4's lexer grammar you will see these rules:
// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
: Esc
( [btnfr"'\\] // The standard escaped character set such as tab, newline, etc.
| UnicodeEsc // A Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
...
fragment UnicodeEsc
: 'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
;
...
fragment Esc : '\\' ;
Note: the limitation to the BMP is purely a Java limitation. Other targets might go much further. For instance my MySQL grammar, written for ANTLR3 (C target) can easily lex e.g. emojis from beyond the BMP. This works for quoted strings as well as IDENTIFIERs.
What's a bit strange here is however that I haven't specified that range in the grammar (it uses only the BMP). Still the parser can parse any utf-8 input. Might be a bug in the target runtime, though I'm happy it exists :-D
I want to write a grammar for a file format whose content can contain characters other than US-ASCII ones. Since I am used to ABNF, I try to use it...
However, none of RFCs 5234 and 7405 are very friendly towards people who DO NOT use US ASCII.
In fact, I'm looking for an ABNF version (and possibly some basic rules as well) which is character oriented rather than byte oriented; the only thing which RFC 5234 has to say about this is in section 2.4:
2.4. External Encodings
External representations of terminal value characters will vary
according to constraints in the storage or transmission environment.
Hence, the same ABNF-based grammar may have multiple external
encodings, such as one for a 7-bit US-ASCII environment, another for
a binary octet environment, and still a different one when 16-bit
Unicode is used. Encoding details are beyond the scope of ABNF,
although Appendix B provides definitions for a 7-bit US-ASCII
environment as has been common to much of the Internet.
By separating external encoding from the syntax, it is intended that
alternate encoding environments can be used for the same syntax.
That doesn't really clarify matters.
Is there a version of ABNF somewhere which is code point oriented rather than byte oriented?
Refer to section 2.3 of RFC 5234, which says:
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
Unicode is just the set of non-negative integers U+0000 through U+10FFFF minus the surrogate range D800-DFFF and there are various RFCs that use ABNF accordingly. An example is RFC 3987.
If the ABNF you're writing is intended for human reading, then I'd say just use the normal syntax and refer to code points instead of bytes instead. You could take a look at various language specifications that allow Unicode in source text, e.g. C#, Java, PowerShell, etc. They all have a grammar, and they all have to define Unicode characters somewhere (e.g. for identifiers).
E.g. the PowerShell grammar has lines like this:
double-quote-character:
       " (U+0022)
       Left double quotation mark (U+201C)
       Right double quotation mark (U+201D)
       Double low-9 quotation mark (U+201E)
Or in the Java specification:
UnicodeInputCharacter:
       UnicodeEscape
       RawInputCharacter
UnicodeEscape:
       \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
       u
       UnicodeMarker u
RawInputCharacter:
       any Unicode character
HexDigit: one of
       0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
The \, u, and hexadecimal digits here are all ASCII characters.
Note that there is surrounding text explaining the intent – which is always better than just dumping a heap of grammar on someone.
If it's for automatic parser generation, you may be better off finding a tool that allows you to specify a grammar both in Unicode and ABNF-like form and publish that instead. People writing parsers should be expected to understand either, though.
I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
More:
Use regular expression to match ANY Chinese character in utf-8 encoding
At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:
č‚–ć™— { printf ("xiaohan\n"); }
it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character č‚– or ć™—, you can't write:
[č‚–ć™—] { printf ("xiaohan/2\n"); }
because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply č‚–ć™— as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (č‚–|ć™—).
For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
This isn't pretty, but it should work.
Flex does not support Unicode. However, Flex supports "8 bit clean" binary input. Therefore you can write lexical patterns which match UTF-8. You can use these patterns in specific lexical areas of the input language, for instance identifiers, comments or string literals.
This will work for well for typical programming languages, where you may be able to assert to the users of your implementation that the source language is written in ASCII/UTF-8 (and no other encoding is supported, period).
This approach won't work if your scanner must process text that can be in any encoding. It also won't work (very well) if you need to express lexical rules specifically for Unicode elements. I.e. you need Unicode characters and Unicode regexes in the scanner itself.
The idea is that you can recognize a pattern which includes UTF-8 bytes using a lex rule, (and then perhaps take the yytext, and convert it out of UTF-8 or at least validate it.)
For a working example, see the source code of the TXR language, in particular this file: http://www.kylheku.com/cgit/txr/tree/parser.l
Scroll down to this section:
ASC [\x00-\x7f]
ASCN [\x00-\t\v-\x7f]
U [\x80-\xbf]
U2 [\xc2-\xdf]
U3 [\xe0-\xef]
U4 [\xf0-\xf4]
UANY {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UONLY {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
As you can see, we can define patterns to match ASCII characters as well as UTF-8 start and continuation bytes. UTF-8 is a lexical notation, and this is a lexical analyzer generator, so ... no problem!
Some explanations: The UANY means match any character, single-byte ASCII or multi-byte UTF-8. UANYN means like UANY but no not match the newline. This is useful for tokens that do not break across lines, like say a comment from # to the end of the line, containing international text. UONLY means match only a UTF-8 extended character, not an ASCII one. This is useful for writing a lex rule which needs to exclude certain specific ASCII characters (not just newline) but all extended characters are okay.
DISCLAIMER: Note that the scanner's rules use a function called utf8_dup_from to convert the yytext to wide character strings containing Unicode codepoints. That function is robust; it detects problems like overlong sequences and invalid bytes and properly handles them. I.e. this program is not relying on these lex rules to do the validation and conversion, just to do the basic lexical recognition. These rules will recognize an overlong form (like an ASCII code encoded using several bytes) as valid syntax, but the conversion function will treat them properly. In any case, I don't expect UTF-8 related security issues in the program source code, since you have to trust source code to be running it anyway (but data handled by the program may not be trusted!) If you're writing a scanner for untrusted UTF-8 data, take care!
I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++.
RE/flex safely supports the full Unicode standard and accepts UTF-8, UTF-16, and UTF-32 input files without requiring UTF-8 hacks (that can't even support UTF-16/32 input and handle UTF BOM.)
Also, UTF-8 hacks with Flex don't allow you to write Unicode regular expressions such as [č‚–ć™—] that are fully supported in RE/flex.
It works seamlessly with Bison to build lexers and parsers.
In fact, with RE/flex we can write any Unicode patterns as UTF-8-based regular expressions in lexer .l specifications, such as:
%option flex unicode
%%
[č‚–ć™—] { printf ("xiaohan/2\n"); }
%%
This generates a lexer that scans UTF-8, UTF-16, and UTF-32 files automatically. As per UTF standardization, for UTF-16/32 input a UTF BOM is expected in the input, while an UTF-8 BOM is optional.
We can use global %option unicode to enable Unicode and %option flex to specify Flex specifications. A local modifier (?u:) can be used to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[č‚–ć™—]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.