Flex(lexer) support for unicode - unicode

I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
More:
Use regular expression to match ANY Chinese character in utf-8 encoding

At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:
肖晗 { printf ("xiaohan\n"); }
it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:
[肖晗] { printf ("xiaohan/2\n"); }
because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗 as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗).
For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
This isn't pretty, but it should work.

Flex does not support Unicode. However, Flex supports "8 bit clean" binary input. Therefore you can write lexical patterns which match UTF-8. You can use these patterns in specific lexical areas of the input language, for instance identifiers, comments or string literals.
This will work for well for typical programming languages, where you may be able to assert to the users of your implementation that the source language is written in ASCII/UTF-8 (and no other encoding is supported, period).
This approach won't work if your scanner must process text that can be in any encoding. It also won't work (very well) if you need to express lexical rules specifically for Unicode elements. I.e. you need Unicode characters and Unicode regexes in the scanner itself.
The idea is that you can recognize a pattern which includes UTF-8 bytes using a lex rule, (and then perhaps take the yytext, and convert it out of UTF-8 or at least validate it.)
For a working example, see the source code of the TXR language, in particular this file: http://www.kylheku.com/cgit/txr/tree/parser.l
Scroll down to this section:
ASC [\x00-\x7f]
ASCN [\x00-\t\v-\x7f]
U [\x80-\xbf]
U2 [\xc2-\xdf]
U3 [\xe0-\xef]
U4 [\xf0-\xf4]
UANY {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UONLY {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
As you can see, we can define patterns to match ASCII characters as well as UTF-8 start and continuation bytes. UTF-8 is a lexical notation, and this is a lexical analyzer generator, so ... no problem!
Some explanations: The UANY means match any character, single-byte ASCII or multi-byte UTF-8. UANYN means like UANY but no not match the newline. This is useful for tokens that do not break across lines, like say a comment from # to the end of the line, containing international text. UONLY means match only a UTF-8 extended character, not an ASCII one. This is useful for writing a lex rule which needs to exclude certain specific ASCII characters (not just newline) but all extended characters are okay.
DISCLAIMER: Note that the scanner's rules use a function called utf8_dup_from to convert the yytext to wide character strings containing Unicode codepoints. That function is robust; it detects problems like overlong sequences and invalid bytes and properly handles them. I.e. this program is not relying on these lex rules to do the validation and conversion, just to do the basic lexical recognition. These rules will recognize an overlong form (like an ASCII code encoded using several bytes) as valid syntax, but the conversion function will treat them properly. In any case, I don't expect UTF-8 related security issues in the program source code, since you have to trust source code to be running it anyway (but data handled by the program may not be trusted!) If you're writing a scanner for untrusted UTF-8 data, take care!

I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++.
RE/flex safely supports the full Unicode standard and accepts UTF-8, UTF-16, and UTF-32 input files without requiring UTF-8 hacks (that can't even support UTF-16/32 input and handle UTF BOM.)
Also, UTF-8 hacks with Flex don't allow you to write Unicode regular expressions such as [肖晗] that are fully supported in RE/flex.
It works seamlessly with Bison to build lexers and parsers.
In fact, with RE/flex we can write any Unicode patterns as UTF-8-based regular expressions in lexer .l specifications, such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
This generates a lexer that scans UTF-8, UTF-16, and UTF-32 files automatically. As per UTF standardization, for UTF-16/32 input a UTF BOM is expected in the input, while an UTF-8 BOM is optional.
We can use global %option unicode to enable Unicode and %option flex to specify Flex specifications. A local modifier (?u:) can be used to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

Related

Understanding Unicode: Surrogate Blocks, Noncharacters

I am trying to actually understand the unicode standard and was poking through the xml spec where it reads:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Now I have a couple of questions:
What are the surrogate blocks? Are they the UTF-16 codes that indicate a 4 byte code point?
Does #xXXXX refer to the code point or to the UTF-16 encoded value here?
If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here? Isn't it the task of an encoding to hide those encoding-related details from the space the encoding maps from?
Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.
Thanks for clarification!
What are the surrogate blocks?
Unicode codepoints in the U+D800 to U+DFFF range, inclusive, which are reserved for exclusive use as UTF-16 surrogates and are illegal in any other context.
Are they the UTF-16 codes that indicate a 4 byte code point?
Yes.
Does #xXXXX refer to the code point or to the UTF-16 encoded value here?
The actual Unicode codepoints. Considering that the definition of Char includes values > #xFFFF, which individual encoded UTF-16 values cannot exceed. UTFs are byte encoding schemes for codepoint values. The XML spec is written in terms of codepoints, not encodings. An XML document can be encoded in any charset specified in the "encoding" attribute of the XML prolog, for purposes of storage and transmission, but the actual XML content is processed in terms of unencoded codepoints.
If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here?
The surrogate codepoints are reserved and not allowed to appear unencoded in any textual content. The Char definition is simply enforcing that rule.
Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.
Because the encoding is not always known ahead of time, and may have to be detected dynamically. U+FFFE is used as a BOM marker to help facilitate that. Early versions of Unicode allowed U+FFFE to be used as either a BOM or an actual non-breaking space character within textual content. That lead to ambiguity at times. So newer versions of Unicode reserve U+FFFE strictly as a BOM only, and non-breaking spacing is handled by U+2060 WORD JOINER instead to avoid any ambiguity.
That being said, in the context of XML, it doesn't make sense to use U+FFFE in any textual content. The entire document is encoded in a particular charset, and any BOM used would have to appear before the XML prolog. The XML spec defines BOM handling and charset detection outside of the XML document itself. So that is why the Char definition excludes U+FFFE.
U+FFFF is reserved and is not intended to ever be used in real content in practice. So that is why the Char definition excludes it.
So basically the Char definition allows all Unicode codepoints minus restricted codepoints.

How does code pages work in case of chinese

How does code pages work in case of chinese / japanese?
It is unable to encode all alphabet's characters for these languages in the limits of one byte so how does it work then?
Note that I'm taking about pre-Unicode times.
I'm most familiar with Japanese, but in general the strategy is the same for any language that needs more characters than fit in a single byte - you use a variable width multibyte encoding where some bytes are recognized as starting a "wide" character and ASCII is left alone.
In the early days so-called "ASCII-safe" encodings were useful. These used only seven bits (the high bit was always 0) so they worked with a variety of systems (including hardware) that expected only control characters to set the high bit in any byte. ISO-2022-JP is one of these and is still used in email quite often (mostly on feature phones).
Here's what ISO-2022-JP looks like if you don't decode it:
echo "日本語" | iconv -f utf8 -t iso2022jp | cat -v
^[$BF|K\8l^[(B
Note that "test" comes through unchanged and all other characters are valid ASCII; ^[ is an ASCII escape character. (ISO-2022 also has 8-bit versions, but the 7-bit is the most commonly used variety.)
Later variable width encodings, like EUC, Shift-JIS, and UTF-8 all work on the same principle except they use binary (non-ASCII) escapes, so the first character of any multi-byte character has the high bit set (that is, the unsigned byte value is >128). The Wikipedia article for UTF-8 has a nice table explaining how UTF handles this. Just like the older ASCII-safe encodings, these leave ASCII strings unmodified.
There also exist fixed-width multibyte encodings, but they're relatively uncommon. There was an attempt to popularize an encoding that just used two bytes for everything, called "UCS-2", but it ended up not having room for enough characters and was mostly superseded by variable width UTF-16 in the 1990s. UTF-16 is (practically speaking) the internal encoding used in Java and Javascript, but due to the history with UCS-2 sometimes things like string length work in strange ways.
Technically fixed-width UTF-32 exists, but it's not widely used and I've never personally encountered it in the wild.

Unicode version of ABNF?

I want to write a grammar for a file format whose content can contain characters other than US-ASCII ones. Since I am used to ABNF, I try to use it...
However, none of RFCs 5234 and 7405 are very friendly towards people who DO NOT use US ASCII.
In fact, I'm looking for an ABNF version (and possibly some basic rules as well) which is character oriented rather than byte oriented; the only thing which RFC 5234 has to say about this is in section 2.4:
2.4. External Encodings
External representations of terminal value characters will vary
according to constraints in the storage or transmission environment.
Hence, the same ABNF-based grammar may have multiple external
encodings, such as one for a 7-bit US-ASCII environment, another for
a binary octet environment, and still a different one when 16-bit
Unicode is used. Encoding details are beyond the scope of ABNF,
although Appendix B provides definitions for a 7-bit US-ASCII
environment as has been common to much of the Internet.
By separating external encoding from the syntax, it is intended that
alternate encoding environments can be used for the same syntax.
That doesn't really clarify matters.
Is there a version of ABNF somewhere which is code point oriented rather than byte oriented?
Refer to section 2.3 of RFC 5234, which says:
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
Unicode is just the set of non-negative integers U+0000 through U+10FFFF minus the surrogate range D800-DFFF and there are various RFCs that use ABNF accordingly. An example is RFC 3987.
If the ABNF you're writing is intended for human reading, then I'd say just use the normal syntax and refer to code points instead of bytes instead. You could take a look at various language specifications that allow Unicode in source text, e.g. C#, Java, PowerShell, etc. They all have a grammar, and they all have to define Unicode characters somewhere (e.g. for identifiers).
E.g. the PowerShell grammar has lines like this:
double-quote-character:
       " (U+0022)
       Left double quotation mark (U+201C)
       Right double quotation mark (U+201D)
       Double low-9 quotation mark (U+201E)
Or in the Java specification:
UnicodeInputCharacter:
       UnicodeEscape
       RawInputCharacter
UnicodeEscape:
       \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
       u
       UnicodeMarker u
RawInputCharacter:
       any Unicode character
HexDigit: one of
       0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
The \, u, and hexadecimal digits here are all ASCII characters.
Note that there is surrounding text explaining the intent – which is always better than just dumping a heap of grammar on someone.
If it's for automatic parser generation, you may be better off finding a tool that allows you to specify a grammar both in Unicode and ABNF-like form and publish that instead. People writing parsers should be expected to understand either, though.

Unicode Encodings

I have a question as to how programs parse strings if they do not a priori know the encoding that is used.
As I understand it, the UTF-8 encoding stores ASII characters with 1 byte, and all other chracters with up to as many as 6 (I think it's 6) bytes. Thus, for example, two spaces would be stored in memory as 0x2020.
How then, would a program be able to determine the difference between this string and the string`0x2020 encoded using the UTF-16 encoding which corresponds to the single character which evidently is a character that appears similar to the symbol sometimes used to denote the adjoint of an operator in mathematics (I just looked that up here).
It seems as if the parser would always have to know the encoding of a string before hand. If so, how is this implemented in practice? Is there a byte preceeding each string which tells the parser what encoding is used or something?
In general, it is not possible to know for certain the exact encoding used based solely on the stream of bytes that can represent text. However, if there is a byte order mark somewhere, you can use it at least as a hint as to what encoding is being used.
But with no hints or some kind of contract/exchange of metadata between the producer and consumer of the text, you can't be 100% sure. You can try using a heuristic, but then you get these kinds of problems if you end up guessing wrong.
If you want to be really sure, set up some kind of protocol or contract between the producer and the consumer of the text so that the text and the encoding scheme is known. You can hardcode the encoding scheme (for example, your program may parse UTF-8 and only UTF-8), or ensure the producer of the text always prepend a byte order mark or specially designed header bytes to communicate the encoding scheme.
Does the language always store strings in a certain encoding so that
the display function could safely assume that the string was encoded,
say, using UTF-8?
In depends on the language.
In C#, yes. A char is defined by the language specification (8.2.1) as a UTF-16 code unit, and thus a string is always UTF-16. Just like Java.
In Ruby 1.9, a string is a byte array with an associated Encoding.
But in pre-Unicode languages like C (and badly-designed post-Unicode languages like PHP), a string is just a byte array with no encoding information. You have to rely on convention. It's a real interesting experience to write a program that uses both a library that assumes UTF-8 strings and another that assumes windows-1252 strings.
A question that's equally relevant to all languages is: How do you determine the encoding of a byte array that contains encoded text? There are several different approaches:
Encoding declarations.
In protocols that use MIME types (notably, SMTP and HTTP), you can declare Content-Type: text/html; charset=UTF-8. In HTML, you can use <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> or the newer <meta charset="UTF-8">. In XML, there's <?xml version="1.0" encoding="UTF-8"?>. In Python source code, there's # -*- coding: UTF-8 -*-.
Unfortunately, such declarations aren't always accurate. And they aren't available at all for locally-stored plain .txt files, so then a different approach must be used.
Byte-order mark (BOM)
Putting the special character U+FEFF at the beginning of a file lets you distinguish between the various UTF encodings.
But it's not usable for legacy encodings like ISO-8859-x or Windows-125x, and not always used with UTF-8.
Validation
Some encodings have strict rules about what makes a valid string. The best-known is UTF-8, with its rigid separation of leading/trailing bytes, prohibition of "overlong" encodings, etc. UTF-32 is even easier to recognize because the restriction of Unicode to 17 "planes" means that every code unit must have the form 00 {00-10} xx xx (or xx xx {00-10} 00 for little-endian).
So if text validates as being UTF-8 or UTF-32, you can safely assume that it is. There's a possibility of false positives, but it's very low.
However, this approach doesn't work well for UTF-16, where the false-positive rate is too high. (The only way for an even-length byte array to not be valid UTF-16 is to contain unpaired surrogates, or U+FFFE or U+FFFF.)
Statistical analysis
Use character frequency tables of various language/encoding combinations. This is the approach used by chardet (in combination with BOM and validation).
Falling back on a default encoding
When all else fails, assume ISO-8859-1, windows-1252, or Encoding.Default.

What issues would come from treating UTF-16 as a fixed 16-bit encoding?

I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:
Dean Harding: UTF-8 is a
variable-length encoding, which is
more complex to process than a
fixed-length encoding. Also, see my
comments on Gumbo's answer: basically,
combining characters exist in all
encodings (UTF-8, UTF-16 & UTF-32) and
they require special handling. You can
use the same special handling that you
use for combining characters to also
handle surrogate pairs in UTF-16, so
for the most part you can ignore
surrogates and treat UTF-16 just like
a fixed encoding.
I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?
I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!
Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"
Edit2:
I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:
Andrew Russell: For performance:
UTF-8 is much harder to decode than
UTF-16. In UTF-16 characters are
either a Basic Multilingual Plane
character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters can
be anywhere between 1 and 4 bytes
This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!
UTF-16 includes all "base plane" characters. The BMP covers most of the current writing systems, and includes many older characters that one can practically encounter. Take a look at them and decide whether you really are going to encounter any characters from the extended planes: cuneiform, alchemical symbols, etc. Few people will really miss them.
If you still encounter characters that require extended planes, these are encoded by two code points (surrogates), and you'll see two empty squares or question marks instead of such a non-character. UTF is self-synchronizing, so a part of a surrogate character never looks like a legitimate character. This allows things like string searches to work even if surrogates are present and you don't handle them.
Thus issues arising from treating UTF-16 as effectively USC-2 are minimal, aside from the fact that you don't handle the extended characters.
EDIT: Unicode uses 'combining marks' that render at the space of previous character, like accents, tilde, circumflex, etc. Sometimes a combination of a diacritic mark with a letter can be represented as a distinct code point, e.g. á can be represented as a single \u00e1 instead of a plain 'a' + accent which are \u0061\u0301. Still you can't represent unusual combinations like z̃ as one code point. This makes search and splitting algorithms a bit more complex. If you somehow make your string data uniform (e.g. only using plain letters and combining marks), search and splitting become simple again, but anyway you lose the 'one position is one character' property. A symmetrical problem happens if you're seriously into typesetting and want to explicitly store ligatures like fi or ffl where one code point corresponds to 2 or 3 characters. This is not a UTF issue, it's an issue of Unicode in general, AFAICT.
It is important to understand that even UTF-32 is fixed-length when it comes to code points, not characters. There are many characters that are composed from multiple code points, and therefore you can't really have a Unicode encoding where one number (code unit) corresponds to one character (as perceived by users).
To answer your question - the most obvious issue with treating UTF-16 as fixed-length encoding form would be to break a string in a middle of a surrogate pair so you get two invalid code points. It all really depends what you are doing with the text.
I guess what I really mean is
"Why would anyone suggest treating
UTF-16 as fixed encoding when it seems
bogus?"
Two words: Backwards compatibility.
Unicode was originally intended to use a fixed-width 16-bit encoding (UCS-2), which is why early adopters of Unicode (e.g., Sun with Java and Microsoft with Windows NT), used a 16-bit character type. When it turned out that 65,536 characters wasn't enough for everyone, UTF-16 was developed in order to allow this 16-bit character systems to represent the 16 new "planes".
This meant that characters were no longer fixed-width, so people created the rationalization that "that's OK because UTF-16 is almost fixed width."
But I'm still not convinced this is
any different to assuming UTF-8 is
single-byte characters!
Strictly speaking, it's not any different. You'll get incorrect results for things like "\uD801\uDC00".lower().
However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.
You can use the same special handling
that you use for combining characters
to also handle surrogate pairs in
UTF-16
I don't know what he's talking about. Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs.
In particular, the characters within a combining sequence can be converted to a different encoding form one characters at a time.
>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'
But not surrogates:
>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed
UTF-16 is a variable-length encoding. The older UCS-2 is not. If you treat a variable-length encoding like fixed (constant length) you risk introducing error whenever you use "number of 16-bit numbers" to mean "number of characters", since the number of characters might actually be less than the number of 16-bit quantities.
The Unicode standard has changed several times along the way. For example, UCS-2 is not a valid encoding anymore. It has been deprecated for a while now.
As mentioned by user 9000, even in UTF-32, you have sequences of characters that are interdependent. The à is a good example, although this character can be canonicalized to \x00E1. So you can make it simple.
Unicode, even when using the UTF-32 encoding, supports up to 30 code points, one after the other, to represent the most complex characters. (The existing characters do not use that many, I think the longest in existence is currently 17 if I'm correct.)
For that reason, Unicode developed Normalization Forms. It actually considers five different forms:
Unnormalized -- a sequence you create manually, for example; text editors are expected to save properly normalized (NFC) code sequences
NFD -- Normalization Form Decomposition
NFKD -- Normalization Form Compatibility Decomposition
NFC -- Normalization Form Canonical Composition
NFKC -- Normalization Form Compatibility Canonical Composition
Although in most situations it does not matter much because long compositions are rare, even in languages that use them.
And in most cases, your code already deals with canonical compositions. However, if you create strings manually in your code, you are not unlikely to create an unnormalized string (assuming you use such long forms).
Properly implemented servers on the Internet are expected to refused strings that are not canonical compositions as per Unicode. Long forms are also forbidden over connections. For example, the UTF-8 encoding technically allows for ASCII characters to be encoded using 1, 2, 3, or 4 bytes (and the old encoding allowed up to 6 bytes!) but those encoding are not permitted.
Any comment on the Internet that contradicts the Unicode Normalization Form document is simply incorrect.