If I need to have the following python value, unicode char '0':
>>> unichr(0)
u'\x00'
How can I define it in Lua?
There isn't one.
Lua has no concept of a Unicode value. Lua has no concept of Unicode at all. All Lua strings are 8-bit sequences of "characters", and all Lua string functions will treat them as such. Lua does not treat strings as having any Unicode encoding; they're just a sequence of bytes.
You can insert an arbitrary number into a string. For example:
"\065\066"
Is equivalent to:
"AB"
The \ notation is followed by 3 digits (or one of the escape characters), which must be less than or equal to 255. Lua is perfectly capable of handling strings with embedded \000 characters.
But you cannot directly insert Unicode codepoints into Lua strings. You can decompose the codepoint into UTF-8 and use the above mechanism to insert the codepoint into a string. For example:
"x\226\131\151"
This is the x character followed by the Unicode combining above arrow character.
But since no Lua functions actually understand UTF-8, you will have to expose some function that expects a UTF-8 string in order for it to be useful in any way.
How about
function unichr(ord)
if ord == nil then return nil end
if ord < 32 then return string.format('\\x%02x', ord) end
if ord < 126 then return string.char(ord) end
if ord < 65539 then return string.format("\\u%04x", ord) end
if ord < 1114111 then return string.format("\\u%08x", ord) end
end
While native Lua does not directly support or handle Unicode, its strings are really buffers of arbitrary bytes that by convention hold ASCII characters. Since strings may contain any byte values, it is relatively straightforward to build support for Unicode on top of native strings. Should byte buffers prove to be insufficiently robust for the purpose, one can also use a userdata object to hold anything, and with the addition of a suitable metatable, endow it with methods for creation, translation to a desired encoding, concatenation, iteration, and anything else that is needed.
There is a page at the Lua User's Wiki that discusses various ways to handle Unicode in Lua programs.
For a more modern answer, Lua 5.3 now has the utf8.char:
Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.
Related
I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.
A unicode scalar value is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)
My feeling was that a unicode string is a sequence of unicode scalar values. I would define a UTF-8 unicode string as a sequence of unicode scalar values encoded in UTF-8. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.
"Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)
But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.
(p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units."
According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!
(p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is."
I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!
For a start, let's seek for a clear definition of an UTF-8 unicode string. So far, I can propose 3 definitions, but the real one (if there is) might be different:
(1) Any array of uint8
(2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
(3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
To make things concrete, here are a few examples:
[ 0xFF ] would be a UTF-8 unicode string according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from an UTF-8 encoded unicode scalar value.
[ 0xB0 ] would be a UTF-8 unicode string according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.
I am just lost with this "standard". Do you have any clear definition?
My feeling was that a unicode string is a sequence of unicode scalar values.
No, a Unicode string is a sequence of code units. The standard doesn't contain "many definitions", but only a single one:
D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
This doesn't require the string to be well-formed (see the following definitions). None of your other quotes from the standard contradict this definition. To the contrary, they only illustrate that a Unicode string, as defined by the standard, can be ill-formed.
An application shall only create well-formed strings, of course:
If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.
But the standard also contains some sections on how to deal with ill-formed input sequences.
In http://nedbatchelder.com/text/unipain.html it is explained that:
In Python 2, there are two different string data types. A plain-old
string literal gives you a "str" object, which stores bytes. If you
use a "u" prefix, you get a "unicode" object, which stores code
points.
What's the difference between code point vs byte? (I'm thinking not really in term of Python per se but just the concept in general). Essentially it's just a bunch of bits, right? I think of pain old string literal treat each 8-bits as a byte and is handled as such, and we interpret the byte as integers and that allow us to map it to ASCII and the extended character sets. What's the difference between interpreting integer as that set of characters and interpreting the "code point" as Unicode characters? It says Python's Unicode object stores "code point". Isn't that just the same as plain old bytes except possibly the interpretation (where bits of each Unicode character starts and stops as utf-8, for example)?
A code point is a number which acts as an identifier for a Unicode character. A code point itself cannot be stored, it must be encoded from Unicode into bytes in e.g. UTF-16LE. While a certain byte or sequence of bytes can represent a specific code point in a given encoding, without the encoding information there is nothing to connect the code point to the bytes.
I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
More:
Use regular expression to match ANY Chinese character in utf-8 encoding
At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:
肖晗 { printf ("xiaohan\n"); }
it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:
[肖晗] { printf ("xiaohan/2\n"); }
because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗 as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗).
For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
This isn't pretty, but it should work.
Flex does not support Unicode. However, Flex supports "8 bit clean" binary input. Therefore you can write lexical patterns which match UTF-8. You can use these patterns in specific lexical areas of the input language, for instance identifiers, comments or string literals.
This will work for well for typical programming languages, where you may be able to assert to the users of your implementation that the source language is written in ASCII/UTF-8 (and no other encoding is supported, period).
This approach won't work if your scanner must process text that can be in any encoding. It also won't work (very well) if you need to express lexical rules specifically for Unicode elements. I.e. you need Unicode characters and Unicode regexes in the scanner itself.
The idea is that you can recognize a pattern which includes UTF-8 bytes using a lex rule, (and then perhaps take the yytext, and convert it out of UTF-8 or at least validate it.)
For a working example, see the source code of the TXR language, in particular this file: http://www.kylheku.com/cgit/txr/tree/parser.l
Scroll down to this section:
ASC [\x00-\x7f]
ASCN [\x00-\t\v-\x7f]
U [\x80-\xbf]
U2 [\xc2-\xdf]
U3 [\xe0-\xef]
U4 [\xf0-\xf4]
UANY {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UONLY {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
As you can see, we can define patterns to match ASCII characters as well as UTF-8 start and continuation bytes. UTF-8 is a lexical notation, and this is a lexical analyzer generator, so ... no problem!
Some explanations: The UANY means match any character, single-byte ASCII or multi-byte UTF-8. UANYN means like UANY but no not match the newline. This is useful for tokens that do not break across lines, like say a comment from # to the end of the line, containing international text. UONLY means match only a UTF-8 extended character, not an ASCII one. This is useful for writing a lex rule which needs to exclude certain specific ASCII characters (not just newline) but all extended characters are okay.
DISCLAIMER: Note that the scanner's rules use a function called utf8_dup_from to convert the yytext to wide character strings containing Unicode codepoints. That function is robust; it detects problems like overlong sequences and invalid bytes and properly handles them. I.e. this program is not relying on these lex rules to do the validation and conversion, just to do the basic lexical recognition. These rules will recognize an overlong form (like an ASCII code encoded using several bytes) as valid syntax, but the conversion function will treat them properly. In any case, I don't expect UTF-8 related security issues in the program source code, since you have to trust source code to be running it anyway (but data handled by the program may not be trusted!) If you're writing a scanner for untrusted UTF-8 data, take care!
I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++.
RE/flex safely supports the full Unicode standard and accepts UTF-8, UTF-16, and UTF-32 input files without requiring UTF-8 hacks (that can't even support UTF-16/32 input and handle UTF BOM.)
Also, UTF-8 hacks with Flex don't allow you to write Unicode regular expressions such as [肖晗] that are fully supported in RE/flex.
It works seamlessly with Bison to build lexers and parsers.
In fact, with RE/flex we can write any Unicode patterns as UTF-8-based regular expressions in lexer .l specifications, such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
This generates a lexer that scans UTF-8, UTF-16, and UTF-32 files automatically. As per UTF standardization, for UTF-16/32 input a UTF BOM is expected in the input, while an UTF-8 BOM is optional.
We can use global %option unicode to enable Unicode and %option flex to specify Flex specifications. A local modifier (?u:) can be used to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.
Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.
I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and limitations of the language in this area.
The types are given on the website as:
char; // unsinged 8 bit UTF-8
wchar; // unsigned 16 bit UTF-16
dchar; // unsigned 32 bit UTF-32
Since we know that most of the Unicode Transformation (UTF) Format encodings represent characters with a variable bit-width, does this mean that a char in D can only contain the values that will fit in 8 bits, or does it expand in the machine's physical memory when you give it double byte characters? Perhaps there is some other possibility, like automatic casting into the next most appropriate type as you overload the variable?
Let's say for example, I want to use the UTF-8 char in an editor and type in Chinese . Will it simply fall over, or is it able to deal with Unicode characters more 'correctly', like in C#? Would it still be necessary to provide glue code to allow working with any language supported by Unicode?
I'd appreciate any specific information you can offer on how these types work under the covers, and any general best practices advice on dealing with their limitations.
A single char or wchar represents an UTF code unit. This means that, by its own, a char in can either represent an ASCII symbol (0-127) or be part of an UTF-8 sequence representing an Unicode character (code point). Only the dchar type can represent an entire Unicode character, because there are more than 65536 code points in Unicode.
Casting one type of string type (string, wstring and dstring, which are simply dynamic arrays of the character types) will not automatically convert their contents to the respective UTF representation. In order to do this, you must use the functions toUTF8, toUTF16 and toUTF32 from std.utf (or toString / toString16 / toString32 from tango.text.convert.Utf if you use Tango).
Users have implemented string classes which will automatically use the most memory-efficient representation that can map each character to a single code unit. This allows quick slicing and indexing with a minimal memory overhead. One such implementation is mtext by Christopher E. Miller.
Further reading:
the String handling section in Wikipedia's entry on D
Text in D, by Daniel Keep