Suppose that my Haskell function is given an input, which is supposed to be the number of a unicode code point. How can one convert this to the corresponding character?
Example:
123 to '{'.
Use toEnum. (Chars implement the Enum typeclass.)
Related
When I encounter arbitrary unicode string, such as in a hashtag, I would like to express only its alphanumeric components in a string of their ascii equivalents. For example,
x='โฌ๐๐๐ฉ๐ง๐๐ค๐ฉ'
would be rendered as
x='Patriot'
Since I cannot anticipate the unicode that could appear in such strings, I would like the method to be as general as possible. Any suggestions?
The unicodedata.normalize method can translate Unicode code points to a canonical value. Then, run the value through ascii encoding ignoring non-ASCII values for a byte string, then back through ascii decode to get a Unicode string again:
>>> x='โฌ๐๐๐ฉ๐ง๐๐ค๐ฉ'
>>> ud.normalize('NFKC',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'
If you need to removed accents from letters, but still keep the base letter, use 'NFKD' instead.
>>> x='โฌ๐๐๐ฉ๐ง๐รด๐ฉ'
>>> ud.normalize('NFKD',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'
I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.
A unicode scalar value is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)
My feeling was that a unicode string is a sequence of unicode scalar values. I would define a UTF-8 unicode string as a sequence of unicode scalar values encoded in UTF-8. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.
"Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)
But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.
(p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units."
According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!
(p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is."
I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!
For a start, let's seek for a clear definition of an UTF-8 unicode string. So far, I can propose 3 definitions, but the real one (if there is) might be different:
(1) Any array of uint8
(2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
(3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
To make things concrete, here are a few examples:
[ 0xFF ] would be a UTF-8 unicode string according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from an UTF-8 encoded unicode scalar value.
[ 0xB0 ] would be a UTF-8 unicode string according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.
I am just lost with this "standard". Do you have any clear definition?
My feeling was that a unicode string is a sequence of unicode scalar values.
No, a Unicode string is a sequence of code units. The standard doesn't contain "many definitions", but only a single one:
D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
This doesn't require the string to be well-formed (see the following definitions). None of your other quotes from the standard contradict this definition. To the contrary, they only illustrate that a Unicode string, as defined by the standard, can be ill-formed.
An application shall only create well-formed strings, of course:
If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.
But the standard also contains some sections on how to deal with ill-formed input sequences.
I have this kind of symbols in db table (ะ ัะ ยฐะ ัะ ัะ ยต) , and I don't know who inserted this data to table.Is there any way to convert them to cyrillic ?
Yes, you can do the conversion. Since you haven't mentioned any langauge, so the logic is given:
Assuming the string length is even, take two immediate characters.
Combine the underlying byte values of two characters to give a 16 bit value. This gives you the multi-byte value of Cryllic character. You can decode the value to give its representation using a proper decoding format like utf-8.
Repeat points 1 and 2 for next two characters until the end of string.
If you want, you can implement it in any language of your choice.
If I need to have the following python value, unicode char '0':
>>> unichr(0)
u'\x00'
How can I define it in Lua?
There isn't one.
Lua has no concept of a Unicode value. Lua has no concept of Unicode at all. All Lua strings are 8-bit sequences of "characters", and all Lua string functions will treat them as such. Lua does not treat strings as having any Unicode encoding; they're just a sequence of bytes.
You can insert an arbitrary number into a string. For example:
"\065\066"
Is equivalent to:
"AB"
The \ notation is followed by 3 digits (or one of the escape characters), which must be less than or equal to 255. Lua is perfectly capable of handling strings with embedded \000 characters.
But you cannot directly insert Unicode codepoints into Lua strings. You can decompose the codepoint into UTF-8 and use the above mechanism to insert the codepoint into a string. For example:
"x\226\131\151"
This is the x character followed by the Unicode combining above arrow character.
But since no Lua functions actually understand UTF-8, you will have to expose some function that expects a UTF-8 string in order for it to be useful in any way.
How about
function unichr(ord)
if ord == nil then return nil end
if ord < 32 then return string.format('\\x%02x', ord) end
if ord < 126 then return string.char(ord) end
if ord < 65539 then return string.format("\\u%04x", ord) end
if ord < 1114111 then return string.format("\\u%08x", ord) end
end
While native Lua does not directly support or handle Unicode, its strings are really buffers of arbitrary bytes that by convention hold ASCII characters. Since strings may contain any byte values, it is relatively straightforward to build support for Unicode on top of native strings. Should byte buffers prove to be insufficiently robust for the purpose, one can also use a userdata object to hold anything, and with the addition of a suitable metatable, endow it with methods for creation, translation to a desired encoding, concatenation, iteration, and anything else that is needed.
There is a page at the Lua User's Wiki that discusses various ways to handle Unicode in Lua programs.
For a more modern answer, Lua 5.3 now has the utf8.char:
Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.
I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and limitations of the language in this area.
The types are given on the website as:
char; // unsinged 8 bit UTF-8
wchar; // unsigned 16 bit UTF-16
dchar; // unsigned 32 bit UTF-32
Since we know that most of the Unicode Transformation (UTF) Format encodings represent characters with a variable bit-width, does this mean that a char in D can only contain the values that will fit in 8 bits, or does it expand in the machine's physical memory when you give it double byte characters? Perhaps there is some other possibility, like automatic casting into the next most appropriate type as you overload the variable?
Let's say for example, I want to use the UTF-8 char in an editor and type in Chinese . Will it simply fall over, or is it able to deal with Unicode characters more 'correctly', like in C#? Would it still be necessary to provide glue code to allow working with any language supported by Unicode?
I'd appreciate any specific information you can offer on how these types work under the covers, and any general best practices advice on dealing with their limitations.
A single char or wchar represents an UTF code unit. This means that, by its own, a char in can either represent an ASCII symbol (0-127) or be part of an UTF-8 sequence representing an Unicode character (code point). Only the dchar type can represent an entire Unicode character, because there are more than 65536 code points in Unicode.
Casting one type of string type (string, wstring and dstring, which are simply dynamic arrays of the character types) will not automatically convert their contents to the respective UTF representation. In order to do this, you must use the functions toUTF8, toUTF16 and toUTF32 from std.utf (or toString / toString16 / toString32 from tango.text.convert.Utf if you use Tango).
Users have implemented string classes which will automatically use the most memory-efficient representation that can map each character to a single code unit. This allows quick slicing and indexing with a minimal memory overhead. One such implementation is mtext by Christopher E. Miller.
Further reading:
the String handling section in Wikipedia's entry on D
Text in D, by Daniel Keep