I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.
A unicode scalar value is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)
My feeling was that a unicode string is a sequence of unicode scalar values. I would define a UTF-8 unicode string as a sequence of unicode scalar values encoded in UTF-8. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.
"Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)
But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.
(p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units."
According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!
(p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is."
I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!
For a start, let's seek for a clear definition of an UTF-8 unicode string. So far, I can propose 3 definitions, but the real one (if there is) might be different:
(1) Any array of uint8
(2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
(3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
To make things concrete, here are a few examples:
[ 0xFF ] would be a UTF-8 unicode string according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from an UTF-8 encoded unicode scalar value.
[ 0xB0 ] would be a UTF-8 unicode string according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.
I am just lost with this "standard". Do you have any clear definition?
My feeling was that a unicode string is a sequence of unicode scalar values.
No, a Unicode string is a sequence of code units. The standard doesn't contain "many definitions", but only a single one:
D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
This doesn't require the string to be well-formed (see the following definitions). None of your other quotes from the standard contradict this definition. To the contrary, they only illustrate that a Unicode string, as defined by the standard, can be ill-formed.
An application shall only create well-formed strings, of course:
If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.
But the standard also contains some sections on how to deal with ill-formed input sequences.
Related
In lua 5.3 reference manual, we can see:
Lua is also encoding-agnostic; it makes no assumptions about the contents of a string.
I can't understand what the sentence says.
The same byte value in a string may represent different characters depending on the character encoding used for that string. For example, the same value \177 may represent ▒ in Code page 437 encoding or ± in Windows 1252 encoding.
Lua makes no assumption as to what the encoding of a given string is and the ambiguity needs to be resolved at the script level; in other words, your script needs to know whether to deal with the byte sequence as Windows 1252, Code page 437, UTF-8, or something else encoded string.
Essentially, a Lua string is a counted sequence of bytes. If you use a Lua string for binary data, the concept of character encodings is not relevant and does not interfere with the binary data. It that way, string is encoding-agnostic.
There are functions in the standard string library that treat string values as text—an uncounted, sequence of characters. There is no text but encoded text. An encoding maps a member of a character set to a sequence of bytes. A string would have the bytes for zero or more such encoded characters. To understand a string as text, you must know the character set and encoding. To use the string functions, the encoding should be compatible with os.setlocale().
I am trying to convert UTF-8 string to Unicode (code point) list with Erlang library "unicode. My input data is a string "АБВ" (Russian string, which correct Unicode representation is [1040,1041,1042]), encoded in UTF-8. When I am running following code:
1> unicode:characters_to_list(<<208,144,208,145,208,146>>,utf8).
[1040,1041,1042]
it returns correct value, but following:
2> unicode:characters_to_list([208,144,208,145,208,146],utf8).
[208,144,208,145,208,146]
does not. Why does it happens? As I read in specification, input data could be either binary or list of chars, so, as for me, I am doing everything right.
The signature of the function is unicode:characters_to_list(Data, InEncoding), it expects Data to be either binary containing string encoded in InEncoding encoding or possibly deep list of characters (code points) and binaries in InEncoding encoding. It returns list of unicode characters. Characters in erlang are integers.
When you call unicode:characters_to_list(<<208,144,208,145,208,146>>, utf8) or unicode:characters_to_list([1040,1041,1042], utf8) it correctly decodes unicode string (yes, second is noop as long as Data is list of integers). But when you call unicode:characters_to_list([208,144,208,145,208,146], utf8) erlang thinks you pass list of 6 characters in utf8 encoding, since it's already unicode the output will be exactly the same.
There is no byte type in erlang, but you assume that unicode:characters_to_list/2 will accept list of bytes and will behave correctly.
To sum it up. There are two usual ways to represent string in erlang, they are bitstrings and lists of characters. unicode:characters_to_list(Data, InEncoding) takes string Data in one of these representations (or combination of them) in InEncoding encoding and converts it to list of unicode codepoints.
If you have list [208,144,208,145,208,146] like in your example you can convert it to binary using erlang:list_to_binary/1 and then pass it to unicode:characters_to_list/2, i.e.
1> unicode:characters_to_list(list_to_binary([208,144,208,145,208,146]), utf8).
[1040,1041,1042]
unicode module supports only unicode and latin-1. Thus, (since the function expects codepoints of unicode or latin-1) characters_to_list does not need to do anything with list in a case of flat list of codepoints. However, list may be deep (unicode:characters_to_list([[1040],1041,<<1042/utf8>>]).). That is a reason to support list datatype for Data argument.
<<208,144,208,145,208,146>> is an UTF-8 binary.
[208,144,208,145,208,146] is a list of bytes (not code points).
[1040,1041,1042] is a list of code points.
You are passing a list of bytes, but the function wants a list of chars or a binary.
In http://nedbatchelder.com/text/unipain.html it is explained that:
In Python 2, there are two different string data types. A plain-old
string literal gives you a "str" object, which stores bytes. If you
use a "u" prefix, you get a "unicode" object, which stores code
points.
What's the difference between code point vs byte? (I'm thinking not really in term of Python per se but just the concept in general). Essentially it's just a bunch of bits, right? I think of pain old string literal treat each 8-bits as a byte and is handled as such, and we interpret the byte as integers and that allow us to map it to ASCII and the extended character sets. What's the difference between interpreting integer as that set of characters and interpreting the "code point" as Unicode characters? It says Python's Unicode object stores "code point". Isn't that just the same as plain old bytes except possibly the interpretation (where bits of each Unicode character starts and stops as utf-8, for example)?
A code point is a number which acts as an identifier for a Unicode character. A code point itself cannot be stored, it must be encoded from Unicode into bytes in e.g. UTF-16LE. While a certain byte or sequence of bytes can represent a specific code point in a given encoding, without the encoding information there is nothing to connect the code point to the bytes.
I need some help understanding the concept of a well-formed UTF-16 string as mentioned on these two paragraphs at Chapter 2: General Structure 2.7 Unicode String:
"Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide.
Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form. For example, isolated surrogates in a Unicode 16-bit string are not allowed when that string is specified to be well formed UTF-16.
The paragraph explains it for UTF-16; not well-formed means the string contains isolated surrogate codeunits.
That is, there are certain code units which are only valid when they appear in pairs. A code unit in the range [0xD800-0xDFFF] must occur only in pairs where the first must be in the range [0xD800-0xDBFF] and the second must be in the range [0xDC00-0xDFFF]. If a string does not obey this requirement then it is not well-formed.
What is the difference between charsets and character encoding? When i say i am using utf-8 encoding then what will be my charset? Does it take unicode as charset by default?
UTF-8 is an encoding of the Unicode character set. Therefore if you're using UTF-8, the character set is Unicode, but you're not likely to have to specify this separately anywhere. The other main encoding of Unicode is UTF-16, which is not put into 8-bit byte streams because it contains zero bytes. If you are dealing with Unicode in a byte sequence, it is certainly encoded as UTF-8.
Other than Unicode, character sets are usually considered to have a single fixed encoding, and then terms like character set, charset, codepage, encoding are often used interchangeably, or depending on the vendor. This is sloppy but creates no runtime problems.
The only possible exceptions I can think of are East Asian: JIS and EUC originally defined multiple encodings for the same character set, but in practice today, each encoding is just treated separately.
Character set: definition which character has which numeric code point (ascii, jis, unicode)
Encoding: definition how the numeric code point is physically represented (utf, ucs, shiftjis)
According to Unicode terminology
ACR: Abstract Character Repertoire
= the set of characters to be encoded, for example, some alphabet or symbol set
CCS: Coded Character Set
= a mapping from an abstract character repertoire to a set of nonnegative integers
CEF: Character Encoding Form
= a mapping from a set of nonnegative integers that are elements of a
CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
CES: Character Encoding Scheme
= a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
CM: Character Map
= a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
TES: Transfer Encoding Syntax
= a reversible transform of encoded data, which may or may not contain textual data
Older protocols like MIME use "charset" when they really mean "character encoding scheme". Originally, different character encodings were though of as independent character repertoires instead of subsets of Unicode.
A character set defines the mapping between numbers and characters. Almost all char sets say 65 is A, and agree in general about mappings of numbers up to 127. But they might have different stands when it comes to numbers above 127.
There are a lot of character sets
EBCDIC
Double Byte Character Set
ANSI
Different OEM char sets
Unicode, an effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.
When you say character encoding, you're talking about how a Unicode code point (a character) is stored internally.
In UTF-8 encoding, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).
UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
This post is almost entirely based on Joel Spolsky's post on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Read it to get a better idea.
Charset is synonym for character encoding
Default encoding depends on the operating system and locale.
EDIT
http://www.w3.org/TR/REC-xml/#sec-TextDecl
http://www.w3.org/TR/REC-xml/#NT-EncodingDecl