I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check
if (first_byte & 0xfc == 0xdc) {
return -1;
}
Can you please help me understand why this check is present.
Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.
See e.g. Wikipedia article UTF-16 for more information.
The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.
In UTF-16, the two byte sequence
DCFO
cannot begin the encoding of any character at all.
The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:
0000 .. D7FF
E000 .. FFFF
All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range
D800 .. DBFF
and the second pair of bytes must be in the range
DC00 .. DFFF
This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.
Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.
Related
Since a string in many modern languages now are sequence of unicode character, it can span more than a single byte. But, If I only care about some ascii character, is it safe to treat string as sequences of byte (assuming the given string is a sequence of valid unicode characters)?
Yes.
From Wikipedia:
[...] ASCII bytes do not occur when encoding non-ASCII code points into UTF-8 [...]
Moreover, 7-bit bytes (bytes where the most significant bit is 0) never appear in a multi-byte sequence, and no valid multi-byte sequence decodes to an ASCII code-point. [...] Therefore, the 7-bit bytes in a UTF-8 stream represent all and only the ASCII characters in the stream. Thus, many [programs] will continue to work as intended by treating the UTF-8 byte stream as a sequence of single-byte characters, without decoding the multi-byte sequences.
From utf8everywhere.org:
By design of this encoding, UTF-8 guarantees that an ASCII character value or a substring will never match a part of a multi-byte encoded character.
This is visualized nicely by this table from Wikipedia:
Number of bytes Byte 1 Byte 2 Byte 3 Byte 4
1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
All ASCII characters seen as a 8-bit byte have the most significant bit set to 0. But in multi-byte encoded characters, all bytes have the MSB set to 1.
Note that UTF8 is one encoding of Unicode. They are not the same! My answer talks about UTF8 encoded strings (which luckily is the most prominent encoding).
An additional thing to be aware of is Unicode normalization, combining characters and other characters that "kind of" contain an ASCII character. Take the Umlaut ä for example:
ä 0xC3A4 LATIN SMALL LETTER A WITH DIAERESIS
ä 0x61CC88 LATIN SMALL LETTER A + COMBINING DIAERESIS
If you search for the ASCII character 'a', you will find it in the second line, but not in the first one, despite the lines logically containing the same "user perceived characters". You can tackle this at least partially by normalizing your strings beforehand.
It is helpful to drop the notion of a unicode character and rather talk about a unicode codpoint (for example U+0065: 'LATIN SMALL LETTER E') and different encodings (ASCII, UTF-8, UTF-16, etc). You are asking about properties of the UTF-8 encoding. In the case of UTF-8: code points below U+0080 have the same encoding as ASCII. The wikipedia page has a nice table
Number Bits for First Last Byte 1 Byte 2
of bytes code point code point code point
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
...
Talking about strings in languages is too broad in my opinion, because even if you language stores string values in some specified encoding you can still receive input in a different encoding. (Think of a Java program (which uses UTF-16 for the internal representation. You can still serialize a string as UTF-8 or get user input which is encoded in ASCII.)
In UTF-8, code points >127 encoded with multiple bytes. For example, character U+041F (100'0001'1111) encoded as:
1101'0000 1001'1111
^^^ ^^
Marked bits determine leading and trailing bytes, other bits are actual bits of the code point.
But can we encode code point 1 as
1100'0000 1000'0001
Of course, it is redundant, but is it legal in UTF-8?
Overlong UTF-8 sequences are not considered valid UTF-8 representations of a code point. A UTF-8 decoder must reject overlong sequences.
Wikipedia citation: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings
Original RFC 2279 specification: https://www.ietf.org/rfc/rfc2279.txt
Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?
Also known as \xAE
A Unicode character as such does not have any length in bytes. It is the character encoding that matters. You know the length of a character in bytes in a specific encoding from the definition of the encoding.
For example, in the ISO-8859-1 (ISO Larin 1) encoding, which encodes just a small subset of Unicode, including “®”, every character is 1 byte long.
In the UTF-16 encoding, all characters are either 2 or 4 bytes long, and characters in the range U+0000...U+FFFF, such as “®”, are 2 bytes
In the UTF-32 encoding, all characters are 4 bytes long.
In the UTF-8 encoding, characters take 1 to 4 bytes. A simple way to check this out is to use the Fileformat.info Character search (though this is not normative information, just a nice quick reference). E.g., the page about U+00AE shows the character in some encodings, including 0xC2 0xAE (that is, 2 bytes) in UTF-8.
It is unicode number U+00AE. It's in the range [0x80, 0x7ff] so in UTF-8 it'll be encoded as two bytes — the table at the top of the Wikipedia article explains in more detail*.
If you were using UTF-16 it'd also be two bytes, since no continuation is necessary.
(* my summary though: one of the features of UTF-8 is that you can jump midway into a byte stream and synchronise with the text without generating any spurious characters, because you can tell whether any byte is a continuation character without further context.
An unavoidable side effect is that only the 7-bit ASCII characters fit into a single byte and everything else takes multiple bytes. 0xae is sufficiently close to the 7-bit range to require only one extra byte. See Wikipedia for specifics.)
I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.
What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?
0xFF and 0xFE cannot appear in legal UTF-8 data. Also the bytes 0xF8-0xFD will only appear in the obsolete version of UTF-8 that allows up to six byte sequences.
0x00 is legal but won't appear anywhere except in the encoding of U+0000. This is exactly the same as other encodings, and the fact that it's legal in all these encodings never stopped it from being used as a terminator in C strings. I'd probably go with 0x00.
The byte 0xff cannot appear in a valid UTF-8 sequence, nor can any of 0xfc, 0xfd, 0xfe.
All UTF-8 bytes must match one of
0xxxxxxx - Lower 7 bit.
10xxxxxx - Second and subsequent bytes in a multi-byte sequence.
110xxxxx - First byte of a two-byte sequence.
1110xxxx - First byte of a three-byte sequence.
11110xxx - First byte of a four-byte sequence.
111110xx - First byte of a five-byte sequence.
1111110x - First byte of a six-byte sequence.
There are no seven or larger byte sequences. The latest version of UTF-8 only allows UTF-8 sequences up to 4 bytes in length, which would leave 0xf8-0xff unused, but is possible though that a byte sequence could be validly called UTF-8 according to an obsolete version and include octets in 0xf8-0xfb.
What about using one of the UTF-8 control characters?
You can choose one from http://www.utf8-chartable.de/
I have a sting in unicode is "hao123--我的上网主页", while in utf8 in C++ string is "hao123锛嶏紞鎴戠殑涓婄綉涓婚〉", but I should write it to a file in this format "hao123\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875", how can I do it. I know little about this encoding. Can anyone help? thanks!
You seem to mix up UTF-8 and UTF-16 (or possibly UCS-2). UTF-8 encoded characters have a variable length of 1 to 4 bytes. Contrary to this, you seem to want to write UTF-16 or UCS-2 to your files (I am guessing this from the \uxxxx character references in your file output string).
For an overview of these character sets, have a look at Wikipedia's article on UTF-8 and browse from there.
Here's some of the very basic basics (heavily simplified):
UCS-2 stores all characters as exactly 16 bits. It therefore cannot encode all Unicode characters, only the so-called "Basic Multilingual Plane".
UTF-16 stores the most frequently-used characters in 16 bits, but some characters must be encoded in 32 bits.
UTF-8 encodes characters with a variable length of 1 to 4 bytes. Only characters from the original 7-bit ASCII charset are encoded as 1 byte.