Unicode FAQ mentions that UTF-8 doesn't need BOM.
Q: Is the UTF-8 encoding scheme the same irrespective of whether the
underlying processor is little endian or big endian?
A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
endian problem as there is for encoding forms that use 16-bit or
32-bit code units. Where a BOM is used with UTF-8, it is only used as
an encoding signature to distinguish UTF-8 from other encodings — it
has nothing to do with byte order.
For code points above U+0744, UTF-8 needs 2 to 4 bytes to represent them. Doesn't it need a BOM to specify the endianness of these bytes or does UTF-8 adopt a default?
UTF-8 gives a strict definition for the order of the bytes that encode a character. No variation between computing platforms is allowed.
For example, the Euro sign U+20AC must be encoded as the byte sequence \xE2\x82\xAC. No other ordering of these bytes is permitted.
UTF-8 uses 1-byte code units, so there is no need for a BOM to indicate a byte order, because there is only 1 byte order possible, and the encoding algorithm determines the ordering of the bytes. For example, U+0744 is encoded in UTF-8 as code units 0xDD 0x84, which are represented in bytes as DD 84. Bytes 84 DD would be an illegal UTF-8 sequence.
Unlike UTF-16 and UTF-32, which use 2-byte and 4-byte code units, respectively. The encoding algorithm determines the order of the code units, but since the code units themselves are multi-byte, they are subject to endian. For example, U+0744 is encoded in UTF-16 as code unit 0x0744, and in UTF-32 as code unit 0x00000744, which are represented in bytes as 07 44 or 44 07 in UTF-16, and as 07 44 00 00 or 00 00 44 07 in UTF-32, depending on endian.
So, a BOM makes sense to indicate which endian is actually being used for UTF-16/32, but not for UTF-8.
Related
I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check
if (first_byte & 0xfc == 0xdc) {
return -1;
}
Can you please help me understand why this check is present.
Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.
See e.g. Wikipedia article UTF-16 for more information.
The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.
In UTF-16, the two byte sequence
DCFO
cannot begin the encoding of any character at all.
The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:
0000 .. D7FF
E000 .. FFFF
All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range
D800 .. DBFF
and the second pair of bytes must be in the range
DC00 .. DFFF
This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.
Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.
Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?
Also known as \xAE
A Unicode character as such does not have any length in bytes. It is the character encoding that matters. You know the length of a character in bytes in a specific encoding from the definition of the encoding.
For example, in the ISO-8859-1 (ISO Larin 1) encoding, which encodes just a small subset of Unicode, including “®”, every character is 1 byte long.
In the UTF-16 encoding, all characters are either 2 or 4 bytes long, and characters in the range U+0000...U+FFFF, such as “®”, are 2 bytes
In the UTF-32 encoding, all characters are 4 bytes long.
In the UTF-8 encoding, characters take 1 to 4 bytes. A simple way to check this out is to use the Fileformat.info Character search (though this is not normative information, just a nice quick reference). E.g., the page about U+00AE shows the character in some encodings, including 0xC2 0xAE (that is, 2 bytes) in UTF-8.
It is unicode number U+00AE. It's in the range [0x80, 0x7ff] so in UTF-8 it'll be encoded as two bytes — the table at the top of the Wikipedia article explains in more detail*.
If you were using UTF-16 it'd also be two bytes, since no continuation is necessary.
(* my summary though: one of the features of UTF-8 is that you can jump midway into a byte stream and synchronise with the text without generating any spurious characters, because you can tell whether any byte is a continuation character without further context.
An unavoidable side effect is that only the 7-bit ASCII characters fit into a single byte and everything else takes multiple bytes. 0xae is sufficiently close to the 7-bit range to require only one extra byte. See Wikipedia for specifics.)
I had read this great tutorial
http://www.joelonsoftware.com/articles/Unicode.html
But I didn't understand how UTF-8 solves high-endian, low-endian machines thing.
For 1byte, its fine.
For multi byte, how it works?
Can someone explain better?
Here is a link that explains UTF-8 in depth. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
At the heart of it, UTF-16 is short integer(16 bit) oriented and UTF-8 is byte oriented. Since architectures can differ on how the bytes of a datatypes are ordered(big endian, little endian) the UTF-16 encoding can go either way. On all architectures I am aware of there is no endian-ness at the nibble or semi-octet level. All bytes are a sequential series of 8 bits. Therefore UTF-8 has no endian-ness.
The Japanese character あ is a good example. It is U+3042 (binary=0011 0000 : 0100 0010).
UTF-16BE: 30, 42 = 0011 0000 : 0100 0010
UTF-16LE: 42, 30 = 0100 0010 : 0011 0000
UTF-8: e3, 81, 82 = 1110 0011 : 10 0000 01 : 10 00 0010
Here is some information on unicode あ
There is no endiannes problem with UTF-8. The problem arises with UTF-16, because there's a need to see a sequence of two-byte chunks as a sequence of byte chunks when writing it into a file or a communication stream, which may have different idea about byte order in a two-byte number. Because UTF-8 works at byte level, there's no need for BOM to be able to parse the sequence correctly on both a big-endian and a little-endian machine. It does not matter if a character is multibyte: UTF-8 defines exactly what order should the characters come, in case of a multi-byte encoding of a codepoint.
The BOM in UTF-8 is for something completely different (well, so the name 'Byte Order Mark' is a litle 'off'). It is to manifest that "this is going to be a UTF-8 stream". UTF-8 BOM is generally unpopular, and many programs do not support it correctly. The site utf8everywhere.org believes it should be deprecated in future.
I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.
What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?
0xFF and 0xFE cannot appear in legal UTF-8 data. Also the bytes 0xF8-0xFD will only appear in the obsolete version of UTF-8 that allows up to six byte sequences.
0x00 is legal but won't appear anywhere except in the encoding of U+0000. This is exactly the same as other encodings, and the fact that it's legal in all these encodings never stopped it from being used as a terminator in C strings. I'd probably go with 0x00.
The byte 0xff cannot appear in a valid UTF-8 sequence, nor can any of 0xfc, 0xfd, 0xfe.
All UTF-8 bytes must match one of
0xxxxxxx - Lower 7 bit.
10xxxxxx - Second and subsequent bytes in a multi-byte sequence.
110xxxxx - First byte of a two-byte sequence.
1110xxxx - First byte of a three-byte sequence.
11110xxx - First byte of a four-byte sequence.
111110xx - First byte of a five-byte sequence.
1111110x - First byte of a six-byte sequence.
There are no seven or larger byte sequences. The latest version of UTF-8 only allows UTF-8 sequences up to 4 bytes in length, which would leave 0xf8-0xff unused, but is possible though that a byte sequence could be validly called UTF-8 according to an obsolete version and include octets in 0xf8-0xfb.
What about using one of the UTF-8 control characters?
You can choose one from http://www.utf8-chartable.de/
It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:
FF FE 00 00 00 00 00 00
How can I tell if this file contains:
The UTF16-LE BOM (FF FE) followed by 3 null characters; or
The UTF32-LE BOM (FF FE 00 00) followed by one null character?
Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?
As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.
A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.
It is unambiguous. FF FE is for UTF-16LE, and FF FE 00 00 denotes UTF-32LE. There is no reason to think that FF FE 00 00 is possibly UTF-16LE because the UTFs were designed for text, and users shouldn't be using NUL characters in their text. After all, when was the last time you opened a hex editor and inserted a few bytes of 00 into a text document? ^_^
I have experienced the same problem like Edward. I agree with Dustin, usually one will not use null-characters in textfiles.
However i have created a file that contains all unicode characters. I have first used the utf-32le encoding, then a utf-32be encoding, a utf-16le and a utf-16be encoding as well as a utf-8 encoding.
When trying to re-encode the files to utf-8, i wanted to compare the result to the already existing utf-8 file. Because the first character in my files after the BOM is the null-character, i could not successfully detect the file with utf-16le BOM, it showed up as utf-32le BOM, because the bytes appeared exactly like Edward has described. The first character after the BOM FFFE is 0000, but the BOM detection found a BOM FFFE0000 and so, detected utf-32le instead of utf-16le whereby my first 0000-character was stolen and taken as part of the BOM.
So one should never use a null-character as first character of a file encoded with utf-16 little endian, because it will make the utf-16le and utf-32le BOM ambiguous.
To solve my problem, i will swap the first and second character. :-)