So I'm teaching myself character encoding, and I have a presumably stupid question: Wikipedia says
The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER
MARK (BOM), ...
, and a chart on that page writes
Encoding Representation (hexadecimal)
UTF-8 EF BB BF
UTF-16 (BE) FE FF
UTF-16 (LE) FF FE
...
I'm a little confused by it. As I know, most machines using Intel CPUs are little-endian, so why BOM is U+FE FF for UTF-16 (BE), rather than U+EF BB BF for UTF-8 or U+FF FE for UTF-16 (LE)?
As I know, most machines using Intel CPUs are little-endian
Intel CPUs are not the only CPUs used in the world. AMD, ARM, etc. And there are big-endian CPUs.
why BOM is U+FE FF for UTF-16 (BE), rather than U+EF BB BF for UTF-8 or U+FF FE for UTF-16 (LE)?
U+FEFF is the Unicode codepoint designation. FE FF, EF BB BF, FF FE, these are sequences of bytes instead. U+ only applies to Unicode codepoint designations, not bytes.
The numeric value of Unicode codepoint U+FEFF ZERO WIDTH NO-BREAK SPACE (which is its official designation, not U+FEFF BYTE ORDER MARK, though it is also used as a BOM) is 0xFEFF (65279).
That codepoint value encoded in UTF-8 produces three 8-bit codeunit values 0xEF 0xBB 0xBF, which are not subject to any endian issues, which is why UTF-8 does not have separate LE and BE variants.
That same codepoint value encoded in UTF-16 produces one 16-bit codeunit value 0xFEFF. Because it is a multi-byte (16-bit) value, it is subject to endian when interpreted as two 8-bit bytes, hence the LE (0xFF 0xFE) and BE (0xFE 0xFF) variants.
It is not just the BOM that is effected. All codeunits in a UTF-16 string are affected by endian. The BOM helps a decoder know the endian used for the codeunits in the entire string.
UTF-32, which also uses multi-byte (32-bit) codeunits, is also subject to endian, and thus it also has LE and BE variants, and a 32-bit BOM to express that endian to decoders (0xFF 0xFE 0x00 0x00 for LE, 0x00 0x00 0xFE 0xFF for BE). And yes, as you can probably guess, there is an ambiguity between the UTF-16LE BOM and the UTF-32LE BOM, if you don't know ahead of time which UTF you are dealing with. A BOM is meant to identify the endian, hence the name "Byte Order Mark", not the particular encoding (though it is commonly used for that purpose).
why BOM is U+FE FF for UTF-16 (BE)
It isn't. BOM is character number U+FEFF. There's no space, it's a single hexadecimal number, aka 65279. This definition does not depend on what sequence of bytes is used to represent that characters in any particular encoding.
It happens that the hexadecimal representation of the byte sequence that encodes the character(*) in UTF-16LE, 0xFE, 0xFF has the same order of digits as the hexadecimal representation of the character number U+FEFF; this is just an artefact of big-endianness, it puts most-significant content on the left, same as humans do for big [hexa]decimal numbers.
(* and indeed any character in the Basic Multilingual Plane. It gets hairier when you go above this range as they no longer fit in two bytes.)
Related
Today I was learning about Character Encoding and Unicode but there is one thing I'm not sure about. I used this website to change 字 to Unicode 101101101010111 (which from my understanding is a character set) and same symbol to UTF-16 (a Character Encoding System) 01010111 01011011 which is how it supposes to be saved in memory or desk.
Unicode is just a character set.
UTF-16 is a Encoding system that change charset in a way to save it on memory or desk.
Am I right?
if yes how did Encoding system change 101101101010111 to 01010111 01011011? how does it work?
Unicode at the core is indeed a character set, i.e. it assigns numbers to what most people think of characters. These numbers are called codepoint.
The codepoint for 字 is U+5B57. This is the format how codepoints are usually specified. "5B57" is hexadecimal number.
In binary, 5B57 is 101101101010111, or 0101101101010111 if it is extended to 16 bits. But it is very unusual to specify codepoints in binary.
UTF-16 is one of several encodings for Unicode, i.e. a representation in memory or in files. UTF-16 uses 16-bit code units. Since 16-bit is 2 bytes, two variants exist for splitting it into bytes:
little-ending (lower 8 bit first)
big-endian (higher 8 bits first)
Often they are called UTF-16LE and UTF-16BE. Since most computers today use a little endian architecture, UTF-16LE is more common.
A single codepoint can result in 1 or 2 UTF-16 code units. In this particular case, it's a single code unit, and it is the same as the value for the codepoint: 5B57. It is saved as two bytes, either as:
5B 57 (or 01011011 01010111 in binary, big endian)
57 5B (or 01010111 01011011 in binary, little endian)
The latter one is the one you have shown. So it is UTF-16LE encoding.
For codepoints resulting in 2 UTF-16 code units, the encoding is somewhat more involved. It is explained in the UTF-16 Wikipedia article.
mostly all website are doing same use UTF 16 i'm also use english to binary translator
Unicode at the core is indeed a character set, i.e. it assigns numbers to what most people think of characters. These numbers are called codepoint.
The codepoint for 字 is U+5B57. This is the format how codepoints are usually specified. "5B57" is hexadecimal number.
In binary, 5B57 is 101101101010111, or 0101101101010111 if it is extended to 16 bits. But it is very unusual to specify codepoints in binary.
UTF-16 is one of several encodings for Unicode, i.e. a representation in memory or in files. UTF-16 uses 16-bit code units. Since 16-bit is 2 bytes, two variants exist for splitting it into bytes:
little-ending (lower 8 bit first)
big-endian (higher 8 bits first)
Often they are called UTF-16LE and UTF-16BE. Since most computers today use a little endian architecture, UTF-16LE is more common.
A single codepoint can result in 1 or 2 UTF-16 code units. In this particular case, it's a single code unit, and it is the same as the value for the codepoint: 5B57. It is saved as two bytes, either as:
5B 57 (or 01011011 01010111 in binary, big endian)
57 5B (or 01010111 01011011 in binary, little endian)
The latter one is the one you have shown. So it is UTF-16LE encoding
While studying the Unicode and utf-8 encoding,
I noticed that the 129th Unicode encoded by the utf-8 starts with 0xc2.
I checked the last letter of 0xcf.
No Unicode was 0xc1 encoded as 0xc1.
Why 129th unicode is start at 0xc2 instead of 0xc1?
The UTF-8 specification, RFC 3629 specifically states in the introduction:
The octet values C0, C1, F5 to FF never appear.
The reason for this is that a 1-byte UTF-8 sequence consists of the 8-bit binary pattern 0xxxxxxx (a zero followed by seven bits) and can represent Unicode code points that fit in seven bits (U+0000 to U+007F).
A 2-byte UTF-8 sequence consists of the 16-bit binary pattern 110xxxxx 10xxxxxx and can represent Unicode code points that fit in eight to eleven bits (U+0080 to U+07FF).
It is not legal in UTF-8 encoding to use more bytes that the minimum required, so while U+007F can be represented in two bytes as 11000001 10111111 (C1 BF hex) it is more compact and therefore follows specification as the 1-byte 01111111.
The first valid two-byte value is the encoding of U+0080, which is 1100010 10000000 (C2 80 hex), so C0 and C1 will never appear.
See section 3 UTF-8 definition in the standard. The last paragraph states:
Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000....
UTF-8 starting with 0xc1 would be a Unicode code point in the range 0x40 to 0x7f. 0xc0 would be a Unicode code point in the range 0x00 to 0x3f.
There is an iron rule that every code point is represented in UTF-8 in the shortest possible way. Since all these code points can be stored in a single UTF-8 byte, they are not allowed to be stored using two bytes.
For the same reason you will find that there are no 4-byte codes starting with 0xf0 0x80 to 0xf0 0x8f because they are stored using fewer bytes instead.
i realize this is pretty basic, as i am reading about Unicode in Wikipedia and wherever it points. but this "U+0000" semantic is not completely explained. it appears to me that "U" always equals 0.
why is that "U+" part of the notation? what exactly does it mean? (it appears to be some base value, but i cannot understand when or why it is ever non-zero.)
also, if i receive a string of text from some other source, how do i know if that string is encoded UTF-8 or UTF-16 or UTF-32? is there some way i can automatically determine that by context?
From Wikipedia, article Unicode, section Architecture and Terminology:
Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF (hexadecimal). Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used.
This convention was introduced so that the readers understand that the code point is specifically a Unicode code point. For example, the letter ă (LATIN SMALL LETTER A WITH BREVE) is U+0103; in Code Page 852 it has the code 0xC7, in Code Page 1250 it has the code 0xE3, but when I write U+0103 everybody understands that I mean the Unicode code point and they can look it up.
For languages written with the Latin alphabet, UTF-16 and UTF-32 strings will most likely contain lots and lots of bytes with the value 0, which should not appear in UTF-8 encoded strings. By looking at which bytes are zero you can also infer the byte order of UTF-16 and UTF-32 strings, even in the absence of a Byte Order Mark.
So for example if you get the bytes
0xC3 0x89 0x70 0xC3 0xA9 0x65
this is most likely Épée in UTF-8 encoding. In little-endian UTF-16 this would be
0x00 0xC9 0x00 0x70 0x00 0xE9 0x00 0x65
(Note how every even-numbered byte is zero.)
If you have a UTF-16 string whose length in bytes is for example 21. Is it safe to say right away that this string has invalid UTF-16 in it? I am not counting in the null-terminator here. I am just counting the actual text data. My reasoning is that in UTF-16 text elements are encoded as 1 or 2 two-byte sequences.
The answer is yes, of course. As you said,
UTF-16 text elements are encoded as 1 or 2 two-byte sequences.
One half of a two-byte sequence is always invalid.
But beware: you say that you are “not counting in the null-terminator here”. But there cannot be a single-byte null-terminator in UTF-16, because a single 0x00 byte can be the least significant byte of a valid UTF-16 byte pair. E.g., the character Ā, called “Latin Capital Letter A with macron” is Unicode U+0100, i.e., the byte sequence 0x00 0x01 in UTF-16LE (little endian) or 0x01 0x00 in UTF-16BE (big endian).
Why Degree symbol differs from UTF-8 from unicode?
According to: http://www.utf8-chartable.de/ and
http://www.fileformat.info/info/unicode/char/b0/index.htm
unicode is B0 but UTF-8 is C2 B0 How come!??
UTF-8 is a way to encode UTF characters using variable number of bytes (the number of bytes depends on the code point).
Code points between U+0080 and U+07FF use the following 2-byte encoding:
110xxxxx 10xxxxxx
where x represent the bits of the code point being encoded.
Let's consider U+00B0. In binary, 0xB0 is 10110000. If one substitutes the bits into the above template, one gets:
11000010 10110000
In hex, this is 0xC2 0xB0.
UTF-8 is one encoding of Unicode. UTF-16 and UTF-32 are other encodings of Unicode.
Unicode defines a numeric value for each character; the degree symbol happens to be 0xB0, or 176 in decimal. Unicode does not define how those numeric values are represented.
UTF-8 encodes the value 0xB0 as two consecutive octets (bytes) with values 0xC2 0xB0.
UTF-16 encodes the same value either as 0x00 0xB0 or as 0xBo 0x00, depending on endianness.
UTF-32 encodes it as 0x00 0x00 0x00 0xB0 or as 0xB0 0x00 0x00 0x00, again depending on endianness (I suppose other orderings are possible).
Unicode (UTF-16 and UTF-32) uses the code point 0x00B0 for that character. UTF-8 doesn't allow characters at values above 127 (0x007F), as the high bit of each byte is reserved to indicate that this particular character is actually a multi-byte one.
Basic 7-bit ASCII maps directly to the first 128 characters of UTF-8. Any characters whose values are above 127 decimal (7F hex) must be "escaped" by setting the high bit and adding 1 or more extra bytes to describe.
The answers from NPE, Marc and Keith are good and above my knowledge on this topic. Still I had to read them a couple of times before I realized what this was about. Then I saw this web page that made it "click" for me.
At http://www.utf8-chartable.de/, you can see the following:
Notice how it is necessary to use TWO bytes to code ONE character. Now read the accepted answer from NPE.