Byte values in multi-byte UTF-8 characters - unicode

I'm reading about UTF-8 character encoding but struggling to understand it. I know that ASCII characters (that is, byte values 0x00 to 0x7F) are represented in UTF-8 as a single byte. The question I'm trying to answer is, in the case of multi-byte UTF-8 characters, are the second and subsequent bytes always 0x80 to 0xFF, or can they be any value?

When a given Unicode codepoint value is U+0000 - U+007F, it fits in a single byte in UTF-8. The byte's high bit is set to 0, and the remaining 7 bits hold the bits of the codepoint value.
When a given Unicode codepoint value is U+0080 or higher, it requires 2-4 bytes in UTF-8, depending on the codepoint value (2 for U+0080 - U+07FF, 3 for U+0800 - U+FFFF, and 4 for U+10000 - U+1FFFFF). The first byte's high bits are set to either 110, 1110, or 11110 to indicate how many bytes are in the full sequence (2-4, respectively). The high bits of the subsequent byte(s) are set to 10. The rest of the bits of all of the bytes contain the bits of the codepoint value, spread out through the bytes as needed.
Bits of First Last Bytes in
code point code point code point sequence Byte 1 Byte 2 Byte 3 Byte 4
7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Read the description on Wikipedia for more details. It provides the above table, as well as a few examples. Also read RFC 3629, which is an official UTF-8 spec.

In a multi-byte representation of a character in UTF-8, all bytes are in the range 0x80 to 0xFF, i.e. they have the most significant bit set. This means that bytes 0x00 to 0x7F are used only as single-byte representations of ASCII characters (called Basic Latin in Unicode).

Related

Does any other UTF-8-encoded code point use the ESC byte 0x1B?

Is there any Unicode codepoint that one of the bytes in its utf-8 representation is the ESC byte (0x1B), not including the 0x1B codepoint itself?
Context: The ESC byte is used in ANSI escape codes (in terminals) and I'd like to know whether that byte can appear as part of a utf-8 byte sequence.
No, all bytes in a UTF-8 multi-byte sequence have bit 7 set. Only the single-byte ASCII range 0-127 has bit 7 clear, and that includes byte 0x1B (whose bit pattern is 00011011), so no other encoded codepoint will have a 0x1B byte in it:
https://en.wikipedia.org/wiki/UTF-8
First code point
Last code point
Byte 1
Byte 2
Byte 3
Byte 4
Code points
U+0000
U+007F
0xxxxxxx
128
U+0080
U+07FF
110xxxxx
10xxxxxx
1920
U+0800
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx
61440
U+10000
U+10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
1048576

Why is there no Unicode starting with 0xC1?

While studying the Unicode and utf-8 encoding,
I noticed that the 129th Unicode encoded by the utf-8 starts with 0xc2.
I checked the last letter of 0xcf.
No Unicode was 0xc1 encoded as 0xc1.
Why 129th unicode is start at 0xc2 instead of 0xc1?
The UTF-8 specification, RFC 3629 specifically states in the introduction:
The octet values C0, C1, F5 to FF never appear.
The reason for this is that a 1-byte UTF-8 sequence consists of the 8-bit binary pattern 0xxxxxxx (a zero followed by seven bits) and can represent Unicode code points that fit in seven bits (U+0000 to U+007F).
A 2-byte UTF-8 sequence consists of the 16-bit binary pattern 110xxxxx 10xxxxxx and can represent Unicode code points that fit in eight to eleven bits (U+0080 to U+07FF).
It is not legal in UTF-8 encoding to use more bytes that the minimum required, so while U+007F can be represented in two bytes as 11000001 10111111 (C1 BF hex) it is more compact and therefore follows specification as the 1-byte 01111111.
The first valid two-byte value is the encoding of U+0080, which is 1100010 10000000 (C2 80 hex), so C0 and C1 will never appear.
See section 3 UTF-8 definition in the standard. The last paragraph states:
Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000....
UTF-8 starting with 0xc1 would be a Unicode code point in the range 0x40 to 0x7f. 0xc0 would be a Unicode code point in the range 0x00 to 0x3f.
There is an iron rule that every code point is represented in UTF-8 in the shortest possible way. Since all these code points can be stored in a single UTF-8 byte, they are not allowed to be stored using two bytes.
For the same reason you will find that there are no 4-byte codes starting with 0xf0 0x80 to 0xf0 0x8f because they are stored using fewer bytes instead.

unicode UTF-8, Decoding issue

UTF-8 is variable-length encoding. If a character can be represented using a single byte eg: A (alphabet A in English), UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on.
Now consider i encode A (01000001) あ(11100011 10000001 10000010).
This will be stored in memory as continuous space: 01000001 11100011 10000001 10000010.
My question is while decoding, how does the editor knows that 1st byte is for first character only and next 3 bytes are for 2nd character?
** It could end up decoding 4 characters where each byte is considered as character, I mean where is the distinction here.
The UTF-8 encoding tells the program how many bytes there are for each encoded codepoint. Any byte starting with 0xxxxxxx is an ASCII character from 0 to 127. Any byte starting with 10xxxxxx is a continuation byte and can only occur after a starting byte: 110xxxxx, 1110xxxx or 11110xxx specify that the next byte, two bytes or three bytes are continuation bytes, respectively.
If there aren’t the right number of continuation bytes, or a continuation byte ever appears in the wrong place, then the string is not valid UTF-8. Some programs take advantage of this to try to auto-detect the encoding.

How does UTF-8 encoding identify single byte and double byte characters?

Recently I've faced an issue regarding character encoding, while I was digging into character set and character encoding this doubt came to my mind.UTF-8 encoding is most popular because of its backward compatibility with ASCII.Since UTF-8 is variable length encoding format, how it differentiates single byte and double byte characters.For example, "Aݔ" is stored as "410754" (Unicode for A is 41 and Unicode for Arabic character is 0754.How encoding identifies 41 is one character and 0754 is another two-byte character?Why it's not considered as 4107 as one double byte character and 54 as a single byte character?
For example, "Aݔ" is stored as "410754"
That’s not how UTF-8 works.
Characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 presentation. For example, U+0041 becomes 0x41 which is 01000001 in binary.
All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.
Computers know where one character ends and the next one starts because UTF-8 was designed so that the single-byte values used for ASCII do not overlap with those used in multi-byte sequences. The bytes 0x00 through 0x7F are only used for ASCII and nothing else; the bytes above 0x7F are only used for multi-byte sequences and nothing else. Furthermore, the bytes that are used at the beginning of the multi-byte sequences also cannot occur in any other position in those sequences.
Because of that the codepoints need to be encoded. Consider the following binary patterns:
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The amount of ones in the first byte tells you how many of the following bytes still belong to the same character. All bytes that belong to the sequence start with 10 in binary. To encode the character you convert its codepoint to binary and fill in the x’s.
As an example: U+0754 is between U+0080 and U+07FF, so it needs two bytes. 0x0754 in binary is 11101010100, so you replace the x’s with those digits:
11011101 10010100
Short answer:
UTF-8 is designed to be able to unambiguously identify the type of each byte in a text stream:
1-byte codes (all and only the ASCII characters) start with a 0
Leading bytes of 2-byte codes start with two 1s followed by a 0 (i.e. 110)
Leading bytes of 3-byte codes start with three 1s followed by a 0 (i.e. 1110)
Leading bytes of 4-byte codes start with four 1s followed by a 0 (i.e. 11110)
Continuation bytes (of all multi-byte codes) start with a single 1 followed by a 0 (i.e. 10)
Your example Aݔ, which consists of the Unicode code points U+0041 and U+0754, is encoded in UTF-8 as:
01000001 11011101 10010100
So, when decoding, UTF-8 knows that the first byte must be a 1-byte code, the second byte must be the leading byte of a 2-byte code, the third byte must be a continuation byte, and since the second byte is the leading byte of a 2-byte code, the second and third byte together must form this 2-byte code.
See here how UTF-8 encodes Unicode code points.
Just to clarify, ASCII mean standard 7-bit ASCII and not extended 8-bit ASCII as commonly used in Europe.
Thus, part of first byte (0x80 to 0xFF) goes to dual byte representation and part of second byte on two bytes (0x0800 to 0xFFFF) takes the full three-byte representation.
Four byte representation uses only the lowest three bytes and only 1.114.111 of the ‭16.777.215‬ available possibilities
You have an xls here
That means that interpreters must 'jump back' a NUL (0) byte when they find those binary patterns.
Hope this helps somebody!

In what circumstances would 32-bits be required in UTF-8 encoding?

From my understanding and what I have been reading around the web, UTF-8 can use 1-4 code units (each a byte in length) to encode all characters from the Unicode character set. What I am wondering is this: since all code points in Unicode can be represented in 21 bits, when would you use 4 code units rather than 3?
You only need 24 bits to represent any Unicode character so when would you use 32 bits in UTF-8 encoding and why? Are extra bits needed to store additional data of some kind?
The UTF-8 encoding has overhead. The first byte uses 1-5 bits to indicate how many additional bytes are used, and each additional byte uses 2 bits as a continuation byte marker. Thus, a four-byte UTF-8 sequence requires 5 bits of overhead for the first byte, and 2 bits of overhead per byte for the remaining 3 bytes, leaving 21 bits to encode the codepoint.
1-byte UTF-8, 7 data bits (U+0000 to U+007F): 0xxxxxxx
2-byte UTF-8, 11 data bits (U+0080 to U+07FF): 110xxxxx 10xxxxxx
3-byte UTF-8, 16 data bits (U+0800 to U+FFFF): 1110xxxx 10xxxxxx 10xxxxxx
4-byte UTF-8, 21 data bits (U+10000 to U+10FFFF): 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Ref: UTF-8