Are there any UTF-8 code units that have byte 60 or 62 (`<` and `>`) as not the first byte of their binary representation? - unicode

I need to debug a XML parser and I am wondering if I can construct "malicious" input that will cause it to not recognize opening and closing tags correctly.
Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won't have trouble with other special characters such as &, = , ", etc.

UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:
If the highest bit is not set, i.e. the code unit is 0xxxxxxx, then this is byte expresses an entire code point, whose value is xxxxxxx (i.e. 7 bits of information).
If the highest bit is set and the code unit is 10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.
Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.
1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.
11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.
As you can see, a value 60, which is 00111100, is a single-byte codepoint of value 60, and the same byte cannot occur as part of any multibyte sequence.
The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.
Update: As #Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like 10111100, but you'd have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can't trust the browser, you don't really get around decoding everything and just check­ing the resulting code point sequence for occurrences of U+3C etc., and don't even bother looking at the byte stream.

In UTF-8, no. In other encodings, yes.
In UTF-8, by design, all bytes of a multibyte character will always have the highest bit set. Vice versa, a byte that doesn't have the highest bit set is always an ASCII character.
However, this is not true for other encodings, which are also valid for XML.
For more information about UTF-8, check e.g wikipedia

A poorly-designed UTF-8 decoder could interpret the bytes C0 BC and C0 BE as U+003C and U+003E. As #KerrekSB stated in his answer:
The standard mandates that a code point must be represented with the minimal number of code units.
But a poor algorithm might still decode a malformed two-byte UTF-8 sequence that is not the minimal number of code units:
C0 BC = 11000000 10111100 = 00000111100 = 3Chex = 60dec = '<'
So in your testing be sure to include malformed UTF-8 sequences and verify that they are rejected.

Related

why cannot UTF-8 encoding of unicode code points fit in 3 bytes

Wikipedia
Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex
I am little puzzled that unicode encoding can take up-to 4 bytes. Could not 1,114,112 code points comfortably fit in 3 bytes? May be I am missing some special situations where it needs 4 bytes; please some concrete example if any?
The Wikipedia article on the history of UTF-8 says that an earlier version of UTF-8 allowed more than 21 bits to be encoded. These encodings took 5 or even 6 bytes.
After it became clear that 2^21 code points will probably be enough for the remaining time of humankind (same thinking as with 5 bits, 6 bits, 7 bits, 8 bits and 16 bits), the encodings for 5 and for 6 bytes were simply forbidden. All other encoding rules were kept, for backwards compatibility.
As a consequence, the number space for the Unicode code points is now 0..10FFFF, which is even a bit less than 21 bits. Therefore it might be worth checking whether these 21 bits fit into the 24 bits of 3 bytes, instead of the current 4 bytes.
One important property of UTF-8 is that each byte that is part of a multibyte encoding has its highest bit set. To distinguish the leading byte from the trailing bytes, the leading byte has the second-highest bit set, while the trailing bytes have the second-highest bit cleared. This property ensures a consistent ordering. Therefore the characters could be encoded like this:
0xxx_xxxx 7 bits freely chooseable
110x_xxxx 10xx_xxxx 11 bits freely chooseable
1110_xxxx 10xx_xxxx 10xx_xxxx 16 bits freely chooseable
Now 7 + 11 + 16 bits = 16.04 bits, which is much shorter than the 21 bits needed. Therefore encoding all Unicode code points using up to 3 bytes per the current UTF-8 encoding rules is impossible.
You can define another encoding where the highest bit of each byte is the continuation bit:
0xxx_xxxx 7 bits freely chooseable
1xxx_xxxx 0xxx_xxxx 14 bits freely chooseable
1xxx_xxxx 1xxx_xxxx 0xxx_xxxx 21 bits freely chooseable
Now you have enough space to encode all 21-bit code points. But that's an entirely new encoding, so you would have to establish this world-wide. Given the experience from Unicode, it will take about 20 years. Good luck.
"unicode" is not an encoding. The common encodings for Unicode are UTF-8, UTF-16 and UTF-32. UTF-8 uses 1-, 2-, 3- or 4-byte sequences and is explained below. It is the overhead of the leading/trailing bit sequences that requires 4 bytes for a 21-bit value.
The UTF-8 encoding uses up to 4 bytes to represent Unicode code points using the following bit patterns:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = U+0000 to U+007F
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 11 bits = U+0080 to U+07FF
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 16 bits = U+0800 to U+FFFF
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 21 bits = U+10000 to U+10FFFF
The advantage of UTF-8 is the lead bytes are unique patterns, and trailing bytes are a unique pattern and allow for easy validation of a correct UTF-8 sequence.
Note also it is illegal to use a longer encoding for a Unicode value that fits into a smaller sequence. For example:
1100_0001 1000_0001bin or C1 81hex encodes U+0041, but 0100_0001bin (41hex) is the shorter sequence.
Ref: https://en.wikipedia.org/wiki/UTF-8
I expand my comment.
Unicode is not an encoding. It make no sense to have a size for unicode code point. Unicode is a mapping between code point and semantic name (e.g. 'LATIN CAPITAL LETTER A'). You are free to choose your own encoding.
Originally Unicode wanted to be a universal coding that fit into 16-bit (so the Unification Japanese/Chinese). As you see, it failed on this target. And a second point (very important) was to be able to convert to Unicode and back without loss of data (this simplify the conversion to Unicode: one tool at a time, at any layer).
So, there were a problem on how to expand Unicode to support more than 16-bits, but on the same time, not to break all Unicode programs. The idea was to use surrogates, so programs that just know about 16 bit Unicode (UCS-2) can still work (and BTW python2, and Javascript know just UCS-2, and they still work fine. The language do no need to know that Unicode code points could have more than 16 bits.
Surrogates gave the upper limit of actual Unicode (so not equal a power of 2).
Later it was designed UTF-8. The characteristic (by design): being compatible with ASCII (on 7 bit characters), encoding all Code points (also > 16 bit), and bee able to go to a random position and synchronize quickly where a character will start. This last point takes some address space, so the text is not as dense as it can be, but it is much more practical (and quick to "scroll" files). These extra data (for synchronization) made impossible to code all new Unicode code points into 3 bytes, with UTF-8.
You may use a UTF-24 (see the comment), but you lose the UFT-8 advantage to be compatible with ASCII, but also with UTF-16 you often have just 2 bytes (and not 4).
Remember: the Unicode code point above 16 bit are the more seldom: ancient languages, better representation (semantic) of existing glyphs, or new emojis (which hopefully one doesn't fill an entire long text just with emojis). So the utility of 3 bytes is not (yet) necessary. Maybe if aliens come to Earth and we should write with their new language characters, we will use mostly Unicode code point above 16 bits. Not something I think will happen soon.

UTF8, codepoints, and their representation in Erlang and Elixir

going through Elixir's handling of unicode:
iex> String.codepoints("abc§")
["a", "b", "c", "§"]
very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.
The ? operator (or is it a macro? can't find the answer) tells me that
iex(69)> ?§
167
Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating ?§ earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:
iex(72)> <<0xc2, 0xa7>>
"§"
And to make me go completely bananas, this is what I get in Erlang shell:
24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.
<<"§">>
!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?
The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.
The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.
When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).
In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:
24> <<167>>.
<<"§">> %% this is encoded in latin1
25> <<"\x{a7}">>.
<<"§">> %% still latin1
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says
27> <<"\x{c2a7}">>.
<<"§">> %% this is latin1
The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.
The code point is a number assigned to the character. It's an abstract value, not dependent on any particular representation in actual memory somewhere.
In order to store the character, you have to convert the code point to some sequence of bytes. There are several different ways to do this; each is called a Unicode Transformation Format, and named UTF-n, where the n is the number of bits in the basic unit of encoding. There used to be a UTF-7, used where 7-bit ASCII was assumed and even the 8th bit of a byte couldn't be reliably transmitted; in modern systems, there are UTF-8, UTF-16, and UTF-32.
Since the largest code point value fits comfortably in 21 bits, UTF-32 is the simplest; you just store the code point as a 32-bit integer. (There could theoretically be a UTF-24 or even a UTF-21, but common modern computing platforms deal naturally with values that take up either exactly 8 or a multiple of 16 bits, and have to work harder to deal with anything else.)
So UTF-32 is simple, but inefficient. Not only does it have 11 extra bits that will never be needed, it has 5 bits that are almost never needed. Far and away most Unicode characters found in the wild are in the Basic Multilingual Plane, U+0000 through U+FFFF. UTF-16 lets you represent all of those code points as a plain integer, taking up half the space of UTF-32. But it can't represent anything from U+10000 on up that way, so part of the 0000-FFFF range is reserved as "surrogate pairs" that can be put together to represent a high-plane Unicode character with two 16-bit units, for a total of 32 bits again but only when needed.
Java uses UTF-16 internally, but Erlang (and therefore Elixir), along with most other programming systems, uses UTF-8. UTF-8 has the advantage of completely transparent compatibility with ASCII - all characters in the ASCII range (U+0000 through U+007F, or 0-127 decimal) are represented by single bytes with the corresponding value. But any characters with code points outside the ASCII range require more than one byte each - even those in the range U+0080 through U+00FF, decimal 128 through 255, which only take up one byte in the Latin-1 encoding that used to be the default before Unicode.
So with Elixir/Erlang "binaries", unless you go out of your way to encode things differently, you are using UTF-8. If you look at the high bit of the first byte of a UTF-8 character, it's either 0, meaning you have a one-byte ASCII character, or it's 1. If it's 1, then the second-highest bit is also 1, because the number of consecutive 1-bits counting down from the high bit before you get to a 0 bit tells you how many bytes total the character takes up. So the pattern 110xxxxx means the character is two bytes, 1110xxxx means three bytes, and 11110xxx means four bytes. (There is no legal UTF-8 character that requires more than four bytes, although the encoding could theoretically support up to seven.)
The rest of the bytes all have the two high bits set to 10, so they can't be mistaken for the start of a character. And the rest of the bits are the code point itself.
To use your case as an example, the code point for "§" is U+00A7 - that is, hexadecimal A7, which is decimal 167 or binary 10100111. Since that's greater than decimal 127, it will require two bytes in UTF-8. Those two bytes will have the binary form 110abcde 10fghijk, where the bits abcdefghijk will hold the code point. So the binary representation of the code point, 10100111, is padded out to 00010100111 and split unto the sequences 00010, which replaces abcde in the UTF-8 template, and 100111, which replaces fghijk. That yields two bytes with binary values 11000010 and 10100111, which are C2 and A7 in hexadecimal, or 194 and 167 in decimal.
You'll notice that the second byte coincidentally has the same value as the code point you're encoding, but t's important to realize that this correspondence is just a coincidence. There are a total of 64 code points, from 128 (U+0080) through 191 (U+00BF), that work out that way: their UTF-8 encoding consists of a byte with decimal value 194 followed by a byte whose value is equal to the code point itself. But for the other 1,114,048 code points possible in Unicode, that is not the case.

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).
Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).
To give a metaphorically accurate answer, all of them.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)
According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.
Unicode has the hexadecimal amount of 110000, which is 1114112

Unicode code point limit

As explained here, All unicode encodings end at largest code point 10FFFF But I've heard differently that
they can go upto 6 bytes, is it true?
UTF-8 underwent some changes during its life, and there are many specifications (most of which are outdated now) which standardized UTF-8. Most of the changes were introduced to help compatibility with UTF-16 and to allow for the ever-growing amount of codepoints.
To make the long story short, UTF-8 was originally specified to allow codepoints with up to 31 bits (or 6 bytes). But with RFC3629, this was reduced to 4 bytes max. to be more compatible to UTF-16.
Wikipedia has some more information. The specification of the Universal Character Set is closely linked to the history of Unicode and its transformation format (UTF).
See the answers to Do UTF-8,UTF-16, and UTF-32 Unicode encodings differ in the number of characters they can store?
UTF-8 and UTF-32 are theoretically capable of representing characters above U+10FFFF, but were artificially restricted to match UTF-16's capacity.
The largest unicode codepoint and the encodings for unicode characters used, are two things. According to the standard, the highest codepoint really is 0x10ffff but herefore you'll need just 21 bits which fit easily into 4 bytes, even with 11 bits wasted!
I guess with your question about 6 bytes you mean a 6-byte utf-8 sequence, right? As others have answered already, using the utf-8 mechanism you could really deal with 6-byte sequences, you can even deal with 7-byte sequences and even with an 8-byte sequence. The 7-byte sequence gives you a range of just what the following bytes have to offer, 6 x 6 bits = 36 bits and a 8-byte sequence gives you 7 x 6 bits = 42 bits. You could deal with it but it is not allowed because unneeded, the highest codepoint is 0x10ffff.
It is also forbidden to use longer sequences than needed as Hibou57 has mentioned. With utf-8 one must always use the shortest sequence possible or the sequence will be treated as invalid! An ASCII character must be in a 7-bit singlebyte of course. The second thing is that the utf-8 4-byte sequence gives you 3 bits of payload in the startbyte and 18 bits of payload in the following bytes which are 21 bits and that matches to the calculation of surrogates when using the utf-16 encoding. The bias 0x10000 is subtracted from the codepoint and the remaining 20 bits go to the high- as well lo-surrogate payload area, each of 10 bits. The third and last thing is, that within utf-8 it is not allowed to encode hi- or -lo-surrogate values. Surrogates are not characters but containers for them, surrogates can only appear in utf-16, not in utf-8 or utf-32 encoded files.
Indeed, for some view of the UTF‑8 encoding, UTF‑8 may technically permit to encode code‑points beyond the forever‑fixed valid range upper‑limit; so one may encode a code‑point beyond that range, but it will not be a valid code‑point anywhere. On the other hand, you may encode a character with unneeded zeroed high‑order bits, ex. encoding an ASCII code‑point with multiple bits, like in 2#1100_0001#, 2#1000_0001# (using Ada's notation), which would for the ASCII letter A UTF‑8 encoded with two bytes. But then, it may be rejected by some safety/security filters, at this use to be used for hacking and piracy. RFC 3629 has some explanation about it. One should just stick to encode valid code‑points (as defined by Unicode), the safe way (no extraneous bytes).

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?
It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.
I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.
Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.