Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character? - unicode

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?

It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.

I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.

Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.

Related

why cannot UTF-8 encoding of unicode code points fit in 3 bytes

Wikipedia
Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex
I am little puzzled that unicode encoding can take up-to 4 bytes. Could not 1,114,112 code points comfortably fit in 3 bytes? May be I am missing some special situations where it needs 4 bytes; please some concrete example if any?
The Wikipedia article on the history of UTF-8 says that an earlier version of UTF-8 allowed more than 21 bits to be encoded. These encodings took 5 or even 6 bytes.
After it became clear that 2^21 code points will probably be enough for the remaining time of humankind (same thinking as with 5 bits, 6 bits, 7 bits, 8 bits and 16 bits), the encodings for 5 and for 6 bytes were simply forbidden. All other encoding rules were kept, for backwards compatibility.
As a consequence, the number space for the Unicode code points is now 0..10FFFF, which is even a bit less than 21 bits. Therefore it might be worth checking whether these 21 bits fit into the 24 bits of 3 bytes, instead of the current 4 bytes.
One important property of UTF-8 is that each byte that is part of a multibyte encoding has its highest bit set. To distinguish the leading byte from the trailing bytes, the leading byte has the second-highest bit set, while the trailing bytes have the second-highest bit cleared. This property ensures a consistent ordering. Therefore the characters could be encoded like this:
0xxx_xxxx 7 bits freely chooseable
110x_xxxx 10xx_xxxx 11 bits freely chooseable
1110_xxxx 10xx_xxxx 10xx_xxxx 16 bits freely chooseable
Now 7 + 11 + 16 bits = 16.04 bits, which is much shorter than the 21 bits needed. Therefore encoding all Unicode code points using up to 3 bytes per the current UTF-8 encoding rules is impossible.
You can define another encoding where the highest bit of each byte is the continuation bit:
0xxx_xxxx 7 bits freely chooseable
1xxx_xxxx 0xxx_xxxx 14 bits freely chooseable
1xxx_xxxx 1xxx_xxxx 0xxx_xxxx 21 bits freely chooseable
Now you have enough space to encode all 21-bit code points. But that's an entirely new encoding, so you would have to establish this world-wide. Given the experience from Unicode, it will take about 20 years. Good luck.
"unicode" is not an encoding. The common encodings for Unicode are UTF-8, UTF-16 and UTF-32. UTF-8 uses 1-, 2-, 3- or 4-byte sequences and is explained below. It is the overhead of the leading/trailing bit sequences that requires 4 bytes for a 21-bit value.
The UTF-8 encoding uses up to 4 bytes to represent Unicode code points using the following bit patterns:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = U+0000 to U+007F
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 11 bits = U+0080 to U+07FF
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 16 bits = U+0800 to U+FFFF
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 21 bits = U+10000 to U+10FFFF
The advantage of UTF-8 is the lead bytes are unique patterns, and trailing bytes are a unique pattern and allow for easy validation of a correct UTF-8 sequence.
Note also it is illegal to use a longer encoding for a Unicode value that fits into a smaller sequence. For example:
1100_0001 1000_0001bin or C1 81hex encodes U+0041, but 0100_0001bin (41hex) is the shorter sequence.
Ref: https://en.wikipedia.org/wiki/UTF-8
I expand my comment.
Unicode is not an encoding. It make no sense to have a size for unicode code point. Unicode is a mapping between code point and semantic name (e.g. 'LATIN CAPITAL LETTER A'). You are free to choose your own encoding.
Originally Unicode wanted to be a universal coding that fit into 16-bit (so the Unification Japanese/Chinese). As you see, it failed on this target. And a second point (very important) was to be able to convert to Unicode and back without loss of data (this simplify the conversion to Unicode: one tool at a time, at any layer).
So, there were a problem on how to expand Unicode to support more than 16-bits, but on the same time, not to break all Unicode programs. The idea was to use surrogates, so programs that just know about 16 bit Unicode (UCS-2) can still work (and BTW python2, and Javascript know just UCS-2, and they still work fine. The language do no need to know that Unicode code points could have more than 16 bits.
Surrogates gave the upper limit of actual Unicode (so not equal a power of 2).
Later it was designed UTF-8. The characteristic (by design): being compatible with ASCII (on 7 bit characters), encoding all Code points (also > 16 bit), and bee able to go to a random position and synchronize quickly where a character will start. This last point takes some address space, so the text is not as dense as it can be, but it is much more practical (and quick to "scroll" files). These extra data (for synchronization) made impossible to code all new Unicode code points into 3 bytes, with UTF-8.
You may use a UTF-24 (see the comment), but you lose the UFT-8 advantage to be compatible with ASCII, but also with UTF-16 you often have just 2 bytes (and not 4).
Remember: the Unicode code point above 16 bit are the more seldom: ancient languages, better representation (semantic) of existing glyphs, or new emojis (which hopefully one doesn't fill an entire long text just with emojis). So the utility of 3 bytes is not (yet) necessary. Maybe if aliens come to Earth and we should write with their new language characters, we will use mostly Unicode code point above 16 bits. Not something I think will happen soon.

Are there any UTF-8 code units that have byte 60 or 62 (`<` and `>`) as not the first byte of their binary representation?

I need to debug a XML parser and I am wondering if I can construct "malicious" input that will cause it to not recognize opening and closing tags correctly.
Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won't have trouble with other special characters such as &, = , ", etc.
UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:
If the highest bit is not set, i.e. the code unit is 0xxxxxxx, then this is byte expresses an entire code point, whose value is xxxxxxx (i.e. 7 bits of information).
If the highest bit is set and the code unit is 10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.
Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.
1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.
11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.
As you can see, a value 60, which is 00111100, is a single-byte codepoint of value 60, and the same byte cannot occur as part of any multibyte sequence.
The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.
Update: As #Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like 10111100, but you'd have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can't trust the browser, you don't really get around decoding everything and just check­ing the resulting code point sequence for occurrences of U+3C etc., and don't even bother looking at the byte stream.
In UTF-8, no. In other encodings, yes.
In UTF-8, by design, all bytes of a multibyte character will always have the highest bit set. Vice versa, a byte that doesn't have the highest bit set is always an ASCII character.
However, this is not true for other encodings, which are also valid for XML.
For more information about UTF-8, check e.g wikipedia
A poorly-designed UTF-8 decoder could interpret the bytes C0 BC and C0 BE as U+003C and U+003E. As #KerrekSB stated in his answer:
The standard mandates that a code point must be represented with the minimal number of code units.
But a poor algorithm might still decode a malformed two-byte UTF-8 sequence that is not the minimal number of code units:
C0 BC = 11000000 10111100 = 00000111100 = 3Chex = 60dec = '<'
So in your testing be sure to include malformed UTF-8 sequences and verify that they are rejected.

If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

On the Unicode site it's written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bits encoding.
So, what's the truth?
If it's 8-bits encoding, then what's the difference between ASCII and UTF-8?
If it's not, then why is it called UTF-8 and why do we need UTF-16 and others if they occupy the same memory?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky - Wednesday, October 08, 2003
Excerpt from above:
Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory.
And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
To understand this further, Unicode treats characters as codepoints - a mere number that can be represented in multiple ways (the encodings). UTF-8 is one such encoding. It is most commonly used, for it gives the best space consumption characteristics among all encodings. If you are storing characters from the ASCII character set in UTF-8 encoding, then the UTF-8 encoded data will take the same amount of space. This allowed for applications that previously used ASCII to seamlessly move (well, not quite, but it certainly didn't result in something like Y2K) to Unicode, for the character representations are the same.
I'll leave this extract here from RFC 3629, on how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
You'll notice why the encoding will result in characters occupying anywhere between 1 and 4 bytes (the right-hand column) for different ranges of characters in Unicode (the left-hand column).
UTF-16, UTF-32, UCS-2 etc. will employ different encoding schemes where the codepoints would represented as 16-bit or 32-bit codes, instead of 8-bit codes that UTF-8 does.
The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.
Unicode itself is a 21-bit character set. There are a number of encodings for it:
UTF-32 where each Unicode code point is stored in a 32-bit integer
UTF-16 where many Unicode code points are stored in a single 16-bit integer, but some need two 16-bit integers (so it needs 2 or 4 bytes per Unicode code point).
UTF-8 where Unicode code points can require 1, 2, 3 or 4 bytes to store a single Unicode code point.
So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.
Just complementing the other answer about UTF-8 coding, that uses 1 to 4 bytes
As people said above, a code with 4 bytes totals 32 bits, but of these 32 bits, 11 bits are used as a prefix in the control bytes, i.e. to identify the code size of a Unicode symbol between 1 and 4 bytes and also enable to recover a text easily even in the middle of the text.
The gold question is: Why we need so much bits (11) for control in a 32 bits code? Wouldn't it be useful to have more than 21 bits for codification?
The point is that the planned scheme needs to be such that it is easily known to go back to the 1st. bite of a code.
Thus, bytes besides the first byte cannot have all their bits released for codify a Unicode symbol because otherwise they could easily to be confused as the first byte of a valid code UTF-8.
So the model is
0UUUUUUU for 1 byte code. We have 7 Us, so there are 2^7 = 128
possibilities that are the traditional ASCII codes.
110UUUUU 10UUUUUU for 2 bytes code. Here we have 11 Us so there
are 2^11 = 2,048 - 128 = 1,921 possibilities. It discounts the previous
gross number 2^7 because you need to discount the codes up to 2^7 = 127, corresponding to the 1 byte legacy ASCII.
1110UUUU 10UUUUUU 10UUUUUU for 3 bytes code. Here we have 16 Us so
there are 2^16 = 65,536 - 2,048 = 63,488 possibilities)
11110UUU 10UUUUUU 10UUUUUU 10UUUUUU for 4 bytes code. Here we have 21
Us so there are 2^21 = 2,097,152 - 65,536 = 2,031,616 possibilities,
where U is a bit 0 or 1 used to codify a Unicode UTF-8 symbol.
So the total possibilities are 127 + 1,921 + 63,488 + 2,031,616 = 2,097,152 Unicode symbols.
In the Unicode tables available (for example, in the Unicode Pad App for Android or here) appear the Unicode code in form (U+H), where H is a hex number of 1 to 6 digits. For example U+1F680 represents a rocket icon: 🚀.
This code translates the bits U of the right to left symbol code (21 to 4 bytes, 16 to 3 bytes, 11 to 2 bytes and 7 to 1 byte), grouped in bytes, and with the incomplete byte on the left completed with 0s.
Below we will try to explain why one needs to have 11 bits of control. Part of the choices made was merely a random choice between 0 and 1, which lacks a rational explanation.
As 0 is used to indicate one byte code, what makes 0 .... always equivalent to the ASCII code of 128 characters (backwards compatibility)
For symbols that uses more than 1 byte, the 10 in the start of 2nd., 3rd. and 4th. byte always serves to know we are in the middle of a code.
To settle confusion, if the first byte starts with 11, it indicates that the 1st. byte represents a Unicode character with 2, 3 or 4 bytes code. On the other hand, 10 represents a middle byte, that is, it never initiates the codification of a Unicode symbol.(Obviously the prefix for continuation bytes could not be 1 because 0... and 1... would exhaust all possible bytes)
If there were no rules for non-initial byte, it would be very ambiguous.
With this choice, we know that the first initial byte bit starts with 0 or 11, which never gets confused with a middle byte, which starts with 10. Just looking at byte we already know if it is a character ASCII, the beginning of a byte sequence (2, 3 or 4 bytes) or the byte from the middle of a byte sequence (2, 3 or 4 bytes).
It could be the opposite choice: The prefix 11 could indicate the middle byte and the prefix 10 the start byte in a code with 2, 3 or 4 bytes. That choice is just a matter of convention.
Also for matter of choice, the 3rd. bit 0 of the 1st. byte means 2 bytes UTF-8 code and the 3rd. bit 1 of the 1st. byte means 3 or 4 bytes UTF-8 code (again, it's impossible adopt prefix '11' for 2 bytes symbol, it also would exhaust all possible bytes: 0..., 10... and 11...).
So a 4th bit is required in the 1st. byte to distinguish 3 ou 4 bytes Unicode UTF-8 codification.
A 4th bit with 0 is for 3 bytes code and 1 is for 4 bytes code, which still uses an additional bit 0 that would be needless at first.
One of the reasons, beyond the pretty symmetry (0 is always the last prefix bit in the starting byte), for having the additional 0 as 5th bit in the first byte for the 4 bytes Unicode symbol, is in order to make an unknown string almost recognizable as UTF-8 because there is no byte in the range from 11111000 to 11111111 (F8 to FF or 248 to 255).
If hypothetically we use 22 bits (Using the last 0 of 5 bits in the first byte as part of character code that uses 4 bytes, there would be 2^22 = 4,194,304 possibilities in total (22 because there would be 4 + 6 + 6 + 6 = 22 bits left for UTF-8 symbol codification and 4 + 2 + 2 + 2 = 10 bits as prefix)
With adopted UTF-8 coding system (5th bit is fixed with 0 for 4 bytes code) , there are 2^21 = 2,097,152 possibilities, but only 1,112,064 of these are valid Unicodes symbols (21 because there are 3 + 6 + 6 + 6 = 21 bits left for UTF-8 symbol codification and 5 + 2 + 2 + 2 = 11 bits as prefix)
As we have seen, not all possibilities with 21 bits are used (2,097,152). Far from it (just 1,112,064). So saving one bit doesn't bring tangible benefits.
Other reason is the possibility of using this unused codes for control functions, outside Unicode world.

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).
Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).
To give a metaphorically accurate answer, all of them.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)
According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.
Unicode has the hexadecimal amount of 110000, which is 1114112

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32?
I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?
UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes these into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.
UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.
UTF-32 will cover all possible characters in 4 bytes. This makes it pretty bloated. I can't think of any advantage to using it.
In short:
UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.
In long: see Wikipedia: UTF-8, UTF-16, and UTF-32.
UTF-8 is variable 1 to 4 bytes.
UTF-16 is variable 2 or 4 bytes.
UTF-32 is fixed 4 bytes.
Unicode defines a single huge character set, assigning one unique integer value to every graphical symbol (that is a major simplification, and isn't actually true, but it's close enough for the purposes of this question). UTF-8/16/32 are simply different ways to encode this.
In brief, UTF-32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.
UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.
And UTF-8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters (the most significant bit is used to signify that this is the start of a multi-byte sequence, leaving 7 bits for the actual character value). All other characters are encoded as sequences of up to 4 bytes (if memory serves).
And that leads us to the advantages. Any ASCII-character is directly compatible with UTF-8, so for upgrading legacy apps, UTF-8 is a common and obvious choice. In almost all cases, it will also use the least memory. On the other hand, you can't make any guarantees about the width of a character. It may be 1, 2, 3 or 4 characters wide, which makes string manipulation difficult.
UTF-32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), but on the other hand, you know that every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF-8.
UTF-16 is a compromise. It lets most characters fit into a fixed-width 16-bit value. So as long as you don't have Chinese symbols, musical notes or some others, you can assume that each character is 16 bits wide. It uses less memory than UTF-32. But it is in some ways "the worst of both worlds". It almost always uses more memory than UTF-8, and it still doesn't avoid the problem that plagues UTF-8 (variable-length characters).
Finally, it's often helpful to just go with what the platform supports. Windows uses UTF-16 internally, so on Windows, that is the obvious choice.
Linux varies a bit, but they generally use UTF-8 for everything that is Unicode-compliant.
So short answer: All three encodings can encode the same character set, but they represent each character as different byte sequences.
Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes:
UTF-8 - "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)
UTF-16 - "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)
UTF-32 - "performance": allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage
I tried to give a simple explanation in my blogpost.
UTF-32
requires 32 bits (4 bytes) to encode any character. For example, in order to represent the "A" character code-point using this scheme, you'll need to write 65 in 32-bit binary number:
00000000 00000000 00000000 01000001 (Big Endian)
If you take a closer look, you'll note that the most-right seven bits are actually the same bits when using the ASCII scheme. But since UTF-32 is fixed width scheme, we must attach three additional bytes. Meaning that if we have two files that only contain the "A" character, one is ASCII-encoded and the other is UTF-32 encoded, their size will be 1 byte and 4 bytes correspondingly.
UTF-16
Many people think that as UTF-32 uses fixed width 32 bit to represent a code-point, UTF-16 is fixed width 16 bits. WRONG!
In UTF-16 the code point maybe represented either in 16 bits, OR 32 bits. So this scheme is variable length encoding system. What is the advantage over the UTF-32? At least for ASCII, the size of files won't be 4 times the original (but still twice), so we're still not ASCII backward compatible.
Since 7-bits are enough to represent the "A" character, we can now use 2 bytes instead of 4 like the UTF-32. It'll look like:
00000000 01000001
UTF-8
You guessed right.. In UTF-8 the code point maybe represented using either 32, 16, 24 or 8 bits, and as the UTF-16 system, this one is also variable length encoding system.
Finally we can represent "A" in the same way we represent it using ASCII encoding system:
01001101
A small example where UTF-16 is actually better than UTF-8:
Consider the Chinese letter "語" - its UTF-8 encoding is:
11101000 10101010 10011110
While its UTF-16 encoding is shorter:
10001010 10011110
In order to understand the representation and how it's interpreted, visit the original post.
UTF-8
has no concept of byte-order
uses between 1 and 4 bytes per character
ASCII is a compatible subset of encoding
completely self-synchronizing e.g. a dropped byte from anywhere in a stream will corrupt at most a single character
pretty much all European languages are encoded in two bytes or less per character
UTF-16
must be parsed with known byte-order or reading a byte-order-mark (BOM)
uses either 2 or 4 bytes per character
UTF-32
every character is 4 bytes
must be parsed with known byte-order or reading a byte-order-mark (BOM)
UTF-8 is going to be the most space efficient unless a majority of the characters are from the CJK (Chinese, Japanese, and Korean) character space.
UTF-32 is best for random access by character offset into a byte-array.
I made some tests to compare database performance between UTF-8 and UTF-16 in MySQL.
Update Speeds
UTF-8
UTF-16
Insert Speeds
Delete Speeds
In UTF-32 all of characters are coded with 32 bits. The advantage is that you can easily calculate the length of the string. The disadvantage is that for each ASCII characters you waste an extra three bytes.
In UTF-8 characters have variable length, ASCII characters are coded in one byte (eight bits), most western special characters are coded either in two bytes or three bytes (for example € is three bytes), and more exotic characters can take up to four bytes. Clear disadvantage is, that a priori you cannot calculate string's length. But it's takes lot less bytes to code Latin (English) alphabet text, compared to UTF-32.
UTF-16 is also variable length. Characters are coded either in two bytes or four bytes. I really don't see the point. It has disadvantage of being variable length, but hasn't got the advantage of saving as much space as UTF-8.
Of those three, clearly UTF-8 is the most widely spread.
I'm surprised this question is 11yrs old and not one of the answers mentioned the #1 advantage of utf-8.
utf-8 generally works even with programs that are not utf-8 aware. That's partly what it was designed for. Other answers mention that the first 128 code points are the same as ASCII. All other code points are generated by 8bit values with the high bit set (values from 128 to 255) so that from the POV of a non-unicode aware program it just sees strings as ASCII with some extra characters.
As an example let's say you wrote a program to add line numbers that effectively does this (and to keep it simple let's assume end of line is just ASCII 13)
// pseudo code
function readLine
if end of file
return null
read bytes (8bit values) into string until you hit 13 or end or file
return string
function main
lineNo = 1
do {
s = readLine
if (s == null) break;
print lineNo++, s
}
Passing a utf-8 file to this program will continue to work. Similarly, splitting on tabs, commas, parsing for ASCII quotes, or other parsing for which only ASCII values are significant all just work with utf-8 because no ASCII value appear in utf-8 except when they are actually meant to be those ASCII values
Some other answers or comments mentions that utf-32 has the advantage that you can treat each codepoint separately. This would suggest for example you could take a string like "ABCDEFGHI" and split it at every 3rd code point to make
ABC
DEF
GHI
This is false. Many code points affect other code points. For example the color selector code points that lets you choose between 👨🏻‍🦳👨🏼‍🦳👨🏽‍🦳👨🏾‍🦳👨🏿‍🦳. If you split at any arbitrary code point you'll break those.
Another example is the bidirectional code points. The following paragraph was not entered backward. It is just preceded by the 0x202E codepoint
‮This line is not typed backward it is only displayed backward
So no, utf-32 will not let you just randomly manipulate unicode strings without a thought to their meanings. It will let you look at each codepoint with no extra code.
FYI though, utf-8 was designed so that looking at any individual byte you can find out the start of the current code point or the next code point.
If you take a arbitrary byte in utf-8 data. If it is < 128 it's the correct code point by itself. If it's >= 128 and < 192 (the top 2 bits are 10) then to find the start of the code point you need to look the preceding byte until you find a byte with a value >= 192 (the top 2 bits are 11). At that byte you've found the start of a codepoint. That byte encodes how many subsequent bytes make the code point.
If you want to find the next code point just scan until the byte < 128 or >= 192 and that's the start of the next code point.
Num Bytes
1st code point
last code point
Byte 1
Byte 2
Byte 3
Byte 4
1
U+0000
U+007F
0xxxxxxx
2
U+0080
U+07FF
110xxxxx
10xxxxxx
3
U+0800
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx
4
U+10000
U+10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
Where xxxxxx are the bits of the code point. Concatenate the xxxx bits from the bytes to get the code point
Depending on your development environment you may not even have the choice what encoding your string data type will use internally.
But for storing and exchanging data I would always use UTF-8, if you have the choice. If you have mostly ASCII data this will give you the smallest amount of data to transfer, while still being able to encode everything. Optimizing for the least I/O is the way to go on modern machines.
As mentioned, the difference is primarily the size of the underlying variables, which in each case get larger to allow more characters to be represented.
However, fonts, encoding and things are wickedly complicated (unnecessarily?), so a big link is needed to fill in more detail:
http://www.cs.tut.fi/~jkorpela/chars.html#ascii
Don't expect to understand it all, but if you don't want to have problems later it's worth learning as much as you can, as early as you can (or just getting someone else to sort it out for you).
Paul.
After reading through the answers, UTF-32 needs some loving.
C#:
Data1 = RandomNumberGenerator.GetBytes(500_000_000);
sw = Stopwatch.StartNew();
int l = Encoding.UTF8.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-8: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.Unicode.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"Unicode: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.UTF32.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-32: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.ASCII.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"ASCII: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
UTF-8 -- Elapsed 9.939s - Size 473,752,800
Unicode -- Elapsed 0.853s - Size 250,000,000
UTF-32 -- Elapsed 3.143s - Size 125,030,570
ASCII -- Elapsed 2.362s - Size 500,000,000
UTF-32 -- MIC DROP
In short, the only reason to use UTF-16 or UTF-32 is to support non-English and ancient scripts respectively.
I was wondering why anyone would chose to have non-UTF-8 encoding when it is obviously more efficient for web/programming purposes.
A common misconception - the suffixed number is NOT an indication of its capability. They all support the complete Unicode, just that UTF-8 can handle ASCII with a single byte, so is MORE efficient/less corruptible to the CPU and over the internet.
Some good reading: http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/10/which_utf_do_i_use.html
and http://utf8everywhere.org