UTF8, codepoints, and their representation in Erlang and Elixir - unicode

going through Elixir's handling of unicode:
iex> String.codepoints("abc§")
["a", "b", "c", "§"]
very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.
The ? operator (or is it a macro? can't find the answer) tells me that
iex(69)> ?§
167
Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating ?§ earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:
iex(72)> <<0xc2, 0xa7>>
"§"
And to make me go completely bananas, this is what I get in Erlang shell:
24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.
<<"§">>
!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?

The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.
The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.
When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).
In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:
24> <<167>>.
<<"§">> %% this is encoded in latin1
25> <<"\x{a7}">>.
<<"§">> %% still latin1
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says
27> <<"\x{c2a7}">>.
<<"§">> %% this is latin1
The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.

The code point is a number assigned to the character. It's an abstract value, not dependent on any particular representation in actual memory somewhere.
In order to store the character, you have to convert the code point to some sequence of bytes. There are several different ways to do this; each is called a Unicode Transformation Format, and named UTF-n, where the n is the number of bits in the basic unit of encoding. There used to be a UTF-7, used where 7-bit ASCII was assumed and even the 8th bit of a byte couldn't be reliably transmitted; in modern systems, there are UTF-8, UTF-16, and UTF-32.
Since the largest code point value fits comfortably in 21 bits, UTF-32 is the simplest; you just store the code point as a 32-bit integer. (There could theoretically be a UTF-24 or even a UTF-21, but common modern computing platforms deal naturally with values that take up either exactly 8 or a multiple of 16 bits, and have to work harder to deal with anything else.)
So UTF-32 is simple, but inefficient. Not only does it have 11 extra bits that will never be needed, it has 5 bits that are almost never needed. Far and away most Unicode characters found in the wild are in the Basic Multilingual Plane, U+0000 through U+FFFF. UTF-16 lets you represent all of those code points as a plain integer, taking up half the space of UTF-32. But it can't represent anything from U+10000 on up that way, so part of the 0000-FFFF range is reserved as "surrogate pairs" that can be put together to represent a high-plane Unicode character with two 16-bit units, for a total of 32 bits again but only when needed.
Java uses UTF-16 internally, but Erlang (and therefore Elixir), along with most other programming systems, uses UTF-8. UTF-8 has the advantage of completely transparent compatibility with ASCII - all characters in the ASCII range (U+0000 through U+007F, or 0-127 decimal) are represented by single bytes with the corresponding value. But any characters with code points outside the ASCII range require more than one byte each - even those in the range U+0080 through U+00FF, decimal 128 through 255, which only take up one byte in the Latin-1 encoding that used to be the default before Unicode.
So with Elixir/Erlang "binaries", unless you go out of your way to encode things differently, you are using UTF-8. If you look at the high bit of the first byte of a UTF-8 character, it's either 0, meaning you have a one-byte ASCII character, or it's 1. If it's 1, then the second-highest bit is also 1, because the number of consecutive 1-bits counting down from the high bit before you get to a 0 bit tells you how many bytes total the character takes up. So the pattern 110xxxxx means the character is two bytes, 1110xxxx means three bytes, and 11110xxx means four bytes. (There is no legal UTF-8 character that requires more than four bytes, although the encoding could theoretically support up to seven.)
The rest of the bytes all have the two high bits set to 10, so they can't be mistaken for the start of a character. And the rest of the bits are the code point itself.
To use your case as an example, the code point for "§" is U+00A7 - that is, hexadecimal A7, which is decimal 167 or binary 10100111. Since that's greater than decimal 127, it will require two bytes in UTF-8. Those two bytes will have the binary form 110abcde 10fghijk, where the bits abcdefghijk will hold the code point. So the binary representation of the code point, 10100111, is padded out to 00010100111 and split unto the sequences 00010, which replaces abcde in the UTF-8 template, and 100111, which replaces fghijk. That yields two bytes with binary values 11000010 and 10100111, which are C2 and A7 in hexadecimal, or 194 and 167 in decimal.
You'll notice that the second byte coincidentally has the same value as the code point you're encoding, but t's important to realize that this correspondence is just a coincidence. There are a total of 64 code points, from 128 (U+0080) through 191 (U+00BF), that work out that way: their UTF-8 encoding consists of a byte with decimal value 194 followed by a byte whose value is equal to the code point itself. But for the other 1,114,048 code points possible in Unicode, that is not the case.

Related

What's the difference between ASCII and Unicode?

What's the exact difference between Unicode and ASCII?
ASCII has a total of 128 characters (256 in the extended set).
Is there any size specification for Unicode characters?
ASCII defines 128 characters, which map to the numbers 0–127. Unicode defines (less than) 221 characters, which, similarly, map to numbers 0–221 (though not all numbers are currently assigned, and some are reserved).
Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'".
Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8.
Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two.
ASCII, Origins
As stated in the other answers, ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations*. Which means that we can represent 128 characters maximum.
Wait, 7 bits? But why not 1 byte (8 bits)?
The last bit (8th) is used for avoiding errors as parity bit.
This was relevant years ago.
Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc.
See below the binary representation of a few characters in ASCII:
0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)
See the full ASCII table over here.
ASCII was meant for English only.
What? Why English only? So many languages out there!
Because the center of the computer industry was in the USA at that
time. As a consequence, they didn't need to support accents or other
marks such as á, ü, ç, ñ, etc. (aka diacritics).
ASCII Extended
Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). And not 2^7 as before (128).
10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)
The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII".
As #Tom pointed out in his comment below there is no such thing as "extended ASCII" yet this is an easy way to refer to this 8th-bit trick. There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1.
Unicode, The Rise
ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? Greek? Russian? Chinese and the likes?
We would have needed an entirely new character set... that's the rational behind Unicode. Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters (see this table).
You cannot save text to your hard drive as "Unicode". Unicode is an abstract representation of the text. You need to "encode" this abstract representation. That's where an encoding comes into play.
Encodings: UTF-8 vs UTF-16 vs UTF-32
This answer does a pretty good job at explaining the basics:
UTF-8 and UTF-16 are variable length encodings.
In UTF-8, a character may occupy a minimum of 8 bits.
In UTF-16, a character length starts with 16 bits.
UTF-32 is a fixed length encoding of 32 bits.
UTF-8 uses the ASCII set for the first 128 characters. That's handy because it means ASCII text is also valid in UTF-8.
Mnemonics:
UTF-8: minimum 8 bits.
UTF-16: minimum 16 bits.
UTF-32: minimum and maximum 32 bits.
Note:
Why 2^7?
This is obvious for some, but just in case. We have seven slots available filled with either 0 or 1 (Binary Code).
Each can have two combinations. If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. Think about this as a combination lock with seven wheels, each wheel having two numbers only.
Source: Wikipedia, this great blog post and Mocki.co where I initially posted this summary.
ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. With incompatible choices, causing the code page disaster. Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page.
Unicode came about to solve this disaster. Version 1 started out with 65536 code points, commonly encoded in 16 bits. Later extended in version 2 to 1.1 million code points. The current version is 6.3, using 110,187 of the available 1.1 million code points. That doesn't fit in 16 bits anymore.
Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. And language runtimes like Java. The v2 spec came up with a way to map those 1.1 million code points into 16-bits. An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. The original v1 code points take 2 bytes, added ones take 4.
Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. Not often used since it is pretty wasteful. There are other ones, like UTF-1 and UTF-7, widely ignored.
An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.
Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". Their association with operating system defaults pretty much draws the lines. One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common.
java provides support for Unicode i.e it supports all world wide alphabets. Hence the size of char in java is 2 bytes. And range is 0 to 65535.
ASCII has 128 code positions, allocated to graphic characters and control characters (control codes).
Unicode has 1,114,112 code positions. About 100,000 of them have currently been allocated to characters, and many code points have been made permanently noncharacters (i.e. not used to encode any character ever), and most code points are not yet assigned.
The only things that ASCII and Unicode have in common are: 1) They are character codes. 2) The 128 first code positions of Unicode have been defined to have the same meanings as in ASCII, except that the code positions of ASCII control characters are just defined as denoting control characters, with names corresponding to their ASCII names, but their meanings are not defined in Unicode.
Sometimes, however, Unicode is characterized (even in the Unicode standard!) as “wide ASCII”. This is a slogan that mainly tries to convey the idea that Unicode is meant to be a universal character code the same way as ASCII once was (though the character repertoire of ASCII was hopelessly insufficient for universal use), as opposite to using different codes in different systems and applications and for different languages.
Unicode as such defines only the “logical size” of characters: Each character has a code number in a specific range. These code numbers can be presented using different transfer encodings, and internally, in memory, Unicode characters are usually represented using one or two 16-bit quantities per character, depending on character range, sometimes using one 32-bit quantity per character.
ASCII and Unicode are two character encodings. Basically, they are standards on how to represent difference characters in binary so that they can be written, stored, transmitted, and read in digital media. The main difference between the two is in the way they encode the character and the number of bits that they use for each. ASCII originally used seven bits to encode each character. This was later increased to eight with Extended ASCII to address the apparent inadequacy of the original. In contrast, Unicode uses a variable bit encoding program where you can choose between 32, 16, and 8-bit encodings. Using more bits lets you use more characters at the expense of larger files while fewer bits give you a limited choice but you save a lot of space. Using fewer bits (i.e. UTF-8 or ASCII) would probably be best if you are encoding a large document in English.
One of the main reasons why Unicode was the problem arose from the many non-standard extended ASCII programs. Unless you are using the prevalent page, which is used by Microsoft and most other software companies, then you are likely to encounter problems with your characters appearing as boxes. Unicode virtually eliminates this problem as all the character code points were standardized.
Another major advantage of Unicode is that at its maximum it can accommodate a huge number of characters. Because of this, Unicode currently contains most written languages and still has room for even more. This includes typical left-to-right scripts like English and even right-to-left scripts like Arabic. Chinese, Japanese, and the many other variants are also represented within Unicode. So Unicode won’t be replaced anytime soon.
In order to maintain compatibility with the older ASCII, which was already in widespread use at the time, Unicode was designed in such a way that the first eight bits matched that of the most popular ASCII page. So if you open an ASCII encoded file with Unicode, you still get the correct characters encoded in the file. This facilitated the adoption of Unicode as it lessened the impact of adopting a new encoding standard for those who were already using ASCII.
Summary:
1.ASCII uses an 8-bit encoding while Unicode uses a variable bit encoding.
2.Unicode is standardized while ASCII isn’t.
3.Unicode represents most written languages in the world while ASCII does not.
4.ASCII has its equivalent within Unicode.
Taken From: http://www.differencebetween.net/technology/software-technology/difference-between-unicode-and-ascii/#ixzz4zEjnxPhs
Storage
Given numbers are only for storing 1 character
ASCII ⟶ 27 bits (1 byte)
Extended ASCII ⟶ 28 bits (1 byte)
UTF-8 ⟶ minimum 28, maximum 232 bits (min 1, max 4 bytes)
UTF-16 ⟶ minimum 216, maximum 232 bits (min 2, max 4 bytes)
UTF-32 ⟶ 232 bits (4 bytes)
Usage (as of Feb 2020)
ASCII defines 128 characters, as Unicode contains a repertoire of more than 120,000 characters.
Beyond how UTF is a superset of ASCII, another good difference to know between ASCII and UTF is in terms of disk file encoding and data representation and storage in random memory. Programs know that given data should be understood as an ASCII or UTF string either by detecting special byte order mark codes at the start of the data, or by assuming from programmer intent that the data is text and then checking it for patterns that indicate it is in one text encoding or another.
Using the conventional prefix notation of 0x for hexadecimal data, basic good reference is that ASCII text starts with byte values 0x00 to 0x7F representing one of the possible ASCII character values. UTF text is normally indicated by starting with the bytes 0xEF 0xBB 0xBF for UTF8. For UTF16, start bytes 0xFE 0xFF, or 0xFF 0xFE are used, with the endian-ness order of the text bytes indicated by the order of the start bytes. The simple presence of byte values that are not in the ASCII range of possible byte values also indicates that data is probably UTF.
There are other byte order marks that use different codes to indicate data should be interpreted as text encoded in a certain encoding standard.

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).
Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).
To give a metaphorically accurate answer, all of them.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)
According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.
Unicode has the hexadecimal amount of 110000, which is 1114112

Unicode code point limit

As explained here, All unicode encodings end at largest code point 10FFFF But I've heard differently that
they can go upto 6 bytes, is it true?
UTF-8 underwent some changes during its life, and there are many specifications (most of which are outdated now) which standardized UTF-8. Most of the changes were introduced to help compatibility with UTF-16 and to allow for the ever-growing amount of codepoints.
To make the long story short, UTF-8 was originally specified to allow codepoints with up to 31 bits (or 6 bytes). But with RFC3629, this was reduced to 4 bytes max. to be more compatible to UTF-16.
Wikipedia has some more information. The specification of the Universal Character Set is closely linked to the history of Unicode and its transformation format (UTF).
See the answers to Do UTF-8,UTF-16, and UTF-32 Unicode encodings differ in the number of characters they can store?
UTF-8 and UTF-32 are theoretically capable of representing characters above U+10FFFF, but were artificially restricted to match UTF-16's capacity.
The largest unicode codepoint and the encodings for unicode characters used, are two things. According to the standard, the highest codepoint really is 0x10ffff but herefore you'll need just 21 bits which fit easily into 4 bytes, even with 11 bits wasted!
I guess with your question about 6 bytes you mean a 6-byte utf-8 sequence, right? As others have answered already, using the utf-8 mechanism you could really deal with 6-byte sequences, you can even deal with 7-byte sequences and even with an 8-byte sequence. The 7-byte sequence gives you a range of just what the following bytes have to offer, 6 x 6 bits = 36 bits and a 8-byte sequence gives you 7 x 6 bits = 42 bits. You could deal with it but it is not allowed because unneeded, the highest codepoint is 0x10ffff.
It is also forbidden to use longer sequences than needed as Hibou57 has mentioned. With utf-8 one must always use the shortest sequence possible or the sequence will be treated as invalid! An ASCII character must be in a 7-bit singlebyte of course. The second thing is that the utf-8 4-byte sequence gives you 3 bits of payload in the startbyte and 18 bits of payload in the following bytes which are 21 bits and that matches to the calculation of surrogates when using the utf-16 encoding. The bias 0x10000 is subtracted from the codepoint and the remaining 20 bits go to the high- as well lo-surrogate payload area, each of 10 bits. The third and last thing is, that within utf-8 it is not allowed to encode hi- or -lo-surrogate values. Surrogates are not characters but containers for them, surrogates can only appear in utf-16, not in utf-8 or utf-32 encoded files.
Indeed, for some view of the UTF‑8 encoding, UTF‑8 may technically permit to encode code‑points beyond the forever‑fixed valid range upper‑limit; so one may encode a code‑point beyond that range, but it will not be a valid code‑point anywhere. On the other hand, you may encode a character with unneeded zeroed high‑order bits, ex. encoding an ASCII code‑point with multiple bits, like in 2#1100_0001#, 2#1000_0001# (using Ada's notation), which would for the ASCII letter A UTF‑8 encoded with two bytes. But then, it may be rejected by some safety/security filters, at this use to be used for hacking and piracy. RFC 3629 has some explanation about it. One should just stick to encode valid code‑points (as defined by Unicode), the safe way (no extraneous bytes).

Are 6 octet UTF-8 sequences valid?

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.
(All quotes are from RFC 3629)
Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets. The
only octet of a "sequence" of one has the higher-order bit set to 0,
the remaining 7 bits being used to encode the character number. In a
sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s) of
that octet contain bits from the number of the character to be
encoded. The following octet(s) all have the higher-order bit set to
1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.
So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?
Section 2:
The octet values C0, C1, F5 to FF never appear.
This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?
Section 12:
Restricted the range of characters to 0000-10FFFF (the UTF-16
accessible range).
Looking at the previous RFC confirms this...they reduced the range of characters.
Section 10:
Another security issue occurs when encoding to UTF-8: the ISO/IEC
10646 description of UTF-8 allows encoding character numbers up to
U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
a risk of buffer overflow if the range of character numbers is not
explicitly limited to U+10FFFF or if buffer sizing doesn't take into
account the possibility of 5- and 6-byte sequences.
So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?
Thanks in advance.
They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.
UTF-8 is well-defined for 0-10FFFF.
Both UTF-8 and UTF-16 allow all Unicode characters to be encoded. What UTF-8 is not allowed to do is to encode upper and lower surrogate halves (which UTF-16 uses) or values above U+10FFFF, which aren't legal Unicode.
Note that the BMP ends at U+FFFF.
I would have to say no: Unicode code points are valid for the range [0, 0x10FFFF], and those map to 1-4 octets. So, if you did come across a 5- or 6-octet UTF-8 encoded code point, it's not a valid code point - there's certainly nothing assigned there. I am a little baffled as to why they're there in the ISO standard - I couldn't find an explanation.
It does make you wonder, however, if perhaps someday in the future, they would expand past U+10FFFF. 0x10FFFF allows for over a million characters, but there are a lot characters out there, and it would depend how much eventually gets encoded. (For sanity's sake, let's hope not, a million characters is a lot!) UTF-32 could handle more code points, and as you've discovered, UTF-8 could. It'd really be UTF-16 that's out of luck - more surrogate pairs would be needed somewhere in the spectrum of code points.

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32?
I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?
UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes these into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.
UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.
UTF-32 will cover all possible characters in 4 bytes. This makes it pretty bloated. I can't think of any advantage to using it.
In short:
UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.
In long: see Wikipedia: UTF-8, UTF-16, and UTF-32.
UTF-8 is variable 1 to 4 bytes.
UTF-16 is variable 2 or 4 bytes.
UTF-32 is fixed 4 bytes.
Unicode defines a single huge character set, assigning one unique integer value to every graphical symbol (that is a major simplification, and isn't actually true, but it's close enough for the purposes of this question). UTF-8/16/32 are simply different ways to encode this.
In brief, UTF-32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.
UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.
And UTF-8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters (the most significant bit is used to signify that this is the start of a multi-byte sequence, leaving 7 bits for the actual character value). All other characters are encoded as sequences of up to 4 bytes (if memory serves).
And that leads us to the advantages. Any ASCII-character is directly compatible with UTF-8, so for upgrading legacy apps, UTF-8 is a common and obvious choice. In almost all cases, it will also use the least memory. On the other hand, you can't make any guarantees about the width of a character. It may be 1, 2, 3 or 4 characters wide, which makes string manipulation difficult.
UTF-32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), but on the other hand, you know that every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF-8.
UTF-16 is a compromise. It lets most characters fit into a fixed-width 16-bit value. So as long as you don't have Chinese symbols, musical notes or some others, you can assume that each character is 16 bits wide. It uses less memory than UTF-32. But it is in some ways "the worst of both worlds". It almost always uses more memory than UTF-8, and it still doesn't avoid the problem that plagues UTF-8 (variable-length characters).
Finally, it's often helpful to just go with what the platform supports. Windows uses UTF-16 internally, so on Windows, that is the obvious choice.
Linux varies a bit, but they generally use UTF-8 for everything that is Unicode-compliant.
So short answer: All three encodings can encode the same character set, but they represent each character as different byte sequences.
Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes:
UTF-8 - "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)
UTF-16 - "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)
UTF-32 - "performance": allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage
I tried to give a simple explanation in my blogpost.
UTF-32
requires 32 bits (4 bytes) to encode any character. For example, in order to represent the "A" character code-point using this scheme, you'll need to write 65 in 32-bit binary number:
00000000 00000000 00000000 01000001 (Big Endian)
If you take a closer look, you'll note that the most-right seven bits are actually the same bits when using the ASCII scheme. But since UTF-32 is fixed width scheme, we must attach three additional bytes. Meaning that if we have two files that only contain the "A" character, one is ASCII-encoded and the other is UTF-32 encoded, their size will be 1 byte and 4 bytes correspondingly.
UTF-16
Many people think that as UTF-32 uses fixed width 32 bit to represent a code-point, UTF-16 is fixed width 16 bits. WRONG!
In UTF-16 the code point maybe represented either in 16 bits, OR 32 bits. So this scheme is variable length encoding system. What is the advantage over the UTF-32? At least for ASCII, the size of files won't be 4 times the original (but still twice), so we're still not ASCII backward compatible.
Since 7-bits are enough to represent the "A" character, we can now use 2 bytes instead of 4 like the UTF-32. It'll look like:
00000000 01000001
UTF-8
You guessed right.. In UTF-8 the code point maybe represented using either 32, 16, 24 or 8 bits, and as the UTF-16 system, this one is also variable length encoding system.
Finally we can represent "A" in the same way we represent it using ASCII encoding system:
01001101
A small example where UTF-16 is actually better than UTF-8:
Consider the Chinese letter "語" - its UTF-8 encoding is:
11101000 10101010 10011110
While its UTF-16 encoding is shorter:
10001010 10011110
In order to understand the representation and how it's interpreted, visit the original post.
UTF-8
has no concept of byte-order
uses between 1 and 4 bytes per character
ASCII is a compatible subset of encoding
completely self-synchronizing e.g. a dropped byte from anywhere in a stream will corrupt at most a single character
pretty much all European languages are encoded in two bytes or less per character
UTF-16
must be parsed with known byte-order or reading a byte-order-mark (BOM)
uses either 2 or 4 bytes per character
UTF-32
every character is 4 bytes
must be parsed with known byte-order or reading a byte-order-mark (BOM)
UTF-8 is going to be the most space efficient unless a majority of the characters are from the CJK (Chinese, Japanese, and Korean) character space.
UTF-32 is best for random access by character offset into a byte-array.
I made some tests to compare database performance between UTF-8 and UTF-16 in MySQL.
Update Speeds
UTF-8
UTF-16
Insert Speeds
Delete Speeds
In UTF-32 all of characters are coded with 32 bits. The advantage is that you can easily calculate the length of the string. The disadvantage is that for each ASCII characters you waste an extra three bytes.
In UTF-8 characters have variable length, ASCII characters are coded in one byte (eight bits), most western special characters are coded either in two bytes or three bytes (for example € is three bytes), and more exotic characters can take up to four bytes. Clear disadvantage is, that a priori you cannot calculate string's length. But it's takes lot less bytes to code Latin (English) alphabet text, compared to UTF-32.
UTF-16 is also variable length. Characters are coded either in two bytes or four bytes. I really don't see the point. It has disadvantage of being variable length, but hasn't got the advantage of saving as much space as UTF-8.
Of those three, clearly UTF-8 is the most widely spread.
I'm surprised this question is 11yrs old and not one of the answers mentioned the #1 advantage of utf-8.
utf-8 generally works even with programs that are not utf-8 aware. That's partly what it was designed for. Other answers mention that the first 128 code points are the same as ASCII. All other code points are generated by 8bit values with the high bit set (values from 128 to 255) so that from the POV of a non-unicode aware program it just sees strings as ASCII with some extra characters.
As an example let's say you wrote a program to add line numbers that effectively does this (and to keep it simple let's assume end of line is just ASCII 13)
// pseudo code
function readLine
if end of file
return null
read bytes (8bit values) into string until you hit 13 or end or file
return string
function main
lineNo = 1
do {
s = readLine
if (s == null) break;
print lineNo++, s
}
Passing a utf-8 file to this program will continue to work. Similarly, splitting on tabs, commas, parsing for ASCII quotes, or other parsing for which only ASCII values are significant all just work with utf-8 because no ASCII value appear in utf-8 except when they are actually meant to be those ASCII values
Some other answers or comments mentions that utf-32 has the advantage that you can treat each codepoint separately. This would suggest for example you could take a string like "ABCDEFGHI" and split it at every 3rd code point to make
ABC
DEF
GHI
This is false. Many code points affect other code points. For example the color selector code points that lets you choose between 👨🏻‍🦳👨🏼‍🦳👨🏽‍🦳👨🏾‍🦳👨🏿‍🦳. If you split at any arbitrary code point you'll break those.
Another example is the bidirectional code points. The following paragraph was not entered backward. It is just preceded by the 0x202E codepoint
‮This line is not typed backward it is only displayed backward
So no, utf-32 will not let you just randomly manipulate unicode strings without a thought to their meanings. It will let you look at each codepoint with no extra code.
FYI though, utf-8 was designed so that looking at any individual byte you can find out the start of the current code point or the next code point.
If you take a arbitrary byte in utf-8 data. If it is < 128 it's the correct code point by itself. If it's >= 128 and < 192 (the top 2 bits are 10) then to find the start of the code point you need to look the preceding byte until you find a byte with a value >= 192 (the top 2 bits are 11). At that byte you've found the start of a codepoint. That byte encodes how many subsequent bytes make the code point.
If you want to find the next code point just scan until the byte < 128 or >= 192 and that's the start of the next code point.
Num Bytes
1st code point
last code point
Byte 1
Byte 2
Byte 3
Byte 4
1
U+0000
U+007F
0xxxxxxx
2
U+0080
U+07FF
110xxxxx
10xxxxxx
3
U+0800
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx
4
U+10000
U+10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
Where xxxxxx are the bits of the code point. Concatenate the xxxx bits from the bytes to get the code point
Depending on your development environment you may not even have the choice what encoding your string data type will use internally.
But for storing and exchanging data I would always use UTF-8, if you have the choice. If you have mostly ASCII data this will give you the smallest amount of data to transfer, while still being able to encode everything. Optimizing for the least I/O is the way to go on modern machines.
As mentioned, the difference is primarily the size of the underlying variables, which in each case get larger to allow more characters to be represented.
However, fonts, encoding and things are wickedly complicated (unnecessarily?), so a big link is needed to fill in more detail:
http://www.cs.tut.fi/~jkorpela/chars.html#ascii
Don't expect to understand it all, but if you don't want to have problems later it's worth learning as much as you can, as early as you can (or just getting someone else to sort it out for you).
Paul.
After reading through the answers, UTF-32 needs some loving.
C#:
Data1 = RandomNumberGenerator.GetBytes(500_000_000);
sw = Stopwatch.StartNew();
int l = Encoding.UTF8.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-8: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.Unicode.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"Unicode: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.UTF32.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-32: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.ASCII.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"ASCII: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
UTF-8 -- Elapsed 9.939s - Size 473,752,800
Unicode -- Elapsed 0.853s - Size 250,000,000
UTF-32 -- Elapsed 3.143s - Size 125,030,570
ASCII -- Elapsed 2.362s - Size 500,000,000
UTF-32 -- MIC DROP
In short, the only reason to use UTF-16 or UTF-32 is to support non-English and ancient scripts respectively.
I was wondering why anyone would chose to have non-UTF-8 encoding when it is obviously more efficient for web/programming purposes.
A common misconception - the suffixed number is NOT an indication of its capability. They all support the complete Unicode, just that UTF-8 can handle ASCII with a single byte, so is MORE efficient/less corruptible to the CPU and over the internet.
Some good reading: http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/10/which_utf_do_i_use.html
and http://utf8everywhere.org