How UTF-16 encoding uses surrogate code points?

How UTF-16 encoding uses surrogate code points? - unicode

In according to the Unicode specification
D91 UTF-16 encoding form: The Unicode encoding form that assigns each
Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF
to a single unsigned 16-bit code unit with the same numeric value as
the Unicode scalar value, and that assigns each Unicode scalar value
in the range U+10000..U+10FFFF to a surrogate pair.
The term "scalar value" is referred to unicode code points, that is the range of abstract ideas which must be encoded into specific byte sequence by different encoding forms (UTF-16 and so on). So it seems that this excerpt gist is in view of not all code points can be accommodated into one UTF-16 code unit (two bytes), there are ones which should be encoded into a pair of code units - 4 bytes (it's called "a surrogate pair").
However, the very term "scalar value" is defined as follows:
D76 Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.
Wait... Does the Unicode have surrogate code points? What's the reason for it when UTF-16 can use 4 bytes to represent scalar points? Can anyone explain the rationale and how UTF-16 uses this code points?

Yes, Unicode reserves ranges for surrogate code points:
High Surrogate Area U+D800 to U+DBFF
Low Surrogate Area U+DC00 to U+DFFF
Unicode reserves these ranges because these 16-bit values are used in surrogate pairs and no other symbol can be assigned to them. Surrogate pairs are two 16-bit values that encode code points above U+FFFF that do not fit into a single 16-bit value.

Just for the sake of ultimate clarification.
UTF-16 uses 16-bits (2-bytes) code units. It means this encoding format encodes code points (= abstract ideas should be represented in a computer memory in some way), as a rule, into 16 bits (so an interpreter allegedly reads data as two bytes at a time).
UTF-16 does its best quite straightforward: the U+000E code point would encoded as 000E, the U+000F as 000F, and so on.
The issue is 16 bits can only come up with the range that is not sufficient to accommodate all unicode code points (0000..FFFF allows of only 65 536 possible values). We might use two 16-bits words (4 bytes) for the code points beyond this range (actually, my misunderstanding was about why UTF-16 doesn't do so). However, this approach results in bitter inability to decode some values. For example, if we encode the U+10000 code point into 0001 0000 (hex notation) how on earth should an interpreter decode such representation: as two subseqent code points U+0001 and U+0000 or as a single one U+10000?
The Unicode specification inclines to the better way. If there is a need to encode the range U+10000..U+10FFF (1 048 576 code points) then we should set apart 1 024 + 1 024 = 2 048 values from those which can be encoded with 16 bits (the spec chose the D800..DFFF range for it). And when an interpeter ecounters a D800..DBFF (High Surrogate Area) value in a computer memory, it knows it's not implied "fully-fledged" code point (not scalar value in terms of the spec) and it should read another 16 bits to get value from the DC00..DFFF range (Low Surrogate Area) and finally conclude which of the U+10000..U+10FFF code point was encoded with these 4 bytes (with this Surrogate Pair). Note, such scheme makes possible to encode 1 024 * 1 024 = 1 048 576 code points (and that's the very same amount we need).
In due to Unicode Codespace is defined as a range of integers from 0 to 10FFFF, we have to introduce the concept of surrogate code points (not code units) - the U+D800..U+DBFF range (because we can't exclude this range from the unicode codespace). In view of surrogate code points are designated for surrogate code units in the UTF-16 (see C1, D74), these code points can look as an UTF-16 relic.

Related

UTF8, codepoints, and their representation in Erlang and Elixir

going through Elixir's handling of unicode:
iex> String.codepoints("abc§")
["a", "b", "c", "§"]
very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.
The ? operator (or is it a macro? can't find the answer) tells me that
iex(69)> ?§
167
Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating ?§ earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:
iex(72)> <<0xc2, 0xa7>>
"§"
And to make me go completely bananas, this is what I get in Erlang shell:
24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.
<<"§">>
!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?

The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.
The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.
When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).
In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:
24> <<167>>.
<<"§">> %% this is encoded in latin1
25> <<"\x{a7}">>.
<<"§">> %% still latin1
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says
27> <<"\x{c2a7}">>.
<<"§">> %% this is latin1
The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.

The code point is a number assigned to the character. It's an abstract value, not dependent on any particular representation in actual memory somewhere.
In order to store the character, you have to convert the code point to some sequence of bytes. There are several different ways to do this; each is called a Unicode Transformation Format, and named UTF-n, where the n is the number of bits in the basic unit of encoding. There used to be a UTF-7, used where 7-bit ASCII was assumed and even the 8th bit of a byte couldn't be reliably transmitted; in modern systems, there are UTF-8, UTF-16, and UTF-32.
Since the largest code point value fits comfortably in 21 bits, UTF-32 is the simplest; you just store the code point as a 32-bit integer. (There could theoretically be a UTF-24 or even a UTF-21, but common modern computing platforms deal naturally with values that take up either exactly 8 or a multiple of 16 bits, and have to work harder to deal with anything else.)
So UTF-32 is simple, but inefficient. Not only does it have 11 extra bits that will never be needed, it has 5 bits that are almost never needed. Far and away most Unicode characters found in the wild are in the Basic Multilingual Plane, U+0000 through U+FFFF. UTF-16 lets you represent all of those code points as a plain integer, taking up half the space of UTF-32. But it can't represent anything from U+10000 on up that way, so part of the 0000-FFFF range is reserved as "surrogate pairs" that can be put together to represent a high-plane Unicode character with two 16-bit units, for a total of 32 bits again but only when needed.
Java uses UTF-16 internally, but Erlang (and therefore Elixir), along with most other programming systems, uses UTF-8. UTF-8 has the advantage of completely transparent compatibility with ASCII - all characters in the ASCII range (U+0000 through U+007F, or 0-127 decimal) are represented by single bytes with the corresponding value. But any characters with code points outside the ASCII range require more than one byte each - even those in the range U+0080 through U+00FF, decimal 128 through 255, which only take up one byte in the Latin-1 encoding that used to be the default before Unicode.
So with Elixir/Erlang "binaries", unless you go out of your way to encode things differently, you are using UTF-8. If you look at the high bit of the first byte of a UTF-8 character, it's either 0, meaning you have a one-byte ASCII character, or it's 1. If it's 1, then the second-highest bit is also 1, because the number of consecutive 1-bits counting down from the high bit before you get to a 0 bit tells you how many bytes total the character takes up. So the pattern 110xxxxx means the character is two bytes, 1110xxxx means three bytes, and 11110xxx means four bytes. (There is no legal UTF-8 character that requires more than four bytes, although the encoding could theoretically support up to seven.)
The rest of the bytes all have the two high bits set to 10, so they can't be mistaken for the start of a character. And the rest of the bits are the code point itself.
To use your case as an example, the code point for "§" is U+00A7 - that is, hexadecimal A7, which is decimal 167 or binary 10100111. Since that's greater than decimal 127, it will require two bytes in UTF-8. Those two bytes will have the binary form 110abcde 10fghijk, where the bits abcdefghijk will hold the code point. So the binary representation of the code point, 10100111, is padded out to 00010100111 and split unto the sequences 00010, which replaces abcde in the UTF-8 template, and 100111, which replaces fghijk. That yields two bytes with binary values 11000010 and 10100111, which are C2 and A7 in hexadecimal, or 194 and 167 in decimal.
You'll notice that the second byte coincidentally has the same value as the code point you're encoding, but t's important to realize that this correspondence is just a coincidence. There are a total of 64 code points, from 128 (U+0080) through 191 (U+00BF), that work out that way: their UTF-8 encoding consists of a byte with decimal value 194 followed by a byte whose value is equal to the code point itself. But for the other 1,114,048 code points possible in Unicode, that is not the case.

in swift, how utf16 surrogate pair is represented in bit

I'm currently learning swift using the book swift programming language 3.1.
In the book, it states that swift's String and Character type is fully unicode compliant, with each character represented by a 21 bits unicode scalar value. Each character can be view via utf8, 16, 32.
I understand how utf8 and utf32 works in the byte and bit level, but I'm having trouble understanding how utf16 works in the bit level.
I know that for characters whose code point can be fit into 16 bits, utf16 simply represent the character as a 16 bit number. But for characters whose representation require more than 16 bits, two 16 bits block is used (called surrogate pair, I believe).
But how is the two 16 bits block is presented in bit level?

A "Unicode Scalar Value" is
Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.
Every Unicode scalar value can be represented as a sequence of one or two UTF-16 code units, as described in the
Unicode Standard:
D91 UTF-16 encoding form
The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
Table 3-5. UTF-16 Bit Distribution
Scalar Value UTF-16
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx 110110wwwwxxxxxx 110111xxxxxxxxxx
Note: wwww = uuuuu - 1
There are 220 Unicode scalar values in the "Supplementary Planes" (U+10000..U+10FFFF), which means that 20 bits are sufficient to encode
all of them in a surrogate pair. Technically this is done by subtracting
0x010000 from the value before splitting it into two blocks of 10 bits.

The utf16 range D800...DFFF is reserved. Values below and above that are simple 16 bit code points. Values D800..DBFF, minus D800, are the high 10 bits of 20-bit codes beyond FFFC. The next two bytes contain the low 10 bits. Of course, endianness comes into the picture making us all wish we could just use utf8. Sigh.

Unicode scalar value in Swift

In Swift Programming Language 3.0, chapter on string and character, the book states
A Unicode scalar is any Unicode code point in the range U+0000 to
U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do
not include the Unicode surrogate pair code points, which are the code
points in the range U+D800 to U+DFFF inclusive
What does this mean?

A Unicode Scalar is a code point which is not serialised as a pair of UTF-16 code units.
A code point is the number resulting from encoding a character in the Unicode standard. For instance, the code point of the letter A is 0x41 (or 65 in decimal).
A code unit is each group of bits used in the serialisation of a code point. For instance, UTF-16 uses one or two code units of 16 bit each.
The letter A is a Unicode Scalar because it can be expressed with only one code unit: 0x0041. However, less common characters require two UTF-16 code units. This pair of code units is called surrogate pair. Thus, Unicode Scalar may also be defined as: any code point except those represented by surrogate pairs.
The answer from courteouselk is correct by the way, this is just a more plain english version.

From Unicode FAQ:
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
Basically, surrogates are codepoints that are reserved for special purposes and promised to never encode a character on their own but always as a first codepoint in a pair of UTF-16 encoding.
[UPD] Also, from wikipedia:
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.

Why were the code points in the range of U+D800 to U+DFFF removed from the Unicode character set?

I am learning about UTF-16 encoding, and I have read that if you want to represent code points in the range of U+10000 to U+10FFFF, then you have to use surrogate pairs, which are in the range of U+D800 to U+DFFF.
So let's say I want to encode the following code point: U+10123 (10000000100100011 in binary):
First I layout this sequence of bits:
110110xxxxxxxxxx 110111xxxxxxxxxx
Then I fill the places with the x with the binary format of the code point:
1101100001000000 1101110100100011 (D840 DD23 in hexadecimal)
I have also read that the code points in the range of U+D800 to U+DFFF were removed from the Unicode character set, but I don't understand why this range was removed!
I mean this range can be easily encoded in 4 bytes, for example the following is the UTF-16 encoded format of the U+D812 code point (1101100000010010 in binary):
1101100000110110 1101110000010010 (D836 DC12 in hexadecimal)
Note: I was using UTF-16 Big Endian in my examples.

Codepoints U+D800 - U+DFFF are reserved exclusively1 for use with UTF-16. Since they are not in the range of U+10000 - U+10FFFF, UTF-16 would not encode them individually using surrogate pairs, so it would be ambiguous (and illegal2) for these individual codepoints to appear un-encoded in a UTF-16 sequence.
Per the Unicode.org UTF-16 FAQ:
1: Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
2: Q: Are there any 16-bit values that are invalid?
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.

I don't have an official source to back this up, but I believe it was to prevent confusion, so that you couldn't get a code sequence that could be interpreted as both valid UTF-16 and valid UCS-2. The loss of 2048 code points was nothing compared to the addition of 1048576 new ones.

Since encoding a code point as surrogate pair starts by subtracting 0x010000, this would lead to negative numbers. And the point of this subtraction is to allow 65536 more code points instead of encoding the left-out 2048. Maybe this will prove useful, should the whole code space be used up in a distant future.

Naming convention for less than utf32

Unicode UTF-32 values we can call codepoints, though I suppose even this is wrong since a single surrogate is itself a codepoint. UTF-8 can be called multi-byte or multi-octet. But what about UTF-16 and UCS-2. They aren't exactly multi-byte since they deal in 2 bytes, and I think multi-word is more of a MS naming scheme.
What is a more accurate name to describe UTF-32 codepoints that can be made up of bytes, as in UTF-8 and words as in UTF-16?

I believe the term you're looking for is 'code unit'.
Code points are simply integral values that may be assigned a character in a character set.
A code unit is a fixed width integer representation used in sequences to represent encoded text. UTF-8, UTF-16, and UTF-32 are all encodings, and use 8, 16, and 32 bit code units respectively.
UTF-32 is unique among the three in that its code unit values are always exactly the code point values of the represented Unicode data.
'multi-byte' can appropriately be used to in reference to UTF-16. (And 'Unicode' can be used in reference to UTF-8; Microsoft's usage of the terminology is misleading on both counts.)
a single surrogate is itself a codepoint.
Unicode classifies code point in the range [U+D800-U+DFFF] as surrogates. These code points are never used as such, however. They are reserved and cannot be used because UTF-16 cannot represent code points in this range; in order to represent such code points UTF-16 would use code units in the range [0xD800-0xDFFF], however UTF-16 uses code unit values in this range to represent code points above U+FFFF and therefore cannot use them to represent code points in the range [U+D800-U+DFFF].

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse