in swift, how utf16 surrogate pair is represented in bit - swift

I'm currently learning swift using the book swift programming language 3.1.
In the book, it states that swift's String and Character type is fully unicode compliant, with each character represented by a 21 bits unicode scalar value. Each character can be view via utf8, 16, 32.
I understand how utf8 and utf32 works in the byte and bit level, but I'm having trouble understanding how utf16 works in the bit level.
I know that for characters whose code point can be fit into 16 bits, utf16 simply represent the character as a 16 bit number. But for characters whose representation require more than 16 bits, two 16 bits block is used (called surrogate pair, I believe).
But how is the two 16 bits block is presented in bit level?

A "Unicode Scalar Value" is
Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.
Every Unicode scalar value can be represented as a sequence of one or two UTF-16 code units, as described in the
Unicode Standard:
D91 UTF-16 encoding form
The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
Table 3-5. UTF-16 Bit Distribution
Scalar Value UTF-16
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx 110110wwwwxxxxxx 110111xxxxxxxxxx
Note: wwww = uuuuu - 1
There are 220 Unicode scalar values in the "Supplementary Planes" (U+10000..U+10FFFF), which means that 20 bits are sufficient to encode
all of them in a surrogate pair. Technically this is done by subtracting
0x010000 from the value before splitting it into two blocks of 10 bits.

The utf16 range D800...DFFF is reserved. Values below and above that are simple 16 bit code points. Values D800..DBFF, minus D800, are the high 10 bits of 20-bit codes beyond FFFC. The next two bytes contain the low 10 bits. Of course, endianness comes into the picture making us all wish we could just use utf8. Sigh.

Related

How UTF-16 encoding uses surrogate code points?

In according to the Unicode specification
D91 UTF-16 encoding form: The Unicode encoding form that assigns each
Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF
to a single unsigned 16-bit code unit with the same numeric value as
the Unicode scalar value, and that assigns each Unicode scalar value
in the range U+10000..U+10FFFF to a surrogate pair.
The term "scalar value" is referred to unicode code points, that is the range of abstract ideas which must be encoded into specific byte sequence by different encoding forms (UTF-16 and so on). So it seems that this excerpt gist is in view of not all code points can be accommodated into one UTF-16 code unit (two bytes), there are ones which should be encoded into a pair of code units - 4 bytes (it's called "a surrogate pair").
However, the very term "scalar value" is defined as follows:
D76 Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.
Wait... Does the Unicode have surrogate code points? What's the reason for it when UTF-16 can use 4 bytes to represent scalar points? Can anyone explain the rationale and how UTF-16 uses this code points?
Yes, Unicode reserves ranges for surrogate code points:
High Surrogate Area U+D800 to U+DBFF
Low Surrogate Area U+DC00 to U+DFFF
Unicode reserves these ranges because these 16-bit values are used in surrogate pairs and no other symbol can be assigned to them. Surrogate pairs are two 16-bit values that encode code points above U+FFFF that do not fit into a single 16-bit value.
Just for the sake of ultimate clarification.
UTF-16 uses 16-bits (2-bytes) code units. It means this encoding format encodes code points (= abstract ideas should be represented in a computer memory in some way), as a rule, into 16 bits (so an interpreter allegedly reads data as two bytes at a time).
UTF-16 does its best quite straightforward: the U+000E code point would encoded as 000E, the U+000F as 000F, and so on.
The issue is 16 bits can only come up with the range that is not sufficient to accommodate all unicode code points (0000..FFFF allows of only 65 536 possible values). We might use two 16-bits words (4 bytes) for the code points beyond this range (actually, my misunderstanding was about why UTF-16 doesn't do so). However, this approach results in bitter inability to decode some values. For example, if we encode the U+10000 code point into 0001 0000 (hex notation) how on earth should an interpreter decode such representation: as two subseqent code points U+0001 and U+0000 or as a single one U+10000?
The Unicode specification inclines to the better way. If there is a need to encode the range U+10000..U+10FFF (1 048 576 code points) then we should set apart 1 024 + 1 024 = 2 048 values from those which can be encoded with 16 bits (the spec chose the D800..DFFF range for it). And when an interpeter ecounters a D800..DBFF (High Surrogate Area) value in a computer memory, it knows it's not implied "fully-fledged" code point (not scalar value in terms of the spec) and it should read another 16 bits to get value from the DC00..DFFF range (Low Surrogate Area) and finally conclude which of the U+10000..U+10FFF code point was encoded with these 4 bytes (with this Surrogate Pair). Note, such scheme makes possible to encode 1 024 * 1 024 = 1 048 576 code points (and that's the very same amount we need).
In due to Unicode Codespace is defined as a range of integers from 0 to 10FFFF, we have to introduce the concept of surrogate code points (not code units) - the U+D800..U+DBFF range (because we can't exclude this range from the unicode codespace). In view of surrogate code points are designated for surrogate code units in the UTF-16 (see C1, D74), these code points can look as an UTF-16 relic.

Why Unicode code points are always written with at least 2 bytes?

Why does Unicode code points are always written with 2 bytes (4 digits) even when that's not necessary ?
From the Wikipedia page about UTF-8 :
$ -> U+0024
¢ -> U+00A2
TL;DR This is all by convention of the Unicode Consortium.
Here is the formal definition, found in Appendix A: Notational Conventions of the Unicode standard (I've referenced the latest at this time, version 11):
In running text, an individual Unicode code point is expressed as U+n, where n is four to
six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15,
respectively). Leading zeros are omitted, unless the code point would have fewer than four
hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345,
U+102345.
They are hexadecimal digits, that represent Unicode Scalar Values. Initially only the first plane called the Basic Multilingual Plane were made available, which supported a range of U+0000 to U+FFFF to be defined. Initially the U+ encoding therefore always had 4 hexadecimal characters.
However, that only allows 64 Ki (65536) code points to use for characters (excluding some reserved values). So the single plane was later extended to 17 planes. Leading zero's are suppressed for values of U+10000 or higher, so the next character is written as U+10000, not U+010000. Currently there are 17 planes of 64Ki code points (some of which may be reserved), starting at U+0000, U+10000 ... U+90000 and finally U100000.
The U+xxxx notation does not follow UTF-8 encoding. Nor does it follow UTF-16, UTF-32 or the deprecated UCS encodings, either in big or little endian. The encoding of characters within the Basic Multilingual Plane is however identical to UTF-16(BE) in hexadecimal. Note that UTF-16 may contain surrogate code units that are used as escape to encode the characters in the other planes. The ranges of those code units are not mapped to characters and will therefore not be present in the textual code point representation.
See for instance, the PLUS-MINUS SIGN, ±:
Unicode code point: U+00B1 (as a textual string)
UTF-8 : 0xC2 0xB1 (as two bytes)
UTF-16 : 0x00B1
UTF-16BE : 0x00B1 as 0x00 0xB1 (as two bytes)
UTF-16LE : 0x00B1 as 0xB1 0x00 (as two bytes)
https://www.fileformat.info/info/unicode/char/00b1/index.htm
Much of this information can be found at sil.org.

UTF8, codepoints, and their representation in Erlang and Elixir

going through Elixir's handling of unicode:
iex> String.codepoints("abc§")
["a", "b", "c", "§"]
very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.
The ? operator (or is it a macro? can't find the answer) tells me that
iex(69)> ?§
167
Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating ?§ earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:
iex(72)> <<0xc2, 0xa7>>
"§"
And to make me go completely bananas, this is what I get in Erlang shell:
24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.
<<"§">>
!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?
The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.
The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.
When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).
In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:
24> <<167>>.
<<"§">> %% this is encoded in latin1
25> <<"\x{a7}">>.
<<"§">> %% still latin1
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says
27> <<"\x{c2a7}">>.
<<"§">> %% this is latin1
The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.
The code point is a number assigned to the character. It's an abstract value, not dependent on any particular representation in actual memory somewhere.
In order to store the character, you have to convert the code point to some sequence of bytes. There are several different ways to do this; each is called a Unicode Transformation Format, and named UTF-n, where the n is the number of bits in the basic unit of encoding. There used to be a UTF-7, used where 7-bit ASCII was assumed and even the 8th bit of a byte couldn't be reliably transmitted; in modern systems, there are UTF-8, UTF-16, and UTF-32.
Since the largest code point value fits comfortably in 21 bits, UTF-32 is the simplest; you just store the code point as a 32-bit integer. (There could theoretically be a UTF-24 or even a UTF-21, but common modern computing platforms deal naturally with values that take up either exactly 8 or a multiple of 16 bits, and have to work harder to deal with anything else.)
So UTF-32 is simple, but inefficient. Not only does it have 11 extra bits that will never be needed, it has 5 bits that are almost never needed. Far and away most Unicode characters found in the wild are in the Basic Multilingual Plane, U+0000 through U+FFFF. UTF-16 lets you represent all of those code points as a plain integer, taking up half the space of UTF-32. But it can't represent anything from U+10000 on up that way, so part of the 0000-FFFF range is reserved as "surrogate pairs" that can be put together to represent a high-plane Unicode character with two 16-bit units, for a total of 32 bits again but only when needed.
Java uses UTF-16 internally, but Erlang (and therefore Elixir), along with most other programming systems, uses UTF-8. UTF-8 has the advantage of completely transparent compatibility with ASCII - all characters in the ASCII range (U+0000 through U+007F, or 0-127 decimal) are represented by single bytes with the corresponding value. But any characters with code points outside the ASCII range require more than one byte each - even those in the range U+0080 through U+00FF, decimal 128 through 255, which only take up one byte in the Latin-1 encoding that used to be the default before Unicode.
So with Elixir/Erlang "binaries", unless you go out of your way to encode things differently, you are using UTF-8. If you look at the high bit of the first byte of a UTF-8 character, it's either 0, meaning you have a one-byte ASCII character, or it's 1. If it's 1, then the second-highest bit is also 1, because the number of consecutive 1-bits counting down from the high bit before you get to a 0 bit tells you how many bytes total the character takes up. So the pattern 110xxxxx means the character is two bytes, 1110xxxx means three bytes, and 11110xxx means four bytes. (There is no legal UTF-8 character that requires more than four bytes, although the encoding could theoretically support up to seven.)
The rest of the bytes all have the two high bits set to 10, so they can't be mistaken for the start of a character. And the rest of the bits are the code point itself.
To use your case as an example, the code point for "§" is U+00A7 - that is, hexadecimal A7, which is decimal 167 or binary 10100111. Since that's greater than decimal 127, it will require two bytes in UTF-8. Those two bytes will have the binary form 110abcde 10fghijk, where the bits abcdefghijk will hold the code point. So the binary representation of the code point, 10100111, is padded out to 00010100111 and split unto the sequences 00010, which replaces abcde in the UTF-8 template, and 100111, which replaces fghijk. That yields two bytes with binary values 11000010 and 10100111, which are C2 and A7 in hexadecimal, or 194 and 167 in decimal.
You'll notice that the second byte coincidentally has the same value as the code point you're encoding, but t's important to realize that this correspondence is just a coincidence. There are a total of 64 code points, from 128 (U+0080) through 191 (U+00BF), that work out that way: their UTF-8 encoding consists of a byte with decimal value 194 followed by a byte whose value is equal to the code point itself. But for the other 1,114,048 code points possible in Unicode, that is not the case.

Unicode scalar value in Swift

In Swift Programming Language 3.0, chapter on string and character, the book states
A Unicode scalar is any Unicode code point in the range U+0000 to
U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do
not include the Unicode surrogate pair code points, which are the code
points in the range U+D800 to U+DFFF inclusive
What does this mean?
A Unicode Scalar is a code point which is not serialised as a pair of UTF-16 code units.
A code point is the number resulting from encoding a character in the Unicode standard. For instance, the code point of the letter A is 0x41 (or 65 in decimal).
A code unit is each group of bits used in the serialisation of a code point. For instance, UTF-16 uses one or two code units of 16 bit each.
The letter A is a Unicode Scalar because it can be expressed with only one code unit: 0x0041. However, less common characters require two UTF-16 code units. This pair of code units is called surrogate pair. Thus, Unicode Scalar may also be defined as: any code point except those represented by surrogate pairs.
The answer from courteouselk is correct by the way, this is just a more plain english version.
From Unicode FAQ:
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
Basically, surrogates are codepoints that are reserved for special purposes and promised to never encode a character on their own but always as a first codepoint in a pair of UTF-16 encoding.
[UPD] Also, from wikipedia:
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.

Clarification on Joel Spolsky's Unicode article

I'm reading the popular Unicode article from Joel Spolsky and there's one illustration that I don't understand.
What does "Hex Min, Hex Max" mean? What do those values represent? Min and max of what?
Binary can only have 1 or 0. Why do I see tons of letter "v" here?
http://www.joelonsoftware.com/articles/Unicode.html
The Hex Min/Max define the range of unicode characters (typically represented by their unicode number in HEX).
The v is referring to the bits of the original number
So the first line is saying:
The unicode characters in the range 0 (hex 00) to 127 (hex 7F) (a 7
bit number) are represented by a 1 byte bit string starting with '0'
followed by all 7 bits of the unicode number.
The second line is saying:
The unicode numbers in the range 128 (hex 0800) to 2047 (07FF) (an 11
bit number) are represented by a 2 byte bit string where the first
byte starts with '110' followed by the first 5 of the 11 bits, and the
second byte starts with '10' followed by the remaining 6 of the 11 bits
etc
Hope that makes sense
Note that the table in Joel's article covers code points that do not, and never will, exist in Unicode. In fact, UTF-8 never needs more than 4 bytes, though the scheme underlying UTF-8 could be extended much further, as shown.
A more nuanced version of the table is available in How does a file with Chinese characters know how many bytes to use per character? It points out some of the gaps. For example, the bytes 0xC0, 0xC1, and 0xF5..0xFF can never appear in valid UTF-8. You can also see information about invalid UTF-8 at Really good bad UTF-8 example test data.
In the table you showed, the Hex Min and Hex Max values are the minimum and maximum U+wxyz values that can be represented using the number of bytes in the 'byte sequence in binary' column. Note that the maximum code point in Unicode is U+10FFFF (and that is defined/reserved as a non-character). This is the maximum value that can be represented using the surrogate encoding scheme in UTF-16 using just 4 bytes (two UTF-16 code points).