What is the difference between Unicode code points and Unicode scalars? - unicode

I see the two being used (seemingly) interchangeably in many cases - are they the same or different? This also seems to depend on whether the language is talking about UTF-8 (e.g. Rust) or UTF-16 (e.g. Java/Haskell). Is the code point/scalar distinction somehow dependent on the encoding scheme?

First let's look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:
D9 Unicode codespace:
A range of integers from 0 to 10FFFF16.
D10 Code point:
Any value in the Unicode codespace.
• A code point is also known as a
code position.
...
D10a Code point type:
Any of the seven fundamental classes of code points in the standard:
Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
[emphasis added]
Okay, so code points are integers in a certain range. They are divided into categories called "code point types".
Now let's look at definition D76, Section 3.9, Unicode Encoding Forms:
D76 Unicode scalar value:
Any Unicode code point except high-surrogate and low-surrogate code points.
• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF16 and E00016
to 10FFFF16, inclusive.
Surrogates are defined and explained in Section 3.8, just before D76. The gist is that surrogates are divided into two categories high-surrogates and low-surrogates. These are used only by UTF-16 so that it can represent all scalar values. (There are 1,112,064 scalars but 216 = 65536 is much less than that.)
UTF-8 doesn't have this problem; it is a variable length encoding scheme (code points can be 1-4 bytes long), so it can accommodate encode all scalars without using surrogates.
Summary: a code point is either a scalar or a surrogate. A code point is merely a number in the most abstract sense; how that number is encoded into binary form is a separate issue. UTF-16 uses surrogate pairs because it can't directly represent all possible scalars. UTF-8 doesn't use surrogate pairs.
In the future, you might find consulting the Unicode glossary helpful. It contains many of the frequently used definitions, as well as links to the definitions in the Unicode specification.

Related

How UTF-16 encoding uses surrogate code points?

In according to the Unicode specification
D91 UTF-16 encoding form: The Unicode encoding form that assigns each
Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF
to a single unsigned 16-bit code unit with the same numeric value as
the Unicode scalar value, and that assigns each Unicode scalar value
in the range U+10000..U+10FFFF to a surrogate pair.
The term "scalar value" is referred to unicode code points, that is the range of abstract ideas which must be encoded into specific byte sequence by different encoding forms (UTF-16 and so on). So it seems that this excerpt gist is in view of not all code points can be accommodated into one UTF-16 code unit (two bytes), there are ones which should be encoded into a pair of code units - 4 bytes (it's called "a surrogate pair").
However, the very term "scalar value" is defined as follows:
D76 Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.
Wait... Does the Unicode have surrogate code points? What's the reason for it when UTF-16 can use 4 bytes to represent scalar points? Can anyone explain the rationale and how UTF-16 uses this code points?
Yes, Unicode reserves ranges for surrogate code points:
High Surrogate Area U+D800 to U+DBFF
Low Surrogate Area U+DC00 to U+DFFF
Unicode reserves these ranges because these 16-bit values are used in surrogate pairs and no other symbol can be assigned to them. Surrogate pairs are two 16-bit values that encode code points above U+FFFF that do not fit into a single 16-bit value.
Just for the sake of ultimate clarification.
UTF-16 uses 16-bits (2-bytes) code units. It means this encoding format encodes code points (= abstract ideas should be represented in a computer memory in some way), as a rule, into 16 bits (so an interpreter allegedly reads data as two bytes at a time).
UTF-16 does its best quite straightforward: the U+000E code point would encoded as 000E, the U+000F as 000F, and so on.
The issue is 16 bits can only come up with the range that is not sufficient to accommodate all unicode code points (0000..FFFF allows of only 65 536 possible values). We might use two 16-bits words (4 bytes) for the code points beyond this range (actually, my misunderstanding was about why UTF-16 doesn't do so). However, this approach results in bitter inability to decode some values. For example, if we encode the U+10000 code point into 0001 0000 (hex notation) how on earth should an interpreter decode such representation: as two subseqent code points U+0001 and U+0000 or as a single one U+10000?
The Unicode specification inclines to the better way. If there is a need to encode the range U+10000..U+10FFF (1 048 576 code points) then we should set apart 1 024 + 1 024 = 2 048 values from those which can be encoded with 16 bits (the spec chose the D800..DFFF range for it). And when an interpeter ecounters a D800..DBFF (High Surrogate Area) value in a computer memory, it knows it's not implied "fully-fledged" code point (not scalar value in terms of the spec) and it should read another 16 bits to get value from the DC00..DFFF range (Low Surrogate Area) and finally conclude which of the U+10000..U+10FFF code point was encoded with these 4 bytes (with this Surrogate Pair). Note, such scheme makes possible to encode 1 024 * 1 024 = 1 048 576 code points (and that's the very same amount we need).
In due to Unicode Codespace is defined as a range of integers from 0 to 10FFFF, we have to introduce the concept of surrogate code points (not code units) - the U+D800..U+DBFF range (because we can't exclude this range from the unicode codespace). In view of surrogate code points are designated for surrogate code units in the UTF-16 (see C1, D74), these code points can look as an UTF-16 relic.

Unicode scalar value in Swift

In Swift Programming Language 3.0, chapter on string and character, the book states
A Unicode scalar is any Unicode code point in the range U+0000 to
U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do
not include the Unicode surrogate pair code points, which are the code
points in the range U+D800 to U+DFFF inclusive
What does this mean?
A Unicode Scalar is a code point which is not serialised as a pair of UTF-16 code units.
A code point is the number resulting from encoding a character in the Unicode standard. For instance, the code point of the letter A is 0x41 (or 65 in decimal).
A code unit is each group of bits used in the serialisation of a code point. For instance, UTF-16 uses one or two code units of 16 bit each.
The letter A is a Unicode Scalar because it can be expressed with only one code unit: 0x0041. However, less common characters require two UTF-16 code units. This pair of code units is called surrogate pair. Thus, Unicode Scalar may also be defined as: any code point except those represented by surrogate pairs.
The answer from courteouselk is correct by the way, this is just a more plain english version.
From Unicode FAQ:
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
Basically, surrogates are codepoints that are reserved for special purposes and promised to never encode a character on their own but always as a first codepoint in a pair of UTF-16 encoding.
[UPD] Also, from wikipedia:
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.

Are surrogate pairs the only way to represent code points larger than 2 bytes in UTF-16?

I know that this is probably a stupid question, but I need to be sure on this issue. So I need to know for example if a programming language says that its String type uses UTF-16 encoding, does that mean:
it will use 2 bytes for code points in the range of U+0000 to U+FFFF.
it will use surrogate pairs for code points larger than U+FFFF (4 bytes per code point).
Or does some programming languages use their own "tricks" when encoding and do not follow this standard 100%.
UTF-16 is a specified encoding, so if you "use UTF-16", then you do what it says and don't invent any "tricks" of your own.
I wouldn't talk about "two bytes" the way you do, though. That's a detail. The key part of UTF-16 is that you encode code points as a sequence of 16-bit code units, and pairs of surrogates are used to encode code points greater than 0xFFFF. The fact that one code unit is comprised of two 8-bit bytes is a second layer of detail that applies to many systems (but there are systems with larger byte sizes where this isn't relevant), and in that case you may distinguish big- and little-endian representations.
But looking the other direction, there's absolutely no reason why you should use UTF-16 specifically. Ultimately, Unicode text is just a sequence of numbers (of value up to 221), and it's up to you how to represent and serialize those.
I would happily make the case that UTF-16 is a historic accident that we probably wouldn't have done if we had to redo everything now: It is a variable-length encoding just as UTF-8, so you gain no random access, as opposed to UTF-32, but it is also verbose. It suffers endianness problems, unlike UTF-8. Worst of all, it confuses parts of the Unicode standard with internal representation by using actual code point values for the surrogate pairs.
The only reason (in my opinion) that UTF-16 exists is because at some early point people believed that 16 bit would be enough for all humanity forever, and so UTF-16 was envisaged to be the final solution (like UTF-32 is today). When that turned out not to be true, surrogates and wider ranges were tacked onto UTF-16. Today, you should by and large either use UTF-8 for serialization externally or UTF-32 for efficient access internally. (There may be fringe reasons for preferring maybe UCS-2 for pure Asian text.)
UTF-16 per se is standard. However most languages whose strings are based on 16-bit code units (whether or not they claim to ‘support’ UTF-16) can use any sequence of code units, including invalid surrogates. For example this is typically an acceptable string literal:
"x \uDC00 y \uD800 z"
and usually you only get an error when you attempt to write it to another encoding.
Python's optional encode/decode option surrogateescape uses such invalid surrogates to smuggle tokens representing the single bytes 0x80–0xFF into standalone surrogate code units U+DC80–U+DCFF, resulting in a string such as this. This is typically only used internally, so you're unlikely to meet it in files or on the wire; and it only applies to UTF-16 in as much as Python's str datatype is based on 16-bit code units (which is on ‘narrow’ builds between 3.0 and 3.3).
I'm not aware of any other commonly-used extensions/variants of UTF-16.

Naming convention for less than utf32

Unicode UTF-32 values we can call codepoints, though I suppose even this is wrong since a single surrogate is itself a codepoint. UTF-8 can be called multi-byte or multi-octet. But what about UTF-16 and UCS-2. They aren't exactly multi-byte since they deal in 2 bytes, and I think multi-word is more of a MS naming scheme.
What is a more accurate name to describe UTF-32 codepoints that can be made up of bytes, as in UTF-8 and words as in UTF-16?
I believe the term you're looking for is 'code unit'.
Code points are simply integral values that may be assigned a character in a character set.
A code unit is a fixed width integer representation used in sequences to represent encoded text. UTF-8, UTF-16, and UTF-32 are all encodings, and use 8, 16, and 32 bit code units respectively.
UTF-32 is unique among the three in that its code unit values are always exactly the code point values of the represented Unicode data.
'multi-byte' can appropriately be used to in reference to UTF-16. (And 'Unicode' can be used in reference to UTF-8; Microsoft's usage of the terminology is misleading on both counts.)
a single surrogate is itself a codepoint.
Unicode classifies code point in the range [U+D800-U+DFFF] as surrogates. These code points are never used as such, however. They are reserved and cannot be used because UTF-16 cannot represent code points in this range; in order to represent such code points UTF-16 would use code units in the range [0xD800-0xDFFF], however UTF-16 uses code unit values in this range to represent code points above U+FFFF and therefore cannot use them to represent code points in the range [U+D800-U+DFFF].

Why is 'U+' used to designate a Unicode code point?

Why do Unicode code points appear as U+<codepoint>?
For example, U+2202 represents the character ∂.
Why not U- (dash or hyphen character) or anything else?
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.
The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).
The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:
u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
It depends on what version of the Unicode standard you are talking about. From Wikipedia:
Older versions of the standard used
similar notations, but with slightly
different rules. For example, Unicode
3.0 used "U-" followed by eight digits, and allowed "U+" to be used
only with exactly four digits to
indicate a code unit, not a code
point.
It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)