Why do Unicode code points appear as U+<codepoint>?
For example, U+2202 represents the character ∂.
Why not U- (dash or hyphen character) or anything else?
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.
The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).
The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:
u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
It depends on what version of the Unicode standard you are talking about. From Wikipedia:
Older versions of the standard used
similar notations, but with slightly
different rules. For example, Unicode
3.0 used "U-" followed by eight digits, and allowed "U+" to be used
only with exactly four digits to
indicate a code unit, not a code
point.
It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)
Related
Why does Unicode code points are always written with 2 bytes (4 digits) even when that's not necessary ?
From the Wikipedia page about UTF-8 :
$ -> U+0024
¢ -> U+00A2
TL;DR This is all by convention of the Unicode Consortium.
Here is the formal definition, found in Appendix A: Notational Conventions of the Unicode standard (I've referenced the latest at this time, version 11):
In running text, an individual Unicode code point is expressed as U+n, where n is four to
six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15,
respectively). Leading zeros are omitted, unless the code point would have fewer than four
hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345,
U+102345.
They are hexadecimal digits, that represent Unicode Scalar Values. Initially only the first plane called the Basic Multilingual Plane were made available, which supported a range of U+0000 to U+FFFF to be defined. Initially the U+ encoding therefore always had 4 hexadecimal characters.
However, that only allows 64 Ki (65536) code points to use for characters (excluding some reserved values). So the single plane was later extended to 17 planes. Leading zero's are suppressed for values of U+10000 or higher, so the next character is written as U+10000, not U+010000. Currently there are 17 planes of 64Ki code points (some of which may be reserved), starting at U+0000, U+10000 ... U+90000 and finally U100000.
The U+xxxx notation does not follow UTF-8 encoding. Nor does it follow UTF-16, UTF-32 or the deprecated UCS encodings, either in big or little endian. The encoding of characters within the Basic Multilingual Plane is however identical to UTF-16(BE) in hexadecimal. Note that UTF-16 may contain surrogate code units that are used as escape to encode the characters in the other planes. The ranges of those code units are not mapped to characters and will therefore not be present in the textual code point representation.
See for instance, the PLUS-MINUS SIGN, ±:
Unicode code point: U+00B1 (as a textual string)
UTF-8 : 0xC2 0xB1 (as two bytes)
UTF-16 : 0x00B1
UTF-16BE : 0x00B1 as 0x00 0xB1 (as two bytes)
UTF-16LE : 0x00B1 as 0xB1 0x00 (as two bytes)
https://www.fileformat.info/info/unicode/char/00b1/index.htm
Much of this information can be found at sil.org.
I see the two being used (seemingly) interchangeably in many cases - are they the same or different? This also seems to depend on whether the language is talking about UTF-8 (e.g. Rust) or UTF-16 (e.g. Java/Haskell). Is the code point/scalar distinction somehow dependent on the encoding scheme?
First let's look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:
D9 Unicode codespace:
A range of integers from 0 to 10FFFF16.
D10 Code point:
Any value in the Unicode codespace.
• A code point is also known as a
code position.
...
D10a Code point type:
Any of the seven fundamental classes of code points in the standard:
Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
[emphasis added]
Okay, so code points are integers in a certain range. They are divided into categories called "code point types".
Now let's look at definition D76, Section 3.9, Unicode Encoding Forms:
D76 Unicode scalar value:
Any Unicode code point except high-surrogate and low-surrogate code points.
• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF16 and E00016
to 10FFFF16, inclusive.
Surrogates are defined and explained in Section 3.8, just before D76. The gist is that surrogates are divided into two categories high-surrogates and low-surrogates. These are used only by UTF-16 so that it can represent all scalar values. (There are 1,112,064 scalars but 216 = 65536 is much less than that.)
UTF-8 doesn't have this problem; it is a variable length encoding scheme (code points can be 1-4 bytes long), so it can accommodate encode all scalars without using surrogates.
Summary: a code point is either a scalar or a surrogate. A code point is merely a number in the most abstract sense; how that number is encoded into binary form is a separate issue. UTF-16 uses surrogate pairs because it can't directly represent all possible scalars. UTF-8 doesn't use surrogate pairs.
In the future, you might find consulting the Unicode glossary helpful. It contains many of the frequently used definitions, as well as links to the definitions in the Unicode specification.
In Swift Programming Language 3.0, chapter on string and character, the book states
A Unicode scalar is any Unicode code point in the range U+0000 to
U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do
not include the Unicode surrogate pair code points, which are the code
points in the range U+D800 to U+DFFF inclusive
What does this mean?
A Unicode Scalar is a code point which is not serialised as a pair of UTF-16 code units.
A code point is the number resulting from encoding a character in the Unicode standard. For instance, the code point of the letter A is 0x41 (or 65 in decimal).
A code unit is each group of bits used in the serialisation of a code point. For instance, UTF-16 uses one or two code units of 16 bit each.
The letter A is a Unicode Scalar because it can be expressed with only one code unit: 0x0041. However, less common characters require two UTF-16 code units. This pair of code units is called surrogate pair. Thus, Unicode Scalar may also be defined as: any code point except those represented by surrogate pairs.
The answer from courteouselk is correct by the way, this is just a more plain english version.
From Unicode FAQ:
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
Basically, surrogates are codepoints that are reserved for special purposes and promised to never encode a character on their own but always as a first codepoint in a pair of UTF-16 encoding.
[UPD] Also, from wikipedia:
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.
The arrangement of the characters that can be used as super-/subscript letters seems completely chaotic. Most of them are obviously not meant to be used as sup/subscr. letters, but even those which are do not hint a very reasonable ordering. In Unicode 6.0 there is now at last an alphabetically-ordered subset of the subscript letters h-t in U+2095 through U+209C, but this was obviously rather squeezed into the remaining space in the block and encompasses less than 1/3 of all letters.
Why did the consortium not just allocate enough space for at least one sup and one subscript alphabet in lower case?
The disorganization in the arrangement of these characters is because they were encoded piecemeal as scripts that used them were encoded, and as round-trip compatibility with other character sets was added. Chapter 15 of the Unicode Standard has some discussion of their origins: for example superscript digits 1 to 3 were in ISO Latin-1 while the others were encoded to support the MARC-8 bibliographic character set (see table here); and U+2071 SUPERSCRIPT LATIN SMALL LETTER I and U+207F SUPERSCRIPT LATIN SMALL LETTER N were encoded to support the Uralic Phonetic Alphabet.
The Unicode Consortium have a general policy of not encoding characters unless there's some evidence that people are using the characters to make semantic distinctions that require encoding. So characters won't be encoded just to complete the set, or to make things look neat.
Why do Unicode code points appear as U+<codepoint>?
For example, U+2202 represents the character ∂.
Why not U- (dash or hyphen character) or anything else?
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.
The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).
The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:
u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
It depends on what version of the Unicode standard you are talking about. From Wikipedia:
Older versions of the standard used
similar notations, but with slightly
different rules. For example, Unicode
3.0 used "U-" followed by eight digits, and allowed "U+" to be used
only with exactly four digits to
indicate a code unit, not a code
point.
It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)