Is there a good resource for finding the last two characters of each plane, particularly planes 3–13?
Obviously 0xFFFE and 0xFFFF is a non character, as well as 0x10FFFE and 0x10FFFF, but I can't find a complete list as to where the last characters are of each plane, as I can't tell where each plane ends.
On Unicodes website it refers to the last two characters of every plane being non characters.
The official source can already be found in http://unicode.org/charts/index.html; search up for "Noncharacters in Charts." In fact, the noncharacters at the end of Plane 3 to D [as of Unicode 12.1] are the only designated code points in these planes.
There are exactly 66 noncharacters in Unicode. There are 34 noncharacters residing at the final two code points of each of the 17 planes, and there is an additional contiguous range of 32 noncharacters from U+FDD0 to U+FDEF in the Arabic Presentation Forms-B block.
Any code point ending with FFFE or FFFF is a noncharacter. For the exceptions, any 4-digit code point beginning with FDD or FDE is a noncharacter.
I'll enumerate the noncharacters:
FDD0-FDEF [These 32 are designated in Unicode 3.1, to allocate more code points for internal use]
FFFE [Probably the most notable one, this one is involved in BOM usage]
FFFF [Can be used as a sentinel, equal to -1 in a 16-bit signed int]
XFFFE [16 of them in the supplementary planes; X is a hexadecimal digit or 10]
XFFFF [16 of them in the supplementary planes; X is a hexadecimal digit or 10]
The Unicode Character Database contains authoritative information on the status of each code point. Using it, you can determine the last assigned code point of each plane. This may (actually, will) change over time, as new characters are assigned. You would also need to define what you mean by “character” – in particular, whether you regard Private Use code points as “characters”.
Each Unicode plane contains 216 code points, starting from 0x000000, and the last two characters of each plane are noncharacters. Therefore, all 0x••FFFE and 0x••FFFF code points are noncharacters, where •• is anything from 0x00 through 0x10 (identifying the plane).
..., as I can't tell where each plane ends.
Every plane by definition ends at U+xxFFFF.
On Unicodes website it refers to the last two characters of every plane being non characters.
No. The Unicode Standard Version 9.0 - Core Specification says (in section 23.7 Noncharacters):
The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values.
Note the keyword "code points", not "characters", they are always U+xxFFFE and U+xxFFFF.
Related
In according to the Unicode specification
D91 UTF-16 encoding form: The Unicode encoding form that assigns each
Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF
to a single unsigned 16-bit code unit with the same numeric value as
the Unicode scalar value, and that assigns each Unicode scalar value
in the range U+10000..U+10FFFF to a surrogate pair.
The term "scalar value" is referred to unicode code points, that is the range of abstract ideas which must be encoded into specific byte sequence by different encoding forms (UTF-16 and so on). So it seems that this excerpt gist is in view of not all code points can be accommodated into one UTF-16 code unit (two bytes), there are ones which should be encoded into a pair of code units - 4 bytes (it's called "a surrogate pair").
However, the very term "scalar value" is defined as follows:
D76 Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.
Wait... Does the Unicode have surrogate code points? What's the reason for it when UTF-16 can use 4 bytes to represent scalar points? Can anyone explain the rationale and how UTF-16 uses this code points?
Yes, Unicode reserves ranges for surrogate code points:
High Surrogate Area U+D800 to U+DBFF
Low Surrogate Area U+DC00 to U+DFFF
Unicode reserves these ranges because these 16-bit values are used in surrogate pairs and no other symbol can be assigned to them. Surrogate pairs are two 16-bit values that encode code points above U+FFFF that do not fit into a single 16-bit value.
Just for the sake of ultimate clarification.
UTF-16 uses 16-bits (2-bytes) code units. It means this encoding format encodes code points (= abstract ideas should be represented in a computer memory in some way), as a rule, into 16 bits (so an interpreter allegedly reads data as two bytes at a time).
UTF-16 does its best quite straightforward: the U+000E code point would encoded as 000E, the U+000F as 000F, and so on.
The issue is 16 bits can only come up with the range that is not sufficient to accommodate all unicode code points (0000..FFFF allows of only 65 536 possible values). We might use two 16-bits words (4 bytes) for the code points beyond this range (actually, my misunderstanding was about why UTF-16 doesn't do so). However, this approach results in bitter inability to decode some values. For example, if we encode the U+10000 code point into 0001 0000 (hex notation) how on earth should an interpreter decode such representation: as two subseqent code points U+0001 and U+0000 or as a single one U+10000?
The Unicode specification inclines to the better way. If there is a need to encode the range U+10000..U+10FFF (1 048 576 code points) then we should set apart 1 024 + 1 024 = 2 048 values from those which can be encoded with 16 bits (the spec chose the D800..DFFF range for it). And when an interpeter ecounters a D800..DBFF (High Surrogate Area) value in a computer memory, it knows it's not implied "fully-fledged" code point (not scalar value in terms of the spec) and it should read another 16 bits to get value from the DC00..DFFF range (Low Surrogate Area) and finally conclude which of the U+10000..U+10FFF code point was encoded with these 4 bytes (with this Surrogate Pair). Note, such scheme makes possible to encode 1 024 * 1 024 = 1 048 576 code points (and that's the very same amount we need).
In due to Unicode Codespace is defined as a range of integers from 0 to 10FFFF, we have to introduce the concept of surrogate code points (not code units) - the U+D800..U+DBFF range (because we can't exclude this range from the unicode codespace). In view of surrogate code points are designated for surrogate code units in the UTF-16 (see C1, D74), these code points can look as an UTF-16 relic.
At the moment, Unicode has 17 planes
But why are there 17? Why aren't they all just included in 1 plane?
Why are the 4th to 13th planes unoccupied? I expected they might be contiguous
Looks like initially there was only one plane. But as Unicode has gone from it's first creation to now being on Unicode 13.0.0 as of March 2020, more planes were added each time.
Looks like they don't go contiguously because the first ones are used for encoding all languages, and the final places only contain non-graphical characters. The "gap" is left for any new text that needs to be added to Unicode. Quote:
It is not anticipated that all these planes will be used in the foreseeable future, given the total sizes of the known writing systems left to be encoded. The number of possible symbol characters that could arise outside of the context of writing systems is potentially huge. At the moment, these 11 planes out of 17 are unused.
I see the two being used (seemingly) interchangeably in many cases - are they the same or different? This also seems to depend on whether the language is talking about UTF-8 (e.g. Rust) or UTF-16 (e.g. Java/Haskell). Is the code point/scalar distinction somehow dependent on the encoding scheme?
First let's look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:
D9 Unicode codespace:
A range of integers from 0 to 10FFFF16.
D10 Code point:
Any value in the Unicode codespace.
• A code point is also known as a
code position.
...
D10a Code point type:
Any of the seven fundamental classes of code points in the standard:
Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
[emphasis added]
Okay, so code points are integers in a certain range. They are divided into categories called "code point types".
Now let's look at definition D76, Section 3.9, Unicode Encoding Forms:
D76 Unicode scalar value:
Any Unicode code point except high-surrogate and low-surrogate code points.
• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF16 and E00016
to 10FFFF16, inclusive.
Surrogates are defined and explained in Section 3.8, just before D76. The gist is that surrogates are divided into two categories high-surrogates and low-surrogates. These are used only by UTF-16 so that it can represent all scalar values. (There are 1,112,064 scalars but 216 = 65536 is much less than that.)
UTF-8 doesn't have this problem; it is a variable length encoding scheme (code points can be 1-4 bytes long), so it can accommodate encode all scalars without using surrogates.
Summary: a code point is either a scalar or a surrogate. A code point is merely a number in the most abstract sense; how that number is encoded into binary form is a separate issue. UTF-16 uses surrogate pairs because it can't directly represent all possible scalars. UTF-8 doesn't use surrogate pairs.
In the future, you might find consulting the Unicode glossary helpful. It contains many of the frequently used definitions, as well as links to the definitions in the Unicode specification.
If I apply Unicode Normalization Form C to a string, will the number of code points in the string ever increase?
Yes, there are code points that expand to multiple code points after applying NFC normalization. Within the Basic Multilingual Plane, for example, there are 70 code points that expand to 2 code points after applying NFC normalization, and there are 2 code points (U+FB2C and U+FB2D within the Alphabetic Presentation Forms block) that expand to 3 code points.
One guarantee that you have for this so-called "expansion factor" is that no string will ever expand more than 3 times in length (in terms of number of code units) after NFC normalization is applied:
There is also a Unicode Consortium stability policy that canonical mappings are always limited in all versions of Unicode, so that no string when decomposed with NFC expands to more than 3× in length (measured in code units). This is true whether the text is in UTF-8, UTF-16, or UTF-32. This guarantee also allows for certain optimizations in processing, especially in determining buffer sizes.
Section 9, Detecting Normalization Forms. UAX #15: Unicode Normalization Forms.
I have written a Java program to determine which code points within a Unicode block expand to multiple code points: http://ideone.com/9PUOCb
Alternatively, Tom Christiansen's unichars utility, part of the Unicode::Tussle CPAN module, can be used. (Note: Mac users may see an error at the make test installation step saying that the Perl version is too old. If you see this error, you can install the module by running notest install Unicode::Tussle within a CPAN shell.)
Examples:
Print the code points in the BMP that expand to 3 code points:
unichars 'length(NFC) == 3'
שּׁ U+FB2C HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT
שּׂ U+FB2D HEBREW LETTER SHIN WITH DAGESH AND SIN DOT
Count the number of code points in all planes that expand to more than one code point:
unichars -a 'length(NFC) > 1' | wc -l
85
See also the frequently asked question What are the maximum expansion factors for the different normalization forms?
Was the position of UTF-16 surrogates area (U+D800..U+DFFF) chosen at random or does it have some logical reason, that it is on this place?
The surrogates area was added in Unicode 2.0, to expand the code beyond 65536 code points while retaining compatibility with the existing 16-bit representation. To encode the 20 bits necessary to represent the 1048576 new code points, they took 1024 characters to represent the first 10 bits and 1024 to represent the second 10 bits (they used 2048 characters instead of 1024 to allow the code to be self-synchronizing). For efficiency in recognizing the characters, it would be best if all 2048 shared a (binary) prefix.
I can only guess that they wanted to shove this unusually-purposed block to higher rather than lower codepoints. The blocks 0xE000–0xE7FF, 0xE800–0xEFFF, and 0xF000–0xF7FF were already reserved for the "private use" area, and 0xF800–0xFFFF was also partially reserved for private use and partially used for other codes. So 0xD800–0xDFFF would have been the highest block available.
Unicode was originally designed as a 16-bit code, and had already assigned a bunch of characters before the need for “supplementary planes” was recognized. The largest available block was U+A000 – U+DFFF, so surrogates would have to go somewhere in there.