Which nonnegative integers aren't assigned a character in the UCS? - unicode

Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:
For example, a Latin small letter "i" with an ogonek, a dot above, and
an acute accent [an abstract character], which is required in
Lithuanian, is represented by the character sequence U+012F, U+0307,
U+0301.
The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.
The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.
Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:
surrogate (the range D800–DFFF)
noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)
The Unicode standard defines noncharacters as follows:
Noncharacters are code points that are permanently reserved and will
never have characters assigned to them.
This page lists noncharacters more precisely.
reserved (I haven't found which nonnegative integers belong to this category)
On the other hand, code points whose basic type is any of:
graphic
format
control
private use
are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:
Private use characters are not constrained in any way by this
International Standard. Private use characters can be used to provide
user-defined characters.
Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?
Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?
For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.

The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.
The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.

Related

What is the meaning of the indicator XXX in the Unicode charts

Consider the unicode chart for C1 Controls and Latin-1 supplement in Unicode Charts. If a character has a glyph, it is shown, if it does not have a glyph, a special dotted line and symbolic marker or identifier is given. In this case, both 0080 and 0081 seem to have some "invalid marker", which I think is what "XXX" means. Is that what it means?
Secondly, what should be the behaviour of a Unicode aware string type that has a value stored into the string of value 0x80 (hex) or 128 (decimal)? Should it be converted to some other point, such as the mapping like this:
Byte Value 128 in many ANSI Codepages is the EURO marker.
Storing a 128 decimal value is equivalent to storing U+20AC ?
The magic "non orthogonality" I have encountered in a particular language or operating system API implementation of its MBCS and Unicode types, and Java's interesting handling, leads me to wonder, what is the real intended use of the U+0080 character? This reference link confuses me by showing that Java treats this character as a Euro symbol (ANSI codepage to Unicode one way friendliness) but that it's name is <control> which is not anything I know how to deal with. Wikipedia says it's PAD here
Can anyone help me? Did I skip a foundational concepts day at Unicode School? What am I missing?
Update The block from 0080 to 0098 is non printable control characters. This much I know. What I wonder is what does the XXX mean and how am I to think of this character when I am processing unicode data with this value in it?
According to the explanation in Ch. 17 (About the Code Charts) of the Unicode Standard, p. 573, by the “Dashed Box Convention”, characters that have no visible rendering as such “are represented by a square dashed box. This box surrounds a short mnemonic abbreviation of the character’s name.” The characters referred to in the questions are control characters, in the C1 Controls area.
The Unicode Standard says, in Ch. 16, p. 544, about C0 and C1 Controls: “The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are gen-erally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.” And the abbreviations in the square dashed boxes reflect the meanings given in ISO/IEC 6429:1992.
Some code points in the C1 Controls area are not defined in ISO/IEC 6429:1992. For them, such as U+0080, the code chart has “XXX” in place of a mnemonic abbreviation. So this indicates that the Unicode standard does not refer to any meaning for those code points, beyond their being control characters with some abstract properties.
Thus, “XXX” does not mean “invalid”, but rather “completely undefined meaning”. The meaning of such code points can be defined by various standards or other conventions, as long as they are consistent with the general definitions—e.g., it would be incompatible to define U+0080 as a graphic character.
Such code points must not be replaced or omitted in any character-level processing; applications that actually change data may do whatever they want, but any general conversion routines, for example, must keep these code points (characters) intact. They must not be treated as malformed or invalid; but an application may treat them as undefined. By Unicode principles, it’s OK to be ignorant of a character, but not completely wrong about it.
This has nothing to do with the meaning of bytes like 0x80 in 8-bit codes like Windows-1252. But if you send e.g. data labeled as ISO-8859-1 encoded (where e.g. 0x80 is in principle U+0080) to a web browser, it will actually treat it as Windows-1252 encoded. The reason is that characters like U+0080 are practically never used in ISO-8859-1 data; occurrence of 0x80 in ISO-8859-1 labeled data is virtually always either windows-1252 mislabeled or messed-up data that cannot be meaningfully processed. So browsers take the practical route and treat ISO-8859-1 as windows-1252; this is being formalized in HTML5 and related specifications.

Differenciate between symbol, number and letter-codepoints in Unicode?

Unicode has a huge number of codepoints, how can I check wheter a codepoint is a symbol (like "!" or "☭"), a number (like "4" or "৯"), a letter (like "a" or "え") or a control character (are usually not displayed directly)?
Is there any logic behind the position of the character and what kind of character it is (as opposed to just what alphabet it is part of), if not, are there any existing resources which classify which ranges are what?
That would be done through the General Category property of those codepoints. It's part of the canonical UnicodeData.txt dataset, and every serious Unicode-related library should have some way for you to get this property.

Unicode alphanumeric character range

I'm looking at the IsCharAlphaNumeric Windows API function. As it only takes a single TCHAR, it obviously can't make any decisions about surrogate pairs for UTF16 content. Does that mean that there are no alphanumeric characters that are surrogate pairs?
Characters outside the BMP can be letters. (Michael Kaplan recently discussed a bug in the classification of the character U+1F48C.) But IsCharAlphaNumeric cannot see characters outside the BMP (for the reasons you noted), so you cannot obtain classification information for them that way.
If you have a surrogate pair, call GetStringType with cchSrc = 2 and check for C1_ALPHA and C1_DIGIT.
Edit: The second half of this answer is incorrect GetStringType does not support surrogate pairs.
You can determine yourself by looking at the Unicode plane assignment what you are missing by not being able to inspect non-BMP codepoints.
For example, you won't be able to identify imperial Aramaic characters as alphanumeric. Shame.
Does that mean that there are no alphanumeric characters that are surrogate pairs?
No, there are supplementary code-points that are in the letter group.
Comparing a char to a code-point?
For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

What's the purpose of the noncharacters U+FDD0 to U+FDEF?

U+FFFE needs to be a noncharacter in order to allow the Byte Order Mark to work.
U+FFFF is described in The Unicode Standard as "useful for internal purposes as sentinels". Makes sense.
But I can't figure out, and The Unicode Standard doesn't really explain, why the set of noncharacters includes some random block within "Arabic Presentation Forms-A". What are these for? (Besides the eye of the basilisk?)
OK the question is "what are they for" and "Why are they in the middle of the Arabic Presentation Forms".
There was a need for a block of 32 non-characters "to make additional codes available to programmers to use for internal processing purposes" http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=IWS-Chapter04a#4d3110c8
It was required that it be in the Basic Multilingual Plane (BMP), i.e. 0x0000 to 0xFFFF, so that they could have single-codepoint representations in UTF-16.
There was a block of unused codepoints in the Arabic Presentation Forms block.
It had been agreed not to encode any more Arabic Presentation Forms, so these were never going to be used.
http://www.unicode.org/mail-arch/unicode-ml/y2001-m10/0014.html
Therefore it was agreed that these codepoints, which were never going to be used otherwise, would be designated noncharacters so they could be used internally by applications/programmers.
These noncharacters are for internal use by application and should not be interchanged.
I tried to explain based on what is said in Unicode standard.
Unicode got 66 non-characters. For all 17 planes they have two each, last two code points of the plane ending with FFFE FFFF. 32 other no-characters are continuous block U+FDD0 to U+FDEF.
So total count
17*2 + 32 = 66
Read following text from the unicode chapter 16, which says that its in some random place because of "historic reason", I'm curious but I don't think there is any ambiguity.
For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not
"Arabic noncharacters" or "right-to-left noncharacters," and are not distinguished in any
other way from the other noncharacters, except in their code point values
U+FEFF is BOM and U+FFFE is byte-swapped version of it. But since U+FFFE is a noncharacter, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text, It just gives a signal, not a standard way. It can be either of the one, reverse bytes or a wrong text.
In the Unicode section 3.2 clause C2 says
C2 A process shall not interpret a noncharacter code point as an abstract character.
The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.
So as application developers you are free to use these characters as you wish. They are used as sentinel or delimter or may be some baslik characters, but they should not be interchanged.
Section 16.7 says
In effect, noncharacters can be thought of as application-internal private-use code points.
Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which
are assigned characters and which are intended for use in open interchange, subject to
interpretation by private agreement, noncharacters are permanently reserved (unassigned)
and have no interpretation whatsoever outside of their possible application-internal private uses
Again U+FFFF is not reserved as sentinel by Unicode standard but just given the typical use case. Read in section 16.7
U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being
associated with the largest code unit values for particular Unicode encoding forms. In
UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16
U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16
This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For
example, they might be used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on
As mentioned here at xkcd, U+FDD0 is actually the Unicode character for the eye of a basilisk. For (obvious) reasons of personal safety however, the character is not rendered to the screen... :)

Is there encoding in Unicode where every "character" is just one code point?

Trying to rephrase: Can you map every combining character combination into one code point?
I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Is this true for Basic Multilingual Plane also?
If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.
If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.
Can you map every combining character
combination into one code point?
Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.
There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.
it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”
Is this true for Basic Multilingual Plane also?
There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.
Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.