Does each code point have a corresponding name? - unicode

Many code points have names, like a's name is 'LATIN SMALL LETTER A'.
Do all code points have names?

No, not every code point has a name. From the Unicode standard core specification, version 12, 4.8 Name, under Unicode Name Property:
NR4: For all other Unicode code points of all other types (Control, Private-Use, Surrogate, Noncharacter and Reserved), the value of the Name property is the null string.
NR1 through NR3 discuss the other options, specifically:
derivation of names for Hangul symbols;
derivation of names for non-Hangul ideographs;
specific names for other graphic, and all format, characters.

Related

What are all characters that the Unicode Other_ID_Start and Other_ID_Continue properties include?

While reading about ID_Start and ID_Continue definitions, I found this: https://unicode.org/reports/tr31/#D1. It says that ID_Start includes Other_ID_Start and ID_Continue includes Other_ID_Continue. I'm unable to find the definitions of these other. The document I mentioned says that they're defined by UAX44. So for example, I tried to consult Unicode 15 version of UAX44: https://www.unicode.org/reports/tr44/tr44-30.html The table 9 (Property Table) only says:
Other_ID_Start Used to maintain backward compatibility of ID_Start.
Other than that, there is no additional information. What am I missing?
Other_ID_Start and Other_ID_Continue, like most binary character properties, are defined in the PropList.txt data file in the Unicode Character Database.
In particular, Other_ID_Start includes characters that used to be included in ID_Start automatically due to some other property they possessed, but now need to be specified manually because said property value has since changed. For example, U+212E ESTIMATED SYMBOL was originally classified as a letter and all letters are ID_Start by default, but later it was reclassified as a symbol and thus would have been excluded if it weren’t for the backwards compatibility requirement.

All ranges of Unicode letters?

Each Unicode code point is assigned to a character category. I'm looking for a list of ranges which has the category "Letter". Best would be a CSV in the format "FROM_CODEPOINT;TO_CODEPOINT" with all ranges that define letters.
Looks like the Unicode consortium publishes a database. The UnicodeData.txt file contains the character category. I can obtain the ranges from there with a simple utility.

Which nonnegative integers aren't assigned a character in the UCS?

Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:
For example, a Latin small letter "i" with an ogonek, a dot above, and
an acute accent [an abstract character], which is required in
Lithuanian, is represented by the character sequence U+012F, U+0307,
U+0301.
The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.
The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.
Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:
surrogate (the range D800–DFFF)
noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)
The Unicode standard defines noncharacters as follows:
Noncharacters are code points that are permanently reserved and will
never have characters assigned to them.
This page lists noncharacters more precisely.
reserved (I haven't found which nonnegative integers belong to this category)
On the other hand, code points whose basic type is any of:
graphic
format
control
private use
are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:
Private use characters are not constrained in any way by this
International Standard. Private use characters can be used to provide
user-defined characters.
Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?
Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?
For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.
The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.
The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.

Differenciate between symbol, number and letter-codepoints in Unicode?

Unicode has a huge number of codepoints, how can I check wheter a codepoint is a symbol (like "!" or "☭"), a number (like "4" or "৯"), a letter (like "a" or "え") or a control character (are usually not displayed directly)?
Is there any logic behind the position of the character and what kind of character it is (as opposed to just what alphabet it is part of), if not, are there any existing resources which classify which ranges are what?
That would be done through the General Category property of those codepoints. It's part of the canonical UnicodeData.txt dataset, and every serious Unicode-related library should have some way for you to get this property.

How to enumerate all Unicode canonically equivalent sequences in Perl?

Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?
For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:
0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD
(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)
Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.
Otherwise it gets a little messy.
You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.
This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...