What are all characters that the Unicode Other_ID_Start and Other_ID_Continue properties include? - unicode

While reading about ID_Start and ID_Continue definitions, I found this: https://unicode.org/reports/tr31/#D1. It says that ID_Start includes Other_ID_Start and ID_Continue includes Other_ID_Continue. I'm unable to find the definitions of these other. The document I mentioned says that they're defined by UAX44. So for example, I tried to consult Unicode 15 version of UAX44: https://www.unicode.org/reports/tr44/tr44-30.html The table 9 (Property Table) only says:
Other_ID_Start Used to maintain backward compatibility of ID_Start.
Other than that, there is no additional information. What am I missing?

Other_ID_Start and Other_ID_Continue, like most binary character properties, are defined in the PropList.txt data file in the Unicode Character Database.
In particular, Other_ID_Start includes characters that used to be included in ID_Start automatically due to some other property they possessed, but now need to be specified manually because said property value has since changed. For example, U+212E ESTIMATED SYMBOL was originally classified as a letter and all letters are ID_Start by default, but later it was reclassified as a symbol and thus would have been excluded if it weren’t for the backwards compatibility requirement.

Related

Using Unicode Tags to represent something other than a subdivision flag?

To this day, unicode tags are exclusively used to represent subdivision flags. But are they limited to subdivisions? Can I use them for instance to represent a language flag like the Esperanto Flag using the mapping:
🏴 + [e] + [o] + [cancel]
You can use any sequence of Unicode characters to represent anything. The trouble is that nobody will be able to understand your message unless you’re following an established protocol.
If you want to conform to the Unicode Standard, or more specifically Unicode Technical Standard #51 (Unicode Emoji), then the tag sequence you are considering is invalid, has no meaning, and should ideally be displayed using a special “error” glyph to indicate its malformedness. Annex C of UTS #51 has more information on that matter.
As of the time of writing, there is only one type of valid emoji tag sequences: Those representing flags of regions with a region code ultimately derived from ISO 3166-2. The Esperanto language does not possess such a region code (because it is not a region) and thus cannot be represented with this mechanism. I recommend employing private-use characters.

Which nonnegative integers aren't assigned a character in the UCS?

Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:
For example, a Latin small letter "i" with an ogonek, a dot above, and
an acute accent [an abstract character], which is required in
Lithuanian, is represented by the character sequence U+012F, U+0307,
U+0301.
The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.
The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.
Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:
surrogate (the range D800–DFFF)
noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)
The Unicode standard defines noncharacters as follows:
Noncharacters are code points that are permanently reserved and will
never have characters assigned to them.
This page lists noncharacters more precisely.
reserved (I haven't found which nonnegative integers belong to this category)
On the other hand, code points whose basic type is any of:
graphic
format
control
private use
are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:
Private use characters are not constrained in any way by this
International Standard. Private use characters can be used to provide
user-defined characters.
Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?
Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?
For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.
The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.
The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.

Character Sets, Locales, Fonts and Code Pages?

I can't figure out the relation between those terms. I actually need a brief explanation for each and eventually a relation between them.
Moreover, where do all these stuff reside? Where are they implemented? Is it the job of the operating system to manage the aforementioned terms? If not, then who is responsible for this job?
Character Sets describe the relationship between character codes and characters.
As an example, all (extended) ASCII character sets assign 41hex == 65dec to A.
Common character sets are ASCII, Unicode (UTF-8, UTF-16), Latin-1 and Windows-1252.
Code Pages are a a representation of character sets / a mechanism for selecting which character set is in use: There are the old DOS/computer manufacturer codepages and the legacy support for them in Windows, ANSI-Codepage (ANSI is not to blame here) and OEM-Codepage.
If you have a choice, avoid them like the plague and go for unicode, preferably UTF-8, though UTF-16 is an acceptable choice for the OS-facing part in Windows.
Locales are collections of all the information needed to conform to local conventions for display of information. On systems which neither use code pages nor have a system-defined universal character set, they also determine the character set used (e.g. Unixoids).
Fonts are the graphics and ancillary information neccessary for displaying text of a known encoding. Examples are "Times New Roman", "Verdana", "Arial" and "WingDings".
Not all Fonts have symbols for all characters present in any specific character set.

What Unicode normalization (and other processing) is appropriate for passwords when hashing?

If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
I'd like to ensure that those hash to the same thing.
I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
Any normalization procedure may cause collisions, e.g. "office" == "office".
Normalization can change the number of bytes in the string.
Further questions
What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
What happens if the server receives characters that are unassigned in its version of Unicode?
Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.
Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)
The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:
11.1 Stability of Normalized Forms
For all versions, even prior to Unicode 4.1, the following policy is followed:
A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.
More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.
Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.
The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.
The spec warns:
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
However, semantics and round-tripping are not of concern here.
Recommendation #3: Apply NFKC or NFKD before hashing.
As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).
RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:
the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
“Contextual Rule Required”: a number of code points from an “
Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.
I can’t speak for any other programming language, but the Python package precis-i18n implements the PRECIS framework described in RFCs 8264, 8265, 8266.
Here’s an example of how simple it is to enforce the OpaqueString profile on a password string:
# pip install precis-i18n
>>> import precis_i18n
>>> precis_i18n.get_profile('OpaqueString').enforce('😳å∆3⨁ucei=The4e-iy5am=3iemoo')
'😳å∆3⨁ucei=The4e-iy5am=3iemoo'
>>>
I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction and source of Python examples.

XML and Unicode specifications: what’s a legal character?

My manager asked me to explain why I called jdom’s checkCharacterData before passing my string to an XMLStreamWriter, so I referred to the XML spec and then got confused.
XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.
Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Question: What is/was a legal Unicode character?
The Unicode Glossary defines it thus:
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]
Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)
NUL is a codepoint, and it falls under the definition of "abstract character" so it is a character by sense 2 above.
Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?
NUL has been a control character from early versions.
Appendix D contains a list of changes.
It says in table D.2 that there have been 65 control characters from Version 1 through Version 3 without change.
Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.
V1.0 V1.1 V2.0 V2.1 V3.0
...
Controls 65 65 65 65 65
And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Writing specifications that are both complete and succinct is hard. When the text disagrees with the BNF, trust the BNF.
The use of the word “character” is intentionally fuzzy in the Unicode standard, but mostly it is used in a technical sense: a code point designated as an assigned character code point. This does not completely coincide with the intuitive concept of character. For example, the intuitive character that consists of letter i with macron and grave accent does not exist as a code point; in Unicode, it can only be represented as a sequence of two or three code points. As another example, the so-called control characters are not characters in the intuitive sense.
When other standards and specifications refer to “Unicode characters,” they refer to code points designated as assigned character code points. The set of Unicode characters varies by Unicode standard version, since new code points are assigned. Technically, the UnicodeData.txt file (at ftp://ftp.unicode.org/Public/UNIDATA/) indicates which code points are characters.
U+0000, conventionally denoted by NUL, has been a Unicode character since the beginning.
The XML specifications are inexact in many ways as regards to characters, as you have observed. But the essential definition is the BNF production for “Char” and the statement “XML processors MUST accept any character in the range specified for Char.” This means that in XML specifications, the concept of character is broader than Unicode character. The ranges in the production contain unassigned code points, actually a huge number of them.
The comment to the “Char” production in XML specifications is best ignored. It is very confusing and even incorrect. The “Char” production simply refers to a set of Unicode code points (different sets in different versions of XML). The set includes code points that you should never use in character data, as well as code points that should be avoided for various reasons. But such rules are at a level different from the formal rules of XML and requirements on XML implementations.
When selecting or writing a routine for checking character data, it depends on the application and purpose what should be accepted and what should be done with code points that fail the test. Even surrogate code points might be processed in some way instead of being just discarded; they may well appear due to confusions with encodings (or e.g. when a Java string has been naively taken as a string of Unicode characters – it is as such just a sequence of 16-bit code units).
I would ignore the verbage and just focus on the definitions:
XML 1.0:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
XML 1.1:
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
Document authors are encouraged to avoid "compatibility characters", as defined in Unicode [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
It sounds stupid because it is stupid. The First Edition of XML (1998) read "the legal graphic characters of Unicode." For whatever reason, the word "graphic" was removed from the Second Edition of 2000, perhaps because it is inaccurate: XML allows many characters that are not graphic characters.
The definition in the Char production is indeed the right place to look.