I have been working on Unicode Normalization for NFKC. In section 1.3 I found the following line-
For NFKC or NFKD, one does a full compatibility decomposition, which
makes use of canonical and compatibility Decomposition_Mapping values.
From where I can get canonical and compatibility Decomposition_Mapping values?
According to http://old.kpfu.ru/eng/departments/ktk/test/perl/lib/unicode/UCDFF301.html, the 5th field on the following file contains the Decomposition_Mapping values:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
Related
While reading about ID_Start and ID_Continue definitions, I found this: https://unicode.org/reports/tr31/#D1. It says that ID_Start includes Other_ID_Start and ID_Continue includes Other_ID_Continue. I'm unable to find the definitions of these other. The document I mentioned says that they're defined by UAX44. So for example, I tried to consult Unicode 15 version of UAX44: https://www.unicode.org/reports/tr44/tr44-30.html The table 9 (Property Table) only says:
Other_ID_Start Used to maintain backward compatibility of ID_Start.
Other than that, there is no additional information. What am I missing?
Other_ID_Start and Other_ID_Continue, like most binary character properties, are defined in the PropList.txt data file in the Unicode Character Database.
In particular, Other_ID_Start includes characters that used to be included in ID_Start automatically due to some other property they possessed, but now need to be specified manually because said property value has since changed. For example, U+212E ESTIMATED SYMBOL was originally classified as a letter and all letters are ID_Start by default, but later it was reclassified as a symbol and thus would have been excluded if it weren’t for the backwards compatibility requirement.
I'm writing a script to create tables containing unicode characters for case folding, etc.
I was able to extract those tables just fine, but I'm struggling to figure out which properties to use to get codepoints for normalization.
In Unicode Annex #44 the closest property group I can find is the NF(C|D|KC|KD)_QC which is for telling if a string has already been normalized.
and it still doesn't list the values I need to actually build the tables.
What am I doing wrong here?
Edit: I'm writing a C library to handle unicode, this isn't a simple one and done, write it in python problem, I'm trying to write my own normalization (technically composition/decomposition) functions.
Edit2: The decomposition property is "dm", but what about composition, and the Kompatibility variants?
The Unicode XML database in the ucdxml directory is not authoritative. I'd suggest to work with the authoritative files in the ucd directory. You'll need
the fields Decomposition_Type and Decomposition_Mapping from column 5 of UnicodeData.txt,
the field Canonical_Combining_Class from column 3, and
the composition exclusions from CompositionExclusions.txt.
If there's a decomposition type in angle brackets, it's a compatibility mapping (NFKD), otherwise it's a canonical mapping. Composition is defined in terms of decomposition mappings. See section 3.11 Normalization Forms of the Unicode standard and UAX #15 for details.
Can you please say how version compare works for UpdateCheck? How Checker determines that version in updates.xml is newer than his version?
For example, my versioning is using formula YEAR/MAJOR.MINOR: 2015/1, 2015/1.1, 2015/1.2, 2015/2, 2016/1 and so on. But I also have specific releases as 2015/1.2-LOK15. How these version numbers will be compared during version check?
Thank you in advance.
install4j transforms the version string into an array of version components. The separators for creating numeric version components are ".", "-" and "_". Each version component has an optional leading text part and a trailing numeric part. The text parts are compared lexically, the version parts are compared numerically.
Version components that start with non-numeric characters, like "LOK15" are generally considered to be pre-cursor versions to the same version component without the text part (like "beta" or "RC"). So 2015/1.2-LOK15 is considered to be lower than 2015/1.2. However 2015/1.2-15LOK would be higher than 2015/1.2.
I am having a hard time understanding unicode sorting order.
When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#") under ICU4J 55.1 I get a return value of -1 indicating that _ comes before #.
However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that # (U+0023) comes before _ (U+005F). Why is ICU4J returning a value of -1?
First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.
Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).
The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
It shows:
005F ; [*010A.0020.0002] # LOW LINE
...
0023 ; [*0290.0020.0002] # NUMBER SIGN
It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.
Converting Mark Ransom's comments into an answer:
The ordering of individual characters is based on a collation table, which has little relationship to the codepoint numbers. See: http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
If you follow the first link on that page, it leads to allkeys.txt which gives the default collation ordering.
In particular, _ is 005F ; [*020B.0020.0002] # LOW LINE while # is 0023 ; [*0391.0020.0002] # NUMBER SIGN. Note that the collation numbers for _ are lower than the numbers for #.
If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
I'd like to ensure that those hash to the same thing.
I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
Any normalization procedure may cause collisions, e.g. "office" == "office".
Normalization can change the number of bytes in the string.
Further questions
What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
What happens if the server receives characters that are unassigned in its version of Unicode?
Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.
Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)
The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:
11.1 Stability of Normalized Forms
For all versions, even prior to Unicode 4.1, the following policy is followed:
A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.
More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.
Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.
The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.
The spec warns:
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
However, semantics and round-tripping are not of concern here.
Recommendation #3: Apply NFKC or NFKD before hashing.
As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).
RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:
the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
“Contextual Rule Required”: a number of code points from an “
Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.
I can’t speak for any other programming language, but the Python package precis-i18n implements the PRECIS framework described in RFCs 8264, 8265, 8266.
Here’s an example of how simple it is to enforce the OpaqueString profile on a password string:
# pip install precis-i18n
>>> import precis_i18n
>>> precis_i18n.get_profile('OpaqueString').enforce('😳å∆3⨁ucei=The4e-iy5am=3iemoo')
'😳å∆3⨁ucei=The4e-iy5am=3iemoo'
>>>
I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction and source of Python examples.