What Unicode normalization (and other processing) is appropriate for passwords when hashing? - unicode

If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
I'd like to ensure that those hash to the same thing.
I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
Any normalization procedure may cause collisions, e.g. "office" == "office".
Normalization can change the number of bytes in the string.
Further questions
What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
What happens if the server receives characters that are unassigned in its version of Unicode?

Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.
Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)
The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:
11.1 Stability of Normalized Forms
For all versions, even prior to Unicode 4.1, the following policy is followed:
A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.
More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.
Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.
The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.
The spec warns:
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
However, semantics and round-tripping are not of concern here.
Recommendation #3: Apply NFKC or NFKD before hashing.

As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).
RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:
the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
“Contextual Rule Required”: a number of code points from an “
Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.
I can’t speak for any other programming language, but the Python package precis-i18n implements the PRECIS framework described in RFCs 8264, 8265, 8266.
Here’s an example of how simple it is to enforce the OpaqueString profile on a password string:
# pip install precis-i18n
>>> import precis_i18n
>>> precis_i18n.get_profile('OpaqueString').enforce('😳å∆3⨁ucei=The4e-iy5am=3iemoo')
'😳å∆3⨁ucei=The4e-iy5am=3iemoo'
>>>
I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction and source of Python examples.

Related

Which nonnegative integers aren't assigned a character in the UCS?

Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:
For example, a Latin small letter "i" with an ogonek, a dot above, and
an acute accent [an abstract character], which is required in
Lithuanian, is represented by the character sequence U+012F, U+0307,
U+0301.
The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.
The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.
Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:
surrogate (the range D800–DFFF)
noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)
The Unicode standard defines noncharacters as follows:
Noncharacters are code points that are permanently reserved and will
never have characters assigned to them.
This page lists noncharacters more precisely.
reserved (I haven't found which nonnegative integers belong to this category)
On the other hand, code points whose basic type is any of:
graphic
format
control
private use
are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:
Private use characters are not constrained in any way by this
International Standard. Private use characters can be used to provide
user-defined characters.
Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?
Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?
For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.
The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.
The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.

What is the limit to encoding base in case of Unicode strings as opposed to base64 having base = 64?

This is actually related to code golf in general, but also appliable elsewhere. People commonly use base64 encoding to store large amounts of binary data in source code.
Assuming all programming languages to be happy to read Unicode source code, what is the max N, for which we can reliably devise a baseN encoding?
Reliability here means being able to encode/decode any data, so every single combination of input bytes can be encoded, and then decoded. The encoded form is free from this rule.
The main goal is to minimize the character count, regardless of byte-count.
Would it be base2147483647 (32-bit) ?
Also, because I know it may vary from browser-to-browser, and we already have problems with copy-pasting code from codegolf answers to our editors, the copy-paste-ability is also a factor here. I know there is a Unicode range of characters that are not displayed.
NOTE:
I know that for binary data, base64 usually expands data, but here the character-count is the main factor.
It really depends on how reliable you want the encoding to be. Character encodings are designed with trade-offs, and in general the more characters allowed, the less likely it is to be universally accepted i.e. less reliable. Base64 isn't immune to this. RFC 3548, published in 2003, mentions that case sensitivity may be an issue, and that the characters + and / may be problematic in certain scenarios. It describes Base32 (no lowercase) and Base16 (hex digits) as potentially safer alternatives.
It does not get better with Unicode. Adding that many characters introduces many more possible points of failure. Depending on how stringent your requirements are, you might have different values for N. I'll cover a few possibilities from large N to small N, adding a requirement each time.
1,114,112: Code points. This is the number of possible code points defined by the Unicode Standard.
1,112,064: Valid UTF. This excludes the surrogates which cannot stand on their own.
1,111,998: Valid for exchange between processes. Unicode reserves 66 code points as permanent non-characters for internal use only. Theoretically, this is the maximum N you could justifiably expect for your copy-paste scenario, but as you noted, in practice many other Unicode strings will fail that exercise.
120,503: Printable characters only, depending on your definition. I've defined it to be all characters outside of the Other and Separator general categories. Also, starting from this bullet point, N is subject to change in future versions of Unicode.
103,595: NFKD normalized Unicode. Unfortunately, many processes automatically normalize Unicode input to a standardized form. If the process used NFKC or NFKD, some information may have been lost. For more reliability, the encoding should thus define a normalization form, with NFKD being better for increasing character count
101,684: No combining characters. These are "characters" which shouldn't stand on their own, such as accents, and are meant to be combined with another base character. Some processes might panic if they are left standing alone, or if there are too many combining characters on a single base character. I've now excluded the Mark category.
85: ASCII85, aka. I want my ASCII back. Okay, this is no longer Unicode, but I felt like mentioning it because it's a lesser known ASCII-only encoding. It's mainly used in Adobe's PostScript and PDF formats, and has a 5:4 encoded data size increase, rather than Base64's 4:3 ratio.

Unicode version of ABNF?

I want to write a grammar for a file format whose content can contain characters other than US-ASCII ones. Since I am used to ABNF, I try to use it...
However, none of RFCs 5234 and 7405 are very friendly towards people who DO NOT use US ASCII.
In fact, I'm looking for an ABNF version (and possibly some basic rules as well) which is character oriented rather than byte oriented; the only thing which RFC 5234 has to say about this is in section 2.4:
2.4. External Encodings
External representations of terminal value characters will vary
according to constraints in the storage or transmission environment.
Hence, the same ABNF-based grammar may have multiple external
encodings, such as one for a 7-bit US-ASCII environment, another for
a binary octet environment, and still a different one when 16-bit
Unicode is used. Encoding details are beyond the scope of ABNF,
although Appendix B provides definitions for a 7-bit US-ASCII
environment as has been common to much of the Internet.
By separating external encoding from the syntax, it is intended that
alternate encoding environments can be used for the same syntax.
That doesn't really clarify matters.
Is there a version of ABNF somewhere which is code point oriented rather than byte oriented?
Refer to section 2.3 of RFC 5234, which says:
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
Unicode is just the set of non-negative integers U+0000 through U+10FFFF minus the surrogate range D800-DFFF and there are various RFCs that use ABNF accordingly. An example is RFC 3987.
If the ABNF you're writing is intended for human reading, then I'd say just use the normal syntax and refer to code points instead of bytes instead. You could take a look at various language specifications that allow Unicode in source text, e.g. C#, Java, PowerShell, etc. They all have a grammar, and they all have to define Unicode characters somewhere (e.g. for identifiers).
E.g. the PowerShell grammar has lines like this:
double-quote-character:
       " (U+0022)
       Left double quotation mark (U+201C)
       Right double quotation mark (U+201D)
       Double low-9 quotation mark (U+201E)
Or in the Java specification:
UnicodeInputCharacter:
       UnicodeEscape
       RawInputCharacter
UnicodeEscape:
       \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
       u
       UnicodeMarker u
RawInputCharacter:
       any Unicode character
HexDigit: one of
       0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
The \, u, and hexadecimal digits here are all ASCII characters.
Note that there is surrounding text explaining the intent – which is always better than just dumping a heap of grammar on someone.
If it's for automatic parser generation, you may be better off finding a tool that allows you to specify a grammar both in Unicode and ABNF-like form and publish that instead. People writing parsers should be expected to understand either, though.

What is a well-formed UTF-16 string?

I need some help understanding the concept of a well-formed UTF-16 string as mentioned on these two paragraphs at Chapter 2: General Structure 2.7 Unicode String:
"Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide.
Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form. For example, isolated surrogates in a Unicode 16-bit string are not allowed when that string is specified to be well formed UTF-16.
The paragraph explains it for UTF-16; not well-formed means the string contains isolated surrogate codeunits.
That is, there are certain code units which are only valid when they appear in pairs. A code unit in the range [0xD800-0xDFFF] must occur only in pairs where the first must be in the range [0xD800-0xDBFF] and the second must be in the range [0xDC00-0xDFFF]. If a string does not obey this requirement then it is not well-formed.

What's the purpose of the noncharacters U+FDD0 to U+FDEF?

U+FFFE needs to be a noncharacter in order to allow the Byte Order Mark to work.
U+FFFF is described in The Unicode Standard as "useful for internal purposes as sentinels". Makes sense.
But I can't figure out, and The Unicode Standard doesn't really explain, why the set of noncharacters includes some random block within "Arabic Presentation Forms-A". What are these for? (Besides the eye of the basilisk?)
OK the question is "what are they for" and "Why are they in the middle of the Arabic Presentation Forms".
There was a need for a block of 32 non-characters "to make additional codes available to programmers to use for internal processing purposes" http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=IWS-Chapter04a#4d3110c8
It was required that it be in the Basic Multilingual Plane (BMP), i.e. 0x0000 to 0xFFFF, so that they could have single-codepoint representations in UTF-16.
There was a block of unused codepoints in the Arabic Presentation Forms block.
It had been agreed not to encode any more Arabic Presentation Forms, so these were never going to be used.
http://www.unicode.org/mail-arch/unicode-ml/y2001-m10/0014.html
Therefore it was agreed that these codepoints, which were never going to be used otherwise, would be designated noncharacters so they could be used internally by applications/programmers.
These noncharacters are for internal use by application and should not be interchanged.
I tried to explain based on what is said in Unicode standard.
Unicode got 66 non-characters. For all 17 planes they have two each, last two code points of the plane ending with FFFE FFFF. 32 other no-characters are continuous block U+FDD0 to U+FDEF.
So total count
17*2 + 32 = 66
Read following text from the unicode chapter 16, which says that its in some random place because of "historic reason", I'm curious but I don't think there is any ambiguity.
For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not
"Arabic noncharacters" or "right-to-left noncharacters," and are not distinguished in any
other way from the other noncharacters, except in their code point values
U+FEFF is BOM and U+FFFE is byte-swapped version of it. But since U+FFFE is a noncharacter, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text, It just gives a signal, not a standard way. It can be either of the one, reverse bytes or a wrong text.
In the Unicode section 3.2 clause C2 says
C2 A process shall not interpret a noncharacter code point as an abstract character.
The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.
So as application developers you are free to use these characters as you wish. They are used as sentinel or delimter or may be some baslik characters, but they should not be interchanged.
Section 16.7 says
In effect, noncharacters can be thought of as application-internal private-use code points.
Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which
are assigned characters and which are intended for use in open interchange, subject to
interpretation by private agreement, noncharacters are permanently reserved (unassigned)
and have no interpretation whatsoever outside of their possible application-internal private uses
Again U+FFFF is not reserved as sentinel by Unicode standard but just given the typical use case. Read in section 16.7
U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being
associated with the largest code unit values for particular Unicode encoding forms. In
UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16
U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16
This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For
example, they might be used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on
As mentioned here at xkcd, U+FDD0 is actually the Unicode character for the eye of a basilisk. For (obvious) reasons of personal safety however, the character is not rendered to the screen... :)