Can anyone please explain me why the NFD normalization from U+2126 (Ω) and U+03A9 (Ω) results in the same representation and does not preserve the code point? I would have expected this behaviour for NFKD and NFKC (and for characters with diacritics) only.
result1 = unicodedata.normalize("NFD", u"\u2126")
result2 = unicodedata.normalize("NFD", u"\u03A9")
print("NFD: " + repr(result1))
print("NFD: " + repr(result2))
Output:
NFD: u'\u03a9'
NFD: u'\u03a9'
These are known as "singleton decompositions", and exist for characters like U+2126 (Ω) that are present in Unicode for compatibility with existing standards. They are not "compatibility decompositions" (like U+1D6C0 𝛀) because they are both visually and semantically identical to another code point (in this case, U+03A9 Ω).
Because they essentially duplicate another code point, one is chosen as the "preferred form" and the other is always replaced by it when normalised (into any form). The first form is essentially deprecated.
Related
This is a follow-up of this question. I'm interested by different glyphs for the same character, also known as "Unicode Compatibility Characters".
Let's take the following two Arabic "reversed-character" words: كلمة ةملك
First word is:
كلمة
in hex code:
0643 0644 0645 0629
Second word is:
ةملك
in hex code:
0629 0645 0644 0643
If I paste those two words in Microsoft Word using Deja Vu Sans, I get this:
With the following pseudo-code using FreeType2, I get:
FT_Face face;
FT_New_Face(library, "DejaVuSans.ttf", 0, &face);
FT_GlyphSlot slot;
FT_Load_Char(face, each_character, FT_LOAD_RENDER);
slot = face->glyph;
//Use slot->bitmap.buffer
FT_Done_Face(face);
What am I missing? How can I have the right glyphs depending of the context?
My key issue is that I store each "character" (I should say glyph - but for me, character was equivalent to glyph) in a table so it's going to be complicated. I'm limited in speed, not in space. Can I have two different unicode characters for the same logical character?
libraqm is a solution to get the glyth for each character depending of its position in the sentence. But I'm still interested to get the character corresponding to the glyth (I know it's not a 1-to-1 relation). For instance, there are 4 characters for the 4 glyths of the letter Kaf as stated in the comment above.
I try to use this solution and this (for str_eval()) but seems other encode or other UTF8's Normalization Form, perhaps combining diacritical marks...
select distinct logradouro, str_eval(logradouro)
from logradouro where logradouro like '%CECi%';
-- logradouro | str_eval
------------------------------+----------------------------
-- AV CECi\u008DLIA MEIRELLES | AV CECi\u008DLIA MEIRELLES
PROBLEM: how to select all rows of the table where the problem exists?That is, where \u occurs?
not works with like '%CECi\u%' neither like '%CECi\\u%'
works with like E'%CECi\u008D%' but is not generic
For Google, edited after solved question: this is a typical XY problem. In the original question (above) I used ~wrong hypothesis. All the solutions bellow are answers to the following (objective) question:
How to select only printable ASCII text?
"Printable ASCII" is a subset of UTF8, it is "all ASCII that is not a 'control character'".
The "non-printable" control characters are UNICODE hexadecimal 00 to 1F and 7F(HTML entity to + or decimal 0 to 31 + 127).
PS1: the zero () is the "end of text" mark of PostgreSQL text datatype internal representation, so not need to be checked, but no problems to include it in the range.
PS2: about the secondary question "how to convert a word with encode bug to a valid word?", see an heuristic at my answer.
This condition will exclude any strings that do not entirely consist of printable ASCII characters:
logradouro ~ '[^\u0020-\u007E]'
Solving with workaround
select distinct logradouro, str_eval(logradouro)
from logradouro where not(logradouro ~ E'^[a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+$');
There is a systematic bug on encode, no way to convert to correct UTF8... Even converting, the problem is that "CECi\u008DLIA" is not "CECíLIA".
The solution is to use a kind of "heuristic spell corrector" on
regexp_replace(logradouro, E'[^a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+', '!')
Example: the i! of "Ceci!lia" is corrected to í.
NOTICE. Any heuristic solution (or neural network) trained with a specific dataset (specific systematic error source) is a black box solution, valid only for that type of systematic error. There is no generalization for this type of problem.
If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
I'd like to ensure that those hash to the same thing.
I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
Any normalization procedure may cause collisions, e.g. "office" == "office".
Normalization can change the number of bytes in the string.
Further questions
What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
What happens if the server receives characters that are unassigned in its version of Unicode?
Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.
Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)
The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:
11.1 Stability of Normalized Forms
For all versions, even prior to Unicode 4.1, the following policy is followed:
A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.
More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.
Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.
The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.
The spec warns:
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
However, semantics and round-tripping are not of concern here.
Recommendation #3: Apply NFKC or NFKD before hashing.
As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).
RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:
the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
“Contextual Rule Required”: a number of code points from an “
Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.
I can’t speak for any other programming language, but the Python package precis-i18n implements the PRECIS framework described in RFCs 8264, 8265, 8266.
Here’s an example of how simple it is to enforce the OpaqueString profile on a password string:
# pip install precis-i18n
>>> import precis_i18n
>>> precis_i18n.get_profile('OpaqueString').enforce('😳å∆3⨁ucei=The4e-iy5am=3iemoo')
'😳å∆3⨁ucei=The4e-iy5am=3iemoo'
>>>
I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction and source of Python examples.
What is the likelihood that I'll run into COMBINING LATIN SMALL LETTER C (U+0368) in "real life" (besides clever Scottish folk)?
I'm asking since it's in both the Unicode Block Combining Diacritical Marks and the Category Mark, Nonspacing [Mn].
As a result, it seems to gets treated the same as characters such as COMBINING GRAVE ACCENT (U+0300) by Utilities such as the ICU Transliterator (using either the suggested "NFD; [:Nonspacing Mark:] Remove; NFC" or a straight "Latin-ASCII" transliteration).
The likelihood is very close to zero, but not exactly zero. You cannot prevent anyone from using a Unicode character as he likes. There is no specific information about U+0368 in the Unicode Standard, but it has definitely been defined as a combining character that will cause a symbol (c) to be displayed above the preceding character. I would expect to find it mostly in digitized forms of medieval manuscripts, or something like that.
Using it after a space character, as in the “clever” page mentioned, is not the intended use, but not invalid either. Unicode lets you use any combining mark after any character, whether it makes sense or not.
It has no canonical or compatibility decomposition, so there is no clear-cut way to deal with in a context where you cannot, or do not want to, retain the character.
The likelihood is utterly indeterminate except to say that if you expect it not to occur, then it will occur.
We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
Try to add a single quote in your rule to see if it passes by making this change,
<squote><squote> => <squote>{1,2}
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
G"ABC<ヲァィ>" <> are Shift-out/shift-in
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.