To compare two strings case insensitively, one correct way is to case
fold them first. How is this better than upper casing or lower casing?
I find examples where lower casing doesn't work right online. For
example "σ" and "ς" (two forms of "Σ") don't become the same when
converted to lower case. But I've failed to find why case folding is
better than mapping to upper case. Is there a case where two strings
that should match case insensitively don't upper case to the same
strings?
Another scenario is when I want to store a case insensitive index. The
recommended way seems to be case folding and then normalizing. What are
its advantages over storing the string mapped to upper case and
normalized? The specs say mapping to upper case is not guaranteed to be
stable across versions of Unicode while case folding is. But are there
any cases where mapping to upper case gives a different string in an
earlier version of Unicode?
As per Unicode stability policy, case mappings are only stable for case pairs, i.e. pairs of characters X and Y where X is the full uppercase mapping of Y, and Y is the full lowercase mapping of X. Only when both these characters exist with these properties is the casing relation between them set in stone.
However, Unicode contains many “incomplete” case pairs where only the lowercase form has been encoded and the uppercase form is missing completely. This is usually the case for letters used in transcription systems that are traditionally lowercase-only. Should capital forms be discovered and subsequently added to Unicode, these letters would then receive a new uppercase mapping.
The most recent characters this has happened to are “ʂ” (from Unicode 1.1), “ᶎ” (from Unicode 4.1), and “ꞔ” (from Unicode 7.0), which all got brand new uppercase forms (Ꞔ, Ʂ, Ᶎ) in Unicode 12.0 two years ago.
Because case mappings do not have to be unique, this makes uppercasing a poor substitute for proper case-folding. For example, both U+0434 (д) and U+1C81 (ᲁ) uppercase to U+0414 (Д), but only the former is locked into a case pair by virtue of being U+0414’s full lowercase mapping. If someone were to find a dedicated capital letter version of U+1C81 in some old manuscript, it would be given a new uppercase mapping, resulting in U+0434 and U+1C81 suddenly no longer comparing equal under that operation.
EDIT: I have just remembered a current example of uppercasing not being sufficient for case-insensitive matching: U+1E9E (ẞ) is already a capital letter and thus uppercases to itself. Its lowercase counterpart is U+00DF (ß), but the uppercase mapping of U+00DF is the sequence <U+0053, U+0053> (SS).
uppercase("ẞ") ≠ uppercase(lowercase("ẞ"))
I found a list from here.
As of Unicode 13.0.0.
Equivalence classes that have more than 1 uppercase mapping.
case fold
original
UPPER CASE
k 006B LATIN SMALL LETTER K
K 004B LATIN CAPITAL LETTER K
K 004B LATIN CAPITAL LETTER K
k 006B LATIN SMALL LETTER K
K 004B LATIN CAPITAL LETTER K
K 212A KELVIN SIGN
K 212A KELVIN SIGN
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S
ß 00DF LATIN SMALL LETTER SHARP S
SS 0053 LATIN CAPITAL LETTER S; 0053 LATIN CAPITAL LETTER S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S
å 00E5 LATIN SMALL LETTER A WITH RING ABOVE
Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
å 00E5 LATIN SMALL LETTER A WITH RING ABOVE
Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
Å 212B ANGSTROM SIGN
Å 212B ANGSTROM SIGN
θ 03B8 GREEK SMALL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
ϑ 03D1 GREEK THETA SYMBOL
Θ 0398 GREEK CAPITAL LETTER THETA
ϴ 03F4 GREEK CAPITAL THETA SYMBOL
ϴ 03F4 GREEK CAPITAL THETA SYMBOL
ω 03C9 GREEK SMALL LETTER OMEGA
Ω 03A9 GREEK CAPITAL LETTER OMEGA
Ω 03A9 GREEK CAPITAL LETTER OMEGA
ω 03C9 GREEK SMALL LETTER OMEGA
Ω 03A9 GREEK CAPITAL LETTER OMEGA
Ω 2126 OHM SIGN
Ω 2126 OHM SIGN
And for lowercasing.
case fold
original
lower case
s 0073 LATIN SMALL LETTER S
S 0053 LATIN CAPITAL LETTER S
s 0073 LATIN SMALL LETTER S
s 0073 LATIN SMALL LETTER S
s 0073 LATIN SMALL LETTER S
ſ 017F LATIN SMALL LETTER LONG S
ſ 017F LATIN SMALL LETTER LONG S
st 0073 LATIN SMALL LETTER S; 0074 LATIN SMALL LETTER T
ſt FB05 LATIN SMALL LIGATURE LONG S T
ſt FB05 LATIN SMALL LIGATURE LONG S T
st FB06 LATIN SMALL LIGATURE ST
st FB06 LATIN SMALL LIGATURE ST
β 03B2 GREEK SMALL LETTER BETA
Β 0392 GREEK CAPITAL LETTER BETA
β 03B2 GREEK SMALL LETTER BETA
β 03B2 GREEK SMALL LETTER BETA
β 03B2 GREEK SMALL LETTER BETA
ϐ 03D0 GREEK BETA SYMBOL
ϐ 03D0 GREEK BETA SYMBOL
ε 03B5 GREEK SMALL LETTER EPSILON
Ε 0395 GREEK CAPITAL LETTER EPSILON
ε 03B5 GREEK SMALL LETTER EPSILON
ε 03B5 GREEK SMALL LETTER EPSILON
ε 03B5 GREEK SMALL LETTER EPSILON
ϵ 03F5 GREEK LUNATE EPSILON SYMBOL
ϵ 03F5 GREEK LUNATE EPSILON SYMBOL
θ 03B8 GREEK SMALL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
ϑ 03D1 GREEK THETA SYMBOL
ϑ 03D1 GREEK THETA SYMBOL
ϴ 03F4 GREEK CAPITAL THETA SYMBOL
θ 03B8 GREEK SMALL LETTER THETA
ι 03B9 GREEK SMALL LETTER IOTA
◌ͅ 0345 COMBINING GREEK YPOGEGRAMMENI
◌ͅ 0345 COMBINING GREEK YPOGEGRAMMENI
Ι 0399 GREEK CAPITAL LETTER IOTA
ι 03B9 GREEK SMALL LETTER IOTA
ι 03B9 GREEK SMALL LETTER IOTA
ι 03B9 GREEK SMALL LETTER IOTA
ι 1FBE GREEK PROSGEGRAMMENI
ι 1FBE GREEK PROSGEGRAMMENI
ΐ 03B9 GREEK SMALL LETTER IOTA; 0308 COMBINING DIAERESIS; 0301 COMBINING ACUTE ACCENT
ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
ΐ 1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
ΐ 1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
κ 03BA GREEK SMALL LETTER KAPPA
Κ 039A GREEK CAPITAL LETTER KAPPA
κ 03BA GREEK SMALL LETTER KAPPA
κ 03BA GREEK SMALL LETTER KAPPA
κ 03BA GREEK SMALL LETTER KAPPA
ϰ 03F0 GREEK KAPPA SYMBOL
ϰ 03F0 GREEK KAPPA SYMBOL
μ 03BC GREEK SMALL LETTER MU
µ 00B5 MICRO SIGN
µ 00B5 MICRO SIGN
Μ 039C GREEK CAPITAL LETTER MU
μ 03BC GREEK SMALL LETTER MU
μ 03BC GREEK SMALL LETTER MU
μ 03BC GREEK SMALL LETTER MU
π 03C0 GREEK SMALL LETTER PI
Π 03A0 GREEK CAPITAL LETTER PI
π 03C0 GREEK SMALL LETTER PI
π 03C0 GREEK SMALL LETTER PI
π 03C0 GREEK SMALL LETTER PI
ϖ 03D6 GREEK PI SYMBOL
ϖ 03D6 GREEK PI SYMBOL
ρ 03C1 GREEK SMALL LETTER RHO
Ρ 03A1 GREEK CAPITAL LETTER RHO
ρ 03C1 GREEK SMALL LETTER RHO
ρ 03C1 GREEK SMALL LETTER RHO
ρ 03C1 GREEK SMALL LETTER RHO
ϱ 03F1 GREEK RHO SYMBOL
ϱ 03F1 GREEK RHO SYMBOL
σ 03C3 GREEK SMALL LETTER SIGMA
Σ 03A3 GREEK CAPITAL LETTER SIGMA
σ 03C3 GREEK SMALL LETTER SIGMA
ς 03C2 GREEK SMALL LETTER FINAL SIGMA
ς 03C2 GREEK SMALL LETTER FINAL SIGMA
σ 03C3 GREEK SMALL LETTER SIGMA
σ 03C3 GREEK SMALL LETTER SIGMA
ΰ 03C5 GREEK SMALL LETTER UPSILON; 0308 COMBINING DIAERESIS; 0301 COMBINING ACUTE ACCENT
ΰ 03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
ΰ 03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
ΰ 1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
ΰ 1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
φ 03C6 GREEK SMALL LETTER PHI
Φ 03A6 GREEK CAPITAL LETTER PHI
φ 03C6 GREEK SMALL LETTER PHI
φ 03C6 GREEK SMALL LETTER PHI
φ 03C6 GREEK SMALL LETTER PHI
ϕ 03D5 GREEK PHI SYMBOL
ϕ 03D5 GREEK PHI SYMBOL
в 0432 CYRILLIC SMALL LETTER VE
В 0412 CYRILLIC CAPITAL LETTER VE
в 0432 CYRILLIC SMALL LETTER VE
в 0432 CYRILLIC SMALL LETTER VE
в 0432 CYRILLIC SMALL LETTER VE
ᲀ 1C80 CYRILLIC SMALL LETTER ROUNDED VE
ᲀ 1C80 CYRILLIC SMALL LETTER ROUNDED VE
д 0434 CYRILLIC SMALL LETTER DE
Д 0414 CYRILLIC CAPITAL LETTER DE
д 0434 CYRILLIC SMALL LETTER DE
д 0434 CYRILLIC SMALL LETTER DE
д 0434 CYRILLIC SMALL LETTER DE
ᲁ 1C81 CYRILLIC SMALL LETTER LONG-LEGGED DE
ᲁ 1C81 CYRILLIC SMALL LETTER LONG-LEGGED DE
о 043E CYRILLIC SMALL LETTER O
О 041E CYRILLIC CAPITAL LETTER O
о 043E CYRILLIC SMALL LETTER O
о 043E CYRILLIC SMALL LETTER O
о 043E CYRILLIC SMALL LETTER O
ᲂ 1C82 CYRILLIC SMALL LETTER NARROW O
ᲂ 1C82 CYRILLIC SMALL LETTER NARROW O
с 0441 CYRILLIC SMALL LETTER ES
С 0421 CYRILLIC CAPITAL LETTER ES
с 0441 CYRILLIC SMALL LETTER ES
с 0441 CYRILLIC SMALL LETTER ES
с 0441 CYRILLIC SMALL LETTER ES
ᲃ 1C83 CYRILLIC SMALL LETTER WIDE ES
ᲃ 1C83 CYRILLIC SMALL LETTER WIDE ES
т 0442 CYRILLIC SMALL LETTER TE
Т 0422 CYRILLIC CAPITAL LETTER TE
т 0442 CYRILLIC SMALL LETTER TE
т 0442 CYRILLIC SMALL LETTER TE
т 0442 CYRILLIC SMALL LETTER TE
ᲄ 1C84 CYRILLIC SMALL LETTER TALL TE
ᲄ 1C84 CYRILLIC SMALL LETTER TALL TE
ᲅ 1C85 CYRILLIC SMALL LETTER THREE-LEGGED TE
ᲅ 1C85 CYRILLIC SMALL LETTER THREE-LEGGED TE
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
Ъ 042A CYRILLIC CAPITAL LETTER HARD SIGN
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ᲆ 1C86 CYRILLIC SMALL LETTER TALL HARD SIGN
ᲆ 1C86 CYRILLIC SMALL LETTER TALL HARD SIGN
ѣ 0463 CYRILLIC SMALL LETTER YAT
Ѣ 0462 CYRILLIC CAPITAL LETTER YAT
ѣ 0463 CYRILLIC SMALL LETTER YAT
ѣ 0463 CYRILLIC SMALL LETTER YAT
ѣ 0463 CYRILLIC SMALL LETTER YAT
ᲇ 1C87 CYRILLIC SMALL LETTER TALL YAT
ᲇ 1C87 CYRILLIC SMALL LETTER TALL YAT
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
Ṡ 1E60 LATIN CAPITAL LETTER S WITH DOT ABOVE
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ẛ 1E9B LATIN SMALL LETTER LONG S WITH DOT ABOVE
ẛ 1E9B LATIN SMALL LETTER LONG S WITH DOT ABOVE
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
ᲈ 1C88 CYRILLIC SMALL LETTER UNBLENDED UK
ᲈ 1C88 CYRILLIC SMALL LETTER UNBLENDED UK
Ꙋ A64A CYRILLIC CAPITAL LETTER MONOGRAPH UK
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
And for lowercase(uppercase(X)).
case fold
original
lower case of upper case
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S
ß 00DF LATIN SMALL LETTER SHARP S
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S
ß 00DF LATIN SMALL LETTER SHARP S
For uppercase(lowercase(s)), no equivalence group has multiple results.
Related
This question already has an answer here:
Visually-identical characters in Unicode
(1 answer)
Closed 2 years ago.
They both render as Ð (And the other as Đ). I think that pretty much sums it up. Is there an inherent difference?
They render slightly differently, or similarly, depending on your system.. Char u+0110 is "Latin Capital Letter Eth" and char u+00d0 is "Latin Capital Letter D with stroke". Try opening these pages in different tabs and switch back and forth, the example is rendered slightly differently.
Unicode Character 'LATIN CAPITAL LETTER ETH' (U+00D0)
Unicode Character 'LATIN CAPITAL LETTER D WITH STROKE' (U+0110)
They render very similarly1, but they are different characters, and the lowercase version of those uppercase letters render very differently1.
For comparison:
Name
Capital
Small
Latin Letter Eth
Ð (U+00D0)
ð (U+00F0)
Latin Letter D with Stroke
Đ (U+0110)
đ (U+0111)
Latin Letter African D / Latin Letter D with Tail
Ɖ (U+0189)
ɖ (U+0256)
1) Depends on the font, of course.
Credit: This answer is mostly a refinement/combination of the answer by doublesharp and the comment by John Montgomery.
Example in Python:
>>> s = 'ı̇'
>>> len(s)
2
>>> list(s)
['ı', '̇']
>>> print(", ".join(map(unicodedata.name, s)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
>>> normalized = unicodedata.normalize('NFC', s)
>>> print(", ".join(map(unicodedata.name, normalized)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
As you can see, NFC normalization does not compose the dotless i + a dot to a normal i. Is there a rationale for this? Is this an oversight? Or is it not included because NFC is supposed to be the perfect inverse of NFD (and one wouldn’t want to decompose i to dotless i + dot).
While NFC isn't the "perfect inverse" of NFD, this follows from NFC being defined in terms of the same decomposition mappings as NFD. NFC is basically defined as NFD followed by recomposing certain NFD decomposition pairs. Since there's no decomposition mapping for LATIN SMALL LETTER I, it can never be the result of a recomposition.
I was just reading a post and saw some odd text effects, however I cannot locate how it is achieved or what it is called:
P̢̢̲̭̘̣̪͉͞͞h̴̛̫͉͖̜͙̳͎̕͞͠'̶̀͢҉̯̞̹͈ṉ̶̘̠̯̬̭̖̳͘͞ģ̵̛͠҉̰̝͇̩͍̗͍̘̫͈̺̭̥͉l̨͍̘͔̰͔̖͍̹̠̭̱̰̖͙̦̦͎̕͟u̢̡҉̲̭̲̺̮̖͖͖i̴̢̹̳͉͎̥̪̜͎̼̣̦̖̻͈̖͉͚ͅ ̵͏͇̗̭ͅm̶̨͍̤̪̱͇̤̬̥̥͔̼͍̠̼͕g̷̷̰̩͙̪̫͉̺̯͘͟͠ļ̶̭͇̘̮̕͢ẃ̵̸̷҉͕̬̠̥̤͖̙̲͇̼̹'̺̩̖̟̣͈̖͙̤̫̰̗̯̀͡ń̷̴̶̰̮̺͔̼̺̹̘̟a̷̰̪͙͇̤͓̤̭͎̦͕̻f͏̨͙̰̘͔̟̜̠͈̯̻͕̖̳̝̝́͘ͅḩ̴̛͉͉̲͇̠͙̣̩͙̩͚̮̼̺ͅ ̧̛̟͓̤͇̯͍̫͖͎͈̫̳͓̞͘Ç͘͏͈̹̠̙͎̳̯͚͔̼͙̻͔͖̲̩̹̕ͅt͏̖̲̤̫̤̫̼̪̥̠͙͚͍̭́ͅḩ̡̲͈̫̯͚͉̱͍̳͝ù̧͙̭̙̻̲̙͚͔̲̬͚͢͝͡ḻ̴̵̨̹͉͙̟̯̞̠͔̦̝̩͜h̶̼̜̦͖͍͎͍̕ṷ̴̶̢͙̗̬͇̯̞̗̰̣̬̥̲̣̦ ̵̲͍̩̭̩̗͈͚͟͝R͏̛͘͟҉̫̝̞̪̣̪̻̤̼͖̪͎'̛̯͚͎̳͎̼͓̘͉͢l͟҉̵̘͈͙̣̹̜͍͎̬̺̹̪̜̀y͏͓̞̬͙̥̞̦͎͖̞͖͎̖̀e̶̵̡̺͉̯̭̣̗h͇̺͇̖̼̻̟͓͜͟͜͞ͅ ̴̷̡̨̪͍̙̳̞̭̙̫̯̘͚͇͚̼͙͟w̧̮̜̯̭̘͈̫̳̖̕͜͠g̢̨̗͖̬̠͎͓̱̞͓̭̯̺͕̭̯̦ͅa̴̠̘̬̩͍͜ͅh̵̷̨̜̻͔̖͈̤͈̩͔͈͇̩̞̲̜̩͍̺'̸̨͇̞̜͈͟n̨͟͞҉̤͚͎͇̣̺͚̻̖͖́ͅà̻͉̙̲̲̞͘͝ģ̙̗̙͓̜̣͔̥̫͟͡l̴̨̨̼͚̫̞̙̳͙͢͟ ̢̦͚̲͇̞̺̗̫͇f̸̸̫̠͖͙̜͉̲͖͓̭͇̦̭̩̲͡͠ḩ̸̲̤͍̖̻̣̝̼́̕͝ͅt̴͝҉҉̵͔̮̞̪á̢̕͢͏̗̯̗̙͙͉̪͓͙̣̰̣g͏̶̡͓̤͍͖̜̠̜ͅn̴̶̛̝̼͉̠̻͓
Don't worry though, unless thousands read it I think we are safe.
It's called Zalgo text.
You can Google for an online generator and use it:
TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
As a side note, don't try to parse HTML with RegEx.
You can get a clue if you pick a sample and submit it to Unicode character inspector:
C U+0043 LATIN CAPITAL LETTER C Lu Basic Latin
̧ U+0327 COMBINING CEDILLA Mn Combining Diacritical Marks
͘ U+0358 COMBINING DOT ABOVE RIGHT Mn Combining Diacritical Marks
U+034F COMBINING GRAPHEME JOINER Mn Combining Diacritical Marks
̕ U+0315 COMBINING COMMA ABOVE RIGHT Mn Combining Diacritical Marks
͈ U+0348 COMBINING DOUBLE VERTICAL LINE BELOW Mn Combining Diacritical Marks
[…]
… where Lu stands for Unicode Character Category 'Letter, Uppercase' and Mn stands for Unicode Character Category 'Mark, Nonspacing'.
In short, they're just regular letters attached to all sort of combining diacritics, thanks to the magic of Unicode. It abuses the fact that é can also be written as e + ´ for entertainment purposes.
Unicode defines two kinds of equivalence 000 canonical equivalence and compatibility equivalence. The example in Unicode Technical Annex #15 for compatibility equivalence is SUPERSCRIPT ONE (U+00B9) and DIGIT ONE (U+0031). It doesn't discuss characters that are visually indistinguishable.
I am curious if characters that are visually indistinguishable have compatibility equivalence under the standard.
Thanks..
ᴇᴅɪᴛ: Added exactly what the original question is looking for at the bottom. This is really cool.
The answer to your question about ʀᴏᴍᴀɴ ɴᴜᴍᴇʀᴀʟ ᴏɴᴇ and ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ is YES. Here’s a quick way to check:
$ perl -Mcharnames=:full -MUnicode::Normalize -le 'print
NFKD "\N{ROMAN NUMERAL ONE}" eq NFKD "\N{LATIN CAPITAL LETTER I}"'
1
However, the answer to your question as to whether characters that are visually indistinguishable have compatibility equivalence is most definitely NO!
For example, ᴄʜᴇʀᴏᴋᴇᴇ ʟᴇᴛᴛᴇʀ ɢᴏ (Ꭺ) looks like ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ (A), but is certainly not NFKD equivalent. Similarly with ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀʟᴘʜᴀ (Α) and ᴄʏʀɪʟʟɪᴄ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ (А) not being NFKD equivalent. There are effectively uncountably many (well, I can’t count them :) such issues. The only code points that are NFKD-equiv to ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ, for example, are:
U+00041 A GC=Lu SC=Latin LATIN CAPITAL LETTER A
U+01D2C ᴬ GC=Lm SC=Latin MODIFIER LETTER CAPITAL A
U+024B6 Ⓐ GC=So SC=Common CIRCLED LATIN CAPITAL LETTER A
U+0FF21 A GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER A
U+1D400 𝐀 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL A
U+1D434 𝐴 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL A
U+1D468 𝑨 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL A
U+1D49C 𝒜 GC=Lu SC=Common MATHEMATICAL SCRIPT CAPITAL A
U+1D4D0 𝓐 GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL A
U+1D504 𝔄 GC=Lu SC=Common MATHEMATICAL FRAKTUR CAPITAL A
U+1D538 𝔸 GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL A
U+1D56C 𝕬 GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL A
U+1D5A0 𝖠 GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL A
U+1D5D4 𝗔 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL A
U+1D608 𝘈 GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL A
U+1D63C 𝘼 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A
U+1D670 𝙰 GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL A
U+1F130 🄰 GC=So SC=Common SQUARED LATIN CAPITAL LETTER A
Similarly, here are the codepoints that are NFKD equiv to the ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ you were looking at:
U+00049 I GC=Lu SC=Latin LATIN CAPITAL LETTER I
U+01D35 ᴵ GC=Lm SC=Latin MODIFIER LETTER CAPITAL I
U+02110 ℐ GC=Lu SC=Common SCRIPT CAPITAL I
U+02111 ℑ GC=Lu SC=Common BLACK-LETTER CAPITAL I
U+02160 Ⅰ GC=Nl SC=Latin ROMAN NUMERAL ONE
U+024BE Ⓘ GC=So SC=Common CIRCLED LATIN CAPITAL LETTER I
U+0FF29 I GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER I
U+1D408 𝐈 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL I
U+1D43C 𝐼 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL I
U+1D470 𝑰 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 𝓘 GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 𝕀 GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 𝕴 GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 𝖨 GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC 𝗜 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 𝘐 GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 𝙄 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 𝙸 GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL I
U+1F138 🄸 GC=So SC=Common SQUARED LATIN CAPITAL LETTER I
Notice there’s no ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪᴏᴛᴀ there, just as one example.
You can’t use NFKD to find lookalikes, and some things that are NKFD equiv don’t look much alike. So you can’t do it that way in the general case. It’s not a problem you can even begin to look at without looking at actual fonts.
I believe ICU has an extended, non-standard property for this, like \p{X-Confusable=A}. I downloaded their datafiles for this, but haven’t played with it much yet.
Update
It turns out that UTS #39, Unicode Security Mechanisms, has exactly what you are looking for. If you fetch its raw, plaintext datafiles, you will be able to determine which code points are potentially confusable with one another.
For example, in the text earlier in this message, I enumerated the code points that were NFKD equivalent to ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ, and pointed out that many potential confusables were missing from that set. That’s because the NFKD mapping is not designed to detect confusables. However, the datafiles from UTS#39 very much are designed for just that very purpose.
To redo my ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ enumeration, updating it to handle all code points that UTS#39 deems mutually confusable with it, we have these, formatted using unichars and sorted in order of the Unicode Collation Algorithm using ucsort:
U+0007C | GC=Sm SC=Common VERTICAL LINE
U+02223 ∣ GC=Sm SC=Common DIVIDES
U+0FFE8 │ GC=So SC=Common HALFWIDTH FORMS LIGHT VERTICAL
U+00031 1 GC=Nd SC=Common DIGIT ONE
U+1D7CF 𝟏 GC=Nd SC=Common MATHEMATICAL BOLD DIGIT ONE
U+1D7D9 𝟙 GC=Nd SC=Common MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
U+1D7E3 𝟣 GC=Nd SC=Common MATHEMATICAL SANS-SERIF DIGIT ONE
U+1D7ED 𝟭 GC=Nd SC=Common MATHEMATICAL SANS-SERIF BOLD DIGIT ONE
U+1D7F7 𝟷 GC=Nd SC=Common MATHEMATICAL MONOSPACE DIGIT ONE
U+00049 I GC=Lu SC=Latin LATIN CAPITAL LETTER I
U+0FF29 I GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER I
U+02160 Ⅰ GC=Nl SC=Latin ROMAN NUMERAL ONE
U+02110 ℐ GC=Lu SC=Common SCRIPT CAPITAL I
U+02111 ℑ GC=Lu SC=Common BLACK-LETTER CAPITAL I
U+1D408 𝐈 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL I
U+1D43C 𝐼 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL I
U+1D470 𝑰 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 𝓘 GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 𝕀 GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 𝕴 GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 𝖨 GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC 𝗜 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 𝘐 GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 𝙄 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 𝙸 GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL I
U+00196 Ɩ GC=Lu SC=Latin LATIN CAPITAL LETTER IOTA
U+0006C l GC=Ll SC=Latin LATIN SMALL LETTER L
U+0FF4C l GC=Ll SC=Latin FULLWIDTH LATIN SMALL LETTER L
U+0217C ⅼ GC=Nl SC=Latin SMALL ROMAN NUMERAL FIFTY
U+02113 ℓ GC=Ll SC=Common SCRIPT SMALL L
U+1D425 𝐥 GC=Ll SC=Common MATHEMATICAL BOLD SMALL L
U+1D459 𝑙 GC=Ll SC=Common MATHEMATICAL ITALIC SMALL L
U+1D48D 𝒍 GC=Ll SC=Common MATHEMATICAL BOLD ITALIC SMALL L
U+1D4C1 𝓁 GC=Ll SC=Common MATHEMATICAL SCRIPT SMALL L
U+1D4F5 𝓵 GC=Ll SC=Common MATHEMATICAL BOLD SCRIPT SMALL L
U+1D529 𝔩 GC=Ll SC=Common MATHEMATICAL FRAKTUR SMALL L
U+1D55D 𝕝 GC=Ll SC=Common MATHEMATICAL DOUBLE-STRUCK SMALL L
U+1D591 𝖑 GC=Ll SC=Common MATHEMATICAL BOLD FRAKTUR SMALL L
U+1D5C5 𝗅 GC=Ll SC=Common MATHEMATICAL SANS-SERIF SMALL L
U+1D5F9 𝗹 GC=Ll SC=Common MATHEMATICAL SANS-SERIF BOLD SMALL L
U+1D62D 𝘭 GC=Ll SC=Common MATHEMATICAL SANS-SERIF ITALIC SMALL L
U+1D661 𝙡 GC=Ll SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL L
U+1D695 𝚕 GC=Ll SC=Common MATHEMATICAL MONOSPACE SMALL L
U+001C0 ǀ GC=Lo SC=Latin LATIN LETTER DENTAL CLICK
U+00399 Ι GC=Lu SC=Greek GREEK CAPITAL LETTER IOTA
U+1D6B0 𝚰 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL IOTA
U+1D6EA 𝛪 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL IOTA
U+1D724 𝜤 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL IOTA
U+1D75E 𝝞 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL IOTA
U+1D798 𝞘 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL IOTA
U+02C92 Ⲓ GC=Lu SC=Coptic COPTIC CAPITAL LETTER IAUDA
U+00406 І GC=Lu SC=Cyrillic CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
U+004C0 Ӏ GC=Lu SC=Cyrillic CYRILLIC LETTER PALOCHKA
U+005D5 ו GC=Lo SC=Hebrew HEBREW LETTER VAV
U+005DF ן GC=Lo SC=Hebrew HEBREW LETTER FINAL NUN
U+007CA ߊ GC=Lo SC=Nko NKO LETTER A
U+02D4F ⵏ GC=Lo SC=Tifinagh TIFINAGH LETTER YAN
U+0A4F2 ꓲ GC=Lo SC=Lisu LISU LETTER I
Nifty though that is, it gets even better. The datafiles include not just single-codepoint confusables, but also confusables that may in some cases require multiple code points. For example, here’s one such set, this time in file-native format:
# C̦ С̡ Ç Ҫ
( C̦ ) 0043 0326 LATIN CAPITAL LETTER C, COMBINING COMMA BELOW
← ( С̡ ) 0421 0321 CYRILLIC CAPITAL LETTER ES, COMBINING PALATALIZED HOOK BELOW
← ( Ç ) 00C7 LATIN CAPITAL LETTER C WITH CEDILLA # →Ҫ→→С̡→
← ( Ҫ ) 04AA CYRILLIC CAPITAL LETTER ES WITH DESCENDER # →С̡→
Isn’t that swell? The only hitch is unless you use the ICU classes, you’ll have to roll your own from the UTS#39 datafiles.
Since there are no other language bindings that I am aware of, I’ve added to my ᴛᴏᴅᴏ list to create Perl bindings to mimic the ICU style of writing \p{X-Confusable=I} in the regex engine.
Note that you may also wish to consider both UTS#36 and UTS#39, which the ICU SpoofChecker class handles for you. It’s specifically for URI-type things (read: Internet identifers, which use a restricted character set), not just any old arbitrary text.
Yes. Look in UnicodeData.txt:
2160;ROMAN NUMERAL ONE;Nl;0;L;<compat> 0049;;;1;N;;;;2170;
The answer by #dan04 is the correct answer to the explicit question, but the indirect question “if characters that are visually indistinguishable have compatibility equivalence” has a more complicated answer.
As a rule, canonically equivalent characters or character sequences are supposed to look similar. They are, roughly speaking, difference presentations of the same intuitive character as encoded characters. This however depends on several practical considerations, and the renderings might in fact be different.
On the other hand, characters can be visually indistinguishable even though their renderings (glyphs) are identical in every known font. For example, any normal font that contains the capital Latin letter A, the capital Greek letter alpha, and the capital Cyrillic letter A have identical glyphs for them, but they are still completely distinct characters, with no equivalence mapping between them.
Compatibility equivalent characters may differ in presentation, and they often do, partly because their difference is often stylistic. But they need not differ.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16.
I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far.
UPDATE
I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to my surprise that even in the Japanese Wikipedia Gothic alphabet is the most common. This is also true in the Chinese Wikipedia but it also had many Chinese characters being used up to 50 or 70 times, including "𨭎", "𠬠", and "𩷶".
Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter's public stream. It occurs more frequently than the tilde!
Excellent question!
The answer is the mathematical letters. This past December I did a scan of the entire PubMed Open Access corpus, and came up with these figures for astral characters in it.
The first number in the figures below is how many copies of each given code point I found in the entire corpus. First, though, to give you a notion on the relative frequencies, here are the top ten trans-ASCII code points in that corpus:
2663710 U+002013 ‹–› GC=Pd EN DASH
1065594 U+0000A0 ‹ › GC=Zs NO-BREAK SPACE
1009762 U+0000B1 ‹±› GC=Sm PLUS-MINUS SIGN
784139 U+002212 ‹−› GC=Sm MINUS SIGN
602377 U+002003 ‹ › GC=Zs EM SPACE
528576 U+0003BC ‹μ› GC=Ll GREEK SMALL LETTER MU
519669 U+0003B2 ‹β› GC=Ll GREEK SMALL LETTER BETA
512312 U+0003B1 ‹α› GC=Ll GREEK SMALL LETTER ALPHA
491842 U+00200A ‹ › GC=Zs HAIR SPACE
462505 U+0000B0 ‹°› GC=So DEGREE SIGN
And here now are the trans-BMP code points, in order of decending frequency:
544 U+01D49E ‹𝒞› GC=Lu MATHEMATICAL SCRIPT CAPITAL C
450 U+01D4AF ‹𝒯› GC=Lu MATHEMATICAL SCRIPT CAPITAL T
385 U+01D4AE ‹𝒮› GC=Lu MATHEMATICAL SCRIPT CAPITAL S
292 U+01D49F ‹𝒟› GC=Lu MATHEMATICAL SCRIPT CAPITAL D
285 U+01D4B3 ‹𝒳› GC=Lu MATHEMATICAL SCRIPT CAPITAL X
262 U+01D4A9 ‹𝒩› GC=Lu MATHEMATICAL SCRIPT CAPITAL N
258 U+01D4AB ‹𝒫› GC=Lu MATHEMATICAL SCRIPT CAPITAL P
254 U+01D4A2 ‹𝒢› GC=Lu MATHEMATICAL SCRIPT CAPITAL G
185 U+01D49C ‹𝒜› GC=Lu MATHEMATICAL SCRIPT CAPITAL A
178 U+01D53C ‹𝔼› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL E
137 U+01D4AA ‹𝒪› GC=Lu MATHEMATICAL SCRIPT CAPITAL O
56 U+01D4A5 ‹𝒥› GC=Lu MATHEMATICAL SCRIPT CAPITAL J
48 U+01D4A6 ‹𝒦› GC=Lu MATHEMATICAL SCRIPT CAPITAL K
44 U+01D4B1 ‹𝒱› GC=Lu MATHEMATICAL SCRIPT CAPITAL V
43 U+01D4B2 ‹𝒲› GC=Lu MATHEMATICAL SCRIPT CAPITAL W
42 U+01D4B4 ‹𝒴› GC=Lu MATHEMATICAL SCRIPT CAPITAL Y
41 U+01D4B5 ‹𝒵› GC=Lu MATHEMATICAL SCRIPT CAPITAL Z
35 U+01D4B0 ‹𝒰› GC=Lu MATHEMATICAL SCRIPT CAPITAL U
30 U+01D4AC ‹𝒬› GC=Lu MATHEMATICAL SCRIPT CAPITAL Q
23 U+01D54A ‹𝕊› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL S
21 U+01D539 ‹𝔹› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL B
19 U+01D5A7 ‹𝖧› GC=Lu MATHEMATICAL SANS-SERIF CAPITAL H
18 U+01D517 ‹𝔗› GC=Lu MATHEMATICAL FRAKTUR CAPITAL T
15 U+01D4C3 ‹𝓃› GC=Ll MATHEMATICAL SCRIPT SMALL N
14 U+01D535 ‹𝔵› GC=Ll MATHEMATICAL FRAKTUR SMALL X
13 U+01D4BF ‹𝒿› GC=Ll MATHEMATICAL SCRIPT SMALL J
11 U+01D540 ‹𝕀› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL I
9 U+01D465 ‹𝑥› GC=Ll MATHEMATICAL ITALIC SMALL X
9 U+01D4CE ‹𝓎› GC=Ll MATHEMATICAL SCRIPT SMALL Y
9 U+01D538 ‹𝔸› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL A
8 U+01D4C2 ‹𝓂› GC=Ll MATHEMATICAL SCRIPT SMALL M
8 U+01D54D ‹𝕍› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL V
7 U+01D4B6 ‹𝒶› GC=Ll MATHEMATICAL SCRIPT SMALL A
7 U+01D4BE ‹𝒾› GC=Ll MATHEMATICAL SCRIPT SMALL I
7 U+01D4CC ‹𝓌› GC=Ll MATHEMATICAL SCRIPT SMALL W
7 U+01D516 ‹𝔖› GC=Lu MATHEMATICAL FRAKTUR CAPITAL S
7 U+01D4BE ‹𝒾› GC=Ll MATHEMATICAL SCRIPT SMALL I
7 U+01D4CC ‹𝓌› GC=Ll MATHEMATICAL SCRIPT SMALL W
7 U+01D516 ‹𝔖› GC=Lu MATHEMATICAL FRAKTUR CAPITAL S
4 U+01D4CF ‹𝓏› GC=Ll MATHEMATICAL SCRIPT SMALL Z
4 U+01D53B ‹𝔻› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL D
4 U+01D54B ‹𝕋› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL T
3 U+01D4BB ‹𝒻› GC=Ll MATHEMATICAL SCRIPT SMALL F
3 U+01D4CA ‹𝓊› GC=Ll MATHEMATICAL SCRIPT SMALL U
3 U+01D507 ‹𝔇› GC=Lu MATHEMATICAL FRAKTUR CAPITAL D
3 U+01D542 ‹𝕂› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL K
3 U+01D546 ‹𝕆› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL O
2 U+01D4BD ‹𝒽› GC=Ll MATHEMATICAL SCRIPT SMALL H
2 U+01D4C5 ‹𝓅› GC=Ll MATHEMATICAL SCRIPT SMALL P
2 U+01D505 ‹𝔅› GC=Lu MATHEMATICAL FRAKTUR CAPITAL B
2 U+01D50E ‹𝔎› GC=Lu MATHEMATICAL FRAKTUR CAPITAL K
2 U+01D541 ‹𝕁› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL J
2 U+01D543 ‹𝕃› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL L
2 U+100002 ‹› GC=Co <private use character>
1 U+01D4B8 ‹𝒸› GC=Ll MATHEMATICAL SCRIPT SMALL C
1 U+01D4C1 ‹𝓁› GC=Ll MATHEMATICAL SCRIPT SMALL L
1 U+01D53D ‹𝔽› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL F
1 U+01D53E ‹𝔾› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL G
1 U+01D54C ‹𝕌› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL U
1 U+01D6A4 ‹𝚤› GC=Ll MATHEMATICAL ITALIC SMALL DOTLESS I
1 U+01D7D9 ‹𝟙› GC=Nd MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
I really wish I knew what they were using U+100002 to do. :(
If those aren't showing up in your browser, you should install George Douros’s Symbola font or another mirror for dowload. It also has all the fun Unicode 6.0.0 code points in it, too.
For me, the Mathematical Alphanumeric Symbols that are used for math typesetting with OpenType fonts such as Cambria Math.