What Is The Difference Between u+00d0 and u+0110? [duplicate] - unicode

This question already has an answer here:
Visually-identical characters in Unicode
(1 answer)
Closed 2 years ago.
They both render as Ð (And the other as Đ). I think that pretty much sums it up. Is there an inherent difference?

They render slightly differently, or similarly, depending on your system.. Char u+0110 is "Latin Capital Letter Eth" and char u+00d0 is "Latin Capital Letter D with stroke". Try opening these pages in different tabs and switch back and forth, the example is rendered slightly differently.
Unicode Character 'LATIN CAPITAL LETTER ETH' (U+00D0)
Unicode Character 'LATIN CAPITAL LETTER D WITH STROKE' (U+0110)

They render very similarly1, but they are different characters, and the lowercase version of those uppercase letters render very differently1.
For comparison:
Name
Capital
Small
Latin Letter Eth
Ð (U+00D0)
ð (U+00F0)
Latin Letter D with Stroke
Đ (U+0110)
đ (U+0111)
Latin Letter African D / Latin Letter D with Tail
Ɖ (U+0189)
ɖ (U+0256)
1) Depends on the font, of course.
Credit: This answer is mostly a refinement/combination of the answer by doublesharp and the comment by John Montgomery.

Related

Why is upper casing not enough for case-insensitive comparison?

To compare two strings case insensitively, one correct way is to case
fold them first. How is this better than upper casing or lower casing?
I find examples where lower casing doesn't work right online. For
example "σ" and "ς" (two forms of "Σ") don't become the same when
converted to lower case. But I've failed to find why case folding is
better than mapping to upper case. Is there a case where two strings
that should match case insensitively don't upper case to the same
strings?
Another scenario is when I want to store a case insensitive index. The
recommended way seems to be case folding and then normalizing. What are
its advantages over storing the string mapped to upper case and
normalized? The specs say mapping to upper case is not guaranteed to be
stable across versions of Unicode while case folding is. But are there
any cases where mapping to upper case gives a different string in an
earlier version of Unicode?
As per Unicode stability policy, case mappings are only stable for case pairs, i.e. pairs of characters X and Y where X is the full uppercase mapping of Y, and Y is the full lowercase mapping of X. Only when both these characters exist with these properties is the casing relation between them set in stone.
However, Unicode contains many “incomplete” case pairs where only the lowercase form has been encoded and the uppercase form is missing completely. This is usually the case for letters used in transcription systems that are traditionally lowercase-only. Should capital forms be discovered and subsequently added to Unicode, these letters would then receive a new uppercase mapping.
The most recent characters this has happened to are “ʂ” (from Unicode 1.1), “ᶎ” (from Unicode 4.1), and “ꞔ” (from Unicode 7.0), which all got brand new uppercase forms (Ꞔ, Ʂ, Ᶎ) in Unicode 12.0 two years ago.
Because case mappings do not have to be unique, this makes uppercasing a poor substitute for proper case-folding. For example, both U+0434 (д) and U+1C81 (ᲁ) uppercase to U+0414 (Д), but only the former is locked into a case pair by virtue of being U+0414’s full lowercase mapping. If someone were to find a dedicated capital letter version of U+1C81 in some old manuscript, it would be given a new uppercase mapping, resulting in U+0434 and U+1C81 suddenly no longer comparing equal under that operation.
EDIT: I have just remembered a current example of uppercasing not being sufficient for case-insensitive matching: U+1E9E (ẞ) is already a capital letter and thus uppercases to itself. Its lowercase counterpart is U+00DF (ß), but the uppercase mapping of U+00DF is the sequence <U+0053, U+0053> (SS).
uppercase("ẞ") ≠ uppercase(lowercase("ẞ"))
I found a list from here.
As of Unicode 13.0.0.
Equivalence classes that have more than 1 uppercase mapping.
case fold
original
UPPER CASE
k 006B LATIN SMALL LETTER K
K 004B LATIN CAPITAL LETTER K
K 004B LATIN CAPITAL LETTER K
k 006B LATIN SMALL LETTER K
K 004B LATIN CAPITAL LETTER K
K 212A KELVIN SIGN
K 212A KELVIN SIGN
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S
ß 00DF LATIN SMALL LETTER SHARP S
SS 0053 LATIN CAPITAL LETTER S; 0053 LATIN CAPITAL LETTER S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S
å 00E5 LATIN SMALL LETTER A WITH RING ABOVE
Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
å 00E5 LATIN SMALL LETTER A WITH RING ABOVE
Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
Å 212B ANGSTROM SIGN
Å 212B ANGSTROM SIGN
θ 03B8 GREEK SMALL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
ϑ 03D1 GREEK THETA SYMBOL
Θ 0398 GREEK CAPITAL LETTER THETA
ϴ 03F4 GREEK CAPITAL THETA SYMBOL
ϴ 03F4 GREEK CAPITAL THETA SYMBOL
ω 03C9 GREEK SMALL LETTER OMEGA
Ω 03A9 GREEK CAPITAL LETTER OMEGA
Ω 03A9 GREEK CAPITAL LETTER OMEGA
ω 03C9 GREEK SMALL LETTER OMEGA
Ω 03A9 GREEK CAPITAL LETTER OMEGA
Ω 2126 OHM SIGN
Ω 2126 OHM SIGN
And for lowercasing.
case fold
original
lower case
s 0073 LATIN SMALL LETTER S
S 0053 LATIN CAPITAL LETTER S
s 0073 LATIN SMALL LETTER S
s 0073 LATIN SMALL LETTER S
s 0073 LATIN SMALL LETTER S
ſ 017F LATIN SMALL LETTER LONG S
ſ 017F LATIN SMALL LETTER LONG S
st 0073 LATIN SMALL LETTER S; 0074 LATIN SMALL LETTER T
ſt FB05 LATIN SMALL LIGATURE LONG S T
ſt FB05 LATIN SMALL LIGATURE LONG S T
st FB06 LATIN SMALL LIGATURE ST
st FB06 LATIN SMALL LIGATURE ST
β 03B2 GREEK SMALL LETTER BETA
Β 0392 GREEK CAPITAL LETTER BETA
β 03B2 GREEK SMALL LETTER BETA
β 03B2 GREEK SMALL LETTER BETA
β 03B2 GREEK SMALL LETTER BETA
ϐ 03D0 GREEK BETA SYMBOL
ϐ 03D0 GREEK BETA SYMBOL
ε 03B5 GREEK SMALL LETTER EPSILON
Ε 0395 GREEK CAPITAL LETTER EPSILON
ε 03B5 GREEK SMALL LETTER EPSILON
ε 03B5 GREEK SMALL LETTER EPSILON
ε 03B5 GREEK SMALL LETTER EPSILON
ϵ 03F5 GREEK LUNATE EPSILON SYMBOL
ϵ 03F5 GREEK LUNATE EPSILON SYMBOL
θ 03B8 GREEK SMALL LETTER THETA
Θ 0398 GREEK CAPITAL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA
ϑ 03D1 GREEK THETA SYMBOL
ϑ 03D1 GREEK THETA SYMBOL
ϴ 03F4 GREEK CAPITAL THETA SYMBOL
θ 03B8 GREEK SMALL LETTER THETA
ι 03B9 GREEK SMALL LETTER IOTA
◌ͅ 0345 COMBINING GREEK YPOGEGRAMMENI
◌ͅ 0345 COMBINING GREEK YPOGEGRAMMENI
Ι 0399 GREEK CAPITAL LETTER IOTA
ι 03B9 GREEK SMALL LETTER IOTA
ι 03B9 GREEK SMALL LETTER IOTA
ι 03B9 GREEK SMALL LETTER IOTA
ι 1FBE GREEK PROSGEGRAMMENI
ι 1FBE GREEK PROSGEGRAMMENI
ΐ 03B9 GREEK SMALL LETTER IOTA; 0308 COMBINING DIAERESIS; 0301 COMBINING ACUTE ACCENT
ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
ΐ 1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
ΐ 1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
κ 03BA GREEK SMALL LETTER KAPPA
Κ 039A GREEK CAPITAL LETTER KAPPA
κ 03BA GREEK SMALL LETTER KAPPA
κ 03BA GREEK SMALL LETTER KAPPA
κ 03BA GREEK SMALL LETTER KAPPA
ϰ 03F0 GREEK KAPPA SYMBOL
ϰ 03F0 GREEK KAPPA SYMBOL
μ 03BC GREEK SMALL LETTER MU
µ 00B5 MICRO SIGN
µ 00B5 MICRO SIGN
Μ 039C GREEK CAPITAL LETTER MU
μ 03BC GREEK SMALL LETTER MU
μ 03BC GREEK SMALL LETTER MU
μ 03BC GREEK SMALL LETTER MU
π 03C0 GREEK SMALL LETTER PI
Π 03A0 GREEK CAPITAL LETTER PI
π 03C0 GREEK SMALL LETTER PI
π 03C0 GREEK SMALL LETTER PI
π 03C0 GREEK SMALL LETTER PI
ϖ 03D6 GREEK PI SYMBOL
ϖ 03D6 GREEK PI SYMBOL
ρ 03C1 GREEK SMALL LETTER RHO
Ρ 03A1 GREEK CAPITAL LETTER RHO
ρ 03C1 GREEK SMALL LETTER RHO
ρ 03C1 GREEK SMALL LETTER RHO
ρ 03C1 GREEK SMALL LETTER RHO
ϱ 03F1 GREEK RHO SYMBOL
ϱ 03F1 GREEK RHO SYMBOL
σ 03C3 GREEK SMALL LETTER SIGMA
Σ 03A3 GREEK CAPITAL LETTER SIGMA
σ 03C3 GREEK SMALL LETTER SIGMA
ς 03C2 GREEK SMALL LETTER FINAL SIGMA
ς 03C2 GREEK SMALL LETTER FINAL SIGMA
σ 03C3 GREEK SMALL LETTER SIGMA
σ 03C3 GREEK SMALL LETTER SIGMA
ΰ 03C5 GREEK SMALL LETTER UPSILON; 0308 COMBINING DIAERESIS; 0301 COMBINING ACUTE ACCENT
ΰ 03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
ΰ 03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
ΰ 1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
ΰ 1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
φ 03C6 GREEK SMALL LETTER PHI
Φ 03A6 GREEK CAPITAL LETTER PHI
φ 03C6 GREEK SMALL LETTER PHI
φ 03C6 GREEK SMALL LETTER PHI
φ 03C6 GREEK SMALL LETTER PHI
ϕ 03D5 GREEK PHI SYMBOL
ϕ 03D5 GREEK PHI SYMBOL
в 0432 CYRILLIC SMALL LETTER VE
В 0412 CYRILLIC CAPITAL LETTER VE
в 0432 CYRILLIC SMALL LETTER VE
в 0432 CYRILLIC SMALL LETTER VE
в 0432 CYRILLIC SMALL LETTER VE
ᲀ 1C80 CYRILLIC SMALL LETTER ROUNDED VE
ᲀ 1C80 CYRILLIC SMALL LETTER ROUNDED VE
д 0434 CYRILLIC SMALL LETTER DE
Д 0414 CYRILLIC CAPITAL LETTER DE
д 0434 CYRILLIC SMALL LETTER DE
д 0434 CYRILLIC SMALL LETTER DE
д 0434 CYRILLIC SMALL LETTER DE
ᲁ 1C81 CYRILLIC SMALL LETTER LONG-LEGGED DE
ᲁ 1C81 CYRILLIC SMALL LETTER LONG-LEGGED DE
о 043E CYRILLIC SMALL LETTER O
О 041E CYRILLIC CAPITAL LETTER O
о 043E CYRILLIC SMALL LETTER O
о 043E CYRILLIC SMALL LETTER O
о 043E CYRILLIC SMALL LETTER O
ᲂ 1C82 CYRILLIC SMALL LETTER NARROW O
ᲂ 1C82 CYRILLIC SMALL LETTER NARROW O
с 0441 CYRILLIC SMALL LETTER ES
С 0421 CYRILLIC CAPITAL LETTER ES
с 0441 CYRILLIC SMALL LETTER ES
с 0441 CYRILLIC SMALL LETTER ES
с 0441 CYRILLIC SMALL LETTER ES
ᲃ 1C83 CYRILLIC SMALL LETTER WIDE ES
ᲃ 1C83 CYRILLIC SMALL LETTER WIDE ES
т 0442 CYRILLIC SMALL LETTER TE
Т 0422 CYRILLIC CAPITAL LETTER TE
т 0442 CYRILLIC SMALL LETTER TE
т 0442 CYRILLIC SMALL LETTER TE
т 0442 CYRILLIC SMALL LETTER TE
ᲄ 1C84 CYRILLIC SMALL LETTER TALL TE
ᲄ 1C84 CYRILLIC SMALL LETTER TALL TE
ᲅ 1C85 CYRILLIC SMALL LETTER THREE-LEGGED TE
ᲅ 1C85 CYRILLIC SMALL LETTER THREE-LEGGED TE
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
Ъ 042A CYRILLIC CAPITAL LETTER HARD SIGN
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ᲆ 1C86 CYRILLIC SMALL LETTER TALL HARD SIGN
ᲆ 1C86 CYRILLIC SMALL LETTER TALL HARD SIGN
ѣ 0463 CYRILLIC SMALL LETTER YAT
Ѣ 0462 CYRILLIC CAPITAL LETTER YAT
ѣ 0463 CYRILLIC SMALL LETTER YAT
ѣ 0463 CYRILLIC SMALL LETTER YAT
ѣ 0463 CYRILLIC SMALL LETTER YAT
ᲇ 1C87 CYRILLIC SMALL LETTER TALL YAT
ᲇ 1C87 CYRILLIC SMALL LETTER TALL YAT
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
Ṡ 1E60 LATIN CAPITAL LETTER S WITH DOT ABOVE
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ẛ 1E9B LATIN SMALL LETTER LONG S WITH DOT ABOVE
ẛ 1E9B LATIN SMALL LETTER LONG S WITH DOT ABOVE
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
ᲈ 1C88 CYRILLIC SMALL LETTER UNBLENDED UK
ᲈ 1C88 CYRILLIC SMALL LETTER UNBLENDED UK
Ꙋ A64A CYRILLIC CAPITAL LETTER MONOGRAPH UK
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
And for lowercase(uppercase(X)).
case fold
original
lower case of upper case
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S
ß 00DF LATIN SMALL LETTER SHARP S
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S
ß 00DF LATIN SMALL LETTER SHARP S
For uppercase(lowercase(s)), no equivalence group has multiple results.

Why do LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE not get normalized to "i" in NFC form?

Example in Python:
>>> s = 'ı̇'
>>> len(s)
2
>>> list(s)
['ı', '̇']
>>> print(", ".join(map(unicodedata.name, s)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
>>> normalized = unicodedata.normalize('NFC', s)
>>> print(", ".join(map(unicodedata.name, normalized)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
As you can see, NFC normalization does not compose the dotless i + a dot to a normal i. Is there a rationale for this? Is this an oversight? Or is it not included because NFC is supposed to be the perfect inverse of NFD (and one wouldn’t want to decompose i to dotless i + dot).
While NFC isn't the "perfect inverse" of NFD, this follows from NFC being defined in terms of the same decomposition mappings as NFD. NFC is basically defined as NFD followed by recomposing certain NFD decomposition pairs. Since there's no decomposition mapping for LATIN SMALL LETTER I, it can never be the result of a recomposition.

How to achieve unusual text effect?

I was just reading a post and saw some odd text effects, however I cannot locate how it is achieved or what it is called:
P̢̢̲̭̘̣̪͉͞͞h̴̛̫͉͖̜͙̳͎̕͞͠'̶̀͢҉̯̞̹͈ṉ̶̘̠̯̬̭̖̳͘͞ģ̵̛͠҉̰̝͇̩͍̗͍̘̫͈̺̭̥͉l̨͍̘͔̰͔̖͍̹̠̭̱̰̖͙̦̦͎̕͟u̢̡҉̲̭̲̺̮̖͖͖i̴̢̹̳͉͎̥̪̜͎̼̣̦̖̻͈̖͉͚ͅ ̵͏͇̗̭ͅm̶̨͍̤̪̱͇̤̬̥̥͔̼͍̠̼͕g̷̷̰̩͙̪̫͉̺̯͘͟͠ļ̶̭͇̘̮̕͢ẃ̵̸̷҉͕̬̠̥̤͖̙̲͇̼̹'̺̩̖̟̣͈̖͙̤̫̰̗̯̀͡ń̷̴̶̰̮̺͔̼̺̹̘̟a̷̰̪͙͇̤͓̤̭͎̦͕̻f͏̨͙̰̘͔̟̜̠͈̯̻͕̖̳̝̝́͘ͅḩ̴̛͉͉̲͇̠͙̣̩͙̩͚̮̼̺ͅ ̧̛̟͓̤͇̯͍̫͖͎͈̫̳͓̞͘Ç͘͏͈̹̠̙͎̳̯͚͔̼͙̻͔͖̲̩̹̕ͅt͏̖̲̤̫̤̫̼̪̥̠͙͚͍̭́ͅḩ̡̲͈̫̯͚͉̱͍̳͝ù̧͙̭̙̻̲̙͚͔̲̬͚͢͝͡ḻ̴̵̨̹͉͙̟̯̞̠͔̦̝̩͜h̶̼̜̦͖͍͎͍̕ṷ̴̶̢͙̗̬͇̯̞̗̰̣̬̥̲̣̦ ̵̲͍̩̭̩̗͈͚͟͝R͏̛͘͟҉̫̝̞̪̣̪̻̤̼͖̪͎'̛̯͚͎̳͎̼͓̘͉͢l͟҉̵̘͈͙̣̹̜͍͎̬̺̹̪̜̀y͏͓̞̬͙̥̞̦͎͖̞͖͎̖̀e̶̵̡̺͉̯̭̣̗h͇̺͇̖̼̻̟͓͜͟͜͞ͅ ̴̷̡̨̪͍̙̳̞̭̙̫̯̘͚͇͚̼͙͟w̧̮̜̯̭̘͈̫̳̖̕͜͠g̢̨̗͖̬̠͎͓̱̞͓̭̯̺͕̭̯̦ͅa̴̠̘̬̩͍͜ͅh̵̷̨̜̻͔̖͈̤͈̩͔͈͇̩̞̲̜̩͍̺'̸̨͇̞̜͈͟n̨͟͞҉̤͚͎͇̣̺͚̻̖͖́ͅà̻͉̙̲̲̞͘͝ģ̙̗̙͓̜̣͔̥̫͟͡l̴̨̨̼͚̫̞̙̳͙͢͟ ̢̦͚̲͇̞̺̗̫͇f̸̸̫̠͖͙̜͉̲͖͓̭͇̦̭̩̲͡͠ḩ̸̲̤͍̖̻̣̝̼́̕͝ͅt̴͝҉҉̵͔̮̞̪á̢̕͢͏̗̯̗̙͙͉̪͓͙̣̰̣g͏̶̡͓̤͍͖̜̠̜ͅn̴̶̛̝̼͉̠̻͓
Don't worry though, unless thousands read it I think we are safe.
It's called Zalgo text.
You can Google for an online generator and use it:
TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
As a side note, don't try to parse HTML with RegEx.
You can get a clue if you pick a sample and submit it to Unicode character inspector:
C U+0043 LATIN CAPITAL LETTER C Lu Basic Latin
̧ U+0327 COMBINING CEDILLA Mn Combining Diacritical Marks
͘ U+0358 COMBINING DOT ABOVE RIGHT Mn Combining Diacritical Marks
U+034F COMBINING GRAPHEME JOINER Mn Combining Diacritical Marks
̕ U+0315 COMBINING COMMA ABOVE RIGHT Mn Combining Diacritical Marks
͈ U+0348 COMBINING DOUBLE VERTICAL LINE BELOW Mn Combining Diacritical Marks
[…]
… where Lu stands for Unicode Character Category 'Letter, Uppercase' and Mn stands for Unicode Character Category 'Mark, Nonspacing'.
In short, they're just regular letters attached to all sort of combining diacritics, thanks to the magic of Unicode. It abuses the fact that é can also be written as e + ´ for entertainment purposes.

Whatever happened to the Unicode Character MATHEMATICAL DOUBLE-STRUCK CAPITAL C?

If you look at the Unicode Block for Mathematical Alphanumeric Symbols you notice that the MATHEMATICAL DOUBLE-STRUCK CAPITAL C is missing. And it is not the only one. Why? What is the point of having DOUBLE-STRUCK if you don't have all 26?
The code chart (PDF) for Mathematical Alphanumeric Symbols contains the following explanation:
Double-struck symbols already encoded in the Letterlike
Symbols block and omitted here to avoid duplicate encoding.
Here “and” is apparently to be read as “are”. Anyway, the point is that the Letterlike Symbols block already contains the double-struck C (as well as a few other double-struck letters). This reflects their relatively common use in mathematics (e.g. for ℂ the use as an alternative to C to denote the set of complex numbers) and their presence in old character codes. The block does not have enough code points for adding all the double-struck letters, so additions were made elsewhere. To keep the allocation natural, holes (reserved code points) were left there.
The code chart contains cross references to the characters allocated elsewhere, e.g. for the reserved code point 1D53A is has the comment “→ 2102 ℂ double-struck capital c”.
The ℂ for example is in another block, as said in the article:
The characters with pink background are located in other Unicode blocks, such as Letterlike symbols.
The ℂ specifically is in Letterlike symbols:
ℂ Double-Struck Capital C U+2102

What are the most common non-BMP Unicode characters in actual use? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16.
I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far.
UPDATE
I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to my surprise that even in the Japanese Wikipedia Gothic alphabet is the most common. This is also true in the Chinese Wikipedia but it also had many Chinese characters being used up to 50 or 70 times, including "𨭎", "𠬠", and "𩷶".
Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter's public stream. It occurs more frequently than the tilde!
Excellent question!
The answer is the mathematical letters. This past December I did a scan of the entire PubMed Open Access corpus, and came up with these figures for astral characters in it.
The first number in the figures below is how many copies of each given code point I found in the entire corpus. First, though, to give you a notion on the relative frequencies, here are the top ten trans-ASCII code points in that corpus:
2663710 U+002013 ‹–› GC=Pd EN DASH
1065594 U+0000A0 ‹ › GC=Zs NO-BREAK SPACE
1009762 U+0000B1 ‹±› GC=Sm PLUS-MINUS SIGN
784139 U+002212 ‹−› GC=Sm MINUS SIGN
602377 U+002003 ‹ › GC=Zs EM SPACE
528576 U+0003BC ‹μ› GC=Ll GREEK SMALL LETTER MU
519669 U+0003B2 ‹β› GC=Ll GREEK SMALL LETTER BETA
512312 U+0003B1 ‹α› GC=Ll GREEK SMALL LETTER ALPHA
491842 U+00200A ‹ › GC=Zs HAIR SPACE
462505 U+0000B0 ‹°› GC=So DEGREE SIGN
And here now are the trans-BMP code points, in order of decending frequency:
544 U+01D49E ‹𝒞› GC=Lu MATHEMATICAL SCRIPT CAPITAL C
450 U+01D4AF ‹𝒯› GC=Lu MATHEMATICAL SCRIPT CAPITAL T
385 U+01D4AE ‹𝒮› GC=Lu MATHEMATICAL SCRIPT CAPITAL S
292 U+01D49F ‹𝒟› GC=Lu MATHEMATICAL SCRIPT CAPITAL D
285 U+01D4B3 ‹𝒳› GC=Lu MATHEMATICAL SCRIPT CAPITAL X
262 U+01D4A9 ‹𝒩› GC=Lu MATHEMATICAL SCRIPT CAPITAL N
258 U+01D4AB ‹𝒫› GC=Lu MATHEMATICAL SCRIPT CAPITAL P
254 U+01D4A2 ‹𝒢› GC=Lu MATHEMATICAL SCRIPT CAPITAL G
185 U+01D49C ‹𝒜› GC=Lu MATHEMATICAL SCRIPT CAPITAL A
178 U+01D53C ‹𝔼› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL E
137 U+01D4AA ‹𝒪› GC=Lu MATHEMATICAL SCRIPT CAPITAL O
56 U+01D4A5 ‹𝒥› GC=Lu MATHEMATICAL SCRIPT CAPITAL J
48 U+01D4A6 ‹𝒦› GC=Lu MATHEMATICAL SCRIPT CAPITAL K
44 U+01D4B1 ‹𝒱› GC=Lu MATHEMATICAL SCRIPT CAPITAL V
43 U+01D4B2 ‹𝒲› GC=Lu MATHEMATICAL SCRIPT CAPITAL W
42 U+01D4B4 ‹𝒴› GC=Lu MATHEMATICAL SCRIPT CAPITAL Y
41 U+01D4B5 ‹𝒵› GC=Lu MATHEMATICAL SCRIPT CAPITAL Z
35 U+01D4B0 ‹𝒰› GC=Lu MATHEMATICAL SCRIPT CAPITAL U
30 U+01D4AC ‹𝒬› GC=Lu MATHEMATICAL SCRIPT CAPITAL Q
23 U+01D54A ‹𝕊› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL S
21 U+01D539 ‹𝔹› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL B
19 U+01D5A7 ‹𝖧› GC=Lu MATHEMATICAL SANS-SERIF CAPITAL H
18 U+01D517 ‹𝔗› GC=Lu MATHEMATICAL FRAKTUR CAPITAL T
15 U+01D4C3 ‹𝓃› GC=Ll MATHEMATICAL SCRIPT SMALL N
14 U+01D535 ‹𝔵› GC=Ll MATHEMATICAL FRAKTUR SMALL X
13 U+01D4BF ‹𝒿› GC=Ll MATHEMATICAL SCRIPT SMALL J
11 U+01D540 ‹𝕀› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL I
9 U+01D465 ‹𝑥› GC=Ll MATHEMATICAL ITALIC SMALL X
9 U+01D4CE ‹𝓎› GC=Ll MATHEMATICAL SCRIPT SMALL Y
9 U+01D538 ‹𝔸› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL A
8 U+01D4C2 ‹𝓂› GC=Ll MATHEMATICAL SCRIPT SMALL M
8 U+01D54D ‹𝕍› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL V
7 U+01D4B6 ‹𝒶› GC=Ll MATHEMATICAL SCRIPT SMALL A
7 U+01D4BE ‹𝒾› GC=Ll MATHEMATICAL SCRIPT SMALL I
7 U+01D4CC ‹𝓌› GC=Ll MATHEMATICAL SCRIPT SMALL W
7 U+01D516 ‹𝔖› GC=Lu MATHEMATICAL FRAKTUR CAPITAL S
7 U+01D4BE ‹𝒾› GC=Ll MATHEMATICAL SCRIPT SMALL I
7 U+01D4CC ‹𝓌› GC=Ll MATHEMATICAL SCRIPT SMALL W
7 U+01D516 ‹𝔖› GC=Lu MATHEMATICAL FRAKTUR CAPITAL S
4 U+01D4CF ‹𝓏› GC=Ll MATHEMATICAL SCRIPT SMALL Z
4 U+01D53B ‹𝔻› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL D
4 U+01D54B ‹𝕋› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL T
3 U+01D4BB ‹𝒻› GC=Ll MATHEMATICAL SCRIPT SMALL F
3 U+01D4CA ‹𝓊› GC=Ll MATHEMATICAL SCRIPT SMALL U
3 U+01D507 ‹𝔇› GC=Lu MATHEMATICAL FRAKTUR CAPITAL D
3 U+01D542 ‹𝕂› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL K
3 U+01D546 ‹𝕆› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL O
2 U+01D4BD ‹𝒽› GC=Ll MATHEMATICAL SCRIPT SMALL H
2 U+01D4C5 ‹𝓅› GC=Ll MATHEMATICAL SCRIPT SMALL P
2 U+01D505 ‹𝔅› GC=Lu MATHEMATICAL FRAKTUR CAPITAL B
2 U+01D50E ‹𝔎› GC=Lu MATHEMATICAL FRAKTUR CAPITAL K
2 U+01D541 ‹𝕁› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL J
2 U+01D543 ‹𝕃› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL L
2 U+100002 ‹􀀂› GC=Co <private use character>
1 U+01D4B8 ‹𝒸› GC=Ll MATHEMATICAL SCRIPT SMALL C
1 U+01D4C1 ‹𝓁› GC=Ll MATHEMATICAL SCRIPT SMALL L
1 U+01D53D ‹𝔽› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL F
1 U+01D53E ‹𝔾› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL G
1 U+01D54C ‹𝕌› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL U
1 U+01D6A4 ‹𝚤› GC=Ll MATHEMATICAL ITALIC SMALL DOTLESS I
1 U+01D7D9 ‹𝟙› GC=Nd MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
I really wish I knew what they were using U+100002 to do. :(
If those aren't showing up in your browser, you should install George Douros’s Symbola font or another mirror for dowload. It also has all the fun Unicode 6.0.0 code points in it, too.
For me, the Mathematical Alphanumeric Symbols that are used for math typesetting with OpenType fonts such as Cambria Math.