What does the \u{...} notation mean in UNICODE and why are only some characters displayed like this in the CLDR project? - unicode

In this link you will find the most used characters for each language. Why are some characters in some languages displayed under the \u{...} notation?
I think that what is in the brackets is the hexadecimal code of the character, but I can't understand why they would only do it with some characters.

The character sequences enclosed in curly brackets {} are digraphs (trigraphs, …) counted as a distinct letter in given language (supposedly with its own place in the alphabet), for instance
digraph {ch} in cs (Czech language);
trigraph {dzs} in hu (Hungarian alphabet);
more complex digraph examples in kkj (Kako language) shows the following Python code snippet:
>>> kkj='[a á à â {a\u0327} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ\u0301} {ɛ\u0300} {ɛ\u0302} {ɛ\u0327} f g {gb} {gw} h i í ì î {i\u0327} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ\u0301} {ɔ\u0300} {ɔ\u0302} {ɔ\u0327} p r s t u ú ù û {u\u0327} v w y]'
>>> print( kkj)
[a á à â {a̧} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̧} f g {gb} {gw} h i í ì î {i̧} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̧} p r s t u ú ù û {u̧} v w y]
>>>
For instance, {a\u0327} renders as {a̧} i.e. something like Latin Small Letter A with Combining Cedilla which has no Unicode equivalent. A counterexample:
ņ (U+0146) Latin Small Letter N With Cedilla with decomposition 004E 0327:
>>> import unicodedata
>>> print( 'ņ', unicodedata.normalize('NFC','{n\u0327}'))
ņ {ņ}
Edit:
Characters presented as unicode literals (\uxxxx = a character with 16-bit hex value xxxx) are unrenderable ones (or hard to render, at least). The following Python script shows some of them (Bidi_Class Values L-Left_To_Right, R-Right_To_Left, NSM-Nonspacing_Mark, BN-Boundary_Neutral):
# -*- coding: utf-8 -*-
import unicodedata
pa = 'ੱੰ਼੍ੁੂੇੈੋੌ'
pa = '\u0327 \u0A71 \u0A70 \u0A3C ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ੴ ੳ ਉ ਊ ਓ ਅ ਆ ਐ ਔ ੲ ਇ ਈ ਏ ਸ {ਸ\u0A3C} ਹ ਕ ਖ {ਖ\u0A3C} ਗ {ਗ\u0A3C} ਘ ਙ ਚ ਛ ਜ {ਜ\u0A3C} ਝ ਞ ਟ ਠ ਡ ਢ ਣ ਤ ਥ ਦ ਧ ਨ ਪ ਫ {ਫ\u0A3C} ਬ ਭ ਮ ਯ ਰ ਲ ਵ ੜ \u0A4D ਾ ਿ ੀ \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C'
pa = '\u0300 \u0301 \u0302 \u1DC6 \u1DC7 \u0A71 \u0A70 \u0A3C \u0A4D \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C \u05B7 \u05B8 \u05BF \u200C \u200D \u200E \u200F \u064B \u064C \u064E \u064F \u0650'
# above examples from ·kkj· ·bas· ·pa· ·yi· ·kn· ·ur· ·mzn·
print( pa )
for chr in pa:
if chr != ' ':
if chr == '{' or chr == '}':
print( chr )
else:
print( '\\u%04x' % ord(chr), chr,
unicodedata.category(chr),
unicodedata.bidirectional(chr) + '\t',
str( unicodedata.combining(chr)) + '\t',
unicodedata.name(chr, '?') )
Result: .\SO\63659122.py
̀ ́ ̂ ᷆ ᷇ ੱ ੰ ਼ ੍ ੁ ੂ ੇ ੈ ੋ ੌ ַ ָ ֿ ‌ ‍ ‎ ‏ ً ٌ َ ُ ِ
\u0300 ̀ Mn NSM 230 COMBINING GRAVE ACCENT
\u0301 ́ Mn NSM 230 COMBINING ACUTE ACCENT
\u0302 ̂ Mn NSM 230 COMBINING CIRCUMFLEX ACCENT
\u1dc6 ᷆ Mn NSM 230 COMBINING MACRON-GRAVE
\u1dc7 ᷇ Mn NSM 230 COMBINING ACUTE-MACRON
\u0a71 ੱ Mn NSM 0 GURMUKHI ADDAK
\u0a70 ੰ Mn NSM 0 GURMUKHI TIPPI
\u0a3c ਼ Mn NSM 7 GURMUKHI SIGN NUKTA
\u0a4d ੍ Mn NSM 9 GURMUKHI SIGN VIRAMA
\u0a41 ੁ Mn NSM 0 GURMUKHI VOWEL SIGN U
\u0a42 ੂ Mn NSM 0 GURMUKHI VOWEL SIGN UU
\u0a47 ੇ Mn NSM 0 GURMUKHI VOWEL SIGN EE
\u0a48 ੈ Mn NSM 0 GURMUKHI VOWEL SIGN AI
\u0a4b ੋ Mn NSM 0 GURMUKHI VOWEL SIGN OO
\u0a4c ੌ Mn NSM 0 GURMUKHI VOWEL SIGN AU
\u05b7 ַ Mn NSM 17 HEBREW POINT PATAH
\u05b8 ָ Mn NSM 18 HEBREW POINT QAMATS
\u05bf ֿ Mn NSM 23 HEBREW POINT RAFE
\u200c ‌ Cf BN 0 ZERO WIDTH NON-JOINER
\u200d ‍ Cf BN 0 ZERO WIDTH JOINER
\u200e ‎ Cf L 0 LEFT-TO-RIGHT MARK
\u200f ‏ Cf R 0 RIGHT-TO-LEFT MARK
\u064b ً Mn NSM 27 ARABIC FATHATAN
\u064c ٌ Mn NSM 28 ARABIC DAMMATAN
\u064e َ Mn NSM 30 ARABIC FATHA
\u064f ُ Mn NSM 31 ARABIC DAMMA
\u0650 ِ Mn NSM 32 ARABIC KASRA

It seems like all codepoints that don't have a well-defined stand-alone look (or are not meant to be used as stand-alone characters) are represented with this notation.
For example U+0A3C is present in the "character" {ਫ\u0A3C}. U+0A3C is a combining codepoint that modifies the one that is before it.

Related

BreakPermittedHere char in filename

I recieved (mail attachment) a .pdf file with a \0082 (Break Permitted Here) character instead presumably of a é one.
Sciences Num<here>riques et Technologie.pdf
What could have happened ?
It's a simple mojibake case: sender and receiver applies different code pages.
Example for given characters .\Py\mojibakeWindows.py é \x82
Mojibake prove using 97 codecs
string é ['U+00e9'] Latin Small Letter E With Acute
versus ['U+0082'] ??? Cc
[b'\x82']
é ['cp437', 'cp720', 'cp775', 'cp850', 'cp852', 'cp857', 'cp858', 'cp860', 'cp861', 'cp863', 'cp865']
['cp1006', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'iso8859_11']
Here the mojibakeWindows.py script is as follows:
import sys
import codecs
import unicodedata
if len(sys.argv) == 3:
str1st = sys.argv[1].encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE');
str2nd = sys.argv[2].encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE');
else:
print( 'need two `string` parameters e.g. as follows:');
print( sys.argv[0], '"╧╤╪"', '"ÏÑØ"' )
sys.exit();
codec_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig'] + ['cp273', 'cp1125', 'iso8859_11', 'koi8_t', 'kz1048']; # 'cp65001',
print( 'Mojibake prove using', len( codec_list ), 'codecs' );
str1stname = unicodedata.name(str1st, '??? {}'.format(unicodedata.category(str1st))) if len(str1st)==1 else ''
str2ndname = unicodedata.name(str2nd, '??? {}'.format(unicodedata.category(str2nd))) if len(str2nd)==1 else ''
print( 'string', str1st, ['U+{0:04x}'.format( ord(ch)) for ch in str1st], str1stname.title() );
print( 'versus', str2nd, ['U+{0:04x}'.format( ord(ch)) for ch in str2nd], str2ndname.title() );
str1list = [];
str2list = [];
strXlist = [];
for cod in codec_list:
for doc in codec_list:
if cod != doc:
# str1ste = codecs.encode( str1st,encoding=cod,errors='replace');
try:
str1ste = codecs.encode( str1st,encoding=cod,errors='strict');
except:
str1ste = b'?' * len(str1st);
# str2nde = codecs.encode( str2nd,encoding=doc,errors='replace');
try:
str2nde = codecs.encode( str2nd,encoding=doc,errors='strict');
except:
str2nde = b'?' * len(str2nd);
if ( str1ste == str2nde and b'?' not in str1ste ):
if cod not in str1list: str1list.append( cod );
if doc not in str2list: str2list.append( doc );
if str1ste not in strXlist: strXlist.append( str1ste );
print( strXlist );
print( str1st, str1list );
print( str2nd, str2list );
Another example applies my Alt KeyCode Finder script (see column Alt0 for ACP code and column Dec for OEMCP code): powershell -COMMAND .\PShell\MyCharMap.ps1 é,0x82
Ch Unicode Dec CP IME Alt Alt0 IME 0405/cs-CZ; CP65001; ACP 65001
é U+00E9 233 …233… Latin Small Letter E With Acute
130 CP437 en-US 0233 (ACP 1252) US & Western Eu
130 CP850 en-GB 0233 (ACP 1252) US & Western Eu
130 CP852 cs-CZ 0233 (ACP 1250) Central Europe
130 CP775 et-EE 0233 (ACP 1257) Baltic
130 CP857 tr-TR 0233 (ACP 1254) Turkish
130 CP720 ar-EG 0233 (ACP 1256) Arabic
vi-VN 0233 (ACP 1258) Vietnamese
U+0082 130 …130… Break Permitted Here
130 CP869 el-gr (ACP 1253) Greek-2
th-TH 0130 (ACP 874) Thai

Utf8 encoding makes me confused

let buf1 = Buffer.from("3", "utf8");
let buf2 = Buffer.from("Здравствуйте", "utf8");
// <Buffer 33>
// <Buffer d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5>
Why does char '3' encode to '33' in buf1 but 'd0 97' in buf2?
Because 3 is not З, despite the similarity to the untrained eye. Look closer and you'll see the difference, however subtle.
The former is Unicode code point U+0033 - DIGIT THREE (see here), while the latter is U+0417 - CYRILLIC CAPITAL LETTER ZE (see here), encoded in UTF-8 as d0 97.
The Russian word is actually hello, pronounced (very roughly, since I only know hello and goodbye, taught by a Russian girlfriend many decades ago) "Strasvoytza", with no "three" anywhere in the concept.
The first character of the second buffer is the Cyrillic character "Ze" https://en.m.wikipedia.org/wiki/Ze_(Cyrillic) and not the Arabic numeral 3 https://en.m.wikipedia.org/wiki/3

Unicode letters with more than 1 alphabetic latin character?

I'm not really sure how to express it but I'm searching for unicode letters which are more than one visual latin letter.
I found this in Word so far:
DZ
Dz
dz
NJ
Lj
LJ
Nj
nj
Any others?
Here are some of the characters I've found. I'd first done this manually by looking at some probable blocks. However I've later written a Python script to do this automatically that you can find at the end of this answer
Digraphs
Two Glyphs
Digraph
Unicode Code Point
HTML
DZ, Dz, dz
DZ, Dz, dz
U+01F1 U+01F2 U+01F3
DZ Dz dz
DŽ, Dž, dž
DŽ, Dž, dž
U+01C4 U+01C5 U+01C6
DŽ Dž dž
IJ, ij
IJ, ij
U+0132 U+0133
IJ ij
LJ, Lj, lj
LJ, Lj, lj
U+01C7 U+01C8 U+01C9
LJ Lj lj
NJ, Nj, nj
NJ, Nj, nj
U+01CA U+01CB U+01CC
NJ Nj nj
Ligatures
Non-ligature
Ligature
Unicode
HTML
AA, aa
Ꜳ, ꜳ
U+A732, U+A733
Ꜳ ꜳ
AE, ae
Æ, æ
U+00C6, U+00E6
Æ æ
AO, ao
Ꜵ, ꜵ
U+A734, U+A735
Ꜵ ꜵ
AU, au
Ꜷ, ꜷ
U+A736, U+A737
Ꜷ ꜷ
AV, av
Ꜹ, ꜹ
U+A738, U+A739
Ꜹ ꜹ
AV, av (with bar)
Ꜻ, ꜻ
U+A73A, U+A73B
Ꜻ ꜻ
AY, ay
Ꜽ, ꜽ
U+A73C, U+A73D
Ꜽ ꜽ
et
🙰
U+1F670
🙰
f‌f
ff
U+FB00
ff
f‌f‌i
ffi
U+FB03
ffi
f‌f‌l
ffl
U+FB04
ffl
f‌i
fi
U+FB01
fi
f‌l
fl
U+FB02
fl
OE, oe
Œ, œ
U+0152, U+0153
Œ œ
OO, oo
Ꝏ, ꝏ
U+A74E, U+A74F
Ꝏ ꝏ
ſs, ſz
ẞ, ß
U+1E9E, U+00DF
ß
st
st
U+FB06
st
ſt
ſt
U+FB05
ſt
TZ, tz
Ꜩ, ꜩ
U+A728, U+A729
Ꜩ ꜩ
ue
ᵫ
U+1D6B
ᵫ
VY, vy
Ꝡ, ꝡ
U+A760, U+A761
Ꝡ ꝡ
There are a few other ligatures that are used for phonetic transcription but looks like Latin characters
Non-ligature
Ligature
Unicode
HTML
db
ȸ
U+0238
ȸ
dz
ʣ
U+02A3
ʣ
IJ, ij
IJ, ij
U+0132, U+0133
IJ ij
ls
ʪ
U+02AA
ʪ
lz
ʫ
U+02AB
ʫ
qp
ȹ
U+0239
ȹ
ts
ʦ
U+02A6
ʦ
ui
ꭐ
U+AB50
ꭐ
turned ui
ꭑ
U+AB51
ꭑ
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Digraphs_and_ligatures
Edit:
There are more letterlike symbols beside ℻ and ℡ like what the OP found in the comment:
℀ ℁ ⅍ ℅ ℆ ℔ ℠ ™
Longer letters are mainly from the CJK Compatibility block
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+338x
㎀
㎁
㎂
㎃
㎄
㎅
㎆
㎇
㎈
㎉
㎊
㎋
㎌
㎍
㎎
㎏
U+339x
㎐
㎑
㎒
㎓
㎔
㎕
㎖
㎗
㎘
㎙
㎚
㎛
㎜
㎝
㎞
㎟
U+33Ax
㎠
㎡
㎢
㎣
㎤
㎥
㎦
㎧
㎨
㎩
㎪
㎫
㎬
㎭
㎮
㎯
U+33Bx
㎰
㎱
㎲
㎳
㎴
㎵
㎶
㎷
㎸
㎹
㎺
㎻
㎼
㎽
㎾
㎿
U+33Cx
㏀
㏁
㏂
㏃
㏄
㏅
㏆
㏇
㏈
㏉
㏊
㏋
㏌
㏍
㏎
㏏
U+33Dx
㏐
㏑
㏒
㏓
㏔
㏕
㏖
㏗
㏘
㏙
㏚
㏛
㏜
㏝
㏞
㏟
Among the 3-letter-like symbols are ㎈ ㎑ ㎒ ㎓ ㎔㏒ ㏕ ㏖ ㏙ ㎪ ㎫ ㎬ ㎭ ㏆ ㏿ ㍱... Probably the ones with most characters are ㎉ and ㎯
Unicode even have codepoints for Roman numerals. Here another 4-letter-like character can be found: Ⅷ
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+215x
⅐
⅑
⅒
⅓
⅔
⅕
⅖
⅗
⅘
⅙
⅚
⅛
⅜
⅝
⅞
⅟
U+216x
Ⅰ
Ⅱ
Ⅲ
Ⅳ
Ⅴ
Ⅵ
Ⅶ
Ⅷ
Ⅸ
Ⅹ
Ⅺ
Ⅻ
Ⅼ
Ⅽ
Ⅾ
Ⅿ
U+217x
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
ⅵ
ⅶ
ⅷ
ⅸ
ⅹ
ⅺ
ⅻ
ⅼ
ⅽ
ⅾ
ⅿ
U+218x
ↀ
ↁ
ↂ
Ↄ
ↄ
ↅ
ↆ
ↇ
ↈ
↉
↊
↋
If normal numbers can be considered then there are some other code points for multiple digits like ⒆ ⒇ ⓳ ⓴ in enclosed alphanumerics
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+246x
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑮
⑯
U+247x
⑰
⑱
⑲
⑳
⑴
⑵
⑶
⑷
⑸
⑹
⑺
⑻
⑼
⑽
⑾
⑿
U+248x
⒀
⒁
⒂
⒃
⒄
⒅
⒆
⒇
⒈
⒉
⒊
⒋
⒌
⒍
⒎
⒏
U+249x
⒐
⒑
⒒
⒓
⒔
⒕
⒖
⒗
⒘
⒙
⒚
⒛
⒜
⒝
⒞
⒟
U+24Ax
⒠
⒡
⒢
⒣
⒤
⒥
⒦
⒧
⒨
⒩
⒪
⒫
⒬
⒭
⒮
⒯
U+24Bx
⒰
⒱
⒲
⒳
⒴
⒵
Ⓐ
Ⓑ
Ⓒ
Ⓓ
Ⓔ
Ⓕ
Ⓖ
Ⓗ
Ⓘ
Ⓙ
U+24Cx
Ⓚ
Ⓛ
Ⓜ
Ⓝ
Ⓞ
Ⓟ
Ⓠ
Ⓡ
Ⓢ
Ⓣ
Ⓤ
Ⓥ
Ⓦ
Ⓧ
Ⓨ
Ⓩ
U+24Dx
ⓐ
ⓑ
ⓒ
ⓓ
ⓔ
ⓕ
ⓖ
ⓗ
ⓘ
ⓙ
ⓚ
ⓛ
ⓜ
ⓝ
ⓞ
ⓟ
U+24Ex
ⓠ
ⓡ
ⓢ
ⓣ
ⓤ
ⓥ
ⓦ
ⓧ
ⓨ
ⓩ
⓪
⓫
⓬
⓭
⓮
⓯
U+24Fx
⓰
⓱
⓲
⓳
⓴
⓵
⓶
⓷
⓸
⓹
⓺
⓻
⓼
⓽
⓾
⓿
and in Enclosed Alphanumeric Supplement
🅫, 🅪, 🆋, 🆌, 🆍, 🄭, 🄮, 🅊, 🅋, 🅌, 🅍, 🅎, 🅏
A few more:
Currency symbol group
₧ ₨ ₶ ₯ ₠ ₢ ₷
Miscellaneous technical group
⎂ ⏨
Control pictures (probably you'll need to zoom out to see)
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+240x
␀
␁
␂
␃
␄
␅
␆
␇
␈
␉
␊
␋
␌
␍
␎
␏
U+241x
␐
␑
␒
␓
␔
␕
␖
␗
␘
␙
␚
␛
␜
␝
␞
␟
U+242x
␠
␡
␢
␣
␤
␥
␦
Alchemical Symbols
🜀 🜅 🜆 🜇 🜈 🝪 🝫 🝬 🝛 🝜 🝝
Musical Symbols
𝄶 𝄷 𝄸 𝄹 𝄉 𝄊 𝄫
And there are the emojis 🔟 💤🆔🚾🆖🆗🔢🔡🔠 💯🆘🆎🆑™🔙🔚🔜🔝🔛📆🗓🔞
Vertical bars may be considered uppercase i or lowercase L (like your 〷 example which is actually the TELEGRAPH LINE FEED SEPARATOR SYMBOL) and we have
Vai syllable see ꔖ 0xa516
Large triple vertical bar operator ⫼ 0x2afc
Counting rod tens digit three: 𝍫 0x1d36b
Suzhou numerals 〢 〣
Chinese river 川
║ BOX DRAWINGS DOUBLE VERTICAL...
Here's the automatic script to find the multi-character letters
import unicodedata
for c in range(0, 0x10FFFF + 1):
d = unicodedata.normalize('NFKD', chr(c))
if len(d) > 1 and d.isascii() and d.isalpha():
print("U+%04X (%s): %s\n" % (c, chr(c), d))
It won't be able to find many ligatures like æ or œ because they're not considered orthographic ligatures and aren't decomposable in Unicode. Here's the result in Unicode 11.0.0 (checked with unicodedata.unidata_version)
U+0132 (IJ): IJ
U+0133 (ij): ij
U+01C7 (LJ): LJ
U+01C8 (Lj): Lj
U+01C9 (lj): lj
U+01CA (NJ): NJ
U+01CB (Nj): Nj
U+01CC (nj): nj
U+01F1 (DZ): DZ
U+01F2 (Dz): Dz
U+01F3 (dz): dz
U+20A8 (₨): Rs
U+2116 (№): No
U+2120 (℠): SM
U+2121 (℡): TEL
U+2122 (™): TM
U+213B (℻): FAX
U+2161 (Ⅱ): II
U+2162 (Ⅲ): III
U+2163 (Ⅳ): IV
U+2165 (Ⅵ): VI
U+2166 (Ⅶ): VII
U+2167 (Ⅷ): VIII
U+2168 (Ⅸ): IX
U+216A (Ⅺ): XI
U+216B (Ⅻ): XII
U+2171 (ⅱ): ii
U+2172 (ⅲ): iii
U+2173 (ⅳ): iv
U+2175 (ⅵ): vi
U+2176 (ⅶ): vii
U+2177 (ⅷ): viii
U+2178 (ⅸ): ix
U+217A (ⅺ): xi
U+217B (ⅻ): xii
U+3250 (㉐): PTE
U+32CC (㋌): Hg
U+32CD (㋍): erg
U+32CE (㋎): eV
U+32CF (㋏): LTD
U+3371 (㍱): hPa
U+3372 (㍲): da
U+3373 (㍳): AU
U+3374 (㍴): bar
U+3375 (㍵): oV
U+3376 (㍶): pc
U+3377 (㍷): dm
U+337A (㍺): IU
U+3380 (㎀): pA
U+3381 (㎁): nA
U+3383 (㎃): mA
U+3384 (㎄): kA
U+3385 (㎅): KB
U+3386 (㎆): MB
U+3387 (㎇): GB
U+3388 (㎈): cal
U+3389 (㎉): kcal
U+338A (㎊): pF
U+338B (㎋): nF
U+338E (㎎): mg
U+338F (㎏): kg
U+3390 (㎐): Hz
U+3391 (㎑): kHz
U+3392 (㎒): MHz
U+3393 (㎓): GHz
U+3394 (㎔): THz
U+3396 (㎖): ml
U+3397 (㎗): dl
U+3398 (㎘): kl
U+3399 (㎙): fm
U+339A (㎚): nm
U+339C (㎜): mm
U+339D (㎝): cm
U+339E (㎞): km
U+33A9 (㎩): Pa
U+33AA (㎪): kPa
U+33AB (㎫): MPa
U+33AC (㎬): GPa
U+33AD (㎭): rad
U+33B0 (㎰): ps
U+33B1 (㎱): ns
U+33B3 (㎳): ms
U+33B4 (㎴): pV
U+33B5 (㎵): nV
U+33B7 (㎷): mV
U+33B8 (㎸): kV
U+33B9 (㎹): MV
U+33BA (㎺): pW
U+33BB (㎻): nW
U+33BD (㎽): mW
U+33BE (㎾): kW
U+33BF (㎿): MW
U+33C3 (㏃): Bq
U+33C4 (㏄): cc
U+33C5 (㏅): cd
U+33C8 (㏈): dB
U+33C9 (㏉): Gy
U+33CA (㏊): ha
U+33CB (㏋): HP
U+33CC (㏌): in
U+33CD (㏍): KK
U+33CE (㏎): KM
U+33CF (㏏): kt
U+33D0 (㏐): lm
U+33D1 (㏑): ln
U+33D2 (㏒): log
U+33D3 (㏓): lx
U+33D4 (㏔): mb
U+33D5 (㏕): mil
U+33D6 (㏖): mol
U+33D7 (㏗): PH
U+33D9 (㏙): PPM
U+33DA (㏚): PR
U+33DB (㏛): sr
U+33DC (㏜): Sv
U+33DD (㏝): Wb
U+33FF (㏿): gal
U+FB00 (ff): ff
U+FB01 (fi): fi
U+FB02 (fl): fl
U+FB03 (ffi): ffi
U+FB04 (ffl): ffl
U+FB05 (ſt): st
U+FB06 (st): st
U+1F12D (🄭): CD
U+1F12E (🄮): WZ
U+1F14A (🅊): HV
U+1F14B (🅋): MV
U+1F14C (🅌): SD
U+1F14D (🅍): SS
U+1F14E (🅎): PPV
U+1F14F (🅏): WC
U+1F16A (🅪): MC
U+1F16B (🅫): MD
U+1F190 (🆐): DJ

How to identify all non-basic UTF-8 characters in a set of strings in perl

I'm using perl's XML::Writer to generate an import file for a program called OpenNMS. According to the documentation I need to pre-declare all special characters as XML ENTITY declarations. Obviously I need to go through all strings I'm exporting and catalogue the special characters used. What's the easiest way to work out which characters in a perl string are "special" with respect to UTF-8 encoding? Is there any way to work out what the entity names for those characters should be?
In order to find "special" characters, you can use ord to find out the codepoint. Here's an example:
# Create a Unicode test file with some Latin chars, some Cyrillic,
# and some outside the BMP.
# The BMP is the basic multilingual plane, see perluniintro.
# (Not sure what you mean by saying "non-basic".)
perl -CO -lwe "print join '', map chr, 97 .. 100, 0x410 .. 0x415, 0x10000 .. 0x10003" > u.txt
# Read it and find codepoints outside the BMP.
perl -CI -nlwe "print for map ord, grep ord > 0xffff, split //" < u.txt
You can get a good introduction from reading perluniintro.
I'm not sure what the docs you're referring to mean in the section "Exported XML".
Looks like some limitation of a system which is de facto ASCII and doesn't do Unicode.
Or a misunderstanding of XML. Or both.
Anyway, if you're looking for names you could use or reference the canonical ones.
See XML Entity Definitions for Characters or one of the older documents for HTML or MathML referenced therein.
You might look into the uniquote program. It has a --xml option. For example:
$ cat sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀² + 𝐁²]) and (𝐂 = sqrt[𝐀² + 𝐁²]).
7 astral + combining chars: (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]) and (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).
$ uniquote -x sample
1 NFD single combining characters: (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}) and (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}).
2 NFC single combining characters: (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}) and (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}).
3 NFD multiple combining characters: (ha\x{302}\x{303}c\x{327}\x{30C}k) and (ha\x{303}\x{302}c\x{327}\x{30C}k).
3 NFC multiple combining characters: (h\x{1EAB}\x{E7}\x{30C}k) and (h\x{E3}\x{302}\x{E7}\x{30C}k).
5 invisible characters: (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}) and (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}).
6 astral characters: (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]) and (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]).
7 astral + combining chars: (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]) and (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]).
8 wide characters: (\x{FF57}\x{FF49}\x{FF44}\x{FF45}) and (\x{FF57}\x{FF49}\x{FF44}\x{FF45}).
9 regular characters: (normal) and (normal).
$ uniquote -b sample
1 NFD single combining characters: (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81) and (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81).
2 NFC single combining characters: (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9) and (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9).
3 NFD multiple combining characters: (ha\xCC\x82\xCC\x83c\xCC\xA7\xCC\x8Ck) and (ha\xCC\x83\xCC\x82c\xCC\xA7\xCC\x8Ck).
3 NFC multiple combining characters: (h\xE1\xBA\xAB\xC3\xA7\xCC\x8Ck) and (h\xC3\xA3\xCC\x82\xC3\xA7\xCC\x8Ck).
5 invisible characters: (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3) and (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3).
6 astral characters: (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]) and (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]).
7 astral + combining chars: (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]) and (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]).
8 wide characters: (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85) and (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85).
9 regular characters: (normal) and (normal).
$ uniquote -v sample
1 NFD single combining characters: (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}) and (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}).
2 NFC single combining characters: (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}) and (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}).
3 NFD multiple combining characters: (ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k) and (ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k).
3 NFC multiple combining characters: (h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k) and (h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k).
5 invisible characters: (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}) and (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}).
6 astral characters: (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]).
7 astral + combining chars: (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]).
8 wide characters: (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}) and (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}).
9 regular characters: (normal) and (normal).
$ uniquote --xml sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hâçk) and (hãçk).
3 NFC multiple combining characters: (hẫk) and (hãk).
5 invisible characters: (4⁄3⁢r³) and (4⁄3⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
7 astral + combining chars: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
8 wide characters: (w) and (w).
9 regular characters: (normal) and (normal).
$ uniquote --verbose --html sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀² + 𝐁²]) and (𝐂 = sqrt[𝐀² + 𝐁²]).
7 astral + combining chars: (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]) and (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).

The number of characters in each unicode block

Does anyone know any reference showing the number of characters in each Unicode block? (in newer version such as 5.x.x or 6.0.0)
Thanks a lot.
http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt contains the data you are interested in.
http://www.unicode.org/Public/6.0.0/ucd/ReadMe.txt contains some instructions and refers to http://unicode.org/reports/tr44/ for interpreting the data. In that document you should read http://unicode.org/reports/tr44/#UnicodeData.txt.
unichars
Does this answer your question:
% unichars '\p{InCyrillic}' | wc -l
256
% unichars '\p{InEthiopic}' | wc -l
356
% unichars '\p{InLatin1}' | wc -l
128
% unichars '\p{InCombiningDiacriticalMarks}' | wc -l
To include the 16 astral planes, add -a:
112
% unichars -a '\p{InAncientGreekNumbers}' | wc -l
75
If you want unassigned or Han or Hangul, you need -u:
% unichars -u '\p{InEthiopic}' | wc -l
384
% unichars -u '\p{InCJKUnifiedIdeographsExtensionA}' | wc -l
6592
You can get other information, too:
% unichars '\P{IsGreek}' '\p{InGreek}'
ʹ 884 0374 GREEK NUMERAL SIGN
; 894 037E GREEK QUESTION MARK
΅ 901 0385 GREEK DIALYTIKA TONOS
· 903 0387 GREEK ANO TELEIA
Ϣ 994 03E2 COPTIC CAPITAL LETTER SHEI
ϣ 995 03E3 COPTIC SMALL LETTER SHEI
Ϥ 996 03E4 COPTIC CAPITAL LETTER FEI
ϥ 997 03E5 COPTIC SMALL LETTER FEI
Ϧ 998 03E6 COPTIC CAPITAL LETTER KHEI
ϧ 999 03E7 COPTIC SMALL LETTER KHEI
Ϩ 1000 03E8 COPTIC CAPITAL LETTER HORI
ϩ 1001 03E9 COPTIC SMALL LETTER HORI
Ϫ 1002 03EA COPTIC CAPITAL LETTER GANGIA
ϫ 1003 03EB COPTIC SMALL LETTER GANGIA
Ϭ 1004 03EC COPTIC CAPITAL LETTER SHIMA
ϭ 1005 03ED COPTIC SMALL LETTER SHIMA
Ϯ 1006 03EE COPTIC CAPITAL LETTER DEI
ϯ 1007 03EF COPTIC SMALL LETTER DEI
% unichars '\p{IsGreek}' '\P{InGreek}' | wc -l
250
% unichars '\P{IsGreek}' '\p{InGreek}' | wc -l
18
% unichars '\p{In=1.1}' | wc -l
6362
% unichars '\p{In=6.0}' | wc -l
15087
uniprops
Here’s uniprops:
% uniprops -l | grep -c 'Block='
84
% uniprops digamma 450 %
U+03DC ‹Ϝ› \N{ GREEK LETTER DIGAMMA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base
Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print
Upper Uppercase Word XID_Continue XIDC XID_Start XIDS XPosixAlnum XPosixAlpha XPosixGraph XPosixPrint XPosixUpper
XPosixWord
U+0450 ‹ѐ› \N{ CYRILLIC SMALL LETTER IE WITH GRAVE }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InCyrillic Cyrillic Is_Cyrillic Cased Cased_Letter LC Changes_When_Casemapped
CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Cyrl Ll L Gr_Base Grapheme_Base Graph GrBase
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
XPosixAlnum XPosixAlpha XPosixGraph XPosixLower XPosixPrint XPosixWord
U+0025 ‹%› \N{ PERCENT SIGN }:
\pP \p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct Print Punctuation XPosixGraph XPosixPrint XPosixPunct
Or even all these:
% uniprops -vag 777
U+0777 ‹ݷ› \N{ ARABIC LETTER FARSI YEH WITH EXTENDED ARABIC-INDIC DIGIT FOUR BELOW }:
\w \pL \p{L_} \p{Lo}
\p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Arab} \p{Arabic} \p{Assigned} \p{Is_Arabic} \p{InArabicSupplement} \p{L} \p{Lo} \p{Gr_Base} \p{Grapheme_Base} \p{Graph}
\p{GrBase} \p{ID_Continue} \p{IDC} \p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Other_Letter} \p{Print} \p{Word} \p{XID_Continue} \p{XIDC} \p{XID_Start} \p{XIDS} \p{XPosixAlnum}
\p{XPosixAlpha} \p{XPosixGraph} \p{XPosixPrint} \p{XPosixWord}
\p{Age:5.1} \p{Script=Arabic} \p{Bidi_Class:AL} \p{Bidi_Class=Arabic_Letter} \p{Bidi_Class:Arabic_Letter} \p{Bc=AL} \p{Block:Arabic_Supplement} \p{Canonical_Combining_Class:0}
\p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR} \p{Canonical_Combining_Class:NR} \p{Decomposition_Type:None} \p{Dt=None}
\p{East_Asian_Width=Neutral} \p{East_Asian_Width:Neutral} \p{General_Category:L} \p{General_Category=Letter} \p{General_Category:Letter} \p{Gc=L} \p{General_Category:Lo}
\p{General_Category=Other_Letter} \p{General_Category:Other_Letter} \p{Gc=Lo} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX} \p{Grapheme_Cluster_Break:XX}
\p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:Yeh}
\p{Jg=Yeh} \p{Joining_Type:D} \p{Joining_Type=Dual_Joining} \p{Joining_Type:Dual_Joining} \p{Jt=D} \p{Line_Break:AL} \p{Line_Break=Alphabetic} \p{Line_Break:Alphabetic}
\p{Lb=AL} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Present_In:6.0} \p{In=6.0}
\p{Script:Arab} \p{Script:Arabic} \p{Sc=Arab} \p{Sentence_Break:LE} \p{Sentence_Break=OLetter} \p{Sentence_Break:OLetter} \p{SB=LE} \p{Word_Break:ALetter} \p{WB=LE}
\p{Word_Break:LE} \p{Word_Break=ALetter}
My uniprops and unichars should run anywhere running Perl version 5.10 or better. There’s also a uninames script that goes with them.
There's a list available here although it does not specific for which version of the standard it applies: