Why Unicode has several reserved character codes?
See the Unicode for two languages- Kannada and Tamil.
Both language are very old and I think there is no chance to get new characters to these languages.
EDIT: Then why are they wasting some character codes by making it reserved character codes?
Why are they not placing the reserved character codes at the end of each language character set?
This has to do with how the Unicode consortium doles out its allocated blocks, scripts, and code points. For example, in Block=Tamil, the start of it runs this way:
$ unichars '\p{Block=Tamil}' | head -20
U+00B82 ◌ஂ GC=Mn SC=Tamil TAMIL SIGN ANUSVARA
U+00B83 ஃ GC=Lo SC=Tamil TAMIL SIGN VISARGA
U+00B85 அ GC=Lo SC=Tamil TAMIL LETTER A
U+00B86 ஆ GC=Lo SC=Tamil TAMIL LETTER AA
U+00B87 இ GC=Lo SC=Tamil TAMIL LETTER I
U+00B88 ஈ GC=Lo SC=Tamil TAMIL LETTER II
U+00B89 உ GC=Lo SC=Tamil TAMIL LETTER U
U+00B8A ஊ GC=Lo SC=Tamil TAMIL LETTER UU
U+00B8E எ GC=Lo SC=Tamil TAMIL LETTER E
U+00B8F ஏ GC=Lo SC=Tamil TAMIL LETTER EE
U+00B90 ஐ GC=Lo SC=Tamil TAMIL LETTER AI
U+00B92 ஒ GC=Lo SC=Tamil TAMIL LETTER O
U+00B93 ஓ GC=Lo SC=Tamil TAMIL LETTER OO
U+00B94 ஔ GC=Lo SC=Tamil TAMIL LETTER AU
U+00B95 க GC=Lo SC=Tamil TAMIL LETTER KA
U+00B99 ங GC=Lo SC=Tamil TAMIL LETTER NGA
U+00B9A ச GC=Lo SC=Tamil TAMIL LETTER CA
U+00B9C ஜ GC=Lo SC=Tamil TAMIL LETTER JA
U+00B9E ஞ GC=Lo SC=Tamil TAMIL LETTER NYA
U+00B9F ட GC=Lo SC=Tamil TAMIL LETTER TTA
They tend to reserve contiguous rows of 4, 8, or 16 code points to all the same “kind” of character. Yes, there are gaps there, but it’s like how in the filesystem, once you allocate a sector (or block if you don’t have separate sectors within a block) to one file, even if that file doesn’t use everything in its (final) sector, you don’t go giving away those unused byte to some other process. Things tend to get padded to block boundaries anyway.
It’s not like we’re at any risk of running out of codes.
Here is the beginning of the allocated area starts with “Signs”, as shown by the first assigned code points in that block. The gap may represent a change from one kind of character to another. If you check out the first five code points in the block for their properties, you see that those unassigned code points still have the right block property:
$ uniprops -a U+00B80 U+00B81 U+00B82 U+00B83 U+00B84 U+00B85
U+0B80 ‹U+0B80› \N{U+0B80}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B81 ‹U+0B81› \N{U+0B81}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B82 ‹◌ஂ› \N{TAMIL SIGN ANUSVARA}
\w \pM \p{Mn}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil Case_Ignorable CI M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC
Mark Nonspacing_Mark Print Taml Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=Nonspacing_Mark BC=NSM Bidi_Class=NSM Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=EX
Grapheme_Cluster_Break=Extend GCB=EX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=EX Sentence_Break=Extend SB=EX Word_Break=Extend WB=Extend
U+0B83 ‹ஃ› \N{TAMIL SIGN VISARGA}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter
L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE
Word_Break=LE
U+0B84 ‹U+0B84› \N{U+0B84}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B85 ‹அ› \N{TAMIL LETTER A}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter
L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE
Word_Break=LE
If you look at other allocated blocks, you see the same sort of thing. It doesn’t make sense to slice up blocks into unrelated things.
As I said, it’s not as though they’re going to run out of space, so I don’t know what the concern is here.
BTW, you can get Unicode exploration and proceesing tools like unichars, uniprops, uninames from my Unicode Command-Line Toolchest, either individually from there or the entire suite available through the CPAN Unicode::Tussle suite.
Related
In this link you will find the most used characters for each language. Why are some characters in some languages displayed under the \u{...} notation?
I think that what is in the brackets is the hexadecimal code of the character, but I can't understand why they would only do it with some characters.
The character sequences enclosed in curly brackets {} are digraphs (trigraphs, …) counted as a distinct letter in given language (supposedly with its own place in the alphabet), for instance
digraph {ch} in cs (Czech language);
trigraph {dzs} in hu (Hungarian alphabet);
more complex digraph examples in kkj (Kako language) shows the following Python code snippet:
>>> kkj='[a á à â {a\u0327} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ\u0301} {ɛ\u0300} {ɛ\u0302} {ɛ\u0327} f g {gb} {gw} h i í ì î {i\u0327} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ\u0301} {ɔ\u0300} {ɔ\u0302} {ɔ\u0327} p r s t u ú ù û {u\u0327} v w y]'
>>> print( kkj)
[a á à â {a̧} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̧} f g {gb} {gw} h i í ì î {i̧} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̧} p r s t u ú ù û {u̧} v w y]
>>>
For instance, {a\u0327} renders as {a̧} i.e. something like Latin Small Letter A with Combining Cedilla which has no Unicode equivalent. A counterexample:
ņ (U+0146) Latin Small Letter N With Cedilla with decomposition 004E 0327:
>>> import unicodedata
>>> print( 'ņ', unicodedata.normalize('NFC','{n\u0327}'))
ņ {ņ}
Edit:
Characters presented as unicode literals (\uxxxx = a character with 16-bit hex value xxxx) are unrenderable ones (or hard to render, at least). The following Python script shows some of them (Bidi_Class Values L-Left_To_Right, R-Right_To_Left, NSM-Nonspacing_Mark, BN-Boundary_Neutral):
# -*- coding: utf-8 -*-
import unicodedata
pa = 'ੱੰ਼੍ੁੂੇੈੋੌ'
pa = '\u0327 \u0A71 \u0A70 \u0A3C ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ੴ ੳ ਉ ਊ ਓ ਅ ਆ ਐ ਔ ੲ ਇ ਈ ਏ ਸ {ਸ\u0A3C} ਹ ਕ ਖ {ਖ\u0A3C} ਗ {ਗ\u0A3C} ਘ ਙ ਚ ਛ ਜ {ਜ\u0A3C} ਝ ਞ ਟ ਠ ਡ ਢ ਣ ਤ ਥ ਦ ਧ ਨ ਪ ਫ {ਫ\u0A3C} ਬ ਭ ਮ ਯ ਰ ਲ ਵ ੜ \u0A4D ਾ ਿ ੀ \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C'
pa = '\u0300 \u0301 \u0302 \u1DC6 \u1DC7 \u0A71 \u0A70 \u0A3C \u0A4D \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C \u05B7 \u05B8 \u05BF \u200C \u200D \u200E \u200F \u064B \u064C \u064E \u064F \u0650'
# above examples from ·kkj· ·bas· ·pa· ·yi· ·kn· ·ur· ·mzn·
print( pa )
for chr in pa:
if chr != ' ':
if chr == '{' or chr == '}':
print( chr )
else:
print( '\\u%04x' % ord(chr), chr,
unicodedata.category(chr),
unicodedata.bidirectional(chr) + '\t',
str( unicodedata.combining(chr)) + '\t',
unicodedata.name(chr, '?') )
Result: .\SO\63659122.py
̀ ́ ̂ ᷆ ᷇ ੱ ੰ ਼ ੍ ੁ ੂ ੇ ੈ ੋ ੌ ַ ָ ֿ ً ٌ َ ُ ِ
\u0300 ̀ Mn NSM 230 COMBINING GRAVE ACCENT
\u0301 ́ Mn NSM 230 COMBINING ACUTE ACCENT
\u0302 ̂ Mn NSM 230 COMBINING CIRCUMFLEX ACCENT
\u1dc6 ᷆ Mn NSM 230 COMBINING MACRON-GRAVE
\u1dc7 ᷇ Mn NSM 230 COMBINING ACUTE-MACRON
\u0a71 ੱ Mn NSM 0 GURMUKHI ADDAK
\u0a70 ੰ Mn NSM 0 GURMUKHI TIPPI
\u0a3c ਼ Mn NSM 7 GURMUKHI SIGN NUKTA
\u0a4d ੍ Mn NSM 9 GURMUKHI SIGN VIRAMA
\u0a41 ੁ Mn NSM 0 GURMUKHI VOWEL SIGN U
\u0a42 ੂ Mn NSM 0 GURMUKHI VOWEL SIGN UU
\u0a47 ੇ Mn NSM 0 GURMUKHI VOWEL SIGN EE
\u0a48 ੈ Mn NSM 0 GURMUKHI VOWEL SIGN AI
\u0a4b ੋ Mn NSM 0 GURMUKHI VOWEL SIGN OO
\u0a4c ੌ Mn NSM 0 GURMUKHI VOWEL SIGN AU
\u05b7 ַ Mn NSM 17 HEBREW POINT PATAH
\u05b8 ָ Mn NSM 18 HEBREW POINT QAMATS
\u05bf ֿ Mn NSM 23 HEBREW POINT RAFE
\u200c Cf BN 0 ZERO WIDTH NON-JOINER
\u200d Cf BN 0 ZERO WIDTH JOINER
\u200e Cf L 0 LEFT-TO-RIGHT MARK
\u200f Cf R 0 RIGHT-TO-LEFT MARK
\u064b ً Mn NSM 27 ARABIC FATHATAN
\u064c ٌ Mn NSM 28 ARABIC DAMMATAN
\u064e َ Mn NSM 30 ARABIC FATHA
\u064f ُ Mn NSM 31 ARABIC DAMMA
\u0650 ِ Mn NSM 32 ARABIC KASRA
It seems like all codepoints that don't have a well-defined stand-alone look (or are not meant to be used as stand-alone characters) are represented with this notation.
For example U+0A3C is present in the "character" {ਫ\u0A3C}. U+0A3C is a combining codepoint that modifies the one that is before it.
I'm not really sure how to express it but I'm searching for unicode letters which are more than one visual latin letter.
I found this in Word so far:
DZ
Dz
dz
NJ
Lj
LJ
Nj
nj
Any others?
Here are some of the characters I've found. I'd first done this manually by looking at some probable blocks. However I've later written a Python script to do this automatically that you can find at the end of this answer
Digraphs
Two Glyphs
Digraph
Unicode Code Point
HTML
DZ, Dz, dz
DZ, Dz, dz
U+01F1 U+01F2 U+01F3
DZ Dz dz
DŽ, Dž, dž
DŽ, Dž, dž
U+01C4 U+01C5 U+01C6
DŽ Dž dž
IJ, ij
IJ, ij
U+0132 U+0133
IJ ij
LJ, Lj, lj
LJ, Lj, lj
U+01C7 U+01C8 U+01C9
LJ Lj lj
NJ, Nj, nj
NJ, Nj, nj
U+01CA U+01CB U+01CC
NJ Nj nj
Ligatures
Non-ligature
Ligature
Unicode
HTML
AA, aa
Ꜳ, ꜳ
U+A732, U+A733
Ꜳ ꜳ
AE, ae
Æ, æ
U+00C6, U+00E6
Æ æ
AO, ao
Ꜵ, ꜵ
U+A734, U+A735
Ꜵ ꜵ
AU, au
Ꜷ, ꜷ
U+A736, U+A737
Ꜷ ꜷ
AV, av
Ꜹ, ꜹ
U+A738, U+A739
Ꜹ ꜹ
AV, av (with bar)
Ꜻ, ꜻ
U+A73A, U+A73B
Ꜻ ꜻ
AY, ay
Ꜽ, ꜽ
U+A73C, U+A73D
Ꜽ ꜽ
et
🙰
U+1F670
🙰
ff
ff
U+FB00
ff
ffi
ffi
U+FB03
ffi
ffl
ffl
U+FB04
ffl
fi
fi
U+FB01
fi
fl
fl
U+FB02
fl
OE, oe
Œ, œ
U+0152, U+0153
Œ œ
OO, oo
Ꝏ, ꝏ
U+A74E, U+A74F
Ꝏ ꝏ
ſs, ſz
ẞ, ß
U+1E9E, U+00DF
ß
st
st
U+FB06
st
ſt
ſt
U+FB05
ſt
TZ, tz
Ꜩ, ꜩ
U+A728, U+A729
Ꜩ ꜩ
ue
ᵫ
U+1D6B
ᵫ
VY, vy
Ꝡ, ꝡ
U+A760, U+A761
Ꝡ ꝡ
There are a few other ligatures that are used for phonetic transcription but looks like Latin characters
Non-ligature
Ligature
Unicode
HTML
db
ȸ
U+0238
ȸ
dz
ʣ
U+02A3
ʣ
IJ, ij
IJ, ij
U+0132, U+0133
IJ ij
ls
ʪ
U+02AA
ʪ
lz
ʫ
U+02AB
ʫ
qp
ȹ
U+0239
ȹ
ts
ʦ
U+02A6
ʦ
ui
ꭐ
U+AB50
ꭐ
turned ui
ꭑ
U+AB51
ꭑ
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Digraphs_and_ligatures
Edit:
There are more letterlike symbols beside ℻ and ℡ like what the OP found in the comment:
℀ ℁ ⅍ ℅ ℆ ℔ ℠ ™
Longer letters are mainly from the CJK Compatibility block
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+338x
㎀
㎁
㎂
㎃
㎄
㎅
㎆
㎇
㎈
㎉
㎊
㎋
㎌
㎍
㎎
㎏
U+339x
㎐
㎑
㎒
㎓
㎔
㎕
㎖
㎗
㎘
㎙
㎚
㎛
㎜
㎝
㎞
㎟
U+33Ax
㎠
㎡
㎢
㎣
㎤
㎥
㎦
㎧
㎨
㎩
㎪
㎫
㎬
㎭
㎮
㎯
U+33Bx
㎰
㎱
㎲
㎳
㎴
㎵
㎶
㎷
㎸
㎹
㎺
㎻
㎼
㎽
㎾
㎿
U+33Cx
㏀
㏁
㏂
㏃
㏄
㏅
㏆
㏇
㏈
㏉
㏊
㏋
㏌
㏍
㏎
㏏
U+33Dx
㏐
㏑
㏒
㏓
㏔
㏕
㏖
㏗
㏘
㏙
㏚
㏛
㏜
㏝
㏞
㏟
Among the 3-letter-like symbols are ㎈ ㎑ ㎒ ㎓ ㎔㏒ ㏕ ㏖ ㏙ ㎪ ㎫ ㎬ ㎭ ㏆ ㏿ ㍱... Probably the ones with most characters are ㎉ and ㎯
Unicode even have codepoints for Roman numerals. Here another 4-letter-like character can be found: Ⅷ
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+215x
⅐
⅑
⅒
⅓
⅔
⅕
⅖
⅗
⅘
⅙
⅚
⅛
⅜
⅝
⅞
⅟
U+216x
Ⅰ
Ⅱ
Ⅲ
Ⅳ
Ⅴ
Ⅵ
Ⅶ
Ⅷ
Ⅸ
Ⅹ
Ⅺ
Ⅻ
Ⅼ
Ⅽ
Ⅾ
Ⅿ
U+217x
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
ⅵ
ⅶ
ⅷ
ⅸ
ⅹ
ⅺ
ⅻ
ⅼ
ⅽ
ⅾ
ⅿ
U+218x
ↀ
ↁ
ↂ
Ↄ
ↄ
ↅ
ↆ
ↇ
ↈ
↉
↊
↋
If normal numbers can be considered then there are some other code points for multiple digits like ⒆ ⒇ ⓳ ⓴ in enclosed alphanumerics
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+246x
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑮
⑯
U+247x
⑰
⑱
⑲
⑳
⑴
⑵
⑶
⑷
⑸
⑹
⑺
⑻
⑼
⑽
⑾
⑿
U+248x
⒀
⒁
⒂
⒃
⒄
⒅
⒆
⒇
⒈
⒉
⒊
⒋
⒌
⒍
⒎
⒏
U+249x
⒐
⒑
⒒
⒓
⒔
⒕
⒖
⒗
⒘
⒙
⒚
⒛
⒜
⒝
⒞
⒟
U+24Ax
⒠
⒡
⒢
⒣
⒤
⒥
⒦
⒧
⒨
⒩
⒪
⒫
⒬
⒭
⒮
⒯
U+24Bx
⒰
⒱
⒲
⒳
⒴
⒵
Ⓐ
Ⓑ
Ⓒ
Ⓓ
Ⓔ
Ⓕ
Ⓖ
Ⓗ
Ⓘ
Ⓙ
U+24Cx
Ⓚ
Ⓛ
Ⓜ
Ⓝ
Ⓞ
Ⓟ
Ⓠ
Ⓡ
Ⓢ
Ⓣ
Ⓤ
Ⓥ
Ⓦ
Ⓧ
Ⓨ
Ⓩ
U+24Dx
ⓐ
ⓑ
ⓒ
ⓓ
ⓔ
ⓕ
ⓖ
ⓗ
ⓘ
ⓙ
ⓚ
ⓛ
ⓜ
ⓝ
ⓞ
ⓟ
U+24Ex
ⓠ
ⓡ
ⓢ
ⓣ
ⓤ
ⓥ
ⓦ
ⓧ
ⓨ
ⓩ
⓪
⓫
⓬
⓭
⓮
⓯
U+24Fx
⓰
⓱
⓲
⓳
⓴
⓵
⓶
⓷
⓸
⓹
⓺
⓻
⓼
⓽
⓾
⓿
and in Enclosed Alphanumeric Supplement
🅫, 🅪, 🆋, 🆌, 🆍, 🄭, 🄮, 🅊, 🅋, 🅌, 🅍, 🅎, 🅏
A few more:
Currency symbol group
₧ ₨ ₶ ₯ ₠ ₢ ₷
Miscellaneous technical group
⎂ ⏨
Control pictures (probably you'll need to zoom out to see)
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+240x
␀
␁
␂
␃
␄
␅
␆
␇
␈
␉
␊
␋
␌
␍
␎
␏
U+241x
␐
␑
␒
␓
␔
␕
␖
␗
␘
␙
␚
␛
␜
␝
␞
␟
U+242x
␠
␡
␢
␣

␥
␦
Alchemical Symbols
🜀 🜅 🜆 🜇 🜈 🝪 🝫 🝬 🝛 🝜 🝝
Musical Symbols
𝄶 𝄷 𝄸 𝄹 𝄉 𝄊 𝄫
And there are the emojis 🔟 💤🆔🚾🆖🆗🔢🔡🔠 💯🆘🆎🆑™🔙🔚🔜🔝🔛📆🗓🔞
Vertical bars may be considered uppercase i or lowercase L (like your 〷 example which is actually the TELEGRAPH LINE FEED SEPARATOR SYMBOL) and we have
Vai syllable see ꔖ 0xa516
Large triple vertical bar operator ⫼ 0x2afc
Counting rod tens digit three: 𝍫 0x1d36b
Suzhou numerals 〢 〣
Chinese river 川
║ BOX DRAWINGS DOUBLE VERTICAL...
Here's the automatic script to find the multi-character letters
import unicodedata
for c in range(0, 0x10FFFF + 1):
d = unicodedata.normalize('NFKD', chr(c))
if len(d) > 1 and d.isascii() and d.isalpha():
print("U+%04X (%s): %s\n" % (c, chr(c), d))
It won't be able to find many ligatures like æ or œ because they're not considered orthographic ligatures and aren't decomposable in Unicode. Here's the result in Unicode 11.0.0 (checked with unicodedata.unidata_version)
U+0132 (IJ): IJ
U+0133 (ij): ij
U+01C7 (LJ): LJ
U+01C8 (Lj): Lj
U+01C9 (lj): lj
U+01CA (NJ): NJ
U+01CB (Nj): Nj
U+01CC (nj): nj
U+01F1 (DZ): DZ
U+01F2 (Dz): Dz
U+01F3 (dz): dz
U+20A8 (₨): Rs
U+2116 (№): No
U+2120 (℠): SM
U+2121 (℡): TEL
U+2122 (™): TM
U+213B (℻): FAX
U+2161 (Ⅱ): II
U+2162 (Ⅲ): III
U+2163 (Ⅳ): IV
U+2165 (Ⅵ): VI
U+2166 (Ⅶ): VII
U+2167 (Ⅷ): VIII
U+2168 (Ⅸ): IX
U+216A (Ⅺ): XI
U+216B (Ⅻ): XII
U+2171 (ⅱ): ii
U+2172 (ⅲ): iii
U+2173 (ⅳ): iv
U+2175 (ⅵ): vi
U+2176 (ⅶ): vii
U+2177 (ⅷ): viii
U+2178 (ⅸ): ix
U+217A (ⅺ): xi
U+217B (ⅻ): xii
U+3250 (㉐): PTE
U+32CC (㋌): Hg
U+32CD (㋍): erg
U+32CE (㋎): eV
U+32CF (㋏): LTD
U+3371 (㍱): hPa
U+3372 (㍲): da
U+3373 (㍳): AU
U+3374 (㍴): bar
U+3375 (㍵): oV
U+3376 (㍶): pc
U+3377 (㍷): dm
U+337A (㍺): IU
U+3380 (㎀): pA
U+3381 (㎁): nA
U+3383 (㎃): mA
U+3384 (㎄): kA
U+3385 (㎅): KB
U+3386 (㎆): MB
U+3387 (㎇): GB
U+3388 (㎈): cal
U+3389 (㎉): kcal
U+338A (㎊): pF
U+338B (㎋): nF
U+338E (㎎): mg
U+338F (㎏): kg
U+3390 (㎐): Hz
U+3391 (㎑): kHz
U+3392 (㎒): MHz
U+3393 (㎓): GHz
U+3394 (㎔): THz
U+3396 (㎖): ml
U+3397 (㎗): dl
U+3398 (㎘): kl
U+3399 (㎙): fm
U+339A (㎚): nm
U+339C (㎜): mm
U+339D (㎝): cm
U+339E (㎞): km
U+33A9 (㎩): Pa
U+33AA (㎪): kPa
U+33AB (㎫): MPa
U+33AC (㎬): GPa
U+33AD (㎭): rad
U+33B0 (㎰): ps
U+33B1 (㎱): ns
U+33B3 (㎳): ms
U+33B4 (㎴): pV
U+33B5 (㎵): nV
U+33B7 (㎷): mV
U+33B8 (㎸): kV
U+33B9 (㎹): MV
U+33BA (㎺): pW
U+33BB (㎻): nW
U+33BD (㎽): mW
U+33BE (㎾): kW
U+33BF (㎿): MW
U+33C3 (㏃): Bq
U+33C4 (㏄): cc
U+33C5 (㏅): cd
U+33C8 (㏈): dB
U+33C9 (㏉): Gy
U+33CA (㏊): ha
U+33CB (㏋): HP
U+33CC (㏌): in
U+33CD (㏍): KK
U+33CE (㏎): KM
U+33CF (㏏): kt
U+33D0 (㏐): lm
U+33D1 (㏑): ln
U+33D2 (㏒): log
U+33D3 (㏓): lx
U+33D4 (㏔): mb
U+33D5 (㏕): mil
U+33D6 (㏖): mol
U+33D7 (㏗): PH
U+33D9 (㏙): PPM
U+33DA (㏚): PR
U+33DB (㏛): sr
U+33DC (㏜): Sv
U+33DD (㏝): Wb
U+33FF (㏿): gal
U+FB00 (ff): ff
U+FB01 (fi): fi
U+FB02 (fl): fl
U+FB03 (ffi): ffi
U+FB04 (ffl): ffl
U+FB05 (ſt): st
U+FB06 (st): st
U+1F12D (🄭): CD
U+1F12E (🄮): WZ
U+1F14A (🅊): HV
U+1F14B (🅋): MV
U+1F14C (🅌): SD
U+1F14D (🅍): SS
U+1F14E (🅎): PPV
U+1F14F (🅏): WC
U+1F16A (🅪): MC
U+1F16B (🅫): MD
U+1F190 (🆐): DJ
If anyone's used formtoemail.com for email before, I'm trying to accept other languages. For example, I just had the following message come in. How do I fix it?
comments: http://propohudenie.com/earn/423-ot-1500-2000-rub-v-den-na-prosmotre-videorolikov-bez-vlozheniy-i-prodazh.html>ÐºÑƒÑ€Ñ Ñƒ шиханова заработок в интернете
ÐžÑ€Ð³Ð°Ð½Ð¸Ð·Ð°Ñ†Ð¸Ñ â€œÐŸÑ€Ð¾Ð²ÐµÑ€ÐµÐ½Ð½Ñ‹Ðµ товары и ÑƒÑ Ð»ÑƒÐ³Ð¸â€ Ð¿Ñ€ÐµÐ´Ð¾Ñ Ñ‚Ð°Ð²Ð»Ñ ÐµÑ‚ Ð´Ð¾Ñ Ñ‚Ð¾Ð²ÐµÑ€Ð½ÑƒÑŽ и проверенную информацию о Ð²Ñ ÐµÐ²Ð¾Ð·Ð¼Ð¾Ð¶Ð½Ñ‹Ñ… товарах и Ñ ÐµÑ€Ð²Ð¸Ñ Ð°Ñ… Ð´Ð»Ñ Ð½Ð°Ñ ÐµÐ»ÐµÐ½Ð¸Ñ . Рашей задачей ÐµÑ Ñ‚ÑŒ проверка ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð° Ñ ÐµÑ€Ð²Ð¸Ñ Ð° или товара, которые Ñ€ÐµÐ°Ð»Ð¸Ð·ÑƒÑŽÑ‚Ñ Ñ Ð² Ñ ÐµÑ‚Ð¸. ÐŸÐ¾Ñ Ð»Ðµ нашей Ð¸Ð½Ñ Ð¿ÐµÐºÑ†Ð¸Ð¸ на ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð¾ товары и Ñ ÐµÑ€Ð²Ð¸Ñ Ñ‹ Ñ€Ð°Ð·Ð¼ÐµÑ‰Ð°ÑŽÑ‚Ñ Ñ Ð² каталоге проверенных товаров и ÑƒÑ Ð»ÑƒÐ³. Данный каталог поможет Ð»ÑŽÐ´Ñ Ð¼ подобрать необходимый товар или ÑƒÑ Ð»ÑƒÐ³Ñƒ, не Ñ Ð¾Ð¼Ð½ÐµÐ²Ð°Ñ Ñ ÑŒ в их ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ðµ.
http://propohudenie.com/earn/423-ot-1500-2000-rub-v-den-na-prosmotre-videorolikov-bez-vlozheniy-i-prodazh.html>от 1500 рублей в день на Ð¿Ñ€Ð¾Ñ Ð¼Ð¾Ñ‚Ñ€Ð°Ñ… видеороликов (Ñ ÐµÑ€Ð³ÐµÐ¹ шиханов)
ÐžÑ Ð½Ð¾Ð²Ð½Ñ‹Ð¼Ð¸ Ð½Ð°Ð¿Ñ€Ð°Ð²Ð»ÐµÐ½Ð¸Ñ Ð¼Ð¸ нашей Ð´ÐµÑ Ñ‚ÐµÐ»ÑŒÐ½Ð¾Ñ Ñ‚Ð¸ ÐµÑ Ñ‚ÑŒ товары и ÑƒÑ Ð»ÑƒÐ³Ð¸, Ñ Ð²Ñ Ð·Ð°Ð½Ð½Ñ‹Ðµ Ñ Ð¿Ð¾Ñ…ÑƒÐ´ÐµÐ½Ð¸ÐµÐ¼ и заработком в Интернете. Ðти товары и Ñ ÐµÑ€Ð²Ð¸Ñ Ñ‹ чаще Ð²Ñ ÐµÐ³Ð¾ Ð¿Ð¾Ð´Ð²ÐµÑ€Ð³Ð°ÑŽÑ‚Ñ Ñ Ñ„Ð°Ð»ÑŒÑ Ð¸Ñ„Ð¸ÐºÐ°Ñ†Ð¸Ð¸, Ð²Ñ Ð»ÐµÐ´Ñ Ñ‚Ð²Ð¸Ðµ их широкой Ð²Ð¾Ñ Ñ‚Ñ€ÐµÐ±Ð¾Ð²Ð°Ð½Ð½Ð¾Ñ Ñ‚Ð¸.
http://propohudenie.com/earn/431-ot-1300-v-mesyac-na-kulinarnyh-receptah.html>Ñ‚Ð°Ñ‚ÑŒÑ Ð½Ñ‹ зориной заработок на кулинарных интернет Ð¸Ð·Ð´Ð°Ð½Ð¸Ñ Ñ…
Мы Ð¾Ñ ÑƒÑ‰ÐµÑ Ñ‚Ð²Ð»Ñ ÐµÐ¼ квалифицированную проверку на ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð¾ данных ÑƒÑ Ð»ÑƒÐ³ и товаров, Ð¿Ð¾Ñ Ð»Ðµ чего Ð¿Ñ€ÐµÐ´Ð¾Ñ Ñ‚Ð°Ð²Ð»Ñ ÐµÐ¼ Ð¿Ð¾Ð»ÑŒÐ·Ð¾Ð²Ð°Ñ‚ÐµÐ»Ñ Ð¼ информацию о результатах проверки. Мы также Ñ Ð¼Ð¾Ð¶ÐµÐ¼ Ð¿Ñ€Ð¸Ð½Ñ Ñ‚ÑŒ жалобы от пользователей на какую-нибудь ÑƒÑ Ð»ÑƒÐ³Ñƒ либо товар, предлагаемые в Интернете, на Ð½ÐµÐ´Ð¾Ð±Ñ€Ð¾Ñ Ð¾Ð²ÐµÑ Ñ‚Ð½Ð¾Ðµ отношение фирм-реализаторов. Раш каталог имеет также информацию о ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ðµ других товаров и ÑƒÑ Ð»ÑƒÐ³ – которые не Ð¾Ñ‚Ð½Ð¾Ñ Ñ Ñ‚Ñ Ñ Ðº ÐºÐ°Ñ‚ÐµÐ³Ð¾Ñ€Ð¸Ñ Ð¼ заработка в Ñ ÐµÑ‚Ð¸ и похудению, мы Ð¿Ñ€Ð¾Ð²ÐµÑ€Ñ ÐµÐ¼ ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð¾ любых ÑƒÑ Ð»ÑƒÐ³ и товаров.
submit: Send
You have to configure the form at the formtoemail.com control panel to accept content in the same encoding you're using to serve your HTML page containing the <form>.
UTF-8 is the sensible choice for this as it allows all characters, but unfortunately the default is ISO-8859-1 and the free version of the service doesn't let you change it.
It's possible to rescue the text above by encoding it to ISO-8859-1 bytes and then decoding it to test using UTF-8. But that's a right pain to do for every mail that comes in. For what it's worth, the above appears to have come from spambot anyway so you're not missing much.
I would look for a different form sending service.
I have got a client requirement where he is using Pligg CMS for social bookmarking. He is getting a lot of manual spam entries in multiple languages. Does Pligg has any setting or Plugin where we can allow only English entries. If any one can suggest a good alternative to Pligg will also help.
Thanks in advance.
I would use the spam trigger module (included with Pligg) and add the following to one of the Trigger configuration fields from the settings page.
À
Á
Â
Ã
Ä
Å
Æ
Ā
Ą
Ă
Ç
Ć
Č
Ĉ
Ċ
Ď
Đ
È
É
Ê
Ë
Ē
Ę
Ě
Ĕ
Ė
Ĝ
Ğ
Ġ
Ģ
Ĥ
Ħ
Ì
Í
Î
Ï
Ī
Ĩ
Ĭ
Į
İ
IJ
Ĵ
Ķ
Ľ
Ĺ
Ļ
Ŀ
Ł
Ñ
Ń
Ň
Ņ
Ŋ
Ò
Ó
Ô
Õ
Ö
Ø
Ō
Ő
Ŏ
Œ
Ŕ
Ř
Ŗ
Ś
Ş
Ŝ
Ș
Š
Ť
Ţ
Ŧ
Ț
Ù
Ú
Û
Ü
Ū
Ů
Ű
Ŭ
Ũ
Ų
Ŵ
Ŷ
Ÿ
Ý
Ź
Ż
Ž
à
á
â
ã
ä
ā
ą
ă
å
æ
ç
ć
č
ĉ
ċ
ď
đ
è
é
ê
ë
ē
ę
ě
ĕ
ė
ƒ
ĝ
ğ
ġ
ģ
ĥ
ħ
ì
í
î
ï
ī
ĩ
ĭ
į
ı
ij
ĵ
ķ
ĸ
ł
ľ
ĺ
ļ
ŀ
ñ
ń
ň
ņ
ʼn
ŋ
ò
ó
ô
õ
ö
ø
ō
ő
ŏ
œ
ŕ
ř
ŗ
ś
š
ş
ť
ţ
ù
ú
û
ü
ū
ů
ű
ŭ
ũ
ų
ụ
ŵ
ÿ
ý
ŷ
ż
ź
ž
ß
ſ
Α
Ά
Ἀ
Ἁ
Ἂ
Ἃ
Ἄ
Ἅ
Ἆ
Ἇ
ᾈ
ᾉ
ᾊ
ᾋ
ᾌ
ᾍ
ᾎ
ᾏ
Ᾰ
Ᾱ
Ὰ
Ά
ᾼ
Β
Γ
Δ
Ε
Έ
Ἐ
Ἑ
Ἒ
Ἓ
Ἔ
Ἕ
Έ
Ὲ
Ζ
Η
Ή
Ἠ
Ἡ
Ἢ
Ἣ
Ἤ
Ἥ
Ἦ
Ἧ
ᾘ
ᾙ
ᾚ
ᾛ
ᾜ
ᾝ
ᾞ
ᾟ
Ὴ
Ή
ῌ
Θ
Ι
Ί
Ϊ
Ἰ
Ἱ
Ἲ
Ἳ
Ἴ
Ἵ
Ἶ
Ἷ
Ῐ
Ῑ
Ὶ
Ί
Κ
Λ
Μ
Ν
Ξ
Ο
Ό
Ὀ
Ὁ
Ὂ
Ὃ
Ὄ
Ὅ
Ὸ
Ό
Π
Ρ
Ῥ
Σ
Τ
Υ
Ύ
Ϋ
Ὑ
Ὓ
Ὕ
Ὗ
Ῠ
Ῡ
Ὺ
Ύ
Φ
Χ
Ψ
Ω
Ώ
Ὠ
Ὡ
Ὢ
Ὣ
Ὤ
Ὥ
Ὦ
Ὧ
ᾨ
ᾩ
ᾪ
ᾫ
ᾬ
ᾭ
ᾮ
ᾯ
Ὼ
Ώ
ῼ
α
ά
ἀ
ἁ
ἂ
ἃ
ἄ
ἅ
ἆ
ἇ
ᾀ
ᾁ
ᾂ
ᾃ
ᾄ
ᾅ
ᾆ
ᾇ
ὰ
ά
ᾰ
ᾱ
ᾲ
ᾳ
ᾴ
ᾶ
ᾷ
β
γ
δ
ε
έ
ἐ
ἑ
ἒ
ἓ
ἔ
ἕ
ὲ
έ
ζ
η
ή
ἠ
ἡ
ἢ
ἣ
ἤ
ἥ
ἦ
ἧ
ᾐ
ᾑ
ᾒ
ᾓ
ᾔ
ᾕ
ᾖ
ᾗ
ὴ
ή
ῂ
ῃ
ῄ
ῆ
ῇ
θ
ι
ί
ϊ
ΐ
ἰ
ἱ
ἲ
ἳ
ἴ
ἵ
ἶ
ἷ
ὶ
ί
ῐ
ῑ
ῒ
ΐ
ῖ
ῗ
κ
λ
μ
ν
ξ
ο
ό
ὀ
ὁ
ὂ
ὃ
ὄ
ὅ
ὸ
ό
π
ρ
ῤ
ῥ
σ
ς
τ
υ
ύ
ϋ
ΰ
ὐ
ὑ
ὒ
ὓ
ὔ
ὕ
ὖ
ὗ
ὺ
ύ
ῠ
ῡ
ῢ
ΰ
ῦ
ῧ
φ
χ
ψ
ω
ώ
ὠ
ὡ
ὢ
ὣ
ὤ
ὥ
ὦ
ὧ
ᾠ
ᾡ
ᾢ
ᾣ
ᾤ
ᾥ
ᾦ
ᾧ
ὼ
ώ
ῲ
ῳ
ῴ
ῶ
ῷ
¨
΅
᾿
῾
῍
῝
῎
῞
῏
῟
῀
῁
΄
΅
`
῭
ͺ
᾽
А
Б
В
Г
Д
Е
Ё
Ж
З
И
Й
К
Л
М
Н
О
П
Р
С
Т
У
Ф
Х
Ц
Ч
Ш
Щ
Ы
Э
Ю
Я
а
б
в
г
д
е
ё
ж
з
и
й
к
л
м
н
о
п
р
с
т
у
ф
х
ц
ч
ш
щ
ы
э
ю
я
Ъ
ъ
Ь
ь
ð
Ð
þ
Þ
Ề
ề
Ể
ể
Ễ
ễ
Ế
ế
Ệ
ệ
Ộ
ộ
Ơ
ơ
Ư
ư
ờ
ა
ბ
გ
დ
ე
ვ
ზ
თ
ი
კ
ლ
მ
ნ
ო
პ
ჟ
რ
ს
ტ
უ
ფ
ქ
ღ
ყ
შ
ჩ
ც
ძ
წ
ჭ
ხ
ჯ
ჰ
ב
ג
ד
ה
ו
ז
ח
ט
י
כ
ל
מ
נ
ס
פ
צ
ק
ר
ש
ת
ա
բ
գ
դ
ե
զ
է
ը
թ
ժ
ի
լ
խ
ծ
կ
հ
ձ
ղ
ճ
մ
յ
ն
շ
ո
չ
պ
ջ
ռ
ս
վ
տ
ր
ց
ւ
փ
ք
օ
ֆ
և
This list is made up of many non-English characters, so it will flag any posts that use these letters. I pulled this list from Pligg's /languages/translit.txt file, which is used to transcribe these letters into more common English ones for use in URLs. It's not complete, for example it does not have any Asian language characters.
Does anyone know any reference showing the number of characters in each Unicode block? (in newer version such as 5.x.x or 6.0.0)
Thanks a lot.
http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt contains the data you are interested in.
http://www.unicode.org/Public/6.0.0/ucd/ReadMe.txt contains some instructions and refers to http://unicode.org/reports/tr44/ for interpreting the data. In that document you should read http://unicode.org/reports/tr44/#UnicodeData.txt.
unichars
Does this answer your question:
% unichars '\p{InCyrillic}' | wc -l
256
% unichars '\p{InEthiopic}' | wc -l
356
% unichars '\p{InLatin1}' | wc -l
128
% unichars '\p{InCombiningDiacriticalMarks}' | wc -l
To include the 16 astral planes, add -a:
112
% unichars -a '\p{InAncientGreekNumbers}' | wc -l
75
If you want unassigned or Han or Hangul, you need -u:
% unichars -u '\p{InEthiopic}' | wc -l
384
% unichars -u '\p{InCJKUnifiedIdeographsExtensionA}' | wc -l
6592
You can get other information, too:
% unichars '\P{IsGreek}' '\p{InGreek}'
ʹ 884 0374 GREEK NUMERAL SIGN
; 894 037E GREEK QUESTION MARK
΅ 901 0385 GREEK DIALYTIKA TONOS
· 903 0387 GREEK ANO TELEIA
Ϣ 994 03E2 COPTIC CAPITAL LETTER SHEI
ϣ 995 03E3 COPTIC SMALL LETTER SHEI
Ϥ 996 03E4 COPTIC CAPITAL LETTER FEI
ϥ 997 03E5 COPTIC SMALL LETTER FEI
Ϧ 998 03E6 COPTIC CAPITAL LETTER KHEI
ϧ 999 03E7 COPTIC SMALL LETTER KHEI
Ϩ 1000 03E8 COPTIC CAPITAL LETTER HORI
ϩ 1001 03E9 COPTIC SMALL LETTER HORI
Ϫ 1002 03EA COPTIC CAPITAL LETTER GANGIA
ϫ 1003 03EB COPTIC SMALL LETTER GANGIA
Ϭ 1004 03EC COPTIC CAPITAL LETTER SHIMA
ϭ 1005 03ED COPTIC SMALL LETTER SHIMA
Ϯ 1006 03EE COPTIC CAPITAL LETTER DEI
ϯ 1007 03EF COPTIC SMALL LETTER DEI
% unichars '\p{IsGreek}' '\P{InGreek}' | wc -l
250
% unichars '\P{IsGreek}' '\p{InGreek}' | wc -l
18
% unichars '\p{In=1.1}' | wc -l
6362
% unichars '\p{In=6.0}' | wc -l
15087
uniprops
Here’s uniprops:
% uniprops -l | grep -c 'Block='
84
% uniprops digamma 450 %
U+03DC ‹Ϝ› \N{ GREEK LETTER DIGAMMA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base
Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print
Upper Uppercase Word XID_Continue XIDC XID_Start XIDS XPosixAlnum XPosixAlpha XPosixGraph XPosixPrint XPosixUpper
XPosixWord
U+0450 ‹ѐ› \N{ CYRILLIC SMALL LETTER IE WITH GRAVE }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InCyrillic Cyrillic Is_Cyrillic Cased Cased_Letter LC Changes_When_Casemapped
CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Cyrl Ll L Gr_Base Grapheme_Base Graph GrBase
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
XPosixAlnum XPosixAlpha XPosixGraph XPosixLower XPosixPrint XPosixWord
U+0025 ‹%› \N{ PERCENT SIGN }:
\pP \p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct Print Punctuation XPosixGraph XPosixPrint XPosixPunct
Or even all these:
% uniprops -vag 777
U+0777 ‹ݷ› \N{ ARABIC LETTER FARSI YEH WITH EXTENDED ARABIC-INDIC DIGIT FOUR BELOW }:
\w \pL \p{L_} \p{Lo}
\p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Arab} \p{Arabic} \p{Assigned} \p{Is_Arabic} \p{InArabicSupplement} \p{L} \p{Lo} \p{Gr_Base} \p{Grapheme_Base} \p{Graph}
\p{GrBase} \p{ID_Continue} \p{IDC} \p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Other_Letter} \p{Print} \p{Word} \p{XID_Continue} \p{XIDC} \p{XID_Start} \p{XIDS} \p{XPosixAlnum}
\p{XPosixAlpha} \p{XPosixGraph} \p{XPosixPrint} \p{XPosixWord}
\p{Age:5.1} \p{Script=Arabic} \p{Bidi_Class:AL} \p{Bidi_Class=Arabic_Letter} \p{Bidi_Class:Arabic_Letter} \p{Bc=AL} \p{Block:Arabic_Supplement} \p{Canonical_Combining_Class:0}
\p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR} \p{Canonical_Combining_Class:NR} \p{Decomposition_Type:None} \p{Dt=None}
\p{East_Asian_Width=Neutral} \p{East_Asian_Width:Neutral} \p{General_Category:L} \p{General_Category=Letter} \p{General_Category:Letter} \p{Gc=L} \p{General_Category:Lo}
\p{General_Category=Other_Letter} \p{General_Category:Other_Letter} \p{Gc=Lo} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX} \p{Grapheme_Cluster_Break:XX}
\p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:Yeh}
\p{Jg=Yeh} \p{Joining_Type:D} \p{Joining_Type=Dual_Joining} \p{Joining_Type:Dual_Joining} \p{Jt=D} \p{Line_Break:AL} \p{Line_Break=Alphabetic} \p{Line_Break:Alphabetic}
\p{Lb=AL} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Present_In:6.0} \p{In=6.0}
\p{Script:Arab} \p{Script:Arabic} \p{Sc=Arab} \p{Sentence_Break:LE} \p{Sentence_Break=OLetter} \p{Sentence_Break:OLetter} \p{SB=LE} \p{Word_Break:ALetter} \p{WB=LE}
\p{Word_Break:LE} \p{Word_Break=ALetter}
My uniprops and unichars should run anywhere running Perl version 5.10 or better. There’s also a uninames script that goes with them.
There's a list available here although it does not specific for which version of the standard it applies: