How to allow only English submissions on Pligg CMS - content-management-system

I have got a client requirement where he is using Pligg CMS for social bookmarking. He is getting a lot of manual spam entries in multiple languages. Does Pligg has any setting or Plugin where we can allow only English entries. If any one can suggest a good alternative to Pligg will also help.
Thanks in advance.

I would use the spam trigger module (included with Pligg) and add the following to one of the Trigger configuration fields from the settings page.
À
Á
Â
Ã
Ä
Å
Æ
Ā
Ą
Ă
Ç
Ć
Č
Ĉ
Ċ
Ď
Đ
È
É
Ê
Ë
Ē
Ę
Ě
Ĕ
Ė
Ĝ
Ğ
Ġ
Ģ
Ĥ
Ħ
Ì
Í
Î
Ï
Ī
Ĩ
Ĭ
Į
İ
IJ
Ĵ
Ķ
Ľ
Ĺ
Ļ
Ŀ
Ł
Ñ
Ń
Ň
Ņ
Ŋ
Ò
Ó
Ô
Õ
Ö
Ø
Ō
Ő
Ŏ
Œ
Ŕ
Ř
Ŗ
Ś
Ş
Ŝ
Ș
Š
Ť
Ţ
Ŧ
Ț
Ù
Ú
Û
Ü
Ū
Ů
Ű
Ŭ
Ũ
Ų
Ŵ
Ŷ
Ÿ
Ý
Ź
Ż
Ž
à
á
â
ã
ä
ā
ą
ă
å
æ
ç
ć
č
ĉ
ċ
ď
đ
è
é
ê
ë
ē
ę
ě
ĕ
ė
ƒ
ĝ
ğ
ġ
ģ
ĥ
ħ
ì
í
î
ï
ī
ĩ
ĭ
į
ı
ij
ĵ
ķ
ĸ
ł
ľ
ĺ
ļ
ŀ
ñ
ń
ň
ņ
ʼn
ŋ
ò
ó
ô
õ
ö
ø
ō
ő
ŏ
œ
ŕ
ř
ŗ
ś
š
ş
ť
ţ
ù
ú
û
ü
ū
ů
ű
ŭ
ũ
ų
ụ
ŵ
ÿ
ý
ŷ
ż
ź
ž
ß
ſ
Α
Ά
Ἀ
Ἁ
Ἂ
Ἃ
Ἄ
Ἅ
Ἆ
Ἇ
ᾈ
ᾉ
ᾊ
ᾋ
ᾌ
ᾍ
ᾎ
ᾏ
Ᾰ
Ᾱ
Ὰ
Ά
ᾼ
Β
Γ
Δ
Ε
Έ
Ἐ
Ἑ
Ἒ
Ἓ
Ἔ
Ἕ
Έ
Ὲ
Ζ
Η
Ή
Ἠ
Ἡ
Ἢ
Ἣ
Ἤ
Ἥ
Ἦ
Ἧ
ᾘ
ᾙ
ᾚ
ᾛ
ᾜ
ᾝ
ᾞ
ᾟ
Ὴ
Ή
ῌ
Θ
Ι
Ί
Ϊ
Ἰ
Ἱ
Ἲ
Ἳ
Ἴ
Ἵ
Ἶ
Ἷ
Ῐ
Ῑ
Ὶ
Ί
Κ
Λ
Μ
Ν
Ξ
Ο
Ό
Ὀ
Ὁ
Ὂ
Ὃ
Ὄ
Ὅ
Ὸ
Ό
Π
Ρ
Ῥ
Σ
Τ
Υ
Ύ
Ϋ
Ὑ
Ὓ
Ὕ
Ὗ
Ῠ
Ῡ
Ὺ
Ύ
Φ
Χ
Ψ
Ω
Ώ
Ὠ
Ὡ
Ὢ
Ὣ
Ὤ
Ὥ
Ὦ
Ὧ
ᾨ
ᾩ
ᾪ
ᾫ
ᾬ
ᾭ
ᾮ
ᾯ
Ὼ
Ώ
ῼ
α
ά
ἀ
ἁ
ἂ
ἃ
ἄ
ἅ
ἆ
ἇ
ᾀ
ᾁ
ᾂ
ᾃ
ᾄ
ᾅ
ᾆ
ᾇ
ὰ
ά
ᾰ
ᾱ
ᾲ
ᾳ
ᾴ
ᾶ
ᾷ
β
γ
δ
ε
έ
ἐ
ἑ
ἒ
ἓ
ἔ
ἕ
ὲ
έ
ζ
η
ή
ἠ
ἡ
ἢ
ἣ
ἤ
ἥ
ἦ
ἧ
ᾐ
ᾑ
ᾒ
ᾓ
ᾔ
ᾕ
ᾖ
ᾗ
ὴ
ή
ῂ
ῃ
ῄ
ῆ
ῇ
θ
ι
ί
ϊ
ΐ
ἰ
ἱ
ἲ
ἳ
ἴ
ἵ
ἶ
ἷ
ὶ
ί
ῐ
ῑ
ῒ
ΐ
ῖ
ῗ
κ
λ
μ
ν
ξ
ο
ό
ὀ
ὁ
ὂ
ὃ
ὄ
ὅ
ὸ
ό
π
ρ
ῤ
ῥ
σ
ς
τ
υ
ύ
ϋ
ΰ
ὐ
ὑ
ὒ
ὓ
ὔ
ὕ
ὖ
ὗ
ὺ
ύ
ῠ
ῡ
ῢ
ΰ
ῦ
ῧ
φ
χ
ψ
ω
ώ
ὠ
ὡ
ὢ
ὣ
ὤ
ὥ
ὦ
ὧ
ᾠ
ᾡ
ᾢ
ᾣ
ᾤ
ᾥ
ᾦ
ᾧ
ὼ
ώ
ῲ
ῳ
ῴ
ῶ
ῷ
¨
΅
᾿
῾
῍
῝
῎
῞
῏
῟
῀
῁
΄
΅
`
῭
ͺ
᾽
А
Б
В
Г
Д
Е
Ё
Ж
З
И
Й
К
Л
М
Н
О
П
Р
С
Т
У
Ф
Х
Ц
Ч
Ш
Щ
Ы
Э
Ю
Я
а
б
в
г
д
е
ё
ж
з
и
й
к
л
м
н
о
п
р
с
т
у
ф
х
ц
ч
ш
щ
ы
э
ю
я
Ъ
ъ
Ь
ь
ð
Ð
þ
Þ
Ề
ề
Ể
ể
Ễ
ễ
Ế
ế
Ệ
ệ
Ộ
ộ
Ơ
ơ
Ư
ư
ờ
ა
ბ
გ
დ
ე
ვ
ზ
თ
ი
კ
ლ
მ
ნ
ო
პ
ჟ
რ
ს
ტ
უ
ფ
ქ
ღ
ყ
შ
ჩ
ც
ძ
წ
ჭ
ხ
ჯ
ჰ
ב
ג
ד
ה
ו
ז
ח
ט
י
כ
ל
מ
נ
ס
פ
צ
ק
ר
ש
ת
ա
բ
գ
դ
ե
զ
է
ը
թ
ժ
ի
լ
խ
ծ
կ
հ
ձ
ղ
ճ
մ
յ
ն
շ
ո
չ
պ
ջ
ռ
ս
վ
տ
ր
ց
ւ
փ
ք
օ
ֆ
և
This list is made up of many non-English characters, so it will flag any posts that use these letters. I pulled this list from Pligg's /languages/translit.txt file, which is used to transcribe these letters into more common English ones for use in URLs. It's not complete, for example it does not have any Asian language characters.

Related

What does the \u{...} notation mean in UNICODE and why are only some characters displayed like this in the CLDR project?

In this link you will find the most used characters for each language. Why are some characters in some languages displayed under the \u{...} notation?
I think that what is in the brackets is the hexadecimal code of the character, but I can't understand why they would only do it with some characters.
The character sequences enclosed in curly brackets {} are digraphs (trigraphs, …) counted as a distinct letter in given language (supposedly with its own place in the alphabet), for instance
digraph {ch} in cs (Czech language);
trigraph {dzs} in hu (Hungarian alphabet);
more complex digraph examples in kkj (Kako language) shows the following Python code snippet:
>>> kkj='[a á à â {a\u0327} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ\u0301} {ɛ\u0300} {ɛ\u0302} {ɛ\u0327} f g {gb} {gw} h i í ì î {i\u0327} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ\u0301} {ɔ\u0300} {ɔ\u0302} {ɔ\u0327} p r s t u ú ù û {u\u0327} v w y]'
>>> print( kkj)
[a á à â {a̧} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̧} f g {gb} {gw} h i í ì î {i̧} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̧} p r s t u ú ù û {u̧} v w y]
>>>
For instance, {a\u0327} renders as {a̧} i.e. something like Latin Small Letter A with Combining Cedilla which has no Unicode equivalent. A counterexample:
ņ (U+0146) Latin Small Letter N With Cedilla with decomposition 004E 0327:
>>> import unicodedata
>>> print( 'ņ', unicodedata.normalize('NFC','{n\u0327}'))
ņ {ņ}
Edit:
Characters presented as unicode literals (\uxxxx = a character with 16-bit hex value xxxx) are unrenderable ones (or hard to render, at least). The following Python script shows some of them (Bidi_Class Values L-Left_To_Right, R-Right_To_Left, NSM-Nonspacing_Mark, BN-Boundary_Neutral):
# -*- coding: utf-8 -*-
import unicodedata
pa = 'ੱੰ਼੍ੁੂੇੈੋੌ'
pa = '\u0327 \u0A71 \u0A70 \u0A3C ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ੴ ੳ ਉ ਊ ਓ ਅ ਆ ਐ ਔ ੲ ਇ ਈ ਏ ਸ {ਸ\u0A3C} ਹ ਕ ਖ {ਖ\u0A3C} ਗ {ਗ\u0A3C} ਘ ਙ ਚ ਛ ਜ {ਜ\u0A3C} ਝ ਞ ਟ ਠ ਡ ਢ ਣ ਤ ਥ ਦ ਧ ਨ ਪ ਫ {ਫ\u0A3C} ਬ ਭ ਮ ਯ ਰ ਲ ਵ ੜ \u0A4D ਾ ਿ ੀ \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C'
pa = '\u0300 \u0301 \u0302 \u1DC6 \u1DC7 \u0A71 \u0A70 \u0A3C \u0A4D \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C \u05B7 \u05B8 \u05BF \u200C \u200D \u200E \u200F \u064B \u064C \u064E \u064F \u0650'
# above examples from ·kkj· ·bas· ·pa· ·yi· ·kn· ·ur· ·mzn·
print( pa )
for chr in pa:
if chr != ' ':
if chr == '{' or chr == '}':
print( chr )
else:
print( '\\u%04x' % ord(chr), chr,
unicodedata.category(chr),
unicodedata.bidirectional(chr) + '\t',
str( unicodedata.combining(chr)) + '\t',
unicodedata.name(chr, '?') )
Result: .\SO\63659122.py
̀ ́ ̂ ᷆ ᷇ ੱ ੰ ਼ ੍ ੁ ੂ ੇ ੈ ੋ ੌ ַ ָ ֿ ‌ ‍ ‎ ‏ ً ٌ َ ُ ِ
\u0300 ̀ Mn NSM 230 COMBINING GRAVE ACCENT
\u0301 ́ Mn NSM 230 COMBINING ACUTE ACCENT
\u0302 ̂ Mn NSM 230 COMBINING CIRCUMFLEX ACCENT
\u1dc6 ᷆ Mn NSM 230 COMBINING MACRON-GRAVE
\u1dc7 ᷇ Mn NSM 230 COMBINING ACUTE-MACRON
\u0a71 ੱ Mn NSM 0 GURMUKHI ADDAK
\u0a70 ੰ Mn NSM 0 GURMUKHI TIPPI
\u0a3c ਼ Mn NSM 7 GURMUKHI SIGN NUKTA
\u0a4d ੍ Mn NSM 9 GURMUKHI SIGN VIRAMA
\u0a41 ੁ Mn NSM 0 GURMUKHI VOWEL SIGN U
\u0a42 ੂ Mn NSM 0 GURMUKHI VOWEL SIGN UU
\u0a47 ੇ Mn NSM 0 GURMUKHI VOWEL SIGN EE
\u0a48 ੈ Mn NSM 0 GURMUKHI VOWEL SIGN AI
\u0a4b ੋ Mn NSM 0 GURMUKHI VOWEL SIGN OO
\u0a4c ੌ Mn NSM 0 GURMUKHI VOWEL SIGN AU
\u05b7 ַ Mn NSM 17 HEBREW POINT PATAH
\u05b8 ָ Mn NSM 18 HEBREW POINT QAMATS
\u05bf ֿ Mn NSM 23 HEBREW POINT RAFE
\u200c ‌ Cf BN 0 ZERO WIDTH NON-JOINER
\u200d ‍ Cf BN 0 ZERO WIDTH JOINER
\u200e ‎ Cf L 0 LEFT-TO-RIGHT MARK
\u200f ‏ Cf R 0 RIGHT-TO-LEFT MARK
\u064b ً Mn NSM 27 ARABIC FATHATAN
\u064c ٌ Mn NSM 28 ARABIC DAMMATAN
\u064e َ Mn NSM 30 ARABIC FATHA
\u064f ُ Mn NSM 31 ARABIC DAMMA
\u0650 ِ Mn NSM 32 ARABIC KASRA
It seems like all codepoints that don't have a well-defined stand-alone look (or are not meant to be used as stand-alone characters) are represented with this notation.
For example U+0A3C is present in the "character" {ਫ\u0A3C}. U+0A3C is a combining codepoint that modifies the one that is before it.

Let users send message in other languages via FormtoEmail

If anyone's used formtoemail.com for email before, I'm trying to accept other languages. For example, I just had the following message come in. How do I fix it?
comments: http://propohudenie.com/earn/423-ot-1500-2000-rub-v-den-na-prosmotre-videorolikov-bez-vlozheniy-i-prodazh.html>ÐºÑƒÑ€Ñ Ñƒ шиханова заработок в интернете
ÐžÑ€Ð³Ð°Ð½Ð¸Ð·Ð°Ñ†Ð¸Ñ â€œÐŸÑ€Ð¾Ð²ÐµÑ€ÐµÐ½Ð½Ñ‹Ðµ товары и ÑƒÑ Ð»ÑƒÐ³Ð¸â€ Ð¿Ñ€ÐµÐ´Ð¾Ñ Ñ‚Ð°Ð²Ð»Ñ ÐµÑ‚ Ð´Ð¾Ñ Ñ‚Ð¾Ð²ÐµÑ€Ð½ÑƒÑŽ и проверенную информацию о Ð²Ñ ÐµÐ²Ð¾Ð·Ð¼Ð¾Ð¶Ð½Ñ‹Ñ… товарах и Ñ ÐµÑ€Ð²Ð¸Ñ Ð°Ñ… Ð´Ð»Ñ Ð½Ð°Ñ ÐµÐ»ÐµÐ½Ð¸Ñ . Рашей задачей ÐµÑ Ñ‚ÑŒ проверка ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð° Ñ ÐµÑ€Ð²Ð¸Ñ Ð° или товара, которые Ñ€ÐµÐ°Ð»Ð¸Ð·ÑƒÑŽÑ‚Ñ Ñ Ð² Ñ ÐµÑ‚Ð¸. ÐŸÐ¾Ñ Ð»Ðµ нашей Ð¸Ð½Ñ Ð¿ÐµÐºÑ†Ð¸Ð¸ на ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð¾ товары и Ñ ÐµÑ€Ð²Ð¸Ñ Ñ‹ Ñ€Ð°Ð·Ð¼ÐµÑ‰Ð°ÑŽÑ‚Ñ Ñ Ð² каталоге проверенных товаров и ÑƒÑ Ð»ÑƒÐ³. Данный каталог поможет Ð»ÑŽÐ´Ñ Ð¼ подобрать необходимый товар или ÑƒÑ Ð»ÑƒÐ³Ñƒ, не Ñ Ð¾Ð¼Ð½ÐµÐ²Ð°Ñ Ñ ÑŒ в их ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ðµ.
http://propohudenie.com/earn/423-ot-1500-2000-rub-v-den-na-prosmotre-videorolikov-bez-vlozheniy-i-prodazh.html>от 1500 рублей в день на Ð¿Ñ€Ð¾Ñ Ð¼Ð¾Ñ‚Ñ€Ð°Ñ… видеороликов (Ñ ÐµÑ€Ð³ÐµÐ¹ шиханов)
ÐžÑ Ð½Ð¾Ð²Ð½Ñ‹Ð¼Ð¸ Ð½Ð°Ð¿Ñ€Ð°Ð²Ð»ÐµÐ½Ð¸Ñ Ð¼Ð¸ нашей Ð´ÐµÑ Ñ‚ÐµÐ»ÑŒÐ½Ð¾Ñ Ñ‚Ð¸ ÐµÑ Ñ‚ÑŒ товары и ÑƒÑ Ð»ÑƒÐ³Ð¸, Ñ Ð²Ñ Ð·Ð°Ð½Ð½Ñ‹Ðµ Ñ Ð¿Ð¾Ñ…ÑƒÐ´ÐµÐ½Ð¸ÐµÐ¼ и заработком в Интернете. Эти товары и Ñ ÐµÑ€Ð²Ð¸Ñ Ñ‹ чаще Ð²Ñ ÐµÐ³Ð¾ Ð¿Ð¾Ð´Ð²ÐµÑ€Ð³Ð°ÑŽÑ‚Ñ Ñ Ñ„Ð°Ð»ÑŒÑ Ð¸Ñ„Ð¸ÐºÐ°Ñ†Ð¸Ð¸, Ð²Ñ Ð»ÐµÐ´Ñ Ñ‚Ð²Ð¸Ðµ их широкой Ð²Ð¾Ñ Ñ‚Ñ€ÐµÐ±Ð¾Ð²Ð°Ð½Ð½Ð¾Ñ Ñ‚Ð¸.
http://propohudenie.com/earn/431-ot-1300-v-mesyac-na-kulinarnyh-receptah.html>Ñ‚Ð°Ñ‚ÑŒÑ Ð½Ñ‹ зориной заработок на кулинарных интернет Ð¸Ð·Ð´Ð°Ð½Ð¸Ñ Ñ…
Мы Ð¾Ñ ÑƒÑ‰ÐµÑ Ñ‚Ð²Ð»Ñ ÐµÐ¼ квалифицированную проверку на ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð¾ данных ÑƒÑ Ð»ÑƒÐ³ и товаров, Ð¿Ð¾Ñ Ð»Ðµ чего Ð¿Ñ€ÐµÐ´Ð¾Ñ Ñ‚Ð°Ð²Ð»Ñ ÐµÐ¼ Ð¿Ð¾Ð»ÑŒÐ·Ð¾Ð²Ð°Ñ‚ÐµÐ»Ñ Ð¼ информацию о результатах проверки. Мы также Ñ Ð¼Ð¾Ð¶ÐµÐ¼ Ð¿Ñ€Ð¸Ð½Ñ Ñ‚ÑŒ жалобы от пользователей на какую-нибудь ÑƒÑ Ð»ÑƒÐ³Ñƒ либо товар, предлагаемые в Интернете, на Ð½ÐµÐ´Ð¾Ð±Ñ€Ð¾Ñ Ð¾Ð²ÐµÑ Ñ‚Ð½Ð¾Ðµ отношение фирм-реализаторов. Раш каталог имеет также информацию о ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ðµ других товаров и ÑƒÑ Ð»ÑƒÐ³ – которые не Ð¾Ñ‚Ð½Ð¾Ñ Ñ Ñ‚Ñ Ñ Ðº ÐºÐ°Ñ‚ÐµÐ³Ð¾Ñ€Ð¸Ñ Ð¼ заработка в Ñ ÐµÑ‚Ð¸ и похудению, мы Ð¿Ñ€Ð¾Ð²ÐµÑ€Ñ ÐµÐ¼ ÐºÐ°Ñ‡ÐµÑ Ñ‚Ð²Ð¾ любых ÑƒÑ Ð»ÑƒÐ³ и товаров.
submit: Send
You have to configure the form at the formtoemail.com control panel to accept content in the same encoding you're using to serve your HTML page containing the <form>.
UTF-8 is the sensible choice for this as it allows all characters, but unfortunately the default is ISO-8859-1 and the free version of the service doesn't let you change it.
It's possible to rescue the text above by encoding it to ISO-8859-1 bytes and then decoding it to test using UTF-8. But that's a right pain to do for every mail that comes in. For what it's worth, the above appears to have come from spambot anyway so you're not missing much.
I would look for a different form sending service.

Reserved character codes in Unicode

Why Unicode has several reserved character codes?
See the Unicode for two languages- Kannada and Tamil.
Both language are very old and I think there is no chance to get new characters to these languages.
EDIT: Then why are they wasting some character codes by making it reserved character codes?
Why are they not placing the reserved character codes at the end of each language character set?
This has to do with how the Unicode consortium doles out its allocated blocks, scripts, and code points. For example, in Block=Tamil, the start of it runs this way:
$ unichars '\p{Block=Tamil}' | head -20
U+00B82 ‭ ◌ஂ GC=Mn SC=Tamil TAMIL SIGN ANUSVARA
U+00B83 ‭ ஃ GC=Lo SC=Tamil TAMIL SIGN VISARGA
U+00B85 ‭ அ GC=Lo SC=Tamil TAMIL LETTER A
U+00B86 ‭ ஆ GC=Lo SC=Tamil TAMIL LETTER AA
U+00B87 ‭ இ GC=Lo SC=Tamil TAMIL LETTER I
U+00B88 ‭ ஈ GC=Lo SC=Tamil TAMIL LETTER II
U+00B89 ‭ உ GC=Lo SC=Tamil TAMIL LETTER U
U+00B8A ‭ ஊ GC=Lo SC=Tamil TAMIL LETTER UU
U+00B8E ‭ எ GC=Lo SC=Tamil TAMIL LETTER E
U+00B8F ‭ ஏ GC=Lo SC=Tamil TAMIL LETTER EE
U+00B90 ‭ ஐ GC=Lo SC=Tamil TAMIL LETTER AI
U+00B92 ‭ ஒ GC=Lo SC=Tamil TAMIL LETTER O
U+00B93 ‭ ஓ GC=Lo SC=Tamil TAMIL LETTER OO
U+00B94 ‭ ஔ GC=Lo SC=Tamil TAMIL LETTER AU
U+00B95 ‭ க GC=Lo SC=Tamil TAMIL LETTER KA
U+00B99 ‭ ங GC=Lo SC=Tamil TAMIL LETTER NGA
U+00B9A ‭ ச GC=Lo SC=Tamil TAMIL LETTER CA
U+00B9C ‭ ஜ GC=Lo SC=Tamil TAMIL LETTER JA
U+00B9E ‭ ஞ GC=Lo SC=Tamil TAMIL LETTER NYA
U+00B9F ‭ ட GC=Lo SC=Tamil TAMIL LETTER TTA
They tend to reserve contiguous rows of 4, 8, or 16 code points to all the same “kind” of character. Yes, there are gaps there, but it’s like how in the filesystem, once you allocate a sector (or block if you don’t have separate sectors within a block) to one file, even if that file doesn’t use everything in its (final) sector, you don’t go giving away those unused byte to some other process. Things tend to get padded to block boundaries anyway.
It’s not like we’re at any risk of running out of codes.
Here is the beginning of the allocated area starts with “Signs”, as shown by the first assigned code points in that block. The gap may represent a change from one kind of character to another. If you check out the first five code points in the block for their properties, you see that those unassigned code points still have the right block property:
$ uniprops -a U+00B80 U+00B81 U+00B82 U+00B83 U+00B84 U+00B85
U+0B80 ‹U+0B80› \N{U+0B80}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B81 ‹U+0B81› \N{U+0B81}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B82 ‹◌ஂ› \N{TAMIL SIGN ANUSVARA}
\w \pM \p{Mn}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil Case_Ignorable CI M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC
Mark Nonspacing_Mark Print Taml Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=Nonspacing_Mark BC=NSM Bidi_Class=NSM Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=EX
Grapheme_Cluster_Break=Extend GCB=EX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=EX Sentence_Break=Extend SB=EX Word_Break=Extend WB=Extend
U+0B83 ‹ஃ› \N{TAMIL SIGN VISARGA}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter
L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE
Word_Break=LE
U+0B84 ‹U+0B84› \N{U+0B84}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B85 ‹அ› \N{TAMIL LETTER A}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter
L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE
Word_Break=LE
If you look at other allocated blocks, you see the same sort of thing. It doesn’t make sense to slice up blocks into unrelated things.
As I said, it’s not as though they’re going to run out of space, so I don’t know what the concern is here.
BTW, you can get Unicode exploration and proceesing tools like unichars, uniprops, uninames from my Unicode Command-Line Toolchest, either individually from there or the entire suite available through the CPAN Unicode::Tussle suite.

How to identify all non-basic UTF-8 characters in a set of strings in perl

I'm using perl's XML::Writer to generate an import file for a program called OpenNMS. According to the documentation I need to pre-declare all special characters as XML ENTITY declarations. Obviously I need to go through all strings I'm exporting and catalogue the special characters used. What's the easiest way to work out which characters in a perl string are "special" with respect to UTF-8 encoding? Is there any way to work out what the entity names for those characters should be?
In order to find "special" characters, you can use ord to find out the codepoint. Here's an example:
# Create a Unicode test file with some Latin chars, some Cyrillic,
# and some outside the BMP.
# The BMP is the basic multilingual plane, see perluniintro.
# (Not sure what you mean by saying "non-basic".)
perl -CO -lwe "print join '', map chr, 97 .. 100, 0x410 .. 0x415, 0x10000 .. 0x10003" > u.txt
# Read it and find codepoints outside the BMP.
perl -CI -nlwe "print for map ord, grep ord > 0xffff, split //" < u.txt
You can get a good introduction from reading perluniintro.
I'm not sure what the docs you're referring to mean in the section "Exported XML".
Looks like some limitation of a system which is de facto ASCII and doesn't do Unicode.
Or a misunderstanding of XML. Or both.
Anyway, if you're looking for names you could use or reference the canonical ones.
See XML Entity Definitions for Characters or one of the older documents for HTML or MathML referenced therein.
You might look into the uniquote program. It has a --xml option. For example:
$ cat sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀² + 𝐁²]) and (𝐂 = sqrt[𝐀² + 𝐁²]).
7 astral + combining chars: (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]) and (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).
$ uniquote -x sample
1 NFD single combining characters: (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}) and (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}).
2 NFC single combining characters: (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}) and (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}).
3 NFD multiple combining characters: (ha\x{302}\x{303}c\x{327}\x{30C}k) and (ha\x{303}\x{302}c\x{327}\x{30C}k).
3 NFC multiple combining characters: (h\x{1EAB}\x{E7}\x{30C}k) and (h\x{E3}\x{302}\x{E7}\x{30C}k).
5 invisible characters: (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}) and (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}).
6 astral characters: (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]) and (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]).
7 astral + combining chars: (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]) and (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]).
8 wide characters: (\x{FF57}\x{FF49}\x{FF44}\x{FF45}) and (\x{FF57}\x{FF49}\x{FF44}\x{FF45}).
9 regular characters: (normal) and (normal).
$ uniquote -b sample
1 NFD single combining characters: (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81) and (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81).
2 NFC single combining characters: (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9) and (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9).
3 NFD multiple combining characters: (ha\xCC\x82\xCC\x83c\xCC\xA7\xCC\x8Ck) and (ha\xCC\x83\xCC\x82c\xCC\xA7\xCC\x8Ck).
3 NFC multiple combining characters: (h\xE1\xBA\xAB\xC3\xA7\xCC\x8Ck) and (h\xC3\xA3\xCC\x82\xC3\xA7\xCC\x8Ck).
5 invisible characters: (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3) and (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3).
6 astral characters: (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]) and (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]).
7 astral + combining chars: (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]) and (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]).
8 wide characters: (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85) and (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85).
9 regular characters: (normal) and (normal).
$ uniquote -v sample
1 NFD single combining characters: (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}) and (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}).
2 NFC single combining characters: (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}) and (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}).
3 NFD multiple combining characters: (ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k) and (ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k).
3 NFC multiple combining characters: (h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k) and (h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k).
5 invisible characters: (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}) and (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}).
6 astral characters: (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]).
7 astral + combining chars: (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]).
8 wide characters: (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}) and (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}).
9 regular characters: (normal) and (normal).
$ uniquote --xml sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hâçk) and (hãçk).
3 NFC multiple combining characters: (hẫk) and (hãk).
5 invisible characters: (4⁄3⁢r³) and (4⁄3⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
7 astral + combining chars: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
8 wide characters: (w) and (w).
9 regular characters: (normal) and (normal).
$ uniquote --verbose --html sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀² + 𝐁²]) and (𝐂 = sqrt[𝐀² + 𝐁²]).
7 astral + combining chars: (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]) and (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).

The number of characters in each unicode block

Does anyone know any reference showing the number of characters in each Unicode block? (in newer version such as 5.x.x or 6.0.0)
Thanks a lot.
http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt contains the data you are interested in.
http://www.unicode.org/Public/6.0.0/ucd/ReadMe.txt contains some instructions and refers to http://unicode.org/reports/tr44/ for interpreting the data. In that document you should read http://unicode.org/reports/tr44/#UnicodeData.txt.
unichars
Does this answer your question:
% unichars '\p{InCyrillic}' | wc -l
256
% unichars '\p{InEthiopic}' | wc -l
356
% unichars '\p{InLatin1}' | wc -l
128
% unichars '\p{InCombiningDiacriticalMarks}' | wc -l
To include the 16 astral planes, add -a:
112
% unichars -a '\p{InAncientGreekNumbers}' | wc -l
75
If you want unassigned or Han or Hangul, you need -u:
% unichars -u '\p{InEthiopic}' | wc -l
384
% unichars -u '\p{InCJKUnifiedIdeographsExtensionA}' | wc -l
6592
You can get other information, too:
% unichars '\P{IsGreek}' '\p{InGreek}'
ʹ 884 0374 GREEK NUMERAL SIGN
; 894 037E GREEK QUESTION MARK
΅ 901 0385 GREEK DIALYTIKA TONOS
· 903 0387 GREEK ANO TELEIA
Ϣ 994 03E2 COPTIC CAPITAL LETTER SHEI
ϣ 995 03E3 COPTIC SMALL LETTER SHEI
Ϥ 996 03E4 COPTIC CAPITAL LETTER FEI
ϥ 997 03E5 COPTIC SMALL LETTER FEI
Ϧ 998 03E6 COPTIC CAPITAL LETTER KHEI
ϧ 999 03E7 COPTIC SMALL LETTER KHEI
Ϩ 1000 03E8 COPTIC CAPITAL LETTER HORI
ϩ 1001 03E9 COPTIC SMALL LETTER HORI
Ϫ 1002 03EA COPTIC CAPITAL LETTER GANGIA
ϫ 1003 03EB COPTIC SMALL LETTER GANGIA
Ϭ 1004 03EC COPTIC CAPITAL LETTER SHIMA
ϭ 1005 03ED COPTIC SMALL LETTER SHIMA
Ϯ 1006 03EE COPTIC CAPITAL LETTER DEI
ϯ 1007 03EF COPTIC SMALL LETTER DEI
% unichars '\p{IsGreek}' '\P{InGreek}' | wc -l
250
% unichars '\P{IsGreek}' '\p{InGreek}' | wc -l
18
% unichars '\p{In=1.1}' | wc -l
6362
% unichars '\p{In=6.0}' | wc -l
15087
uniprops
Here’s uniprops:
% uniprops -l | grep -c 'Block='
84
% uniprops digamma 450 %
U+03DC ‹Ϝ› \N{ GREEK LETTER DIGAMMA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base
Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print
Upper Uppercase Word XID_Continue XIDC XID_Start XIDS XPosixAlnum XPosixAlpha XPosixGraph XPosixPrint XPosixUpper
XPosixWord
U+0450 ‹ѐ› \N{ CYRILLIC SMALL LETTER IE WITH GRAVE }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InCyrillic Cyrillic Is_Cyrillic Cased Cased_Letter LC Changes_When_Casemapped
CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Cyrl Ll L Gr_Base Grapheme_Base Graph GrBase
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
XPosixAlnum XPosixAlpha XPosixGraph XPosixLower XPosixPrint XPosixWord
U+0025 ‹%› \N{ PERCENT SIGN }:
\pP \p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct Print Punctuation XPosixGraph XPosixPrint XPosixPunct
Or even all these:
% uniprops -vag 777
U+0777 ‹ݷ› \N{ ARABIC LETTER FARSI YEH WITH EXTENDED ARABIC-INDIC DIGIT FOUR BELOW }:
\w \pL \p{L_} \p{Lo}
\p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Arab} \p{Arabic} \p{Assigned} \p{Is_Arabic} \p{InArabicSupplement} \p{L} \p{Lo} \p{Gr_Base} \p{Grapheme_Base} \p{Graph}
\p{GrBase} \p{ID_Continue} \p{IDC} \p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Other_Letter} \p{Print} \p{Word} \p{XID_Continue} \p{XIDC} \p{XID_Start} \p{XIDS} \p{XPosixAlnum}
\p{XPosixAlpha} \p{XPosixGraph} \p{XPosixPrint} \p{XPosixWord}
\p{Age:5.1} \p{Script=Arabic} \p{Bidi_Class:AL} \p{Bidi_Class=Arabic_Letter} \p{Bidi_Class:Arabic_Letter} \p{Bc=AL} \p{Block:Arabic_Supplement} \p{Canonical_Combining_Class:0}
\p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR} \p{Canonical_Combining_Class:NR} \p{Decomposition_Type:None} \p{Dt=None}
\p{East_Asian_Width=Neutral} \p{East_Asian_Width:Neutral} \p{General_Category:L} \p{General_Category=Letter} \p{General_Category:Letter} \p{Gc=L} \p{General_Category:Lo}
\p{General_Category=Other_Letter} \p{General_Category:Other_Letter} \p{Gc=Lo} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX} \p{Grapheme_Cluster_Break:XX}
\p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:Yeh}
\p{Jg=Yeh} \p{Joining_Type:D} \p{Joining_Type=Dual_Joining} \p{Joining_Type:Dual_Joining} \p{Jt=D} \p{Line_Break:AL} \p{Line_Break=Alphabetic} \p{Line_Break:Alphabetic}
\p{Lb=AL} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Present_In:6.0} \p{In=6.0}
\p{Script:Arab} \p{Script:Arabic} \p{Sc=Arab} \p{Sentence_Break:LE} \p{Sentence_Break=OLetter} \p{Sentence_Break:OLetter} \p{SB=LE} \p{Word_Break:ALetter} \p{WB=LE}
\p{Word_Break:LE} \p{Word_Break=ALetter}
My uniprops and unichars should run anywhere running Perl version 5.10 or better. There’s also a uninames script that goes with them.
There's a list available here although it does not specific for which version of the standard it applies: