How the Chinese number categorized in Unicode database? - unicode

I am learning Unicode now, I list all the code under category 'Decimal Number'
Unicode (General_Category=Decimal_Number)
I have a question as I am a Chinese, I cannot find chinese number under the Decimal Number. I found them categorized as 'Other Letter'
一二三四五六七八九十零壹貳叁肆伍陸柒捌玖拾贰陆
I like to know more, is there any reasons?

Related

Japanese and Chinese first/last name detector

Given two Unicode strings encoding a first and last name (in Japanese or Chinese), what would be the best approach to tell if the first/last name belongs is Chinese or Japanese?
For example, is it possible to tell if the following are Chinese or Japanese names?
任天堂
金城武
唐泽西
白川轩
竹中宇
叶山明
林慧梦
No, it is impossible to tell the language of a string from just its raw character content alone.

antlr4 and international characters

I have been using antlr4 to parse a German document and so far I have done the following to parse the text that includes German characters:
LETTERS:
[a-zA-Z_\u00DC\u00FC\u00D6\u00F6\u00C4\u00E4\u00DF]; // hex unicodes for ÜüÖöÄäß
what is the best way to describe lingual characters of all languages in Unicode in a way that antlr understands, without specifying each language/character individually? say, the french, Arabic, or Chinese, Japanese characters?
Thank you
The best way is to use character ranges corresponding to the desired Unicode classes. Even then, the result can be a bit clumsy. See this worked example.
The raw data available in the Unicode standard's Appendix tables can be stripped and munged into a usable format with just a bit too much effort. ;)

Strategy for defining Unicode Ranges by Culture

I am new to Unicode have been given the requirement to look at some translated text, iterate over all of the characters of that translation and determine if all the characters are valid for the target culture (language and location).
For example, if I am translating a document from English to Greek, I want to detect if there are any English/ASCII "A"s in the Greek translation and report that as an error. This may likely be the case from corrupted data from a translation memory.
Is there any existing grouping of Unicode characters by culture? Or is there any existing strategy for developing this kind of grouping? I see that there is some grouping of characters at (http://www.unicode.org/charts/). But it seems that this is not quite what I am looking for at first glance.
Does any thing exist like "Here are the valid Unicode characters for Spanish - Spain: [some Unicode range(s)]" or "Here are the valid Unicode characters for Russian - Russia: [some Unicode range(s)]"
Or has anyone developed a strategy to define these?
If this is not the right place to ask this question, I would welcome any direction on where might be a good place to ask the question.
This is something that CLDR (Common Locale Data Repository) deals with. It is not part of the Unicode Standard, but it is an activity and a resource managed by the Unicode Consortium. The LDML specification defines the format of the locale data. The Character Elements define some sets of characters: “main/standard”, “auxiliary”, “index”, and “punctuation”.
The data for Greek includes only Greek letters and some basic punctuation. This, like all such data at CLDR, is largely subjective. And even though the CLDR process is meant to produce well-reviewed data based on consensus, the reality is different. It can be argued that in normal Greek texts, Latin letters are not uncommon, especially in technical areas. For example, the international symbol for the ampere is “A” as a Latin letter; the symbol for the kilogram is “kg”, in Latin letters, even though the word for it is written Greek letters in Greek.
Thus, no matter how you run the analysis, the occurrence of Latin “A” in Greek text could be flagged as potentially suspicious, but not an error.
There are C/C++ and Java libraries that implement access to CLDR data, as part of ICU.

Do I need unicode to identify different writing system

Whether it is optimal or not, I am trying to identify specific characters using its hexadecimal code. (Is there better way to identify alphabets, Arabic, Chinese, or Japanese characters?)
http://play.golang.org/p/b81_rgXr3G
fmt.Printf("%x \n", "가") //eab080
fmt.Printf("%x \n", "ㅎ") //e3858e
So it is true that in Korean
eab080 < e3858e
Then my question is
do we have any table or chart for each language's hexadecimal boundary?
I mean, for English
fmt.Printf("%x \n", "A") //41
fmt.Printf("%x \n", "z") //7a
Then 41 < 7a
As you see above, the alphabet is to be bounded between 41 and 7a.
I am trying out the same thing for another writing system that is not in alphabet.
Do I need unicode to identify different writing system? The unicode standard library seems only to provide encode and decode English alphabets.
Thanks in advance.
No, we do not have any table or chart for each language’s hexadecimal boundary. There is some data about characters typically used in various languages.
This answers the question asked, but you should consider whether that was your real problem. The question refers to writing systems, alphabets, and languages as if they were one thing; they are separate concepts. You should define your practical problem: what information do you really need? In a text in some language, any Unicode character may appear.
By the way, English has (at least in some forms of the language) also words like fiancé, coöoperation, rôle, anæmia, belovèd, etc.

Romanization of Unicode text

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.
Examples:
Greek:Romanize("Αλφαβητικός") returns "Alphabētikós" (or "Alfavi̱tikós")
Japanese:Romanize("しんばし") returns "shimbashi" (or "sinbasi")
Russian:Romanize("яйца Фаберже") returns "yaytsa Faberzhe" (or "jajca Faberže")
It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.
Does such a library exist?
The problem is a lot more complex than you think.
Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.
You can use Unidecode Sharp :
[a C#] port from Python Unidecode that itself port from Perl unidecode.
(there are also PHP and Ruby implementations available)
Usage;
using BinaryAnalysis.UnidecodeSharp;
.......................................
string _Greek="Αλφαβητικός";
MessageBox.Show(_Greek.Unidecode());
string _Japan ="しんばし";
MessageBox.Show(_Japan.Unidecode());
string _Russian ="яйца Фаберже";
MessageBox.Show(_Russian.Unidecode());
I hope, it will be good for you.
I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.
The Unicode Common Locale Data Repository has some transliteration mappings you could use.