What Unicode Ranges are used exclusively by Simplified Chinese, Traditional Chinese, Korean, and Japanese? - unicode

I need to create packages that contain Unicode characters used only by a specified language. A key requirement for these packages is to make them as small as possible (thus why each package only contains the characters used for its language).
The problem is I can't find a single resource online that specifies the ranges ONLY for a certain language, such as ranges X1-X2, Y3-Y8, etc for Simplified Chinese. Instead everywhere tells me to use CJK (U+4E00 - U+9FFF). I'd like to know which parts of CJK are used for each of the below languages.
I understand that many characters in Asian languages are considered obsolete/unused. Thus they should be excluded from the ranges. The ranges should only include characters used to communicate. I hope that's clear haha..
That being said, the languages I'm try to make these packages for are:
Simplified Chinese
Traditional Chinese
Korean
Japanese
Does anyone know the exclusive ranges for these languages or how to find them out?

Related

Unicode for Contextual forms of ټ,ګ,ځ,څ,ڼ,ښ,ډ,ۍ,ړ,ې in Pashto language

I am developing a program that give the correct format of text for example if I write سلام so it give FEB3, FEE0, FE8E and FEE2 witch are Unicode of سـ, ـلـ,ﺎ,ـم, then if I write ټول there is Unicode for character ټ which is 067C, but there is not Unicode for character ټـ which is Initial Contextual form.
So I found Unicode for isolated of ټ,ګ,ځ,څ,ڼ,ښ,ډ,ۍ,ړ,ې in the Wikipedia, but I can't find Unicode of Contextual forms.
For example Unicode of ټـ ,ـټـ,ـټ.
I am waiting for response if any one knows the solution of this problem.
thanks...
A Unicode character is intended to be abstract in the sense that it doesn't have a particular presentation form. The preferred way to display cursive scripts like Arabic is to store the standard, non-contextual forms, and convert them to their cursive forms at display time - that is, as one of the final stages of a text display system in an operating system or word processor.
The cursive forms are usually provided as glyphs in the font, and are chosen using information in tables in the font file embodying the contextual rules.
Unicode stores quite a large number of Arabic contextual forms, but only for compatibility with older encodings, and with traditional metal type, for which only a finite number of physical glyphs can be supplied. Unfortunately for your purposes, these contextual forms don't cover all the extended characters used in languages other than Arabic, such as the example you give, which is U+067C ARABIC LETTER TEH WITH RING, used in Pashto.
It's very unlikely that further contextual Arabic forms will be added, in my opinion. Therefore your proposed program cannot be made to work, at least according to its current design.
Earlier Unicode versions included separate codes for the different forms of Arabic letters for all letters except some. Arabic letters are used to write Pashto, Farsi, Urdu, and few other languages. The letters that were used in Arabic, Farsi, and may be a couple more languages were assigned different codes for each form of the their letters. However, the letters used only by less taught languages like Pashto, which you are asking about, were assigned codes for only the isolated forms. In the later versions of the Unicode, it was decided to only assign a single code to each letter, leaving Pashto only letters to have codes for only the isolated forms.
Actually there was no need to have a separate code for each form which was a bad decision made by the earlier Unicode versions. A rendering engine (editors, and other programs that deal with plain text) should account for the different forms of each letter and display the correct form according to its position.

Strategy for defining Unicode Ranges by Culture

I am new to Unicode have been given the requirement to look at some translated text, iterate over all of the characters of that translation and determine if all the characters are valid for the target culture (language and location).
For example, if I am translating a document from English to Greek, I want to detect if there are any English/ASCII "A"s in the Greek translation and report that as an error. This may likely be the case from corrupted data from a translation memory.
Is there any existing grouping of Unicode characters by culture? Or is there any existing strategy for developing this kind of grouping? I see that there is some grouping of characters at (http://www.unicode.org/charts/). But it seems that this is not quite what I am looking for at first glance.
Does any thing exist like "Here are the valid Unicode characters for Spanish - Spain: [some Unicode range(s)]" or "Here are the valid Unicode characters for Russian - Russia: [some Unicode range(s)]"
Or has anyone developed a strategy to define these?
If this is not the right place to ask this question, I would welcome any direction on where might be a good place to ask the question.
This is something that CLDR (Common Locale Data Repository) deals with. It is not part of the Unicode Standard, but it is an activity and a resource managed by the Unicode Consortium. The LDML specification defines the format of the locale data. The Character Elements define some sets of characters: “main/standard”, “auxiliary”, “index”, and “punctuation”.
The data for Greek includes only Greek letters and some basic punctuation. This, like all such data at CLDR, is largely subjective. And even though the CLDR process is meant to produce well-reviewed data based on consensus, the reality is different. It can be argued that in normal Greek texts, Latin letters are not uncommon, especially in technical areas. For example, the international symbol for the ampere is “A” as a Latin letter; the symbol for the kilogram is “kg”, in Latin letters, even though the word for it is written Greek letters in Greek.
Thus, no matter how you run the analysis, the occurrence of Latin “A” in Greek text could be flagged as potentially suspicious, but not an error.
There are C/C++ and Java libraries that implement access to CLDR data, as part of ICU.

Unicode comparison of Cyrillic 'С' and Latin 'C'

I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.

Subset of Unicode normally used in writing?

What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski

Romanization of Unicode text

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.
Examples:
Greek:Romanize("Αλφαβητικός") returns "Alphabētikós" (or "Alfavi̱tikós")
Japanese:Romanize("しんばし") returns "shimbashi" (or "sinbasi")
Russian:Romanize("яйца Фаберже") returns "yaytsa Faberzhe" (or "jajca Faberže")
It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.
Does such a library exist?
The problem is a lot more complex than you think.
Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.
You can use Unidecode Sharp :
[a C#] port from Python Unidecode that itself port from Perl unidecode.
(there are also PHP and Ruby implementations available)
Usage;
using BinaryAnalysis.UnidecodeSharp;
.......................................
string _Greek="Αλφαβητικός";
MessageBox.Show(_Greek.Unidecode());
string _Japan ="しんばし";
MessageBox.Show(_Japan.Unidecode());
string _Russian ="яйца Фаберже";
MessageBox.Show(_Russian.Unidecode());
I hope, it will be good for you.
I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.
The Unicode Common Locale Data Repository has some transliteration mappings you could use.