How to determine the simplified Unicode variant of a semantic variant of a traditional Chinese character? - unicode

As mentioned in an answer to Simplified Chinese Unicode table, the Unihan database specifies whether a traditional character has a simplified variant (kSimplifiedVariant). However, some characters have semantic variants (kSemanticVariant) which themselves have simplified variants. For example U+8216 舖 has a semantic variant U+92EA 鋪 which in turn has a simplified variant U+94FA 铺.
Should traditional to simplified mappings convert U+8216 to U+94FA?
If so, what's the easiest way of generating or downloading the full mapping, given that the Unihan database does not list U+94FA as a kSimplifiedVariant directly for U+8216, only for the intermediate form U+92EA?

Related

How this mixed-character string split on unicode word boundaries

Consider the string "abc를". According to unicode's demo implementation of word segmentation, this string should be split into two words, "abc" and "를". However, 3 different Rust implementations of word boundary detection (regex, unic-segment, unicode-segmentation) have all disagreed, and grouped that string into one word. Which behavior is correct?
As a follow up, if the grouped behavior is correct, what would be a good way to scan this string for the search term "abc" in a way that still mostly respects word boundaries (for the purpose of checking the validity of string translations). I'd want to match something like "abc를" but don't match something like abcdef.
I'm not so certain that the demo for word segmentation should be taken as the ground truth, even if it is on an official site. For example, it considers "abc를" ("abc\uB97C") to be two separate words but considers "abc를" ("abc\u1105\u1173\u11af") to be one, even though the former decomposes to the latter.
The idea of a word boundary isn't exactly set in stone. Unicode has a Word Boundary specification which outlines where word-breaks should and should not occurr. However, it has an extensive notes section for elaborating on other cases (emphasis mine):
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
...
My understanding is that the crates you list are following the spec without further contextual analysis. Why the demo disagrees I cannot say, but it may be an attempt to implement one of these edge cases.
To address your specific problem, I'd suggest using Regex with \b for matching a word boundary. This unfortunately follows the same unicode rules and will not consider "를" to be a new word. However, this regex implementation offers an escape hatch to fallback to ascii behaviour. Simply use (?-u:\b) to match a non-unicode boundary:
use regex::Regex;
fn main() {
let pattern = Regex::new("(?-u:\\b)abc(?-u:\\b)").unwrap();
println!("{:?}", pattern.find("some abcdef abc를 sentence"));
}
You can run it for yourself on the playground to test your cases and see if this works for you.

How do I extract Unicode normalization tables from the XML Unicode Character Database?

I'm writing a script to create tables containing unicode characters for case folding, etc.
I was able to extract those tables just fine, but I'm struggling to figure out which properties to use to get codepoints for normalization.
In Unicode Annex #44 the closest property group I can find is the NF(C|D|KC|KD)_QC which is for telling if a string has already been normalized.
and it still doesn't list the values I need to actually build the tables.
What am I doing wrong here?
Edit: I'm writing a C library to handle unicode, this isn't a simple one and done, write it in python problem, I'm trying to write my own normalization (technically composition/decomposition) functions.
Edit2: The decomposition property is "dm", but what about composition, and the Kompatibility variants?
The Unicode XML database in the ucdxml directory is not authoritative. I'd suggest to work with the authoritative files in the ucd directory. You'll need
the fields Decomposition_Type and Decomposition_Mapping from column 5 of UnicodeData.txt,
the field Canonical_Combining_Class from column 3, and
the composition exclusions from CompositionExclusions.txt.
If there's a decomposition type in angle brackets, it's a compatibility mapping (NFKD), otherwise it's a canonical mapping. Composition is defined in terms of decomposition mappings. See section 3.11 Normalization Forms of the Unicode standard and UAX #15 for details.

Unicode comparison of Cyrillic 'С' and Latin 'C'

I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

How to normalize CodePage to Unicode Form C when diacritic precedes and accent not combining form

I would like to be able to say "Normalize this string by forcing diacritic accents into their combining form".
Details:
My code is being developed in C# but I don't believe the issue to be language specific.
There are two problems with my data (1) the diacritic is preceding the base character in this data (it needs to follow the base character in Unicode forms D or KD). (2) the accent diacritic in my data is a Greek Tonos (U+0384) but needs to be combining form (U+0301) in order to Normalize.
I would like to do this programmatically. I would think that this type of operation should be well known but I did not find support in the C# Globalization methods (There are normalization methods but there is no way to force the diacritic accents into their combining form).
I do not think that the C# Globalization methods can help you here. The issue as you pointed out is that U+0384 is not a combining charcter. It is a character by itself. This also can be seen from the compatibilty decomposition ( To U+0020 U+0301). The data set most likely comes from a source that would display the tonos as a diacritic on the next character. This is not "proper" according to the unicode spec. Thus you'll have to convert the data yourself. I have run into a similar issue with the apostrophe; sometimes the right quotation mark is being used by applications.
The data conversion is not hard, I'm sure you can code that up.
I would have a stateful converter and stream the data through. When U+0384 gets detected, it does not get emmetied. You sticth to the "tonos" state and emit U+0301 after the NEXT character. The are error conditions to be handled (U+0384 runs, end of data in "tonos" state).
This data can be normalized with the usual APIs.
Good Luck.