How to find the missing components of Chinese characters within Unicode? - unicode

I am currently working on the decomposition of Chinese characters (Japanese kanji, to be more exact) and I have found a few components that are seemingly either not included in the Unihan database or they cannot be properly displayed with any font I am aware of. Is there some way to locate these characters within UTF-8 or UTF-16 and make them to be properly displayed in their character form? The list of components is provided below:
渋 ---> 氵+ 止 + ??? ... I have not managed to find those four dots in the Unihan database ... even here the authors had to encode the component ... the same issue appears in kanji 楽 and 摂 and 率
龍 ---> 𦚏 + ??? .... the component on the right hand side seems not to be in the Unicode ... the same goes for 拝 or 継
制 ---> ??? + 刂 ... left component seems not to be in the Unicode (the closest probably is 韦) ... the same goes for the kanji 段 ---> ??? + 殳
祭 ---> ??? + 示
留 ---> ??? + 田 (it is possible to decompose into three components 𠫔 + 刀 + 田, but two would be better)
Thank you very much for your advice :-)
I went through the whole Unihan database (over 90 000 chars) and did not manage to find the missing components. I tried installing various fonts Babel Stone Han, simch5100, etc. but their coverage of Unicode is not 100%. Nevertheless, I am afraid that some of these components are not included within Unicode by themselves and they can be displayed only as a part another character.

You may want to have a look at the IDS.TXT data file maintained by Andrew West (BabelStone), which provides Ideographic Description Sequences (IDS) for all the 97,058 CJK unified ideographs defined in Unicode version 15.0.
It makes use of about 120 "numbered components" which are characters not yet defined in Unicode (although it seems they may be added later on, according to some official proposal). They are currently represented by glyphs found in an associated Private Use Area (PUA) font named BabelStone Han PUA, which can be freely downloaded from the bottom of the page.
There is also one open-source application making extensive use of this data in a graphical way, called Unicopedia Sinica, available on GitHub.

Related

Unicode characters aren't combined properly

I am working with some Devanagari text data I want to display in the browser. Unfortunately, there's one combination of nonspacing combining characters that doesn't get rendered as a proberly combined character.
The problem occurs every time a base character is combined with the Devanagari Stress Sign Udatta ॑ (U+0951) and the Devanagari Sign Visarga ः (U+0903).
An example for this would be र॑ः, which is र (U+0930) + ॑ + ः and should be rendered as one character. But the stress sign and the other one don't seem to like each other (as you can see above!).
It's no problem to combine the base char with each of the other two signs alone, btw: र॑ / रः
I already tried to use several fonts which should be able to render Devanagari characters (some Noto fonts, Siddhanta, GentiumPlus) and tested it with different browsers, but the problem seems to be something else.
Does anyone have an idea? Is this not a valid combination of symbols?
EDIT: I just tried to switch around the two marks just to see what if - it renders as रः॑, so U+0951 and U+0903 don't seem to have the same function, as the stress sign gets rendered on top of the other mark.
It looks like i don't understand Unicode enough, yet.
This is NOT a solution for your problem, but might be useful information:
I am working with some Devanagari text data I want to display in the
browser.
Like you, I couldn't get this to work in any browser despite trying several fonts, including Arial Unicode MS:
The browser was simply rendering the text Devanagari Test: रः॑ from within the <body> of a JSP. The stress sign is clearly appearing above the Sign Visarga instead of the base character.
Is this not a valid combination of symbols?
It is a valid combination. I don't know Devanagari, so I don't know whether it is semantically "valid", but it is trivial to generate exactly the character you want from a Java application:
System.out.println("Devanagari test: \u0930\u0903\u0951");
This is the output from executing the println() call, showing the stress sign above the base character:
The screenshot above is from NetBeans 8.2 on Windows 10, but the rendering also worked fine using the latest releases of Eclipse and Intellij IDEA. The constraints are:
The three characters must be specified in that order in println() for the rendering to work.
The Sign Visarga and the Stress Sign Udatta must be presented in their Unicode form. Pasting their glyph representations into the source code won't work, although this can be done for the base character.
An appropriate font must be used for the display. I used Arial Unicode MS for the screen shot above, but other fonts such as Serif, SansSerif and Monospaced also worked.
Does anyone have an idea?
Unfortunately not, although it is clear that:
The grapheme you want to render exists, and is valid.
Although it won't render in a browser, it can be written to the console by a Java application.
The problem seems to be that all browsers apply the diacritic (Stress Sign Udatta) to the immediately preceding character rather than the base character.
See Why are some combining diacritics shifted to the right in some programs? for more information on this.

What character is this:?

EDIT
While posting the question, character I ask for was shown well to me, but after postig it does not show up anymore. As it does not appear, please look up in original site
EDIT2
I looked for Unicode chars associated with "alien", and found no matching ones. Here is how they are compared side by side:
I found, that some texts inside my database contain character like . I am not sure, how it would rendered with different fonts and environments, so here is the image, how I see it:
I tried to identify it with different ways. For example, when I paste it into Sublime Text, it automatically shows as control character <0x85>. When I tried to identify it in different unicode-detectors (http://www.babelstone.co.uk/Unicode/whatisit.html, https://unicode-table.com/en/, https://unicode-search.net/unicode-namesearch.pl), their conclusion is pretty match the same:
Uni­code code point char­acter U+0085
UTF-8 en­co­ding c2 85 hexa­decimal
194 133 deci­mal
0302 0205 octal
Uni­co­de char­ac­ter name <control>
Uni­co­de 1.0 char­act­er name (de­pre­ca­ted) NEXT LINE (NEL)
https://unicode-search.net/unicode-namesearch.pl
also included this information
HTML en­co­ding … … hexa­decimal
… … deci­mal
which gave me some vague hint, how it was possible, that … become ``. But this is not main problem here.
My question is: how is possible, that control character is shown up like this and what is the actual glyph used to represent it?
I tried to sketch into http://shapecatcher.com/ to identify it but without success. I did not find such a glyph in any Unicode table.
The alien symbol is not a Unicode character; but is in Microsoft's Webdings font, with character code 0x85. Running Start > Run > charmap, then selecting Webdings from the Font drop list, opens this window:
If I click that alien character in the leftmost column, the message Character Code : 0x85 is shown at the bottom of the window.
I can even copy that character from the Character Map and paste it into Microsoft Wordpad:
The WebDings symbols were included in Unicode Release 7: Pictographic symbols (including many emoji), geometric symbols, arrows, and ornaments originating from the Wingdings and Webdings sets. Therefore you would expect the alien symbol to also be in Unicode. However, I don't think the version of Webdings that was used included that alien symbol, since Windows 10 also has a ttf file for Webdings (version 5.01), and it also does not include the alien symbol:
So presumably what originally caught your attention was some text being rendered with an older version of the Webdings font which included that alien symbol.
The glyph is 👽 U+1F47D EXTRATERRESTRIAL ALIEN. I don't know why your system misrenders a control character.

How do I add a new Arabic vowel-sign in the PUA area of a font?

I am using Ubuntu 14.04, with FontForge compiled from the Git repo as of 31
July.
I'm trying to add a vowel-sign to an Arabic font, Graph, by Future Soft Egypt:
http://openfontlibrary.org/en/font/graph
I have added glyphs where the Unicode code-point already exists (eg peh,
U+067E), and that works fine. I am now trying to add a vowel sign where no
Unicode code-point exists - it is a "damma with tail", used by some writers in
Swahili to mean "o".
I decided to put it in the PUA at U+E909, and copied the font's damma (U+064F)
and added a tail:
http://kevindonnelly.org.uk/swahili/images/dammas.png
I generated the font, and set up the keyboard to emit that character.
The glyph comes up OK, but there are two problems, as can be seen here:
http://kevindonnelly.org.uk/swahili/images/output.png
showing at top "bubu", using the original damma, and at bottom "bobo", using
the new damma-with-tail.
(1) The damma-with-tail is too far to the left, even though the anchor points
in FF have not been moved.
(2) Worse, the damma-with-tail means that only the isolated versions of the
consonant glyphs get used - in the second line the two bs should be joined, as
in the first line.
I'm not sure whether this is a function of using the PUA, or whether it's due
to my missing some step I need to take in FF (eg the Encoding -> Add Encoding
Slots that needs to be done for the consonants), but if anyone could shed some
light on how to fix the two problems, I'd be very grateful.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

How do I use Unicode Character Combining with Kanji/Hanzi?

I'm trying to find a workaround to display old and rare characters in unicode using character combining. Currently I'm converting some dictionaries from EPWING into text and there are 36 different characters which cannot be reproduced using normal UTF-8. Below is the problem section of the epwing gaiji to unicode mappings for one of the dictionaries that I am converting, in some areas it has an interesting syntax that is clearly being used to combine characters in different ways. I was hoping if someone could identify what this syntax is, and where I might find documentation or a tutorial on how to use it.
s/<?w=b02a>/𡓦/g
s/<?w=b04b>/者/g
s/<?w=b064>/<⾱ 𤰇>/g
s/<?w=b077>/<彳<匕\/匕>>/g
s/<?w=b07c>/<山\/⺀>/g
s/<?w=b12e>/𥝝/g
s/<?w=b155>/</>/g
s/<?w=b156>/<\/>/g
s/<?w=b157>/<\/\/>/g
s/<?w=b158>/<こ[1]/と|ヿ>/g
s/<?w=b16f>/<㗢>/g
s/<?w=b170>/<㗥>/g
s/<?w=b171>/ଏ/g
s/<?w=b175>/lb/g
s/<?w=b22a>//g
s/<?w=b234>/ff/g
s/<?w=b25e>/㯌/g
s/<?w=b271>/<扌 晉>/g
s/<?w=b36b>/𣴴/g
s/<?w=b373>/𥝱/g
s/<?w=b42c>/𦼠/g
s/<?w=b434>/<已\/大>/g
s/<?w=b438>/𩸽/g
s/<?w=b43a>/𩺊/g
s/<?w=b43f>/<㇀/丶>/g
s/<?w=b440>/𠂆/g
s/<?w=b45a>/<?>/g
s/<?w=b45b>/<|>/g
s/<?w=b53d>/<?>/g
s/<?w=b53e>/<?>/g
s/<?w=b540>/<o>/g
s/<?w=b537>/<ト モ>/g
s/<?w=b541>/<一/𠔀>/g
s/<?w=b544>/<?>/g
s/<?w=b546>/<[r45]卐>/g
s/<?w=b55f>/*/g
I know that this line is supposed to represent 彳as a left vertical radical with one 匕 stacked on top of another 匕 as the right vertical portion of the character:
s/<?w=b077>/<彳<匕\/匕>>/g
This one is also pretty obvious, it's a 卐 rotated 45 degrees:
s/<?w=b546>/<[r45]卐>/g
Note: the four character hexadecimal codes that come after the ?w= is an identifier for the epwing gaiji that the unicode is supposed to correspond to.
Thank you for your time.
Please see The Unicode Standard section 12.2, Ideographic Description Characters. It discusses your precise situation.
Unfortunately, you may found that software support for what you are trying to do is practically non-existent.