How to automatically translate between traditional and simplified Chinese characters with Unicode? - unicode

Is there a way to translate programmatically from traditional to simplified Chinese characters? If so, how do you do it, does unicode offer a way? If not, why doesn't there exist a database with the mapping, is it not one-to-one? I know you can find a mirror image glyph from another glyph in Unicode, but can you find the simplified glyph from a traditional one?

It is indeed not one to one. My favorite example to explain this quickly is this:
Take the character for face, 面. So far so good, it's the same in Traditional and Simplified Chinese. However, 面 is also the simplified version of 麵, noodle (where the 面 part on the right is the phonetic part). So if you have 面, you have no way of knowing which is which.

Related

Unicode Character search/modification

So I am trying to combine some unicode characters to create something like this white symbol.
Currently I have ▝███ but am struggling to get anything closer.
https://www.compart.com/en/unicode/block/U+2580
Could anyone shed some light on the subject or provide other character combinations?
Am I barking up the wrong tree?
The Unicode characters you're looking for are in the Symbols for Legacy Computing Block. These include 1/3 offsets to do what you're trying to do. These have extremely poor font support, however. I don't know of any font that includes glyphs for them. You'd probably have to provide your own, in which case you might do better using a private code point or a custom ligature.

What are valid uses of U+0080 through U+009F?

I'm making a virtual computer with a custom font and programming environment (Mini Micro), all Unicode based. I have need for a few custom glyphs in my environment. I know about the Private Use Areas, but I'm wondering about the "control" code points at U+0080 through U+009F. I can't find any documentation on what these points are for beyond "control".
Would it be a gross abuse of Unicode to tuck a few of my custom glyphs in there? What would be a proper use of them?
Wikipedia lists their meaning. You get 2 of them for your use, U+0091 and U+0092.
The 0x80 - 0x9F range you referto to is generally called the C1 control characters. Like other control codes, the C1s are for code extension, and by their very nature, some are generally left open for further expansion and thus have only vague standardization.
The original and most comprehensive reference is probably ECMA-48 - up to the Fifth Edition in June 1991. (The link takes you to a free download in PDF format.)
For additional glyphs, C1 codes would not be appropriate. In effect, the whole idea of control codes is that they are the special case of non-graphical codes.
UNICODE has continued to evolve, with an emoji block that has a lot of "characters" you might not expect. Let's try one: 💎 it is officially called the GemStone Emoji. I used this copy/paste website to insert it, you might look to see if something you can use has been standardized in the Emoji code block.
One of the interesting things about the emoji characters is that they are double-wide, even in a fixed-width font.
Microsoft uses them for smart quotes the Euro and a few other symbols in its latin-1 extension cp1252. As this character encoding is frequently reported as latin-1 using these code points for other uses can cause problems, especially as latin-1 is supposed to be code point equivalent to Unicode. This Wikipedia page gives some history and the meanings of these control characters.

How can I normalize fonts?

Users sometimes use weird ASCII characters in a program, and I was wondering if there was a way to "normalize" it.
So basically, if the input ᴀʙᴄᴅᴇꜰɢ, the output would be ABCDEFG. Is there a dictionary that exists somewhere that does something like this? If not, is there a better method than just doing something like str.replace("ᴀ", "A") for all the different "fonts"?
This isn't a language specific question -- if something doesn't exist like this than I guess the next step is to create a dictionary myself.
Yes.
BTW—The technical terms are: Latin Capital Letters from the C0 Controls and Basic Latin block and the Latin Letter Small Capitals from the Phonetic Extensions block.
Anyway, the general topic for your question is Unicode confusables. The link is for a mapping. Uncode.org has more material on confusables and everything else Unicode.
(Normalization is always something to consider when processing Unicode text, but it doesn't particularly relate to this issue.)
Your example seems to involve unicode characters, not ASCII characters. Unicode normalization (FAQ) is a large and complex subject, with many difference equivalence classes of characters, depending on what you are trying to do.

Detect if character is simplified or traditional Chinese character

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.

What are the experiences with using unicode in identifiers

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.
For example:
ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ
With these, it is not that difficult to create some very hard to find bugs.
At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?
Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.
This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).
My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.
I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).
I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.
In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.
Now on to the difference between ANSI code pages and Unicode.
In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.
Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.
I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.
Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.
This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches
It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.
Works for me, as I'm in latin-america.
Actually, once everithing is ironed out, the whole thing becomes a smooth ride.
Of course, this depends on the language of choice.
I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.
Another language that makes extensive use of unicode is Fortress.
Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.