Related (in fact, perhaps a duplicate of): how to extract characters from a Korean string in VBA
The linked question doesn't give me satisfactory answers and it's 2 years old so I'm making a new question.
I want to find the first symbol in a Korean glyph, ie. "한" -> "ㅎ" or "가" -> "ㄱ". I also want to recognize inputs that are already single symbols, such as "ㄱ".
I'm working with NSString, which I believe uses UTF-8. Do I have to convert the string to EUC-KR, then start reading bytes, or what?
As a disclaimer, I have no experience in working with iphone or NSString, except for what I've read in the documentation in order to answer this question. I'm addressing the question mainly as a unicode problem.
In order to find the first symbol (jamo) from a Korean glyph, you have to perform a decomposition as described in my answer to how to extract characters from a Korean string in VBA (it's a new answer so you didn't see it when you posted your question). To apply my answer (which is derived directly from the Unicode standard), you have to work with the Unicode code points (numerical values) of the Korean syllables. It looks like calling the method dataUsingEncoding passing NSUnicodeStringEncoding as a parameter should do the trick.
In order to identify single symbols, you have to check whether the Unicode code point of the character you are checking is in any of the following ranges:
1100-11FF (Hangul Jamo). I think this should cover most of the real life cases.
A960-A97F (Hangul Jamo Extended-A)
D7B0-D7FF (Hangul Jamo Extended-B)
3130-318F (Hangul Compatibility Jamo)
FFA0-FFDC (Halfwidth Jamo)
Check the Unicode Code Charts for a complete reference.
Related
I'm making a virtual computer with a custom font and programming environment (Mini Micro), all Unicode based. I have need for a few custom glyphs in my environment. I know about the Private Use Areas, but I'm wondering about the "control" code points at U+0080 through U+009F. I can't find any documentation on what these points are for beyond "control".
Would it be a gross abuse of Unicode to tuck a few of my custom glyphs in there? What would be a proper use of them?
Wikipedia lists their meaning. You get 2 of them for your use, U+0091 and U+0092.
The 0x80 - 0x9F range you referto to is generally called the C1 control characters. Like other control codes, the C1s are for code extension, and by their very nature, some are generally left open for further expansion and thus have only vague standardization.
The original and most comprehensive reference is probably ECMA-48 - up to the Fifth Edition in June 1991. (The link takes you to a free download in PDF format.)
For additional glyphs, C1 codes would not be appropriate. In effect, the whole idea of control codes is that they are the special case of non-graphical codes.
UNICODE has continued to evolve, with an emoji block that has a lot of "characters" you might not expect. Let's try one: 💎 it is officially called the GemStone Emoji. I used this copy/paste website to insert it, you might look to see if something you can use has been standardized in the Emoji code block.
One of the interesting things about the emoji characters is that they are double-wide, even in a fixed-width font.
Microsoft uses them for smart quotes the Euro and a few other symbols in its latin-1 extension cp1252. As this character encoding is frequently reported as latin-1 using these code points for other uses can cause problems, especially as latin-1 is supposed to be code point equivalent to Unicode. This Wikipedia page gives some history and the meanings of these control characters.
I want to use emojis in my iOS and Android app. I checked the list of emojis here and it lists out the hex code for the emojis. When I try to use the hex code such as U+1F600 directly, I don't see the emoji within the app. I found one other way of representing emoji which looks like \uD83D\uDE00. When using this notation, the emoji is seen within the app without any extra code. I think this is a Unicode string for the emoji. I think this is more of a general question that specific to emojis. How can I convert an emoji hex code to the Unicode string as shown above. I didn't find any list where the Unicode for the emojis is listed.
It seems that your question is really one of "how do I display a character, knowing its code point?"
This question turns out to be rather language-dependent! Modern languages have little trouble with this. In Swift, we do this:
$ swift
Welcome to Apple Swift version 3.0.2 (swiftlang-800.0.63 clang-800.0.42.1). Type :help for assistance.
1> "\u{1f600}"
$R0: String = "😀"
In JavaScript, it is the same:
$ node
> "\u{1f600}"
'😀'
In Java, you have to do a little more work. If you want to use the code point directly you can say:
new StringBuilder().appendCodePoint(0x1f600).toString();
The sequence "\uD83D\uDE00" also works in all three languages. This is because those "characters" are actually what Unicode calls surrogates and when they are combined together a certain way they stand for a single character. The details of how this all works can be found on the web in many places (look for UTF-16 encoding). The algorithm is there. In a nutshell you take the code point, subtract 10000 hex, and spread out the 20 bits of that difference like this: 110110xxxxxxxxxx110111xxxxxxxxxx.
But rather than worrying about this translation, you should use the code point directly if your language supports it well. You might also be able to copy-paste the emoji character into a good text editor (make sure the encoding is set to UTF-8). If you need to use the surrogates, your best best is to look up a Unicode chart that shows you something called the "UTF-16 encoding."
In Delphi XE #$1F600 is equivalent to #55357#56832 or D83D DE04 smile.
Within a program, I use it in the following way:
const smilepage : array [1..3] of WideString =(#$1F600,#$1F60A,#$2764);
JavaScript - two way
let hex = "😀".codePointAt(0).toString(16)
let emo = String.fromCodePoint("0x"+hex);
console.log(hex, emo);
I have a string that contains unicode escape codes, eg. #"D\u017cem" (\u017c is code for ż). I would like to convert that string to the one containg actual characters. In the example that would be #"Dżem".
Is there any method in SDK or library that can do such replacement AND work on iPhone?
(Obviously I can do the replacement myself, changing characters one by one, but it is rather cumbersome)
According to Apple,
It is not safe is to include high-bit characters in your source code
Note that the "universal character name" \u017c is replaced at compile time with an implementation-defined value which in practice is the UTF8 representation, so the end result is the same as you would get if you (correctly) did the replacement you are talking about. If you're having a problem with some other source-processing tool, you might be better served by teaching that tool to recognize C99 universal character names.
I suggest to start using NSLocalizedString()
http://www.pushplay.net/2009/08/developing-localized-iphone-applications/
http://developer.apple.com
Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF
I have a twelve-year-old Windows program. As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there's one spot that still needs to be changed over. There is a serious constraint on it though: the exact same ASCII byte sequence MUST be created by different encoders, some of which will be operating on non-Windows systems.
I'm trying to determine whether UTF-8 will do the trick or not. I've heard in passing that different UTF-8 sequences can come up with the same Unicode string, which would be a problem here.
So the question is: given a Unicode string, can I expect a single canonical UTF-8 sequence to be generated by any standards-conforming implementation of a converter? Or are there multiple possibilities?
Any given Unicode string will have only one representation in UTF-8.
I think the confusion here is that there are multiple ways in Unicode to get the same visual output for some languages. Not to mention that Unicode has several characters that have no visual representation.
But this has nothing to do with UTF-8, its a property of Unicode itself. The encoding of a given Unicode as UTF-8 is a purely mechanical process, and it's perfectly reversible.
The conversion rules are here:
http://en.wikipedia.org/wiki/UTF-8
As John already said, there is only one standards-conforming UTF-8 representation.
But the tricky point is "standards-conforming".
Older encoders are usually unable to properly convert UTF-16 because of surrogates.
Java is one notable case of those non-conforming converters (it will produce two 3-bytes sequences instead of one 4-byte sequence).
MySQL had problems until recently, and I am not sure about the current status.
Now, you will only have problems with code points that need surrogates, meaning above U+FFFF. If you application survived without Unicode for a long time, it means you never needed to move such "esoteric" characters :-)
But it is good to get things right from the get go.
Try using standards-conforming encoders and you will be fine.