iPhone Emoji Unicode Encoding UTF - iphone

I'm encoding Unicode Emoji characters into a string and loading that into a webview..
for the encoding I'm using "NSUTF8StringEncoding" But the emoji characters just show up as jiberish unicode symbols like ∆¬ÓÔÓ˝ˆ So what encoding do I need?

NSUTF32StringEncoding did the trick, didn't see it on the list of encoding before.

Related

Character Encodings compatibility with ASCII

I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?
There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.

ASCII compatibles and not compatibles characters encoding

What is an example of a character encoding which is not compatible with ASCII and why isn't it?
Also, what are other encoding which have upward compatibility with ASCII (except UTF and ISO8859, which I already know) and for what reason?
There are EBCDIC-based encodings that are not compatible with ASCII. For example, I recently encountered an email that was encoded using CP1026, aka EBCDIC 1026. If you look at its character table, letters and numbers are encoded at very different offsets than in ASCII. This was throwing off my email parser particularly because LF is encoded as 0x25 instead of as 0x0A in ASCII.

How to convert ANSI text to Unicode?

I would like to convert RTF text to Unicode. In the RTF font table one can find the name of the font or font-face (eg. Arial Cyr, Courier Greek) and the charset to use with it (0-255). So how to write a function that converts a character code (0-255) with these settings to Unicode?
As I see, the post-tags like Greek, Cyr, Tur etc. affect the glyph of the displayed characters and the charset affects it too. So the function could have these input parameters:
fontname postfix, font charset, character code
But what is next? Or am I on the wrong way?
RTF was invented long before Unicode. It most certainly isn't ANSI text, RTF only uses ASCII, it uses a rather unholy mix of character sets with non-ASCII characters encoded in hex with a reference to the character set. The mapping is also not perfect, many Unicode codepoints have no corresponding charset.
You'll spend a lifetime creating your own RTF to Unicode converter. Take advantage of an existing solution, most any platform has one. On Windows that would be the RichEdit control. If you use .NET then it is especially simple, use the RichTextBox class, assign its Rtf property and read back its Text property. Which is utf-16 encoded Unicode.

url encode non latin characters

I need to URL encode a non-latin string (japanese, chinese, or just non ascii characters in spanish/french/italian etc.). I can't find any encoders or snippets that deal with more than just ASCII characters to create a URL encoding. Is there a library or some feature I haven't found in OS that can create a fully compliant URL encoding from any UTF8 content?
Have you tried using
- (NSString *)stringByAddingPercentEscapesUsingEncoding:(NSStringEncoding)encoding
but specifying the appropriate NSStringEncoding related to the language choice?
You can see all the available string encodings here

Japanese ASCII Code

Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.
I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.
Please see my similar question regarding Kanji/Kana characters. As #coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.
In short, the Unicode ranges for hiragana and katakana are:
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
If you find this answer useful please upvote #coobird's answer to my question as well.
がんばって!
Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).
For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.
Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?
I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.
Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.
Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.
I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.
Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.
If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.