NSString unicode encoding problem - iphone

I'm having problems converting the string to something readable . I'm using
NSString *substring = [NSString stringWithUTF8String:[symbol.data cStringUsingEncoding:NSUTF8StringEncoding]];
but I can't convert \U7ab6\U51b1 into '
It shows as 窶冱 which is what I don't want, it should show as an '. Can anyone help me?

it is shown as a ’
That's character U+2019 RIGHT SINGLE QUOTATION MARK.
What has happened is you've had the character sequence ’s submitted to you, in the UTF-8 encoding, which comes out as bytes:
’ s
E2 80 99 73
That byte sequence has then, incorrectly, been interpreted as if it were encoded in Windows code page 932 (Japanese; more or less Shift-JIS):
E2 80 99 73
窶 冱
So in this one particular case, you could recover the ’s string by firstly encoding the characters into cp932 bytes, and then decoding those bytes back to characters using UTF-8.
However, this will not solve your real problem, which is that the strings were read in incorrectly in the first place. You got 窶冱 in this case because the UTF-8 byte sequence resulting from encoding ’s happened also to be a valid Shift-JIS byte sequence. But that won't be the case for all possible UTF-8 byte sequences you might get. Many other characters will be unrecoverably mangled.
You need to find where bytes are being read into the system and decoded as Shift-JIS, and fix that to use UTF-8 instead.

Related

I have a problem converting a Unicode character into two hexadecimal bytes in UTF-8

I have a problem and I hope you can help me... basically I just started UTF-8 and Unicode, the professor wrote a text file, he wrote "ciaò" inside and showed us the content, displaying each character in hexadecimal (for example the 'c' is 0063, the 'i' is 0069, the 'a' is 0061). The problem is the 'ò' character, which is formed by 2 bytes in UTF-8: c3; b2 (hex). The exercise he gave us is to verify that in UTF-8 the 'ò' character is written just like that (for the resolution he advised us to look at the Unicode website).
I tried to do the exercise this way: I saw that the character 'ò' in hex is 00F2, I transformed it into binary (11110010) and I formed the two bytes of UTF-8 filling the bytes to complete them. |110|11110| e |10|010000|. The problem is that this way I get the following values: DE (instead of c3 for the first byte); 90 (instead of b2 for the second byte). Can someone explain me where I am wrong please?
For the character "ò", its UTF-16 representation is 00F2, and its UTF-8 representation is C3B2. I don't think you can use 00F2 to represent it in UTF-8.
To verify C3B2 is "ò", you can check a website like this one, or if you are using a linux-like terminal you can write:
echo -e "\xC3\xB2"
Which should simply print "ò".

how to calculate URL encoding for characters outside the ASCII character set?

I know that for ASCII characters the URL encoding is just a percentage sign and a hex number that corresponds to the character. But for characters outside that range, hex encoding consists of two or more %hex-number sequences.
For example, for the character that corresponds to hex value 56CE, URL encoding, according to standard .net/java APIs is not %56CE but "%e5%9b%8e"
So if we know the hex value for a character outside the ASCII character range, how is the URL encoding calculated? In other words, how does e5, 9b, 8e come out of 56CE? I tried converting to binary and did see a pattern for the last 2 numbers (%9b, %8e) but have no idea where the %e5 comes from.
You have to encode the Unicode codepoints into charset bytes first, and then you can url-encode those bytes. In your example, E5 9B 8E are the UTF-8 encoded bytes of Unicode codepoint U+56CE, and then %E5%9B%8E is the url encoded form of the UTF-8 bytes.

What character encoding is c3 82 c2 bf?

I have a source of text data that includes the byte sequence c3 82 c2 bf. In context I think it's supposed to be a capital Greek Phi symbol (Φ).
Anyway I can't figure out what encoding is being used; I'm writing a Python script to process this data into a database that expects Unicode, and it throws an exception on this particular sequence of data.
Any suggestions on how to handle it?
Interpreted as UTF-8, c3 82 is “” U+00C2 and c2 bf is “¿” U+00BF, which does not make much sense, but it’s technically valid UTF-8 data, so it should not be reported as character-level data error. Interpreted as UTF-16, it’s Hangul syllables and possibly a CJK ideograph, depending on endianness, but still formally valid data, though most probably not what was meant.
This sounds like the result of double conversion, but it’s difficult to make educated guesses. If it stands for Φ, then the UTF-16 form is 03 A6 or A6 03 and the UTF-8 form is CE A6, which don’t really resemble the actual data. Information about the origin of the data might help in guessing what transcodings may have happened.
It's probably a double conversion from Ñ character.
Ñ character in UTF-8 is: 0xc391.
If you try to convert from LATIN-1 to UTF-8 the Ñ character which is already encoded in UTF-8, you'll get: 0xc382c2bf.
Why?
0xc382 is UTF-8 translation from LATIN-1 0xc3 character à (A with tilde)
0xc2bf is ¿ character which is what you get when you can't convert a character from LATIN-1 (0x91 is an invalid character in LATIN-1
FWIW, I ended up with c3 82 c2 bf from . I did not dig into the transformations because I was able to simply throw that part of the code away. Suffice it to say that was in an html email template that was processed by a wordpress (php) plugin.
I don't know the reason. But maybe there is a possible scenary.
binary x0xx is converted to 0xC2 x0xx
binary x1xx is converted to 0xC3 x0xx
So there are lots of c2 and c3 added.
Where does this happen? Send non ascii in url query string for an ajax call, the Flask server will do this.
i have received this character \xc3\x82 from external utf-16 document after conversion to utf-8 using $str = mb_convert_encoding($content, "UTF-8" , "UTF-16LE"); (PHP)
original sequence was 0xA0 0x00 and the converter converted it probably to what it meant to be NBSP .. it was character at thousands separator in currency number. nbsp is \xc2\xa0 so right now i have thousands removal as:
$price = str_replace(["\xc2\xa0","\xc3\x82"], '', $price);

How to detect CString text encoding in iphone/iPad?

I have a mixed set of CString in different text encoding.
Since I do not know the original encoding of the CString, how to detect CString text encoding in iphone/iPad ?
Thanks.
You cannot solve this problem in the general case without some additional information, because the same string could be valid in multiple encodings. For example, the hex values 48 45 4C 4C D4 equate to "HELLÔ" in iso-8859-1, and "HELLт" in the KOI8-R encoding. Any of the 8-bit encodings are going to be pretty much indistinguishable, unless you start getting into heuristics like doing dictionary checks (hmmm... looks like Bulgarian).
One strategy is to try utf-8 first, and then fall back on a designated 8-bit encoding (e.g., iso-8859-1) if the input fails to decode as utf-8. (With utf-8, there are byte sequences that are invalid, so there's a good chance that a string in some arbitrary 8-bit encoding will throw an error if you try to decode it as utf-8).
The NSString class offers some encoding detection with +stringWithContentsOfFile:usedEncoding:error, but it seems to be available only when loading from a file or URL. I'm not sure how many encodings it tries or how accurate it is.

Imap message encodeing problem

Some of the mails contents fetched from imap server looks like =C3=B6=C3=BC=C3=B6=C3=BC=C3=B6=C3=BC= what kind of encoding is this? Mail header encoding is UTF-8 but decoding with UTF-8 i got scrambled msg.
Any help is much appreciated.
Quoted-Printable
It is used to transmit 8-bit data over a 7-bit medium.
Characters are converted from 8-bit to three 7-bit characters in the form =XX where XX is the hexadecimal character code for the 8-bit character, the = character will become =3D.
The length of a line is restricted to 76 characters, soft line breaks are added to comply with this rule, this is done by ending with a = to indicate that the line should continue.
https://www.rfc-editor.org/rfc/rfc2045
http://en.wikipedia.org/wiki/Quoted-printable
Online Decoder