NSString UTF8String mangling unicode characters - iphone

When I run [NSString UTF8String] on certain unicode characters the resulting const char* representation is mangled both in NSLog and on the device/simulator display. The NSString itself displays fine but I need to convert the NSString to a cStr to use it in CGContextShowTextAtPoint.
It's very easy to reproduce (see code below) but I've searched for similar questions without any luck. Must be something basic I'm missing.
const char *cStr = [#"章" UTF8String];
NSLog(#"%s", cStr);
Thanks!

CGContextShowTextAtPoint is only for ASCII chars.
Check this SO question for answers.

When using the string format specifier (aka %s) you cannot be guaranteed that the characters of a c string will print correctly if they are not ASCII. Using a complex character as you've defined can be expressed in UTF-8 using escape characters to indicate the character set from which the character can be found. However the %s uses the system encoding to interpret the characters in the character string you provide to the formatting ( in this case, in NSLog ). See Apple's documentation:
https://developer.apple.com/library/mac/documentation/cocoa/Conceptual/Strings/Articles/formatSpecifiers.html
%s
Null-terminated array of 8-bit unsigned characters. %s interprets its input in the system encoding rather than, for example, UTF-8.
Going onto you CGContextShowTextAtPoint not working, that API supports only the macRoman character set, which is not the entire Unicode character set.
Youll need to look into another API for showing Unicode characters. Probably Core Text is where you'll want to start.

I've never noticed this issue before, but some quick experimentation shows that using printf instead of NSLog will cause the correct Unicode character to show up.
Try:
printf("%s", cStr);
This gives me the desired output ("章") both in the Xcode console and in Terminal. As nob1984 stated in his answer, the interpretation of the character data is up to the callee.

Related

use wcstombs() to convert wchar_t* (containing Unicode) to MBCS (char*) dependent on locale

My input is Unicode characters, eg:, "(U+00DB) (U+0081)" (wchar_t*). I use wcstombs to convert this wide char string into char * (MBCS). Since Unicode is already encoded in UTF-8 I am expecting it to return byte by byte copied sequence of Unicode as DB81 char*. But instead I get c3 9b. This is happening on Linux and on windows i get "DB 81" only.
I need to open a file with name DB 81 (as shown in hexdump), but fopen uses char* filename. Thus I have to convert this wchar_t* to MBCS. Please help!!
No, what you want to do is not what you think you should.
fopen(), under any circumstances - cannot handle all possible filenames on your system, because it lacks unicode support.
Please refer to http://www.utf8everywhere.org to see how to do it with _wfopen().

How do you add a macron to a character in an NSString? via Unicode?

Objective-C iOS Programming:
I need to display a number like 8.33333 just as 8.3, with the three having a macron (repeating number symbol, a bar line) above it. I have done some searching and have not found a solution to this. I have found the encoding for C/C++/Java source code being "\u0304" and for Unicode being "U+0304". Is there a way that I can create an NSString from a Unicode character? And how would a create a Unicode character with a macron?
Thanks.
For combining characters such as U+0304, the string should contain the original letter followed by the combining character. For instance,
NSString *str = #"ca\u0304t";
is a representation of cāt.

How to convert from unicode to ASCII

Is there any way to convert unicode values to ASCII?
To simply strip the accents from unicode characters you can use something like:
string.Concat(input.Normalize(NormalizationForm.FormD).Where(
c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have. Almost the only thing you can do that is even close to the right thing is to discard all characters above codepoint 128, and even that is very likely nowhere near what your requirements say. (The other possibility is to simplify accented or umlauted letters to make more than 128 characters 'nearly' expressible, but that still doesn't even begin to actually cover Unicode.)
Technically, yes you can by using Encoding.ASCII.
Example (from byte[] to ASCII):
// Convert Unicode to Bytes
byte[] uni = Encoding.Unicode.GetBytes("Whatever unicode string you have");
// Convert to ASCII
string Ascii = Encoding.ASCII.GetString(uni);
Just remember Unicode a much larger standard than Ascii and there will be characters that simply cannot be correctly encoded. Have a look here for tables and a little more information on the two encodings.
This workaround might better suit your needs. It strips the unicode chars from a string and only keeps the ASCII chars.
byte[] bytes = Encoding.ASCII.GetBytes("eéêëèiïaâäàåcç  test");
char[] chars = Encoding.ASCII.GetChars(bytes);
string line = new String(chars);
line = line.Replace("?", "");
//Results in "eiac test"
Please note that the 2nd "space" in the character input string is the char with ASCII value 255
It depends what you mean by "convert".
You can transliterate using the AnyAscii package.
// C#
using AnyAscii;
string s = "άνθρωποι".Transliterate();
// anthropoi
Well, seeing as how there's some 100,000+ unicode characters and only 128 ASCII characters, a 1-1 mapping is obviously impossible.
You can use the Encoding.ASCII object to get the ASCII byte values from a Unicode string, though.
If your metadata fields only accept ASCII input. Unicode characters can be converted to their TEX equivalent through MathJax. What is MathJax?
MathJax is a JavaScript display engine for rendering TEX or MathML-coded mathematics in browsers without requiring font installation or browser plug-ins. Any modern browser with JavaScript enabled will be MathJax-ready. For general information about MathJax, visit mathjax.org.

NSStream, UTF8String & NSString... Messy Conversion

I am constructing a data packet to be sent over NSStream to a server. I am trying to seperate two pieces of data with the a '§' (ascii code 167). This is the way the server is built, so I need to try to stay within those bounds...
unichar asciiChar = 167; //yields #"§"
[self setSepString:[NSString stringWithCharacters:&asciiChar length:1]];
sendData=[NSString stringWithFormat:#"USER User%#Pass", sepString];
NSLog(sendData);
const uint8_t *rawString=(const uint8_t *)[sendData UTF8String];
[oStream write:rawString maxLength:[sendData length]];
So the final outcome should look like this.. and it does when sendData is first constructed:
USER User§Pass
however, when it is received on the server side, it looks like this:
//not a direct copy and paste. The 'mystery character' may not be exact
USER UserˤPas
...the seperator string has become two in length, and the last letter is getting cropped from the command. I believe this to be cause by the UTF8 conversion.
Can anyone shed some light on this for me?
Any help would be greatly appreciated!
The correct encoding in UTF-8 for this character is the two-byte sequence 0xC2 0xA7, which is what you're getting. (Fileformat.info is invaluable for this stuff.) This is out of the LATIN-1 set, so you almost certainly want to be using NSISOLatin1StringEncoding rather than NSUTF8StringEncoding in order to get a single-byte 167 encoding. Look at NSString -dataUsingEncoding:.
What you have and what you want to transmit is not really a UTF-8 string, and it's technically not us-ascii, because that's only 7 bits. You want to transmit an arbitrary array of bytes, according to the protocol that you're working with. The two fields of the byte array, username and password, might themselves be UTF-8 strings, but with the 167 separator it cannot be a UTF-8 string.
Here are some options I see:
Construct the uint8_t* byte array using at least two different NSString objects plus the 167 code. This will be necessary if the username or password can possibly contain non-ascii characters.
Use the NSString method getBytes:maxLength:usedLength:encoding:options:range:remainingRange and set encoding to NSASCIIStringEncoding. If you do this you must validate elsewhere that your username and password is us-ascii only.
Use the NSString method getCString. However, that's been deprecated because you cannot specify the encoding you want.

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?
ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.
\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.
It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)