NSStrings, C strings, pathnames and encodings in iPhone - iphone

I am using libxml2 in my iPhone app. I have an NSString that holds the pathname to an XML file. The pathname may include non-ASCII characters. I want to get a C string representation of the NSString for to pass to xmlReadFile(). It appears that cStringUsingEncoding gives me the representation I seek. I am not clear on which encoding to use.
I wonder if there is a "default" encoding in iPhone OS that I can use here and ensure that I can roundtrip non-ASCII pathnames.

Use NSString's fileSystemRepresentation. If the string contains characters that are not representable in the file system's encoding then this method will raise an exception.
To convert back, use NSFileManager's stringWithFileSystemRepresentation:length:

Related

NSString to NSData encoding considerations

I understand why when going from NSData to NSString you need to specify encoding.
However I'm finding it frustrating how the reverse (NSString to NSData) needs to have an encoding specified.
In this related question the answers suggested using
NSUTF8StringEncoding or defaultCStringEncoding, with the latter not being fully explained.
So I just wanted to ask IF the following is correct when converting NSString to NSData:
In cases where you want to be 100% sure the binary representation of the NSString object is UTF8 then use NSUTF8StringEncoding (or whatever encoding is needed)
In cases where the encoding of the NSString object is known/expected to already be of a certain type and no conversion is required then it's safe (perhaps internally faster) to use defaultCStringEncoding (from what I have read objective-c uses UTF-16 internally, not sure if LE or BE but I'd assume LE because the platform is LE)
TIA
The encoding needs to be specified for converting NSString to NSData for the same reason it needs to be specified going from NSData to NSString.
An NSData object is a wrapper for a string of absolutely raw bytes. If the NSString doesn't specify some encoding, it doesn't know what to write, because at the level of ones and zeroes, a UTF-16 encoding looks different from a UTF-8 encoding of the same letter, and of course, if you write UTF-16 as big-endian and read it as little-endian you will get gibberish.
In other words, don't think of it as converting or escaping a string; it's generating a byte buffer, and the encoding tells it which ones and zeroes to write when the next character is "a" and which ones to write when it means "妈".
As for your question...here's my two cents.
1) If you are converting an NSString to an NSData so that your same program can convert it back later, and no other software will need to deal with that NSData until after you've read it back into an NSString, then none of this matters. All that matters is that your string-to-data encoding and your data-to-string encoding match.
2) If you are dealing only with ASCII characters, you can probably get away with a lot, just because many kinds of encoding use the same representation for characters under 128. But this breaks easily, even with little things like smart quotes.
3) Despite the name, defaultCStringEncoding is not something you should use as a default. It's designed for special circumstances where you need to deal with system strings and don't otherwise know how the system deals with its internal strings. It refers to the way strings are handled in the default C implementation, NOT in the NSString internals, so there's not necessarily a performance benefit.
4) If you write a string with an unknown string encoding, and you try to read it back with a different string encoding, your code will fail; in many cases, you will just end up with an empty string.
Bottom line is: who will be trying to interpret your NSData objects? If it's your own app, pick an encoding that makes sense for you (I use UTF8 for everything) and use it for both conversions. Otherwise, figure out what your ecosystem needs to read or write and make that your standard.

Erlang, io_lib and unicode

I'm having a little trouble getting erlang to give me a unicode string.
Here's what works:
io:format("~ts~n", [<<226,132,162>>]).
™
ok
But instead of printing to the console, I want to assign it to a variable. So I thought:
T = lists:flatten(io_lib:format("~ts~n", [<<226,132,162>>])).
T.
[8482,10]
How can I get T in the io_lib example to contain the ™ symbol so I can write it to a network stream?
Instead of assigning the flattened version to a variable for sending on the network, can you instead re-write your code that sends over the network to accept the binary in the first place and use the formatted write mechanism ~ts when sending over the socket?
That would also let you avoid the lists:flatten, which isn't needed for the built-in IO mechanisms.
It does contain the trademark symbol: as you can see here, 8482 is its code. It isn't printed as ™ in the shell, because the shell prints as strings only lists which contain printable character code in Latin-1. So [8482, 10] is a Unicode string (in UTF-32 encoding). If you want to convert it to a different encoding, use the unicode module.
First thing is knowing what you need to do. Then you can adapt your code the best way you find.
Erlang represents unicode strings as lists of codepoints. Unicode codepoints are integers, not bytes. Snce you can only send bytes over the network, things like unicode strings, need to be encoded in byte squences by the sending side and decoded by the receiving side. UTF-8 is the most used encoding for unicode strings, and that's what your binary is, the UTF-8 encoding of the unicode string composed by the codepoint 8482.
What you get out of the io_lib:format call is the erlang string representation of that codepoint plus the new line character.
A very reasonable way to send unicode strings over the network is encoding them in UTF-8. Don't use io_lib:format for that, though. unicode:characters_to_binary/1 is the function meant to transform unicode strings in UTF-8 encoded binaries.
In the receiving side (and probably even better in your whole application) you'll have to decide how you will handle the strings, either in encoded binaries (or lists) or in plain unicode lists. But over the network the only choice is using binaries (or iolists wich are possibly deep lists of bytes) and I'll bet the most reasonable encoding for your application will be UTF-8.

How to convert unicode escape code to character in Objective C (on iPhone)

I have a string that contains unicode escape codes, eg. #"D\u017cem" (\u017c is code for ż). I would like to convert that string to the one containg actual characters. In the example that would be #"Dżem".
Is there any method in SDK or library that can do such replacement AND work on iPhone?
(Obviously I can do the replacement myself, changing characters one by one, but it is rather cumbersome)
According to Apple,
It is not safe is to include high-bit characters in your source code
Note that the "universal character name" \u017c is replaced at compile time with an implementation-defined value which in practice is the UTF8 representation, so the end result is the same as you would get if you (correctly) did the replacement you are talking about. If you're having a problem with some other source-processing tool, you might be better served by teaching that tool to recognize C99 universal character names.
I suggest to start using NSLocalizedString()
http://www.pushplay.net/2009/08/developing-localized-iphone-applications/
http://developer.apple.com

What encoding to use with raw bytes in a NSString

I need to store some raw bytes from NSData object into an NSString (basically a null encoding) but I am not sure how to do this. Obviously assigning an improper 8-bit encoding would be bad. NSASCIIStringEncoding is not OK because the docs say "Strict 7-bit ASCII encoding within 8-bit chars; ASCII values 0…127 only." but I need full range of 0x0 - 0xFF.
Base64 encoding is NOT an acceptable solution.
Basically, you don't.
An NSString is for strings of validly encoded string data; typically UTF8 or UTF16. NSData is for arbitrary binary data.
If you want to store raw bytes into an NSString, you need to encode them and base64 is one of the most common means of doing so.
Use NSNEXTSTEPStringEncoding. According to the documentation:
8-bit ASCII encoding with NEXTSTEP extensions.
It appears in the current documentation (as of writing this post) and is available in both Apple and GNU's implementation of the (OPENSTEP) standard.
Caveat: It doesn't state what exactly those "extensions" are, so tread lightly.

Emoji iCons are not displaying correctly when read from plist

I am trying to read some text from a plist file and display it to the users in alert box.
When I build the string using this code, everything works (users sees Hello with a smily icon):
NSString *hello = #"Hello \ue415";
but when I get the string from plist, using this code, uses sees "Hello \ue415":
NString *hello = (NSString *)[pageLiteratureDic objectForKey:litratureKey];
Do I have to encode string differently? Any help or pointers will be much appreciated... everyone love emojis ;)
You shouldn't literally type "\ue415" as text into the plist file. \u.... is an escape sequence in the syntax of strings and characters in the C language. The string itself does not contain backslash and "u" and whatever, it contains just 1 character, the Unicode character at the codepoint 0xe415. If you want to save that in a plist, you have to manually type that one Unicode character in there yourself, making sure to use whatever encoding that is required of a plist (maybe utf-8 or utf-16, not sure). Alternately, you can write a program that creates a plist from that string, and then copy and paste whatever is in that plist file over to your file.
In the plist, instead of "Hello \ue415" try using the smily face character explicitly as in "Hello :)". Just cut and paste the smily character over the unicode code. The reading of the plist is probably escaping the backslash and stopping the interpretation as a unicode character.