NSString to NSData encoding considerations - encoding

I understand why when going from NSData to NSString you need to specify encoding.
However I'm finding it frustrating how the reverse (NSString to NSData) needs to have an encoding specified.
In this related question the answers suggested using
NSUTF8StringEncoding or defaultCStringEncoding, with the latter not being fully explained.
So I just wanted to ask IF the following is correct when converting NSString to NSData:
In cases where you want to be 100% sure the binary representation of the NSString object is UTF8 then use NSUTF8StringEncoding (or whatever encoding is needed)
In cases where the encoding of the NSString object is known/expected to already be of a certain type and no conversion is required then it's safe (perhaps internally faster) to use defaultCStringEncoding (from what I have read objective-c uses UTF-16 internally, not sure if LE or BE but I'd assume LE because the platform is LE)
TIA

The encoding needs to be specified for converting NSString to NSData for the same reason it needs to be specified going from NSData to NSString.
An NSData object is a wrapper for a string of absolutely raw bytes. If the NSString doesn't specify some encoding, it doesn't know what to write, because at the level of ones and zeroes, a UTF-16 encoding looks different from a UTF-8 encoding of the same letter, and of course, if you write UTF-16 as big-endian and read it as little-endian you will get gibberish.
In other words, don't think of it as converting or escaping a string; it's generating a byte buffer, and the encoding tells it which ones and zeroes to write when the next character is "a" and which ones to write when it means "妈".
As for your question...here's my two cents.
1) If you are converting an NSString to an NSData so that your same program can convert it back later, and no other software will need to deal with that NSData until after you've read it back into an NSString, then none of this matters. All that matters is that your string-to-data encoding and your data-to-string encoding match.
2) If you are dealing only with ASCII characters, you can probably get away with a lot, just because many kinds of encoding use the same representation for characters under 128. But this breaks easily, even with little things like smart quotes.
3) Despite the name, defaultCStringEncoding is not something you should use as a default. It's designed for special circumstances where you need to deal with system strings and don't otherwise know how the system deals with its internal strings. It refers to the way strings are handled in the default C implementation, NOT in the NSString internals, so there's not necessarily a performance benefit.
4) If you write a string with an unknown string encoding, and you try to read it back with a different string encoding, your code will fail; in many cases, you will just end up with an empty string.
Bottom line is: who will be trying to interpret your NSData objects? If it's your own app, pick an encoding that makes sense for you (I use UTF8 for everything) and use it for both conversions. Otherwise, figure out what your ecosystem needs to read or write and make that your standard.

Related

Working with strings with mixed encodings in python 3.x

I'm working with a binary file that references another file using absolute paths.
The path contains both japanese and ascii characters.
The length of the string is given, so I can just read that many bytes and convert it into a string.
However the problem is trying to convert the string. If I specify the encoding as ascii, it'll fail on the japanese characters. If I specify it as japanese encoding (shift-jis or something), it won't read the english characters properly.
One byte is used for each ascii character, while two bytes are used for each japanese character.
What is the fastest and cleanest way to convert these bytes into a string? The encodings are known. Will the same technique work in older versions of python.
This sounds like you have fallen victim for a misunderstand the basics of Unicode and encodings. It may be that you have not, but misunderstandnings are common and understandable, while the situation you describe are not.
A string of bytes that contains mixed encodings are, per definition, invalid in any of these encodings. If this really was the case, you would have to split the bytes string into it's parts, and decode every part separately. In this case it would probably mean splitting on the path separators, so it would be reasonably easy, but in other cases it would not. However, I serously doubt that this is the case, as it would mean that your source is insane. That happens, but it is unlikely. :-)
If the source gives you one path as a bytes string, it is most likely that this string uses only one encoding. It may contain both Japanese and ASCII-characters and still be using one encoding. The most common encodings that can handle both Japanese and ASCII are UTF-8 and UTF-16. My guess is that your source uses one of those. In fact, since you write "One byte is used for each ascii character, while two bytes are used for each japanese character" it is probably UTF-8. It could also be Shift JIS, but it seems you already tried that.
If not, please explain what your source is, and give examples of the byte strings (in ASCII/HEX) that you are given.

Change data encoding

I get some data from the server in Unicode. However I need this data in UTF8. How can I convert data to UTF8 encoding?
The ideal solution is that that the server sends you UTF-8 in the first place.
UTF-8 is an encoding of Unicode, so depending on what you mean by “Unicode” in your question, it may already be doing that.
Cocoa misuses “Unicode” in the symbol NSUnicodeStringEncoding to refer to UTF-16. It's possible, but unlikely, that that's what the server is sending you.
The server should tell you in the Content-Type header what encoding it used for the content. You should look at that in your program rather than assuming the server will use any specific encoding.
If the encoding is not specified in the header, try treating it as UTF-8, and if that doesn't work, I suggest complaining to whoever runs the server.
To convert from any encoding supported by Cocoa to UTF-8, pass the input data and the encoding it's in to the -[NSString initWithData:encoding:] method, which will decode the data and produce a string; then, send the string a dataUsingEncoding: message with NSUTF8StringEncoding as the desired encoding.
Well UTF-8 is an encoding for Unicode, but to get a string:
NSString *string = [[NSString alloc] initWithData:yourData encoding:NSUTF8StringEncoding]

What encoding to use with raw bytes in a NSString

I need to store some raw bytes from NSData object into an NSString (basically a null encoding) but I am not sure how to do this. Obviously assigning an improper 8-bit encoding would be bad. NSASCIIStringEncoding is not OK because the docs say "Strict 7-bit ASCII encoding within 8-bit chars; ASCII values 0…127 only." but I need full range of 0x0 - 0xFF.
Base64 encoding is NOT an acceptable solution.
Basically, you don't.
An NSString is for strings of validly encoded string data; typically UTF8 or UTF16. NSData is for arbitrary binary data.
If you want to store raw bytes into an NSString, you need to encode them and base64 is one of the most common means of doing so.
Use NSNEXTSTEPStringEncoding. According to the documentation:
8-bit ASCII encoding with NEXTSTEP extensions.
It appears in the current documentation (as of writing this post) and is available in both Apple and GNU's implementation of the (OPENSTEP) standard.
Caveat: It doesn't state what exactly those "extensions" are, so tread lightly.

NSStrings, C strings, pathnames and encodings in iPhone

I am using libxml2 in my iPhone app. I have an NSString that holds the pathname to an XML file. The pathname may include non-ASCII characters. I want to get a C string representation of the NSString for to pass to xmlReadFile(). It appears that cStringUsingEncoding gives me the representation I seek. I am not clear on which encoding to use.
I wonder if there is a "default" encoding in iPhone OS that I can use here and ensure that I can roundtrip non-ASCII pathnames.
Use NSString's fileSystemRepresentation. If the string contains characters that are not representable in the file system's encoding then this method will raise an exception.
To convert back, use NSFileManager's stringWithFileSystemRepresentation:length:

NSStream, UTF8String & NSString... Messy Conversion

I am constructing a data packet to be sent over NSStream to a server. I am trying to seperate two pieces of data with the a '§' (ascii code 167). This is the way the server is built, so I need to try to stay within those bounds...
unichar asciiChar = 167; //yields #"§"
[self setSepString:[NSString stringWithCharacters:&asciiChar length:1]];
sendData=[NSString stringWithFormat:#"USER User%#Pass", sepString];
NSLog(sendData);
const uint8_t *rawString=(const uint8_t *)[sendData UTF8String];
[oStream write:rawString maxLength:[sendData length]];
So the final outcome should look like this.. and it does when sendData is first constructed:
USER User§Pass
however, when it is received on the server side, it looks like this:
//not a direct copy and paste. The 'mystery character' may not be exact
USER UserˤPas
...the seperator string has become two in length, and the last letter is getting cropped from the command. I believe this to be cause by the UTF8 conversion.
Can anyone shed some light on this for me?
Any help would be greatly appreciated!
The correct encoding in UTF-8 for this character is the two-byte sequence 0xC2 0xA7, which is what you're getting. (Fileformat.info is invaluable for this stuff.) This is out of the LATIN-1 set, so you almost certainly want to be using NSISOLatin1StringEncoding rather than NSUTF8StringEncoding in order to get a single-byte 167 encoding. Look at NSString -dataUsingEncoding:.
What you have and what you want to transmit is not really a UTF-8 string, and it's technically not us-ascii, because that's only 7 bits. You want to transmit an arbitrary array of bytes, according to the protocol that you're working with. The two fields of the byte array, username and password, might themselves be UTF-8 strings, but with the 167 separator it cannot be a UTF-8 string.
Here are some options I see:
Construct the uint8_t* byte array using at least two different NSString objects plus the 167 code. This will be necessary if the username or password can possibly contain non-ascii characters.
Use the NSString method getBytes:maxLength:usedLength:encoding:options:range:remainingRange and set encoding to NSASCIIStringEncoding. If you do this you must validate elsewhere that your username and password is us-ascii only.
Use the NSString method getCString. However, that's been deprecated because you cannot specify the encoding you want.