I get some data from the server in Unicode. However I need this data in UTF8. How can I convert data to UTF8 encoding?
The ideal solution is that that the server sends you UTF-8 in the first place.
UTF-8 is an encoding of Unicode, so depending on what you mean by “Unicode” in your question, it may already be doing that.
Cocoa misuses “Unicode” in the symbol NSUnicodeStringEncoding to refer to UTF-16. It's possible, but unlikely, that that's what the server is sending you.
The server should tell you in the Content-Type header what encoding it used for the content. You should look at that in your program rather than assuming the server will use any specific encoding.
If the encoding is not specified in the header, try treating it as UTF-8, and if that doesn't work, I suggest complaining to whoever runs the server.
To convert from any encoding supported by Cocoa to UTF-8, pass the input data and the encoding it's in to the -[NSString initWithData:encoding:] method, which will decode the data and produce a string; then, send the string a dataUsingEncoding: message with NSUTF8StringEncoding as the desired encoding.
Well UTF-8 is an encoding for Unicode, but to get a string:
NSString *string = [[NSString alloc] initWithData:yourData encoding:NSUTF8StringEncoding]
Related
I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.
I understand why when going from NSData to NSString you need to specify encoding.
However I'm finding it frustrating how the reverse (NSString to NSData) needs to have an encoding specified.
In this related question the answers suggested using
NSUTF8StringEncoding or defaultCStringEncoding, with the latter not being fully explained.
So I just wanted to ask IF the following is correct when converting NSString to NSData:
In cases where you want to be 100% sure the binary representation of the NSString object is UTF8 then use NSUTF8StringEncoding (or whatever encoding is needed)
In cases where the encoding of the NSString object is known/expected to already be of a certain type and no conversion is required then it's safe (perhaps internally faster) to use defaultCStringEncoding (from what I have read objective-c uses UTF-16 internally, not sure if LE or BE but I'd assume LE because the platform is LE)
TIA
The encoding needs to be specified for converting NSString to NSData for the same reason it needs to be specified going from NSData to NSString.
An NSData object is a wrapper for a string of absolutely raw bytes. If the NSString doesn't specify some encoding, it doesn't know what to write, because at the level of ones and zeroes, a UTF-16 encoding looks different from a UTF-8 encoding of the same letter, and of course, if you write UTF-16 as big-endian and read it as little-endian you will get gibberish.
In other words, don't think of it as converting or escaping a string; it's generating a byte buffer, and the encoding tells it which ones and zeroes to write when the next character is "a" and which ones to write when it means "妈".
As for your question...here's my two cents.
1) If you are converting an NSString to an NSData so that your same program can convert it back later, and no other software will need to deal with that NSData until after you've read it back into an NSString, then none of this matters. All that matters is that your string-to-data encoding and your data-to-string encoding match.
2) If you are dealing only with ASCII characters, you can probably get away with a lot, just because many kinds of encoding use the same representation for characters under 128. But this breaks easily, even with little things like smart quotes.
3) Despite the name, defaultCStringEncoding is not something you should use as a default. It's designed for special circumstances where you need to deal with system strings and don't otherwise know how the system deals with its internal strings. It refers to the way strings are handled in the default C implementation, NOT in the NSString internals, so there's not necessarily a performance benefit.
4) If you write a string with an unknown string encoding, and you try to read it back with a different string encoding, your code will fail; in many cases, you will just end up with an empty string.
Bottom line is: who will be trying to interpret your NSData objects? If it's your own app, pick an encoding that makes sense for you (I use UTF8 for everything) and use it for both conversions. Otherwise, figure out what your ecosystem needs to read or write and make that your standard.
I've been trying to solve the encoding conversion problem without any luck so far. I found many suggestions on Stack Overflow how to tackle the problem like converting the XML string into NSData that uses UTF8 encoding, but the result was the same, my Spanish tildas are presented as weird chars. This is what I am using to grab the xml:
//Convert Win 1252 encoding of the string to UTF8
NSString *xmlString = [NSString stringWithContentsOfURL:[NSURL URLWithString:chosenDrawRss] encoding:NSWindowsCP1252StringEncoding error:&error];
If I call the above method with encoding UTF8 app crushes...
I tried converting the string into NSData using UTF8 and then back to NSString but still I had no luck. I wish I could simply change the XML file encoding but unfortunately that is out of my hands :(
When printed to NSLog everything is presented as it should be whether Im printing the XML string or the NSData created from that string...
BEst regards,
L
UPDATE: I haven't found the solution for the encoding mess :(. Since I found out that my Win1252 encoded RSS was also containing some HTML that I wanted to get rid of I wrote a .php script that I am calling from iPhone. This .php script parses the original XML from the remote server. This .php script is in UTF8; does some html cleaning and reorders the XML elements so in my case it made sense doing it is this way. Unfortunately, I still have no clue how to read a win1252 encoded XML and convert it to UTF8 directly from iOS :(((
I need to store some raw bytes from NSData object into an NSString (basically a null encoding) but I am not sure how to do this. Obviously assigning an improper 8-bit encoding would be bad. NSASCIIStringEncoding is not OK because the docs say "Strict 7-bit ASCII encoding within 8-bit chars; ASCII values 0…127 only." but I need full range of 0x0 - 0xFF.
Base64 encoding is NOT an acceptable solution.
Basically, you don't.
An NSString is for strings of validly encoded string data; typically UTF8 or UTF16. NSData is for arbitrary binary data.
If you want to store raw bytes into an NSString, you need to encode them and base64 is one of the most common means of doing so.
Use NSNEXTSTEPStringEncoding. According to the documentation:
8-bit ASCII encoding with NEXTSTEP extensions.
It appears in the current documentation (as of writing this post) and is available in both Apple and GNU's implementation of the (OPENSTEP) standard.
Caveat: It doesn't state what exactly those "extensions" are, so tread lightly.
I am constructing a data packet to be sent over NSStream to a server. I am trying to seperate two pieces of data with the a '§' (ascii code 167). This is the way the server is built, so I need to try to stay within those bounds...
unichar asciiChar = 167; //yields #"§"
[self setSepString:[NSString stringWithCharacters:&asciiChar length:1]];
sendData=[NSString stringWithFormat:#"USER User%#Pass", sepString];
NSLog(sendData);
const uint8_t *rawString=(const uint8_t *)[sendData UTF8String];
[oStream write:rawString maxLength:[sendData length]];
So the final outcome should look like this.. and it does when sendData is first constructed:
USER User§Pass
however, when it is received on the server side, it looks like this:
//not a direct copy and paste. The 'mystery character' may not be exact
USER UserˤPas
...the seperator string has become two in length, and the last letter is getting cropped from the command. I believe this to be cause by the UTF8 conversion.
Can anyone shed some light on this for me?
Any help would be greatly appreciated!
The correct encoding in UTF-8 for this character is the two-byte sequence 0xC2 0xA7, which is what you're getting. (Fileformat.info is invaluable for this stuff.) This is out of the LATIN-1 set, so you almost certainly want to be using NSISOLatin1StringEncoding rather than NSUTF8StringEncoding in order to get a single-byte 167 encoding. Look at NSString -dataUsingEncoding:.
What you have and what you want to transmit is not really a UTF-8 string, and it's technically not us-ascii, because that's only 7 bits. You want to transmit an arbitrary array of bytes, according to the protocol that you're working with. The two fields of the byte array, username and password, might themselves be UTF-8 strings, but with the 167 separator it cannot be a UTF-8 string.
Here are some options I see:
Construct the uint8_t* byte array using at least two different NSString objects plus the 167 code. This will be necessary if the username or password can possibly contain non-ascii characters.
Use the NSString method getBytes:maxLength:usedLength:encoding:options:range:remainingRange and set encoding to NSASCIIStringEncoding. If you do this you must validate elsewhere that your username and password is us-ascii only.
Use the NSString method getCString. However, that's been deprecated because you cannot specify the encoding you want.