I've been trying to solve the encoding conversion problem without any luck so far. I found many suggestions on Stack Overflow how to tackle the problem like converting the XML string into NSData that uses UTF8 encoding, but the result was the same, my Spanish tildas are presented as weird chars. This is what I am using to grab the xml:
//Convert Win 1252 encoding of the string to UTF8
NSString *xmlString = [NSString stringWithContentsOfURL:[NSURL URLWithString:chosenDrawRss] encoding:NSWindowsCP1252StringEncoding error:&error];
If I call the above method with encoding UTF8 app crushes...
I tried converting the string into NSData using UTF8 and then back to NSString but still I had no luck. I wish I could simply change the XML file encoding but unfortunately that is out of my hands :(
When printed to NSLog everything is presented as it should be whether Im printing the XML string or the NSData created from that string...
BEst regards,
L
UPDATE: I haven't found the solution for the encoding mess :(. Since I found out that my Win1252 encoded RSS was also containing some HTML that I wanted to get rid of I wrote a .php script that I am calling from iPhone. This .php script parses the original XML from the remote server. This .php script is in UTF8; does some html cleaning and reorders the XML elements so in my case it made sense doing it is this way. Unfortunately, I still have no clue how to read a win1252 encoded XML and convert it to UTF8 directly from iOS :(((
Related
I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.
Edit Added some new information to make the question more clearly.
In matlab early before 2012B, the method urlread would return a string constructed by wrong charset if the web content's charset is not utf8. (It has been improved somewhat in Matlab 2012B)
For example
% a chinese website whose content encoding by gb2312
url = 'http://www.cnbeta.com/articles/213618.htm';
html = urlread(url)
Because Matlab encoded the html using utf8 instead of gb2312.
You will see the chinese character in the html doesnot show correctly.
If I read a chinese website with utf8 encoded, then everything works fine:
% a chinese website whose content encoding by utf8
url = 'http://www.baidu.com/';
html = urlread(url)
So is there any way to reconstruct the string correctly from html?
I have tried as following, but it didnot work:
>> bytes = unicode2native(html,'utf8');
>> str = native2unicode(bytes,'gb2312')
However, I do known there is a way to fix the urlread's problem:
Type edit urlread.m in the console and then replace the code nearly Line 108 (In matlab 2011B):
output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'UTF-8');
by:
output = native2unicode(typecast(byteArrayOutputStream.toByteArray','uint8'),'gb2312');
Save the file, and now urlread would works for website encoded by gb2312.
Actually, this solution point out why urlread doesnot work sometime. The method urlread always use the utf8 charset to encode string even if the content is not encoded by utf8.
It seems that you already have the solution, just create a function called urlread_gb that can read gb2312.
Currently, I'm trying to use the MSXML loadXML method in ASP to load XML string which may contain Unicode Chinese characters like
𠮢 (U+20BA2) 4bytes
and the xml string looks like
<City>City</City><Name>𠮢</Name>
So, in my code, I could see the xml string comes in right, but the loadXML returns an an error message like
Invalid unicode characters, & #55362;�
Can someone please tell me what I can do to resolve this issue?
Thanks,
Edited
The code looks like this
Set objDoc = CreateObject("MSXML2.DOMDocument")
objDoc.async = false
objDoc.setProperty "SelectionLanguage", "XPath"
objDoc.validateOnParse = false
objDoc.loadXML(strXml)
I suggest posting the exact code, XML source and error message you are getting. I cannot reproduce an error by parsing <element>𠮢</element> in MSXML 4.0 SP3; this works fine.
I certainly do get a parseError with reason "Invalid unicode character" by trying to parse <element>𠮢</element>, because that's not well-formed XML. If you do have this in your markup then you need to fix the serialiser that produced it because neither MSXML nor any standards-compliant XML parser will load it.
If 𠮢 is turned into a character reference it must be 𠮢 (or 𠮢). Code units 55362 and 57250 are 'surrogates', reserved for encoding astral plane characters in UTF-16. They can't be included in an XML document.
𠮢 is the entity encoded form of 0xD842 0xDFA2, which is the UTF-16 encoded form of the Unicode 𠮢 character. Make sure that the XML is completely UTF-16 encoded, not mixed single-byte ASCII and multi-byte UTF-16.
I'm having issues finding out what's wrong with the json string I receive from http://www.hier-bin-ich-koenig.de/json/events to be able to parse it. It doesn't validate, at least not with jsonlint, but I don't know where the issue is. So of course SBJson is unhappy too.
I also don't understand where that [Ô] is coming from. I'd love to know if it's from the content or the code that's converting the content into json. Being able to find where the validation error is would be great.
The exact error sent by the tokeniser is:
JSONValue failed. Error is: Illegal start of token [Ô]
Your page includes a UTF-16 BOM (byte order mark), followed by a UTF-8 encoded document. You should drop the BOM entirely. It is not recommended for UTF-8 encoding.
I had the same problem when I was parsing a json string which was generated by a PHP page. I resolved this problem by using Notepad++,
1, open the php file.
2, menu -> encoding -> encode UTF8 without BOM
3, save.
that's done.
I get some data from the server in Unicode. However I need this data in UTF8. How can I convert data to UTF8 encoding?
The ideal solution is that that the server sends you UTF-8 in the first place.
UTF-8 is an encoding of Unicode, so depending on what you mean by “Unicode” in your question, it may already be doing that.
Cocoa misuses “Unicode” in the symbol NSUnicodeStringEncoding to refer to UTF-16. It's possible, but unlikely, that that's what the server is sending you.
The server should tell you in the Content-Type header what encoding it used for the content. You should look at that in your program rather than assuming the server will use any specific encoding.
If the encoding is not specified in the header, try treating it as UTF-8, and if that doesn't work, I suggest complaining to whoever runs the server.
To convert from any encoding supported by Cocoa to UTF-8, pass the input data and the encoding it's in to the -[NSString initWithData:encoding:] method, which will decode the data and produce a string; then, send the string a dataUsingEncoding: message with NSUTF8StringEncoding as the desired encoding.
Well UTF-8 is an encoding for Unicode, but to get a string:
NSString *string = [[NSString alloc] initWithData:yourData encoding:NSUTF8StringEncoding]