How to decode shift-jis encoded data in an XML document using Cocoa (iPhone) - iphone

I have an XML document that may have shift-jis encoded data in it and I'm trying to parse it using an NSXMLParser object.
Ordinarily I assume the document is UTF8 encoded and all is well - does anyone know if/how I can determine if an element is shift-jis encoded and then how to decode it?
Thanks

An XML document is UTF-8 encoded unless it has an XML declaration stating otherwise, for example:
<?xml version="1.0" encoding="shift_jis"?>
or:
<?xml version="1.0" encoding="cp932"?>
Any XML parser should detect the encoding given in the XML declaration. (Some parsers may not support some of the CJK codecs so will complain, but AIUI NSXMLParser should be fine.)
If you've got a file with Shift-JIS byte sequences that does not have such a stated encoding, or which contains Shift-JIS byte sequences in some elements and UTF-8 in others, what you have is not well-formed; it's not an XML document at all and no parser will read it.
If you've just got a missing encoding declaration, you really need to fix it at the source end, but in the meantime hacking in a suitable XML declaration or transcoding the bytes manually from Shift-JIS to UTF-8 before feeding it into the parser should help.

Related

having problems opening DITA files in OxygenXML which contain special characters

I am having problems opening files which contain special characters like é, è, ë, ê, à, á, ö, etc. The error message I get from OxygnXML is:
File encoding (UTF8) does not support all characters from the current file.
To ignore these errors or to replace invalid characters follow the link below to change the "Encoding errors handling" option value from REPORT to IGNORE or REPLACE.
The strange thing is: when I alter the file (by swapping the 'ó' for an 'o', for instance), I can import the files both in OxygenXML and in FontoXML.Afterwards I can correct them again and save the file. But I don't see a difference between the original file and the altered file.
This is the original file
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is zó zenuwachtig, dat ze bijna aan de ... moet .</p>
And this is the saved corrected file (from FontoXML, in this case - just to show the added instructions):
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is
z<?fontoxml-change-addition-start author-id="erik.verhaar" change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22" timestamp="1627473671530"?>ó<?fontoxml-change-addition-end change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22"?><?fontoxml-change-deletion author-id="erik.verhaar" change-id="0296c77c-863b-421f-bf5c-c0901c7a2751" text="ó" timestamp="1627473669483"?>
zenuwachtig, dat ze bijna aan de ... moet .</p>
What is the difference between the original ó and the corrected one? And how can I change my original files so they can be imported in OxygenXML?
Thanks!!
Text files (XML for example) are saved on disk using bytes, they are edited and presented using characters. An encoding takes care of converting bytes to characters (sometimes multiple bytes are converted to characters) when the document is opened and again the encoding does the conversion of characters to bytes when the document is saved.
There are many encodings but with the most popular (like UTF-8) characters belonging to the 0-128 ASCII range like a-z A-Z are usually saved to a single byte. Characters outside of the range, for example e-acute (é) usually get saved as multiple bytes, depending on the encoding used for saving.
When an XML document is opened Oxygen attempts to understand what encoding to use for reading it. If the XML document has a heading like this:
Oxygen uses the encoding specified in the heading. If the XML doc is lacking the heading Oxygen will fallback to UTF-8. Basically Oxygen implements the XML specification when it comes to detecting the encoding of the XML file:
https://www.w3.org/TR/xml/#sec-guessing
In your case Oxygen detected the encoding as UTF-8 and started to use UTF-8 to convert bytes to characters. It encountered a sequence of bytes which were not encoded using UTF-8. Oxygen does not continue loading the file because in such cases you may end up with corrupt content when saving it back.
In my opinion the other editor tool you used to create the XML files was not XML aware, it did not actually saved the XML as UTF-8 even if the heading in the XML document specified this.
We do not actually know with what encoding that other editing tool used to save the XML, one thing you could try would be to reopen the XML document in that other editing tool and change its encoding heading declaration from:
<?xml version='1.0' encoding='UTF-8'?>
to:
<?xml version='1.0' encoding='CP1250'?>
because I suspect that other editing tool actually used for saving the XML document the default platform encoding which on Windows should usually be CP1250.
Then save the XML document in the other editing tool and try to re-open it in Oxygen, if it works change its heading encoding declaration back to UTF-8 and save the XML document in Oxygen in order to properly save it using the UTF-8 encoding.
This older set of slides I made about XML encoding might also be useful to you:
https://www.oxygenxml.com/events/2018/large_xml_documents.pdf

MSXML.DOMDocument.4.0 loadXML with Chinese Unicode characters

Currently, I'm trying to use the MSXML loadXML method in ASP to load XML string which may contain Unicode Chinese characters like
𠮢 (U+20BA2) 4bytes
and the xml string looks like
<City>City</City><Name>𠮢</Name>
So, in my code, I could see the xml string comes in right, but the loadXML returns an an error message like
Invalid unicode characters, & #55362;&#57250
Can someone please tell me what I can do to resolve this issue?
Thanks,
Edited
The code looks like this
Set objDoc = CreateObject("MSXML2.DOMDocument")
objDoc.async = false
objDoc.setProperty "SelectionLanguage", "XPath"
objDoc.validateOnParse = false
objDoc.loadXML(strXml)
I suggest posting the exact code, XML source and error message you are getting. I cannot reproduce an error by parsing <element>𠮢</element> in MSXML 4.0 SP3; this works fine.
I certainly do get a parseError with reason "Invalid unicode character" by trying to parse <element>𠮢</element>, because that's not well-formed XML. If you do have this in your markup then you need to fix the serialiser that produced it because neither MSXML nor any standards-compliant XML parser will load it.
If 𠮢 is turned into a character reference it must be 𠮢 (or 𠮢). Code units 55362 and 57250 are 'surrogates', reserved for encoding astral plane characters in UTF-16. They can't be included in an XML document.
𠮢 is the entity encoded form of 0xD842 0xDFA2, which is the UTF-16 encoded form of the Unicode 𠮢 character. Make sure that the XML is completely UTF-16 encoded, not mixed single-byte ASCII and multi-byte UTF-16.

Reconstructing Windows-1252 characters from data incorrectly saved as UTF-8

I'm dealing with data that has been sampled using Java HtmlUnit. The webpage used Windows-1252 encoding but the response was retrieved as if the page was encoded as UTF-8 (ie when getContentAsString on the HtmlUnit WebResponse object was invoked, UTF-8 encoding was specified rather than deferring to the encoding specified in the server response). Is there any way to reverse this process to reconstruct the original Windows-1252 data from the incorrectly labelled UTF-8 character data?
Most other questions on this topic are concerned with identifying the type of file or converting from one stream type to another for characters correctly encoded in the first place. That is not the case here. I don't believe utilities such as iconv will work because they expect the streams to have been correctly persisted in their source encoding to begin with.
Probably not. If Windows-1252-encoded text gets mistaken for UTF-8, all non-ASCII codepoints would be damaged, because of the way UTF-8 deals with those codepoints. Only if you are very very lucky, and all non-ASCII codepoints come in pairs or triplets that, by pure chance, convert to real Unicode codepoints, you can reverse the process.
But you're pretty much out of luck.

parse xml with objective c

i have problem when i parse xml because i have this caracter ö
<?xml version="1.0" encoding="UTF-8"?>
<rsp stat="ok">
<mediaid>abösjdk3</mediaid>
<mediaurl>http://twitöic.com/abc123</mediaurl>
</rsp>
the building:
parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x9A 0x74 0x68 0x65
<mediaid>ab\232sjdk3</mediaid>
^
other question please if want parse this > 6 < 12 month i will have problem,i not want replace > samone have solution?
You'll have this problem with any parser, not only with objective-c.
That character isn't encoded as UTF-8 and as such it will halt any parser.
Either remove the encoding information or change for the correct value.
Edited to answer a comment
i use GDataXmlNode to parse and in my xml file i not use <?xml version="1.0" encoding="UTF-8"?> – cs1.6
IF the original XML file does not have the encoding attribute, then either when you instantiate the parser, or load the XML file, inform the proper encoding, which I have no idea what it is.
Because for the way that the O.P. is posted, it implies that the character ö is encoded as \232. However, the decimal 232 in ISO-8859-1 represents the character è. The character ö is represented as \246.
Go through this, it will help...

how to parse XML which contains data in Norwegian language?

how to parse XML which contains data in Norwegian language ?
Does i need any type of encoding with NSParser ?
Thanks.
I guess you are worried about non-ASCII characters in the XML file. Well you don't need to. The first line of an XML file should look something like:
<?xml version="1.0" encoding="UTF-8"?>
where the encoding attribute tells you which character set was used to encode the characters in the file. NSXMLParser will use that line to determine which character set it will use. Once it gets to your methods, all the text will be in NSStrings which will be able to cope with your Norwegian characters automatically.
All you need to be concerned about is that the file really is encoded in the character set that the first line says it is.
The xml is the language which don't concern which kind of language you are using!! In xml there should be one start tag and it's end tag. Then you can parse using xml parsing.
here is the tutorial to understand xml and
here is the link to tutorial to parse the xml file.
may this will be help full to your problem.