How do I access the encoding for an xml-file using TBXML?
To clarify: I would like to acsess the top row in an xml-file and there get the value of encoding by using TBXML.
?xml version="1.0" encoding="utf-8"?
You can't, it's not part of the xml document.
Why do you want to know the encoding?
Related
I am having problems opening files which contain special characters like é, è, ë, ê, à, á, ö, etc. The error message I get from OxygnXML is:
File encoding (UTF8) does not support all characters from the current file.
To ignore these errors or to replace invalid characters follow the link below to change the "Encoding errors handling" option value from REPORT to IGNORE or REPLACE.
The strange thing is: when I alter the file (by swapping the 'ó' for an 'o', for instance), I can import the files both in OxygenXML and in FontoXML.Afterwards I can correct them again and save the file. But I don't see a difference between the original file and the altered file.
This is the original file
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is zó zenuwachtig, dat ze bijna aan de ... moet .</p>
And this is the saved corrected file (from FontoXML, in this case - just to show the added instructions):
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is
z<?fontoxml-change-addition-start author-id="erik.verhaar" change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22" timestamp="1627473671530"?>ó<?fontoxml-change-addition-end change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22"?><?fontoxml-change-deletion author-id="erik.verhaar" change-id="0296c77c-863b-421f-bf5c-c0901c7a2751" text="ó" timestamp="1627473669483"?>
zenuwachtig, dat ze bijna aan de ... moet .</p>
What is the difference between the original ó and the corrected one? And how can I change my original files so they can be imported in OxygenXML?
Thanks!!
Text files (XML for example) are saved on disk using bytes, they are edited and presented using characters. An encoding takes care of converting bytes to characters (sometimes multiple bytes are converted to characters) when the document is opened and again the encoding does the conversion of characters to bytes when the document is saved.
There are many encodings but with the most popular (like UTF-8) characters belonging to the 0-128 ASCII range like a-z A-Z are usually saved to a single byte. Characters outside of the range, for example e-acute (é) usually get saved as multiple bytes, depending on the encoding used for saving.
When an XML document is opened Oxygen attempts to understand what encoding to use for reading it. If the XML document has a heading like this:
Oxygen uses the encoding specified in the heading. If the XML doc is lacking the heading Oxygen will fallback to UTF-8. Basically Oxygen implements the XML specification when it comes to detecting the encoding of the XML file:
https://www.w3.org/TR/xml/#sec-guessing
In your case Oxygen detected the encoding as UTF-8 and started to use UTF-8 to convert bytes to characters. It encountered a sequence of bytes which were not encoded using UTF-8. Oxygen does not continue loading the file because in such cases you may end up with corrupt content when saving it back.
In my opinion the other editor tool you used to create the XML files was not XML aware, it did not actually saved the XML as UTF-8 even if the heading in the XML document specified this.
We do not actually know with what encoding that other editing tool used to save the XML, one thing you could try would be to reopen the XML document in that other editing tool and change its encoding heading declaration from:
<?xml version='1.0' encoding='UTF-8'?>
to:
<?xml version='1.0' encoding='CP1250'?>
because I suspect that other editing tool actually used for saving the XML document the default platform encoding which on Windows should usually be CP1250.
Then save the XML document in the other editing tool and try to re-open it in Oxygen, if it works change its heading encoding declaration back to UTF-8 and save the XML document in Oxygen in order to properly save it using the UTF-8 encoding.
This older set of slides I made about XML encoding might also be useful to you:
https://www.oxygenxml.com/events/2018/large_xml_documents.pdf
i have problem when i parse xml because i have this caracter ö
<?xml version="1.0" encoding="UTF-8"?>
<rsp stat="ok">
<mediaid>abösjdk3</mediaid>
<mediaurl>http://twitöic.com/abc123</mediaurl>
</rsp>
the building:
parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x9A 0x74 0x68 0x65
<mediaid>ab\232sjdk3</mediaid>
^
other question please if want parse this > 6 < 12 month i will have problem,i not want replace > samone have solution?
You'll have this problem with any parser, not only with objective-c.
That character isn't encoded as UTF-8 and as such it will halt any parser.
Either remove the encoding information or change for the correct value.
Edited to answer a comment
i use GDataXmlNode to parse and in my xml file i not use <?xml version="1.0" encoding="UTF-8"?> – cs1.6
IF the original XML file does not have the encoding attribute, then either when you instantiate the parser, or load the XML file, inform the proper encoding, which I have no idea what it is.
Because for the way that the O.P. is posted, it implies that the character ö is encoded as \232. However, the decimal 232 in ISO-8859-1 represents the character è. The character ö is represented as \246.
Go through this, it will help...
how to parse XML which contains data in Norwegian language ?
Does i need any type of encoding with NSParser ?
Thanks.
I guess you are worried about non-ASCII characters in the XML file. Well you don't need to. The first line of an XML file should look something like:
<?xml version="1.0" encoding="UTF-8"?>
where the encoding attribute tells you which character set was used to encode the characters in the file. NSXMLParser will use that line to determine which character set it will use. Once it gets to your methods, all the text will be in NSStrings which will be able to cope with your Norwegian characters automatically.
All you need to be concerned about is that the file really is encoded in the character set that the first line says it is.
The xml is the language which don't concern which kind of language you are using!! In xml there should be one start tag and it's end tag. Then you can parse using xml parsing.
here is the tutorial to understand xml and
here is the link to tutorial to parse the xml file.
may this will be help full to your problem.
I have an XML document that may have shift-jis encoded data in it and I'm trying to parse it using an NSXMLParser object.
Ordinarily I assume the document is UTF8 encoded and all is well - does anyone know if/how I can determine if an element is shift-jis encoded and then how to decode it?
Thanks
An XML document is UTF-8 encoded unless it has an XML declaration stating otherwise, for example:
<?xml version="1.0" encoding="shift_jis"?>
or:
<?xml version="1.0" encoding="cp932"?>
Any XML parser should detect the encoding given in the XML declaration. (Some parsers may not support some of the CJK codecs so will complain, but AIUI NSXMLParser should be fine.)
If you've got a file with Shift-JIS byte sequences that does not have such a stated encoding, or which contains Shift-JIS byte sequences in some elements and UTF-8 in others, what you have is not well-formed; it's not an XML document at all and no parser will read it.
If you've just got a missing encoding declaration, you really need to fix it at the source end, but in the meantime hacking in a suitable XML declaration or transcoding the bytes manually from Shift-JIS to UTF-8 before feeding it into the parser should help.
how to use specific encoding on NSXMLParser ?
It default on utf-8 ,I want to use on tis-620
any idea?
If there is a correct encoding info inside the xml (first line, e.g. <?xml version="1.0" encoding="UTF-8"?>) everything should work out of the box.