how to parse XML which contains data in Norwegian language? - iphone

how to parse XML which contains data in Norwegian language ?
Does i need any type of encoding with NSParser ?
Thanks.

I guess you are worried about non-ASCII characters in the XML file. Well you don't need to. The first line of an XML file should look something like:
<?xml version="1.0" encoding="UTF-8"?>
where the encoding attribute tells you which character set was used to encode the characters in the file. NSXMLParser will use that line to determine which character set it will use. Once it gets to your methods, all the text will be in NSStrings which will be able to cope with your Norwegian characters automatically.
All you need to be concerned about is that the file really is encoded in the character set that the first line says it is.

The xml is the language which don't concern which kind of language you are using!! In xml there should be one start tag and it's end tag. Then you can parse using xml parsing.
here is the tutorial to understand xml and
here is the link to tutorial to parse the xml file.
may this will be help full to your problem.

Related

dotNetRDF owl file encoding for chinese

I have a owl file generated by Protege. Some classes' name and property name contain Chinese words like "苹果".
It's ok when I just open the owl file. However, when I usw OntologyGraph to load the owl file and foreach for OntologyClass, it shows error codes.
I want to ask, does dotnetrdf support chinese? How can I set the encoding style by dotnetrdf
Thanks for answering!
The problem might be with the file encoding, similar to the one reported in this question.
A Protege .owl file is an XML file that should contain a first line that specifies what the file encoding is. If that line is either missing or specifies an incorrect encoding for the file then dotNetRDF will potentially misread the file, leading to errors.

having problems opening DITA files in OxygenXML which contain special characters

I am having problems opening files which contain special characters like é, è, ë, ê, à, á, ö, etc. The error message I get from OxygnXML is:
File encoding (UTF8) does not support all characters from the current file.
To ignore these errors or to replace invalid characters follow the link below to change the "Encoding errors handling" option value from REPORT to IGNORE or REPLACE.
The strange thing is: when I alter the file (by swapping the 'ó' for an 'o', for instance), I can import the files both in OxygenXML and in FontoXML.Afterwards I can correct them again and save the file. But I don't see a difference between the original file and the altered file.
This is the original file
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is zó zenuwachtig, dat ze bijna aan de ... moet .</p>
And this is the saved corrected file (from FontoXML, in this case - just to show the added instructions):
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is
z<?fontoxml-change-addition-start author-id="erik.verhaar" change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22" timestamp="1627473671530"?>ó<?fontoxml-change-addition-end change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22"?><?fontoxml-change-deletion author-id="erik.verhaar" change-id="0296c77c-863b-421f-bf5c-c0901c7a2751" text="ó" timestamp="1627473669483"?>
zenuwachtig, dat ze bijna aan de ... moet .</p>
What is the difference between the original ó and the corrected one? And how can I change my original files so they can be imported in OxygenXML?
Thanks!!
Text files (XML for example) are saved on disk using bytes, they are edited and presented using characters. An encoding takes care of converting bytes to characters (sometimes multiple bytes are converted to characters) when the document is opened and again the encoding does the conversion of characters to bytes when the document is saved.
There are many encodings but with the most popular (like UTF-8) characters belonging to the 0-128 ASCII range like a-z A-Z are usually saved to a single byte. Characters outside of the range, for example e-acute (é) usually get saved as multiple bytes, depending on the encoding used for saving.
When an XML document is opened Oxygen attempts to understand what encoding to use for reading it. If the XML document has a heading like this:
Oxygen uses the encoding specified in the heading. If the XML doc is lacking the heading Oxygen will fallback to UTF-8. Basically Oxygen implements the XML specification when it comes to detecting the encoding of the XML file:
https://www.w3.org/TR/xml/#sec-guessing
In your case Oxygen detected the encoding as UTF-8 and started to use UTF-8 to convert bytes to characters. It encountered a sequence of bytes which were not encoded using UTF-8. Oxygen does not continue loading the file because in such cases you may end up with corrupt content when saving it back.
In my opinion the other editor tool you used to create the XML files was not XML aware, it did not actually saved the XML as UTF-8 even if the heading in the XML document specified this.
We do not actually know with what encoding that other editing tool used to save the XML, one thing you could try would be to reopen the XML document in that other editing tool and change its encoding heading declaration from:
<?xml version='1.0' encoding='UTF-8'?>
to:
<?xml version='1.0' encoding='CP1250'?>
because I suspect that other editing tool actually used for saving the XML document the default platform encoding which on Windows should usually be CP1250.
Then save the XML document in the other editing tool and try to re-open it in Oxygen, if it works change its heading encoding declaration back to UTF-8 and save the XML document in Oxygen in order to properly save it using the UTF-8 encoding.
This older set of slides I made about XML encoding might also be useful to you:
https://www.oxygenxml.com/events/2018/large_xml_documents.pdf

Get encoding for xml-file using TBXML

How do I access the encoding for an xml-file using TBXML?
To clarify: I would like to acsess the top row in an xml-file and there get the value of encoding by using TBXML.
?xml version="1.0" encoding="utf-8"?
You can't, it's not part of the xml document.
Why do you want to know the encoding?

query about xml parsing

i just want to knw,is there any boundations in xml parsing with characters
like can we parse a word containing some characters like
"frühe" containing "ü"
"böser" containing "ö"
while i am parsing my xml,which is few different languages, some characters are like the above.
and wen i saw in console, it get interpted,exaactly wen it reacher "ü"
becoz at console it prints "fr"
so can someone provide me some ideas about this thing
regards
shishir
If you are using the standard NSXmlParser class and the XML file has the correct encoding= attribute then you shouldn't have anything to worry about. The console output probably isn't unicode-aware so it is interpreting the multi-byte UTF-8 characters literally. Try showing the parsed text in a UIAlertView or some other UI element and see if you still have problems.

How to decode shift-jis encoded data in an XML document using Cocoa (iPhone)

I have an XML document that may have shift-jis encoded data in it and I'm trying to parse it using an NSXMLParser object.
Ordinarily I assume the document is UTF8 encoded and all is well - does anyone know if/how I can determine if an element is shift-jis encoded and then how to decode it?
Thanks
An XML document is UTF-8 encoded unless it has an XML declaration stating otherwise, for example:
<?xml version="1.0" encoding="shift_jis"?>
or:
<?xml version="1.0" encoding="cp932"?>
Any XML parser should detect the encoding given in the XML declaration. (Some parsers may not support some of the CJK codecs so will complain, but AIUI NSXMLParser should be fine.)
If you've got a file with Shift-JIS byte sequences that does not have such a stated encoding, or which contains Shift-JIS byte sequences in some elements and UTF-8 in others, what you have is not well-formed; it's not an XML document at all and no parser will read it.
If you've just got a missing encoding declaration, you really need to fix it at the source end, but in the meantime hacking in a suitable XML declaration or transcoding the bytes manually from Shift-JIS to UTF-8 before feeding it into the parser should help.