How to parse XML with special characters? - iphone

Whenever I try to parse XML with special characters such as ō or 満月先生 I get an error. The xml documents claims to use UTF-8 encoding but that does not seem to be the case.
Here is what the troublesome text looks like when I view the XML in Firefox:
Bleach: The Diamond Dust
Rebellion - M� Hitotsu no
Hy�rinmaru; Bleach - The
DiamondDust Rebellion - Mou Hitotsu no
Hyourinmaru
On the actual website, Å� is actually the character ō.
<br /> One day,
Doraemon and his friends meet
Professor Mangetsu
(����,
Professor Mangetsu?), who studies
magic and magical beings such as
goblins, and his daughter Miyoko
(���,
Miyoko?), and are warned of the
dangerous approximation of the
"star of the
Underworld" to the
Earth's orbit.<br />
<br />
And once again, on the actual website, those characters appear as 満月先生 and 美夜子.
The actual XML file is formatted properly other than those special characters, which certainly do not appear to be using the UTF-8 encoding. Is there a way to get NSXML to parse these XML files?

To use other characters than those who are utf-8, you need to use their special character code. If you want to represent ö you need to type ö
Find more on
Wikipedia: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

Related

Unicode and UTF-8 difference, lof of inconsistencies from the whole internet

I know there are a lot of answers about this subject, but I need some clarification.
From what I've understood, ASCII and Unicode are both charsets,
they tell you that A is decimal(41) and B is decimal(42) for example.
UTF-8, UTF-16, UTF-32, and ANSI are encodings
they are tasked with storing those 41 and 42 numbers into a binary form of their liking and managing their retrieval and conversion back to decimal. Then with the charset, you are able to get the corresponding char.
But, I was looking into how to get which charset/encoding is used by a webpage and I did tools>page information on Firefox.
And I can read this: charset=utf-8
(this is the page: http://www.leboncoin.fr/annonces/offres/ile_de_france/)
Is this a bug in Firefox?
Or, did I completely misunderstand charset/encoding?
You have slightly misunderstood character sets, though this is not a big issue. A character set is just the set of available characters, it doesn't have to reference any numbers (though they almost always do). See also: What's the difference between encoding and charset?
The real issue here is the use of charset. It comes from an HTML5 meta tag that often looks something like this:
<meta charset="utf-8" />
Despite the name, charset actually specifies a character encoding in HTML5, not a character set. This is likely due to historical confusion between character sets and encodings, as there was not much difference between the two before Unicode introduced multiple encodings for a single character set.

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

Difference between Unicode format?

I noticed something while uploading some unicode data to the database. When the content is uploaded throught textarea, is gets stored in क format, but when you personally type or paste the unicode and insert it hardcoded in php, then it would store in ठformat. But for both, the unicode character is same क.
Now please tell me the difference between the different formats of unicode characters. And how they affect the development. There has to be some limitations in those formats.
& #2325; is markup used in HTML to represent a Unicode character
If you hard code something in a php source file, Make sure you are opening it with editor that correctly displays text files with unicode characters in it.
http://www.joelonsoftware.com/articles/Unicode.html is good place to know the basics of unicode.
UTF-8 encoding of क has the byte sequence E0 A4
Now if somebody interprets this as 8 bit Latin encoding it will think it is two characters
you will see in the table in the above link E0 is à and A4 is ¤
When the content is uploaded throught textarea, is gets stored in क format,
Forms should not submit content in a character-reference (&#...;) format.
But in reality, they do in most current browsers... but only when they can't submit the character in question in any other way. In this case, you can't tell whether the user originally typed क or क, it is a lossy encoding.
To avoid this, make sure you are serving your page in a charset that supports all possible Unicode characters. In practical terms this means always use UTF-8, and serve your page with the Content-Type: text/html;charset=utf-8 header and/or the <meta http-equiv="Content=Type" content="text/html;charset=utf-8"/> element in the header. You'll then get all characters in simple, uncorrupted UTF-8 format.

Should I use hex ascii accented character code in HTML or use the actual character?

I have several huge CSVs with lots of accented characters in html hex code: é for é and lots of others, even – for –, etc.
My site is a wiki for people to update listings. So when they are presented a textarea for update, the existing content is filled in, and obviously those hex codes will be shown.
Should I be bothered replacing those codes with actual accented characters, or just leave it as it is? I wrote a script to replace the characters, but somehow the output are weird characters. Probably the format saved in Ruby isn't in UTF-8 format.
By default my site is in UTF-8, and the accented characters are displayed properly with some html coding in the view.
Please advise. Thanks.
Could you clarify what the problem is?
If your data (CSV) is in UTF-8, and the default encoding of your site is UTF-8, then all you would need to do is make sure that when users are editing content, that content is properly treated as UTF-8.
You may not need to display the markup to the users. Perhaps you could leverage a WYSIWIG editor package like TinyMCE?

query about xml parsing

i just want to knw,is there any boundations in xml parsing with characters
like can we parse a word containing some characters like
"frühe" containing "ü"
"böser" containing "ö"
while i am parsing my xml,which is few different languages, some characters are like the above.
and wen i saw in console, it get interpted,exaactly wen it reacher "ü"
becoz at console it prints "fr"
so can someone provide me some ideas about this thing
regards
shishir
If you are using the standard NSXmlParser class and the XML file has the correct encoding= attribute then you shouldn't have anything to worry about. The console output probably isn't unicode-aware so it is interpreting the multi-byte UTF-8 characters literally. Try showing the parsed text in a UIAlertView or some other UI element and see if you still have problems.