SBJson parser unhappy with [Ô] - iphone

I'm having issues finding out what's wrong with the json string I receive from http://www.hier-bin-ich-koenig.de/json/events to be able to parse it. It doesn't validate, at least not with jsonlint, but I don't know where the issue is. So of course SBJson is unhappy too.
I also don't understand where that [Ô] is coming from. I'd love to know if it's from the content or the code that's converting the content into json. Being able to find where the validation error is would be great.
The exact error sent by the tokeniser is:
JSONValue failed. Error is: Illegal start of token [Ô]

Your page includes a UTF-16 BOM (byte order mark), followed by a UTF-8 encoded document. You should drop the BOM entirely. It is not recommended for UTF-8 encoding.

I had the same problem when I was parsing a json string which was generated by a PHP page. I resolved this problem by using Notepad++,
1, open the php file.
2, menu -> encoding -> encode UTF8 without BOM
3, save.
that's done.

Related

What does this decode to, and is it UTF? Игорќ

I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.

Polish name (Wężarów) returned from json service as W\u0119\u017car\u00f3w, renders as Wężarów. Can't figure out encoding/charset.

I'm using DB-IP.com to get city names from IP addresses. Many of these are international cities, with special characters in the names.
As an example, one of these cities is Wężarów in Poland. Checking the JSON return in the console or opening the request URL directly, it's being returned from DB-IP as "W\u0119\u017car\u00f3w" with a Content-Type of text/javascript;charset=UTF-8. This is rendered in the browser as Wężarów - it is also saved in my mysql database as Wężarów (which I've tried with both utf8 and latin1 encoding).
I'm ok with saving it in the DB as another format, as long as I can convert it back to Wężarów for display in browser. I've tried encoding and decoding to/from several formats, even just to display directly on the screen (ignoring the DB entirely). I'm completely confused on what I need to do here to get it in readable format.
I'm working with PERL, however if I can figure out what I need to do with the encoding/decoding/charset (as I'm currently clueless), I'm sure I can figure it out from there.
It looks like the UTF-8 encoded string was interpreted by the browser as if it were Windows-1252. Here's how I deduced it:
% python3
>>> s = "W\u0119\u017car\u00f3w"
>>> b = bytes(s, encoding='utf-8')
>>> b
b'W\xc4\x99\xc5\xbcar\xc3\xb3w'
>>> str(b, encoding='utf-8')
'Wężarów'
>>> str(b, encoding='latin-1')
'WÄ\x99żarów'
>>> str(b, encoding='windows-1252')
'Wężarów'
If you're not good with Python, what I'm doing here is encoding the string "W\u0119\u017car\u00f3w" into UTF-8, yielding the byte sequence 'W\xc4\x99\xc5\xbcar\xc3\xb3w'. Decoding that with UTF-8 yielded 'Wężarów', confirming that this is the correct UTF-8 encoding of the string you want. So I took a guess that the browser is using the wrong encoding to render it, and decoded it using Latin-1. That gave me something very close, so I looked up Latin-1 and noticed that it's named as the basis for Windows-1252. Decoding again as Windows-1252 gives the result you saw.
What's gone wrong here is that the browser can't tell what encoding to use to render the page, and it's guessing wrong. You need to fix this by telling it explicitly to use UTF-8. Here's a page by the W3C that describes how to do that. Essentially what you need to do is add an HTML <meta> element to the document head. If you also set an HTTP header with the encoding name in it, make sure they are consistent.
(In Firefox, while you're debugging, you can go to View -> Character Encoding to set the encoding on a page-by-page basis. I assume other browsers have the same feature.)

Description encoding error with YouTube API v3

I've successfully created a project to upload YouTube videos programmatically through VB.NET, and it has worked for some weeks now until today.
I'm having trouble uploading videos which contain German umlauts in the description field: as soon as I try to upload such a video, I'm getting the following WebException:
System.Exception: Bad Request ---> System.Net.WebException:
If I remove the description field or the umlauts, the upload works without problems.
I've also tried to UTF8-encode the string, but not successfully.
The error just occurred today...
I had the very same error today: it was occurring with Japanese and Korean while English and Chinese/Taiwanese were fine.
At first, I thought it was utf8 related. A few hours later, I found out YouTube does not rely on ISO 3166-1. You can get their list there.
Replacing 'jp' to 'ja' and 'kr' to 'ko' in defaultAudioLanguage fixed the issue.
The issue is that the special characters can't be parsed through an http request. So why not write a converter that searches for the umlaut characters and converts them to characters that can be parsed, for example
ä -> a
ë -> e
ö -> o
û -> u
etc...
That would be the simplest way to do it, although you might be able to get away with switching to some encoding that will automatically remove them for you, then switch back to default to build the request.
I would play around with the different encodings that you can use in VB.Net and see what you can get.
Here is some documentation on what encoding there is available to .Net, how to UTF-8 encode strings in VB .Net, and the Encoding class reference for VB .Net:
http://msdn.microsoft.com/en-us/library/ms404377.aspx
vb.net - Encode string to UTF-8
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx?cs-save-lang=1&cs-lang=vb#code-snippet-1

MSXML.DOMDocument.4.0 loadXML with Chinese Unicode characters

Currently, I'm trying to use the MSXML loadXML method in ASP to load XML string which may contain Unicode Chinese characters like
𠮢 (U+20BA2) 4bytes
and the xml string looks like
<City>City</City><Name>𠮢</Name>
So, in my code, I could see the xml string comes in right, but the loadXML returns an an error message like
Invalid unicode characters, & #55362;&#57250
Can someone please tell me what I can do to resolve this issue?
Thanks,
Edited
The code looks like this
Set objDoc = CreateObject("MSXML2.DOMDocument")
objDoc.async = false
objDoc.setProperty "SelectionLanguage", "XPath"
objDoc.validateOnParse = false
objDoc.loadXML(strXml)
I suggest posting the exact code, XML source and error message you are getting. I cannot reproduce an error by parsing <element>𠮢</element> in MSXML 4.0 SP3; this works fine.
I certainly do get a parseError with reason "Invalid unicode character" by trying to parse <element>𠮢</element>, because that's not well-formed XML. If you do have this in your markup then you need to fix the serialiser that produced it because neither MSXML nor any standards-compliant XML parser will load it.
If 𠮢 is turned into a character reference it must be 𠮢 (or 𠮢). Code units 55362 and 57250 are 'surrogates', reserved for encoding astral plane characters in UTF-16. They can't be included in an XML document.
𠮢 is the entity encoded form of 0xD842 0xDFA2, which is the UTF-16 encoded form of the Unicode 𠮢 character. Make sure that the XML is completely UTF-16 encoded, not mixed single-byte ASCII and multi-byte UTF-16.

problem while parsing the CDATA

<text><![CDATA[øCu·l es tu principal reto, objetivo o problema?]]></text>
while parsing the above tag, its crashing.
how to parse the CDATA
the same line is appearing in windows like this...
<text><![CDATA[¿Cuál es tu principal reto, objetivo o problema?]]></text>
due to the special chars the parser is crashing.
why they are converted into special chars in Mac..?
how to solve this?
Well for one, the string as you post it here looks like something has gone wrong with the encoding. "ø" is not a Spanish character.
What xml parser are you using? I would guess that somewhere in that string is a character, possibly hidden, or maybe it's "ø" which makes your parser crash.
Edit (in response to the OP's comment)
I will try to guess what is happening and hope you can use my guess to resolve what is actually happening. So when you created the xml file you used some editor. This editor used a particular encoding. This means that it transferred the characters on your screen into bytes on your disk using a particular mapping from character into bytes (it encoded the characters into bytes). There are many different encodings, one common encoding is called Latin-1. So let's assume the file was encoded using Latin-1. After creating it, you transferred the file onto another machine where you opened it in a different editor. Now, how does the new editor know the encoding of the file? The answer is that it probably tried to guess the encoding. Now here is where the problem arises: it guessed wrong and interpreted the bytes using an encoding other than Latin-1.
While you have your (garbled) file open in an editor try selecting different encodings from the menu. The one that displays all your special characters correctly is likely to be the one used when the file was created.
Edit 2
But my other question remains: what xml parser are you using?
Edit 3
Ok, so now when you write "crashing", do you actually mean crashing or does it just return? Do you get an error message? If yes, what? Can you do the following:
Remove the funny characters from this line and run your code on the following:
<text><![CDATA[l es tu principal reto, objetivo o problema?]]></text>
Does it still crash?