Unicode Chars in JWPlayer 6 RSS playlists (with demo) - unicode

Ok, I have a cloudfront install which has characters such as ö and ó in the folder/file name.
JWPlayer goes into infinite buffering whenever I point an RTMP stream at it which has one of these.
What do I escape these to in order for this to work? It's driving me nuts. I've tried ditching RSS and using JSON but couldn't get that to work either.
Demo: http://apricot.viviotech.net/miscellaneous/index2.html
You'll see that N*E*R*D tracks work - so it's not the overall config/player, just anything with an unusual Char.
Help?
UPDATE:
Right, managed to get something working: if I copy and paste the o with diareses from the generated HTML and resave, it works.
i.e apricot.viviotech.net/miscellaneous/index3.html
They look indentical, but in a hex editor, the one which DOESN'T work looks like Björk (with a byte encoding of C3 B6) And the one which DOES comes out as Björk (with a byte encoding of 6F CC 88) (purely in the underlyigng HEX). In the HTML source, they look identical, but obviously there's some encoding issue in the way these are being output which I just don't understand.

Found the problem: https://stackoverflow.com/a/16477559/442412 : note to self, concentrate more on what's going in rather than what's going out...

Related

Hebrew characters processed by HTML Tidy turn into gibberish

I'm using HTML Tidy Online (http://infohound.net/tidy/) to tidy up some very old and messed up HTML file which contains some Hebrew characters. Whenever the page is processed by Tidy the output turns Hebrew characters into gibberish, even after changing encoding methods in the settings. Using different settings, I do manage to get the same output with the Hebrew characters as unicode entities.
I Googled around for a possible solution but found none.
I had a couple ideas in mind, but I'm not sure exactly how to approach them, if at all (maybe someone has a better solution).
I thought maybe I could (after processing the page) scan the page for unicode entities and replace them with the corresponding Hebrew characters (in a systematic way, of course).
Maybe I could take the HTML Tidy source code and modify it to output Hebrew characters appropriately. The problem with this is that I doubt I am knowledgeable enough to even get started on something like this.
I had a similar problem. Document in UTF-8, containing unicode characters. HTML Tidy turned them into HTML entities. This in HTMLTIDY.CFG fixed it:
char-encoding: utf8
input-encoding: utf8
output-encoding: utf8
Hope it helps.
The website http://infohound.net/tidy/ that you are using has a "Char encoding" clause at the bottom right. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. In Notepad++ for example, you can go to Encoding > Convert to UTF-8 without BOM.

problem while parsing the CDATA

<text><![CDATA[øCu·l es tu principal reto, objetivo o problema?]]></text>
while parsing the above tag, its crashing.
how to parse the CDATA
the same line is appearing in windows like this...
<text><![CDATA[¿Cuál es tu principal reto, objetivo o problema?]]></text>
due to the special chars the parser is crashing.
why they are converted into special chars in Mac..?
how to solve this?
Well for one, the string as you post it here looks like something has gone wrong with the encoding. "ø" is not a Spanish character.
What xml parser are you using? I would guess that somewhere in that string is a character, possibly hidden, or maybe it's "ø" which makes your parser crash.
Edit (in response to the OP's comment)
I will try to guess what is happening and hope you can use my guess to resolve what is actually happening. So when you created the xml file you used some editor. This editor used a particular encoding. This means that it transferred the characters on your screen into bytes on your disk using a particular mapping from character into bytes (it encoded the characters into bytes). There are many different encodings, one common encoding is called Latin-1. So let's assume the file was encoded using Latin-1. After creating it, you transferred the file onto another machine where you opened it in a different editor. Now, how does the new editor know the encoding of the file? The answer is that it probably tried to guess the encoding. Now here is where the problem arises: it guessed wrong and interpreted the bytes using an encoding other than Latin-1.
While you have your (garbled) file open in an editor try selecting different encodings from the menu. The one that displays all your special characters correctly is likely to be the one used when the file was created.
Edit 2
But my other question remains: what xml parser are you using?
Edit 3
Ok, so now when you write "crashing", do you actually mean crashing or does it just return? Do you get an error message? If yes, what? Can you do the following:
Remove the funny characters from this line and run your code on the following:
<text><![CDATA[l es tu principal reto, objetivo o problema?]]></text>
Does it still crash?

UTF-8 incorrectly displayed in Lua/ Corona

In Lua, for an iPad Corona project, I'm requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I'm setting the http header when the server sends this (using PHP) to 'Content-Type: text/plain; charset=utf-8' (and there's no BOM, byte order mark either). The "garbage" I'm seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.
Does anyone have any help or pointers?
If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?
Thanks for any help!
Lua doesn't know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.
I'm afraid I don't know Corona, so can't provide any specific solutions, but I'd suggest looking to see what functions it's got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.
Can you show the code for your network.request() call?
If you're downloading a html page, you should use network.download().
I had this exact same problem, except with Japanese characters. Although Lua doesn't support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.
So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.
I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I'd like to figure this issue out myself, too.

How to diagnose, and reverse (not prevent) Unicode mangling

Somewhere upstream of me, "something" happened that looks like unicode mangling. One symptom is that a lowercase u umlaut (ü) gets converted to "ü" (ie, character FC gets converted to C3 BC). Assuming that I have no control over this upstream process, how can I reverse-engineer what's going on? And if that is possible, can I crank the sausage machine backwards and get the original text back?
(If it helps to understand this case, the text I received was in the form of a MySQL dump. I think somwewhere in the dump/transport process it got mangled.)
Your text isn't 'mangled'. It's just in UTF8. C3 BC is what the ü is supposed to be encoded as. Just set whatever software you use to UTF8 also, and all pain will go away. If you can't set your software to Unicode, seriously consider switching to newer software.
I know it's scary at first, but you will have to do that eventually, anyway. My favorite music typesetter switched to Unicode-only input a while ago (they even deliberately removed support for the old 8-bit code pages to get people to switch), and I was upset, thinking that Latin-1 was good enough for me, and it was stupid to break stuff that was working perfectly well... and then I got over it and just set emacs to Unicode buffers, and now I'll never have to think about character encoding again in my life!
First of all, it looks like you've got UTF-8 encoded text (as you've found ü interpreted in your expected encoding, maybe Latin-1).
You could guess this encoding being used by checking that the correct byte sequences are used (and the illegal ones not used, of course). See the Wikipedia article for a reference and look for valid and invalid byte sequences. You can be pretty sure about the encoding if the text starts with a BOM, but that's not required for UTF-8.
To get the text back in your required encoding, several tools are available, GNU recode for one.

How can I figure out what code page I am looking at?

I have a device with some documentation on how to send it text. It uses 0x00-0x7F to send 'special' characters like accented characters, euro signs, ...
I am guessing they copied an existing code page and made some changes, but I have no idea how to figure out what code page is closest to the one in my documentation.
In theory, this should be easy to do. For example, they map Á to 0x41, so if I could find some way to go through all code pages and find the ones that have this character on that position, it would be a piece of cake.
However, all I can find on the internet are links to code page dumps just like the one I'm looking at, or software that uses heuristics to read text and guess the most likely code page. Surely someone out there has made it possible to look up what code page one is looking at ?
If it uses 0x00 to 0x7F for the "special" characters, how does it encode the regular ASCII characters?
In most of the charsets that support the character Á, its codepoint is 193 (0xC1). If you subtract 128 from that, you get 65 (0x41). Maybe your "codepage" is just the upper half of one of the standard charsets like ISO-8859-1 or windows-1252, with the high-order bit set to zero instead of one (that is, subtracting 128 from each one).
If that's the case, I would expect to find a flag you can set to tell it whether the next bunch of codepoints should be converted using the "upper" or "lower" encoding. I don't know of any system that uses that scheme, but it's the most sensible explanation I can come with for the situation you describe.
There is no way to auto-detect the codepage without additional information. Below the display layer it’s just bytes and all bytes are created equal. There’s no way to say “I’m a 0x41 from this and that codepage”, there’s only “I’m 0x41. Display me!”
What endian is the system? Perhaps you're flipping bit orders?
In most codepages, 0x41 is just the normal "A", I don't think any standard codepages have "Á" in that position. It could have a control character somewhere before the A that added the accent, or uses a non-standard codepage.
I don't see any use in knowing the "closest codepage", you just need to use the docs you got with the device.
Your last sentence is puzzling, what do you mean by "possible to look up what code page one is looking at"?
If you include your whole codepage, people here on SO could be more helpful and give you more insight about this issue, having one data point 0x41=Á doesn't help much.
Somewhat random idea, but if you can get replicate a significant amount of the text off the device, you could try running it through something like the detect function in http://chardet.feedparser.org/.