How did SourceForge maim this Unicode character? - unicode

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?

The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.

In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

Related

Browser not recognising UTF8

I have UTF8 data in a MYSQL table. I Base64 encode this as it's read from the table and transport it to a web page via PHP and AJAX. Javascript Base64 decodes it as it is inserted into the HTML. The page receiving it is declared to be UTF8.
My problem is that if I insert the Base64 decoded data (using atob()) into the page, any two bytes that make up a single UTF-8 character are presented as two separate Unicode code points. I have to use "decodeURIComponent(escape(window.atob(data)))" (learned from another question on this forum, thank you) to get the characters to be represented correctly, and what this process does is convert the two UTF-8 byte to a single byte equaling the unicode code point for the char (also the same char under ISO 8859).
In short, to get the UTF-8 data correctly rendered in a UTF-8 page they have to be converted to their unicode code-point/ISO 8859 values.
An example:
THe unicode code-point for lowercase e-acute is \u00e9. The UTF-8 encoding of this character is \xc3\xa9:
THe following images show what is rendered for various decodings of my Base64 encoding of this word - first plain atob(), then adding escape() to the process, then further adding decodeURIComponent(). I show the console reporting the output of each, as well as three INPUT fields populated with the three outputs ("record[6]" contains the Base64 encoded data). First the code:
console.log(window.atob(record[6]));
console.log(escape(window.atob(record[6])));
console.log(decodeURIComponent(escape(window.atob(record[6]))));
jQuery("#b64-1").val(window.atob(record[6]));
jQuery("#b64-2").val(escape(window.atob(record[6])));
jQuery("#b64-3").val(decodeURIComponent(escape(window.atob(record[6]))));
`
Copy and pasting the two versions of née into a hex editor reveals what has happened
''
Clearly, the two bytes from the atob() decoding are the correct values for UTF-8 e-acute (\xc3\xa9), but are initially rendered not as a single UTF-8 char, but as two individual chars: C3 (uppercase A tilde) and A9 (copyright sign). The next two steps convert those two chars to the single codepoint for e-acute \u00e9.
So decodeURIComponent() obviously recognises the two bytes as a single UTF-8 character (because it changes them to A9), but not the browser.
Can anyone explain to me why this needs to happen in a page declared to be UTF-8?
(I am using Chrome on W10-64)

Change of char encoding in Eclipse

I am working on an assignment where I need to XOR the bits of each char of a given text. For example, weird char's like '��'.
When trying to save, Eclipse prompts that "Some characters cannot be mapped with Cp1252...", after which I can choose to save as UTF-8.
My knowledge of character encoding is quite fuzzy; wouldn't saving to UTF-8 change the bits? If so, how may I instead work with the original message (original bits) to XOR them and do my assignment?
Thanks!
I am assuming you are using Java in this answer.
The file encoding only changes how the data is represented in the file. When you read the file again (using the correct encoding) it will converted back to Unicode in your String so the program will see the same bits.
Encoding Cp1252 can only represent a small number of characters (less than 256) compared to the 113,021 characters in Unicode 7 all of which can be encoded with UTF-8.

addPortalMessage requires decode('utf-8')

Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?
UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.

Reconstructing Windows-1252 characters from data incorrectly saved as UTF-8

I'm dealing with data that has been sampled using Java HtmlUnit. The webpage used Windows-1252 encoding but the response was retrieved as if the page was encoded as UTF-8 (ie when getContentAsString on the HtmlUnit WebResponse object was invoked, UTF-8 encoding was specified rather than deferring to the encoding specified in the server response). Is there any way to reverse this process to reconstruct the original Windows-1252 data from the incorrectly labelled UTF-8 character data?
Most other questions on this topic are concerned with identifying the type of file or converting from one stream type to another for characters correctly encoded in the first place. That is not the case here. I don't believe utilities such as iconv will work because they expect the streams to have been correctly persisted in their source encoding to begin with.
Probably not. If Windows-1252-encoded text gets mistaken for UTF-8, all non-ASCII codepoints would be damaged, because of the way UTF-8 deals with those codepoints. Only if you are very very lucky, and all non-ASCII codepoints come in pairs or triplets that, by pure chance, convert to real Unicode codepoints, you can reverse the process.
But you're pretty much out of luck.

IWebBrowser: How to specify the encoding when loading html from a stream?

Using the concepts from the sample code provided by Microsoft for loading HTML content into an IWebBrowser from an IStream using the web browser's IPersistStreamInit interface:
pseudocode:
void LoadWebBrowserFromStream(IWebBrowser webBrowser, IStream stream)
{
IPersistStreamInit persist = webBrowser.Document as IPersistStreamInit;
persist.Load(stream);
}
How can one specify the encoding of the html inside the IStream? The IStream will contain a series of bytes, but the problem is what do those bytes represent? They could, for example, contain bytes where:
each byte represents a character from the current Windows code-page (e.g. 1252)
each byte could represent a character from the ISO-8859-1 character set
the bytes could represent UTF-8 encoded characters
every 2 bytes could represent a character, using UTF-16 encoding
In my particular case, i am providing the IWebBrowser an IStream that contains a series of double-bytes characters (UTF-16), but the browser (incorrectly) believes that UTF-8 encoding is in effect. This results in garbled characters.
Workaround solution
While the question asks how to specify the encoding, in my particular case, with only UTF-16 encoding, there's a simple workaround. Adding the 0xFEFF Byte Order Mark (BOM) indicates that the text is UTF-16 unicode. ie then uses the proper encoding and shows the text properly.
Of course that wouldn't work if the text were encoded, for example with:
UCS-2
UCS-4
ISO-10646-UCS-2
UNICODE-1-1-UTF-8
UNICODE-2-0-UTF-16
UNICODE-2-0-UTF-8
US-ASCII
ISO-8859-1
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
WINDOWS-1250
WINDOWS-1251
WINDOWS-1252
WINDOWS-1253
WINDOWS-1254
WINDOWS-1255
WINDOWS-1256
WINDOWS-1257
WINDOWS-1258
IE's document supports IPersistMoniker loading too. IE uses URL monikers for downloading. You can replace the url moniker created by CreateURLMonikerEx with your own moniker. A few details about URL moniker's implementation can be find here. See if you can get IHTTPNegotiate from the binding context when your BindToStroage implemetation is called.