Browser not recognising UTF8 - unicode

I have UTF8 data in a MYSQL table. I Base64 encode this as it's read from the table and transport it to a web page via PHP and AJAX. Javascript Base64 decodes it as it is inserted into the HTML. The page receiving it is declared to be UTF8.
My problem is that if I insert the Base64 decoded data (using atob()) into the page, any two bytes that make up a single UTF-8 character are presented as two separate Unicode code points. I have to use "decodeURIComponent(escape(window.atob(data)))" (learned from another question on this forum, thank you) to get the characters to be represented correctly, and what this process does is convert the two UTF-8 byte to a single byte equaling the unicode code point for the char (also the same char under ISO 8859).
In short, to get the UTF-8 data correctly rendered in a UTF-8 page they have to be converted to their unicode code-point/ISO 8859 values.
An example:
THe unicode code-point for lowercase e-acute is \u00e9. The UTF-8 encoding of this character is \xc3\xa9:
THe following images show what is rendered for various decodings of my Base64 encoding of this word - first plain atob(), then adding escape() to the process, then further adding decodeURIComponent(). I show the console reporting the output of each, as well as three INPUT fields populated with the three outputs ("record[6]" contains the Base64 encoded data). First the code:
console.log(window.atob(record[6]));
console.log(escape(window.atob(record[6])));
console.log(decodeURIComponent(escape(window.atob(record[6]))));
jQuery("#b64-1").val(window.atob(record[6]));
jQuery("#b64-2").val(escape(window.atob(record[6])));
jQuery("#b64-3").val(decodeURIComponent(escape(window.atob(record[6]))));
`
Copy and pasting the two versions of née into a hex editor reveals what has happened
''
Clearly, the two bytes from the atob() decoding are the correct values for UTF-8 e-acute (\xc3\xa9), but are initially rendered not as a single UTF-8 char, but as two individual chars: C3 (uppercase A tilde) and A9 (copyright sign). The next two steps convert those two chars to the single codepoint for e-acute \u00e9.
So decodeURIComponent() obviously recognises the two bytes as a single UTF-8 character (because it changes them to A9), but not the browser.
Can anyone explain to me why this needs to happen in a page declared to be UTF-8?
(I am using Chrome on W10-64)

Related

Encode binary data into ASCII while preserving most of its valid characters

When displaying a bytes object in python the print function will display a select number values as ASCII characters instead of their numeric representation.
>>> b"1 2 \x30 a b \x80"
b'1 2 0 a b \x80'
Is there a known encoding that would allow a set of binary data containing mostly ASCII text to be put into a valid ASCII string where the few invalid characters are replaced by a numeric representation, similarly to what python does with bytes?
Edit:
I used python's bytes.repr as an example of what the encoding would do, the project we need this for is written in c++, something like a "language agnostic" spec would be nice.
Edit 2:
Think of this as a base64 alternative so that binary data that is mostly ASCII does not get altered too much.

How did SourceForge maim this Unicode character?

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?
The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.
In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

base64 encoding: input character

I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:
if (/([^\u0000-\u00ff])/.test(text)){
throw new Error("Can't base64 encode non-ASCII characters.");
}
He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:
Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise.
Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).
A bit confused on what the standard is.
Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.
Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.
When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.
For me the (first) linked article has a fundamental problem:
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters
You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.
Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.
Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.
(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)

IWebBrowser: How to specify the encoding when loading html from a stream?

Using the concepts from the sample code provided by Microsoft for loading HTML content into an IWebBrowser from an IStream using the web browser's IPersistStreamInit interface:
pseudocode:
void LoadWebBrowserFromStream(IWebBrowser webBrowser, IStream stream)
{
IPersistStreamInit persist = webBrowser.Document as IPersistStreamInit;
persist.Load(stream);
}
How can one specify the encoding of the html inside the IStream? The IStream will contain a series of bytes, but the problem is what do those bytes represent? They could, for example, contain bytes where:
each byte represents a character from the current Windows code-page (e.g. 1252)
each byte could represent a character from the ISO-8859-1 character set
the bytes could represent UTF-8 encoded characters
every 2 bytes could represent a character, using UTF-16 encoding
In my particular case, i am providing the IWebBrowser an IStream that contains a series of double-bytes characters (UTF-16), but the browser (incorrectly) believes that UTF-8 encoding is in effect. This results in garbled characters.
Workaround solution
While the question asks how to specify the encoding, in my particular case, with only UTF-16 encoding, there's a simple workaround. Adding the 0xFEFF Byte Order Mark (BOM) indicates that the text is UTF-16 unicode. ie then uses the proper encoding and shows the text properly.
Of course that wouldn't work if the text were encoded, for example with:
UCS-2
UCS-4
ISO-10646-UCS-2
UNICODE-1-1-UTF-8
UNICODE-2-0-UTF-16
UNICODE-2-0-UTF-8
US-ASCII
ISO-8859-1
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
WINDOWS-1250
WINDOWS-1251
WINDOWS-1252
WINDOWS-1253
WINDOWS-1254
WINDOWS-1255
WINDOWS-1256
WINDOWS-1257
WINDOWS-1258
IE's document supports IPersistMoniker loading too. IE uses URL monikers for downloading. You can replace the url moniker created by CreateURLMonikerEx with your own moniker. A few details about URL moniker's implementation can be find here. See if you can get IHTTPNegotiate from the binding context when your BindToStroage implemetation is called.

What is encoding & decoding in communication?

Can someone please redirect me to some good references about the encoding and decoding in communication and different encoding techniques(unicode, base64, utf7) etc.
Wikipedia is always a good start.
Then there's always Joel Spolsky's article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Note that the three things you name operate on different levels.
Unicode is a character set: a mapping between characters and numbers (code points).
UTF7 maps between code points and bytes.
base64 maps between bytes and bytes. (It mangles bytes so that they are represented by bytes in the ASCII range.)
The definitions of encoding and decoding are somewhat subjective.
Both are forms of transliteration, being the process of converting from one alphabet to another. ASCII to UTF8, ASCII to base64, etc are all examples of this.
What distinguishes the two is that "encoding" is often used when transliterating from a usable format to a transmission or intermediate format of some kind and decoding is the reverse. This is where the "subjective" bit comes in. ASCII to UTF8 can be viewed as encoding or decoding depending on the context.
Other formats like base64 are used almost universally for transmission only (eg binary data in email) and as such converting to them is almost universally called "encoding" and converting from as "decoding".
The important point to take away from all this is that something like ASCII or UTF8 is not magical in any way. All these formats are simply an agreed-upon encoding of information into a binary format. So ASCII 65 is 'A' for no other reason than that's the standard.
Unicode formats get more interesting because they make the distinction between the code point and the encoding. Unicode defines the code points for each character. The binary data is different for each encoding format. For example, see Unicode Character 'EURO-CURRENCY SIGN' (U+20A0) to see all the different binary values for one code point.
Regarding yours unicode, base64, utf7 (no one uses it, it might be utf8). They are not just "encoding & decoding" but encoding & decoding of text data.
Unicode is the way all real and possible characters are enumerated. It has nothing about encoding itself. UTFXX is set of encoding of unicode (converting code to actual bytes). most popular are UTF8 and UTF16. Very basically UTF8 is ASCII compatible (chars with codes < 128 are represented same way as ASCII), but other characters are represented by 2-3 bytes. UTF16 encode most of characters to 2 bytes.
Base64 has nothing about text data. It encodes generic binary data to text that consists of 64 printable ascii characters. It is used to transfer binary data, UTF8 and UTF16 via Email usually.