What is the difference between EM Dash #151; and #8212;? - unicode

I've an ASCII file that contains an EM Dash (— or — in HTML). The hex value is 0x97. When we pass this file through one application it arrives as UTF-8, and it converts the character to 0xC297, which is — in HTML. However, when we pass this file through a different application it converts the character to 0xE28094 or —.
What would cause these applications to convert these characters differently? Is it perhaps a code page setting?

— is wrong. When you use numeric character references, the number refers to the Unicode codepoint. For numbers below 256 that is the same as the codepoint in ISO-8859-1. In 8859-1, character 151 is amongst the “C1 control codes”, and not a dash or any other visible character.
The confusion arises because character 151 is a dash in Windows code page 1252 (Western European). Many people think cp1252 is the same thing as ISO-8859-1, but in reality it's not: the characters in the C1 range (128 to 159) are different.
The first application is reading your “ASCII” file* as ISO-8859-1, but actually it's probably cp1252 and you'll need a way to clue the app in about what encoding it has to expect.
(*: “ASCII” is a misnomer if there are top-bit-set characters in the file. You probably mean “ANSI”, which is really also a misnomer, but one which has stuck in the Windows world to mean “text encoded in the current system-default code page”.)

— is not em dash, your text was mis-translated from em dash to that value.
— is the HTML decimal entity for em dash. Specifically it is referencing the Unicode code point 8212 which represents an em dash.
Your file is not ASCII if it contains an em dash. ASCII chars only encode to decimal range 0 - 127, and em dash is not a character that can be represented by ASCII encoding. If you have em dash stored as 0x97 (151 in decimal) you probably have an ANSI text file (aka Windows Codepage 1252 (w-1252)).
Your first app...
The data started as an em dash encoded in w-1252. In w-1252 the em dash maps to the decimal value 151 (0x97 in hex, or 10010111 in binary).
At some point the em dash was handled by code that thought the bytes in your file were iso-8859-1 encoded text. When that code interpreted 0x97 as a string/char it mapped 0x97 to a character according to the iso-8859-1 encoding. In iso-8859-1 0x97 maps to the char "End of guarded area".
Next, the string, which the code thinks is the "End of guarded area" control char, was encoded as utf-8. "End of guarded area" encoded in utf-8 is the two-byte sequence: 0xC2 0x97.
Your second app...
The text file was correctly interpreted as w-1252, thus the 0x97 is recognized as em dash, which was correctly encoded as the em dash in utf-8: 0xE2 0x80 0x94.
What influences this behavior
Not sure if you're dealing with web apps or what, but the concept should be the same whatever it is. We had the same 0x97->0xC297 scenario in a web app where people input data into a form. I found that the charset of the web page was declared as iso8859-1, and the browser's best way to handle the w1252 chars was to just send them along as as the iso bytes without alerting the user or the server. The server receives the data thinks it's iso and converts to utf-8, resulting in 0xC297.
Basically any time an app touches text it needs to be told how the text is encoded, or else it might fall back to a system default. If that happens you risk data corruption.

According to the HTML4 specification's character entity reference, the emdash is — (U+2014).

An ASCII file can not contain the character 0x97, as the ASCII character set only ranges from 0x00 to 0x7F. Therefore your file is not ASCII, but some other single byte encoding. The windows-1250 encoding for example has the em-dash at 0x97.
If the applications decode the text file using some other encoding than the one that was used to create the file, any character above 0x7F will be wrong.
In unicode the em-dash has the character code 0x2014, or 8212 in decimal.
Unicode Character 'EM DASH' (U+2014)
In a web page that for example uses windows-1250 as encoding, the code — will render as an em-dash:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>em-dash</title>
<meta http-equiv="content-type" content="text/html; charset=windows-1250"/>
</head>
<body>
<div>—</div>
</body>
</html>

Related

How did SourceForge maim this Unicode character?

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?
The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.
In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

IWebBrowser: How to specify the encoding when loading html from a stream?

Using the concepts from the sample code provided by Microsoft for loading HTML content into an IWebBrowser from an IStream using the web browser's IPersistStreamInit interface:
pseudocode:
void LoadWebBrowserFromStream(IWebBrowser webBrowser, IStream stream)
{
IPersistStreamInit persist = webBrowser.Document as IPersistStreamInit;
persist.Load(stream);
}
How can one specify the encoding of the html inside the IStream? The IStream will contain a series of bytes, but the problem is what do those bytes represent? They could, for example, contain bytes where:
each byte represents a character from the current Windows code-page (e.g. 1252)
each byte could represent a character from the ISO-8859-1 character set
the bytes could represent UTF-8 encoded characters
every 2 bytes could represent a character, using UTF-16 encoding
In my particular case, i am providing the IWebBrowser an IStream that contains a series of double-bytes characters (UTF-16), but the browser (incorrectly) believes that UTF-8 encoding is in effect. This results in garbled characters.
Workaround solution
While the question asks how to specify the encoding, in my particular case, with only UTF-16 encoding, there's a simple workaround. Adding the 0xFEFF Byte Order Mark (BOM) indicates that the text is UTF-16 unicode. ie then uses the proper encoding and shows the text properly.
Of course that wouldn't work if the text were encoded, for example with:
UCS-2
UCS-4
ISO-10646-UCS-2
UNICODE-1-1-UTF-8
UNICODE-2-0-UTF-16
UNICODE-2-0-UTF-8
US-ASCII
ISO-8859-1
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
WINDOWS-1250
WINDOWS-1251
WINDOWS-1252
WINDOWS-1253
WINDOWS-1254
WINDOWS-1255
WINDOWS-1256
WINDOWS-1257
WINDOWS-1258
IE's document supports IPersistMoniker loading too. IE uses URL monikers for downloading. You can replace the url moniker created by CreateURLMonikerEx with your own moniker. A few details about URL moniker's implementation can be find here. See if you can get IHTTPNegotiate from the binding context when your BindToStroage implemetation is called.

"’" showing on page instead of " ' "

’ is showing on my page instead of '.
I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
In addition, my browser is set to Unicode (UTF-8):
So what's the problem, and how can I fix it?
So what's the problem,
It's a ’ (RIGHT SINGLE QUOTATION MARK - U+2019) character which is being decoded as CP-1252 instead of UTF-8. If you check the encodings table, then you see that this character is in UTF-8 composed of bytes 0xE2, 0x80 and 0x99. If you check the CP-1252 code page layout, then you'll see that each of those bytes stand for the individual characters â, € and ™.
and how can I fix it?
Use UTF-8 instead of CP-1252 to read, write, store, and display the characters.
I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
This only instructs the client which encoding to use to interpret and display the characters. This doesn't instruct your own program which encoding to use to read, write, store, and display the characters in. The exact answer depends on the server side platform / database / programming language used. Do note that the one set in HTTP response header has precedence over the HTML meta tag. The HTML meta tag would only be used when the page is opened from local disk file system instead of from HTTP.
In addition, my browser is set to Unicode (UTF-8):
This only forces the client which encoding to use to interpret and display the characters. But the actual problem is that you're already sending ’ (encoded in UTF-8) to the client instead of ’. The client is correctly displaying ’ using the UTF-8 encoding. If the client was misinstructed to use, for example ISO-8859-1, you would likely have seen ââ¬â¢ instead.
I am using ASP.NET 2.0 with a database.
This is most likely where your problem lies. You need to verify with an independent database tool what the data looks like.
If the ’ character is there, then you aren't connecting to the database correctly. You need to tell the database connector to use UTF-8.
If your database contains ’, then it's your database that's messed up. Most probably the tables aren't configured to use UTF-8. Instead, they use the database's default encoding, which varies depending on the configuration. If this is your issue, then usually just altering the table to use UTF-8 is sufficient. If your database doesn't support that, you'll need to recreate the tables. It is good practice to set the encoding of the table when you create it.
You're most likely using SQL Server, but here is some MySQL code (copied from this article):
CREATE DATABASE db_name CHARACTER SET utf8;
CREATE TABLE tbl_name (...) CHARACTER SET utf8;
If your table is however already UTF-8, then you need to take a step back. Who or what put the data there. That's where the problem is. One example would be HTML form submitted values which are incorrectly encoded/decoded.
Here are some more links to learn more about the problem:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), from our own Joel.
Unicode - How to get the characters right?, with more concise and practical information, solutions are targeted on Java environments.
How to setup your PHP site to use UTF8, targeted on PHP environments.
Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.
Or use ’.
’ (Unicode codepoint U+2019 RIGHT SINGLE QUOTATION MARK) is encoded in UTF-8 as bytes:
0xE2 0x80 0x99.
’ (Unicode codepoints U+00E2 U+20AC U+2122) is encoded in UTF-8 as bytes:
0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2.
These are the bytes your browser is actually receiving in order to produce ’ when processed as UTF-8.
That means that your source data is going through two charset conversions before being sent to the browser:
The source ’ character (U+2019) is first encoded as UTF-8 bytes:
0xE2 0x80 0x99
those individual bytes were then being mis-interpreted and decoded to Unicode codepoints U+00E2 U+20AC U+2122 by one of the Windows-125X charsets (1252, 1254, 1256, and 1258 all map 0xE2 0x80 0x99 to U+00E2 U+20AC U+2122), and then those codepoints are being encoded as UTF-8 bytes:
0xE2 -> U+00E2 -> 0xC3 0xA2
0x80 -> U+20AC -> 0xE2 0x82 0xAC
0x99 -> U+2122 -> 0xE2 0x84 0xA2
You need to find where the extra conversion in step 2 is being performed and remove it.
This sometimes happens when a string is converted from Windows-1252 to UTF-8 twice.
We had this in a Zend/PHP/MySQL application where characters like that were appearing in the database, probably due to the MySQL connection not specifying the correct character set. We had to:
Ensure Zend and PHP were communicating with the database in UTF-8 (was not by default)
Repair the broken characters with several SQL queries like this...
UPDATE MyTable SET
MyField1 = CONVERT(CAST(CONVERT(MyField1 USING latin1) AS BINARY) USING utf8),
MyField2 = CONVERT(CAST(CONVERT(MyField2 USING latin1) AS BINARY) USING utf8);
Do this for as many tables/columns as necessary.
You can also fix some of these strings in PHP if necessary. Note that because characters have been encoded twice, we actually need to do a reverse conversion from UTF-8 back to Windows-1252, which confused me at first.
mb_convert_encoding('’', 'Windows-1252', 'UTF-8'); // returns ’
I have some documents where … was showing as … and ê was showing as ê. This is how it got there (python code):
# Adam edits original file using windows-1252
windows = '\x85\xea'
# that is HORIZONTAL ELLIPSIS, LATIN SMALL LETTER E WITH CIRCUMFLEX
# Beth reads it correctly as windows-1252 and writes it as utf-8
utf8 = windows.decode("windows-1252").encode("utf-8")
print(utf8)
# Charlie reads it *incorrectly* as windows-1252 writes a twingled utf-8 version
twingled = utf8.decode("windows-1252").encode("utf-8")
print(twingled)
# detwingle by reading as utf-8 and writing as windows-1252 (it's really utf-8)
detwingled = twingled.decode("utf-8").encode("windows-1252")
assert utf8==detwingled
To fix the problem, I used python code like this:
with open("dirty.html","rb") as f:
dt = f.read()
ct = dt.decode("utf8").encode("windows-1252")
with open("clean.html","wb") as g:
g.write(ct)
(Because someone had inserted the twingled version into a correct UTF-8 document, I actually had to extract only the twingled part, detwingle it and insert it back in. I used BeautifulSoup for this.)
It is far more likely that you have a Charlie in content creation than that the web server configuration is wrong. You can also force your web browser to twingle the page by selecting windows-1252 encoding for a utf-8 document. Your web browser cannot detwingle the document that Charlie saved.
Note: the same problem can happen with any other single-byte code page (e.g. latin-1) instead of windows-1252.
You have a mismatch in your character encoding; your string is encoded in one encoding (UTF-8) and whatever is interpreting this page is using another (say ASCII).
Always specify your encoding in your http headers and make sure this matches your framework's definition of encoding.
Sample http header:
Content-Type text/html; charset=utf-8
Setting encoding in asp.net
<configuration>
<system.web>
<globalization
fileEncoding="utf-8"
requestEncoding="utf-8"
responseEncoding="utf-8"
culture="en-US"
uiCulture="de-DE"
/>
</system.web>
</configuration>
Setting encoding in jsp
If your content type is already UTF8 , then it is likely the data is already arriving in the wrong encoding. If you are getting the data from a database, make sure the database connection uses UTF-8.
If this is data from a file, make sure the file is encoded correctly as UTF-8. You can usually set this in the "Save as..." Dialog of the editor of your choice.
If the data is already broken when you view it in the source file, chances are that it used to be a UTF-8 file but was saved in the wrong encoding somewhere along the way.
If someone gets this error on WordPress website, you need to change wp-config db charset:
define('DB_CHARSET', 'utf8mb4_unicode_ci');
instead of:
define('DB_CHARSET', 'utf8mb4');
If the other answers haven't helped, you might want to check whether your database is actually storing the mojibake characters. I was viewing the text in utf-8, but I was still seeing the mojibake and it turned out that, due to a database upgrade, the text had been permanently "mojibaked".
In this case, one option is to "fix" the text with Python's ftfy package (or JavaScript verion here).
You must have copy/paste text from Word Document. Word document use Smart Quotes. You can replace it with Special Character (’) or simply type in your HTML editor (').
I'm sure this will solve your problem.
In DBeaver (or other editors) the script file you're working can prompt to save as UTF8 and that will change the char:
–
into
–
or
–
The same thing happened to me with the '–' character (long minus sign).
I used this simple replace so resolve it:
htmlText = htmlText.Replace('–', '-');

What is this encodification that FF and Chrome does?

I was looking a source code of a particular page of my project and noticed that FF transforms special characters such as "á" to á.
Which encodification is that?
Thanks!!
I suspect it is the other way round; Firefox and Chrome take á in the HTML source code and render it as the character á ("Latin small a with acute").
The reason for allowing these in HTML is that the HTML might be supplied in an encoding which doesn't support the character. Any Unicode character is allowed, but it may not get rendered correctly if your browser doesn't have that character in any of its fonts.
As it says in the W3C HTML spec, there are two ways of encoding arbitrary Unicode characters:
&#D;: where D is the decimal value of the Unicode character (e.g. á)
&#xH;: where H is the (case-insensitive) hexadecimal value of the Unicode character, e.g. 1 in your case
It's Numeric character references as defined in the HTML 4.01 Specification.
HTML ASCII Character Encoding. Here's a table of many of them:
HTML Codes

What multi-byte character set starts with 0x7F and is 4 bytes long?

I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..
UPDATE:
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?
For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?
One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long
What are the other bytes? Do you have any Latin text in this encoding?
If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.
Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.
ETA:
are there supposed to be Byte Order Marks?
There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.
That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.
It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.
Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.
(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)
ETA(2):
OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.
0x00-0x7E: plain ASCII
0x7F A B C: Unicode character
The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:
A*0x1000 + B*0x40 + C
That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:
.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.
So, for example:
0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京
ETA(3):
Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).
I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.
Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.
It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.
You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.
GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.
Try chardet. It does a good job of guessing the character encoding of a string of bytes.
Are Unicode and UTF-8 the same?
No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.
It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.
Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode.
The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.