Some UTF-8 characters do not show up on browser - encoding

Some UTF-8 characters like the UTF-8 equivalent of C2 96 (hyphen). On the browser it displays it as (utf box with 00 96). And not as '-'(hyphen). Any reasons for this behavior? How do we correct this?
http://stuffofinterest.com/misc/utf8.php?s=128 (Refer this URL for the codes)
I found that this can be handled with html entities. Is there any way to display this without converting to html entities?

The character you're talking about is an en-dash, not a hyphen. Its Unicode code point is U+2013, and its UTF-8 encoding is E2 80 93, not C2 96. That table you linked to is incorrect. The first two columns have nothing to do with UCS-2 or Unicode; they actually contain the windows-1252 encodings for the characters in question. The columns labeled "UTF-8 Hex" and "UTF-8 Native" are just plain wrong, at least for the rows labeled 128 to 159. The entities – and – represent an en-dash, but the UTF-8 sequence C2 96 represents a non-displayable control character.
You shouldn't need to encode those characters manually anyway. Just tell your text editor (or whatever you use to create the content) to save the file as UTF-8.

I suspect this is because the characters between U+0080 and U+009F inclusive are control characters. I'm still slightly surprised that they show differently when encoded directly in the HTML than using entities, but basically you shouldn't be using them to start with. U+0096 isn't really "hyphen", it's "start of guarded area".
See the U+0080-U+00FF code chart for more information. Basically, try to avoid control characters...

Two reasons come to mind:
Are you sure that you have output the correct character code to the browser? Better check in some hex viewer.
The font you are using doesn't have a glyph defined at this code point.

Related

understanding different character encodings

When I save a text document in UTF-8 that's basically saying: Computer, use the codepage for UTF8 that's installed somewhere on your computer to figure out, how to turn the 1's and 0's to characters, right?
When I save this content:
激光
äüß
#§
in ISO-8895-1, it becomes this (on Linux, using Kate editor):
æ¿å
äüÃ
#§
What is not displayed here is that in the first and second row that are some weird squares displayed instead of characters (can be seen in developer tools).
So my understanding is that this means that the combination of 0's and 1's that represent 激 in utf-8 is mapped to æ in ISO-8895-1, right? And the weird squares > < happen because there is no mapping for that binary number in the ISO-8895-1 character set so the computer defaults to some other encoding.
Is that correct?
Yes, sort of correct.
If you store a file as UTF-8, it usually gets a special byte combination that indicates its type of encoding at the beginning of the file. I think, Kate (don't know this editor) doesn't recognize this and just displays the file as something else. So basically, your file is still correct, but was just visualized in a wrong way.
The weird squares are another indicator, that Kate doesn't recognize those leading bytes, cause usually editors hide them from the user and just use the information to display the file correctly.
You have it pretty much right. The character U+6FC0 (激) for example is encoded with 3 bytes in UTF-8: 0xE6 0xBF 0x80.
If you interpret these bytes in ISO-8859-1, you get the characters æ¿. Depending on the version of ISO-8859-1, 0x80 is either not mapped to a character at all, or is mapped to a non-printable control character, that's why you can see only two characters for the three bytes.
If you use Windows-1252 instead of ISO-8859-1 you'll see æ¿€.

Are all character encoding compatible with ascii, for at least a-z

Some charset don't have all the 128 first identical to ascii, but is A to Z and a to z, always in the sam position?
I had a plan to set apaches default charset to somting odd in my test envirement, for easy detecting sites that don't tell the browser what encoding it sending.
But so far, I can't find one that makes A to Z show up as someting else.
There is an other question close to the subject, but thats about all 128 ascii chars:
Are ASCII characters always encoded the same way in all character encodings?
No, EBCDIC from IBM is the famous exception.
Another testcase is UTF-16 Big Endian, which puts "A" at U+0041. ASCII would treat the first 00 as a NUL, which often is interpreted as an end-of-string.
In short: no. The encoding mentioned in the other answer, EBCDIC, has a very different layout, to pick just one example.
Most encodings you find in the wild on the web today are probably ASCII compatible. But there are encodings other than ASCII which are entirely incompatible too.

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

Difference between Unicode format?

I noticed something while uploading some unicode data to the database. When the content is uploaded throught textarea, is gets stored in क format, but when you personally type or paste the unicode and insert it hardcoded in php, then it would store in ठformat. But for both, the unicode character is same क.
Now please tell me the difference between the different formats of unicode characters. And how they affect the development. There has to be some limitations in those formats.
& #2325; is markup used in HTML to represent a Unicode character
If you hard code something in a php source file, Make sure you are opening it with editor that correctly displays text files with unicode characters in it.
http://www.joelonsoftware.com/articles/Unicode.html is good place to know the basics of unicode.
UTF-8 encoding of क has the byte sequence E0 A4
Now if somebody interprets this as 8 bit Latin encoding it will think it is two characters
you will see in the table in the above link E0 is à and A4 is ¤
When the content is uploaded throught textarea, is gets stored in क format,
Forms should not submit content in a character-reference (&#...;) format.
But in reality, they do in most current browsers... but only when they can't submit the character in question in any other way. In this case, you can't tell whether the user originally typed क or क, it is a lossy encoding.
To avoid this, make sure you are serving your page in a charset that supports all possible Unicode characters. In practical terms this means always use UTF-8, and serve your page with the Content-Type: text/html;charset=utf-8 header and/or the <meta http-equiv="Content=Type" content="text/html;charset=utf-8"/> element in the header. You'll then get all characters in simple, uncorrupted UTF-8 format.

Codepages and encodings

Before anyone recommends that I do a google search on this, I have. I just need a bit more clarity around what codepages and encodings.
If I use UTF8 encoding, and use an italian code page and then a french code page, does this mean ill get different characters even though the bytes havent changed?
Joel has a nice summary of this:
http://www.joelonsoftware.com/articles/Unicode.html
And no. if I understand your question correctly it doesn't mean that.
When you're converting UTF-8 to a specific code page, it is possible that only some of the characters are going to be converted. What happens to the ones that don't get converted depends on how you call the conversion. A possible result is that the characters which could not be mapped to the code page would be converted to question mark characters.
An encoding is simply a mapping between numerical values and "characters".
US-ASCII maps the number 65 to the letter A, 32 to a space and 49 to the digit "1". (How these things are rendered is another matter.) In fact, UTF-8 does the same! But there are other values which UTF-8 treats differently to ASCII. It is a variable-length encoding, i.e. a character may be encoded with 1, 2, 3, or 4 bytes; common characters generally consume less bytes.
Plain text files, including web pages, are stored and transmitted as sequences of bytes. These bytes are supposed to represent something textual. Software applications (like text editors and web browsers) are responsible for rending the information within these files on the screen. Usually they make use of library or OS functions.
If the software assumes a different encoding to the software that created the file, the wrong characters may be displayed!
Note that it is possible to convert between different encodings; however if you convert to an encoding that does not contain a certain character, the software must make a choice as to what to use instead. This conversion often happens transparently (when you save a file with a certain encoding, whatever you've typed must be changed into that encoding).
UTF-8 includes all characters from your French and Italian code page, but the language specific code pages does not include all of each others characters.
So you can take input from each language and convert it to UTF-8 for storage, but you can not be certain that you will get the right characters if you take Italian input and show it as French.
Use UTF-8 all the way if you can.