What is this encodification that FF and Chrome does? - encoding

I was looking a source code of a particular page of my project and noticed that FF transforms special characters such as "á" to á.
Which encodification is that?
Thanks!!

I suspect it is the other way round; Firefox and Chrome take á in the HTML source code and render it as the character á ("Latin small a with acute").
The reason for allowing these in HTML is that the HTML might be supplied in an encoding which doesn't support the character. Any Unicode character is allowed, but it may not get rendered correctly if your browser doesn't have that character in any of its fonts.
As it says in the W3C HTML spec, there are two ways of encoding arbitrary Unicode characters:
&#D;: where D is the decimal value of the Unicode character (e.g. á)
&#xH;: where H is the (case-insensitive) hexadecimal value of the Unicode character, e.g. 1 in your case

It's Numeric character references as defined in the HTML 4.01 Specification.

HTML ASCII Character Encoding. Here's a table of many of them:
HTML Codes

Related

How to convert ANSI text to Unicode?

I would like to convert RTF text to Unicode. In the RTF font table one can find the name of the font or font-face (eg. Arial Cyr, Courier Greek) and the charset to use with it (0-255). So how to write a function that converts a character code (0-255) with these settings to Unicode?
As I see, the post-tags like Greek, Cyr, Tur etc. affect the glyph of the displayed characters and the charset affects it too. So the function could have these input parameters:
fontname postfix, font charset, character code
But what is next? Or am I on the wrong way?
RTF was invented long before Unicode. It most certainly isn't ANSI text, RTF only uses ASCII, it uses a rather unholy mix of character sets with non-ASCII characters encoded in hex with a reference to the character set. The mapping is also not perfect, many Unicode codepoints have no corresponding charset.
You'll spend a lifetime creating your own RTF to Unicode converter. Take advantage of an existing solution, most any platform has one. On Windows that would be the RichEdit control. If you use .NET then it is especially simple, use the RichTextBox class, assign its Rtf property and read back its Text property. Which is utf-16 encoded Unicode.

Difference between Unicode format?

I noticed something while uploading some unicode data to the database. When the content is uploaded throught textarea, is gets stored in क format, but when you personally type or paste the unicode and insert it hardcoded in php, then it would store in ठformat. But for both, the unicode character is same क.
Now please tell me the difference between the different formats of unicode characters. And how they affect the development. There has to be some limitations in those formats.
& #2325; is markup used in HTML to represent a Unicode character
If you hard code something in a php source file, Make sure you are opening it with editor that correctly displays text files with unicode characters in it.
http://www.joelonsoftware.com/articles/Unicode.html is good place to know the basics of unicode.
UTF-8 encoding of क has the byte sequence E0 A4
Now if somebody interprets this as 8 bit Latin encoding it will think it is two characters
you will see in the table in the above link E0 is à and A4 is ¤
When the content is uploaded throught textarea, is gets stored in क format,
Forms should not submit content in a character-reference (&#...;) format.
But in reality, they do in most current browsers... but only when they can't submit the character in question in any other way. In this case, you can't tell whether the user originally typed क or क, it is a lossy encoding.
To avoid this, make sure you are serving your page in a charset that supports all possible Unicode characters. In practical terms this means always use UTF-8, and serve your page with the Content-Type: text/html;charset=utf-8 header and/or the <meta http-equiv="Content=Type" content="text/html;charset=utf-8"/> element in the header. You'll then get all characters in simple, uncorrupted UTF-8 format.

Some UTF-8 characters do not show up on browser

Some UTF-8 characters like the UTF-8 equivalent of C2 96 (hyphen). On the browser it displays it as (utf box with 00 96). And not as '-'(hyphen). Any reasons for this behavior? How do we correct this?
http://stuffofinterest.com/misc/utf8.php?s=128 (Refer this URL for the codes)
I found that this can be handled with html entities. Is there any way to display this without converting to html entities?
The character you're talking about is an en-dash, not a hyphen. Its Unicode code point is U+2013, and its UTF-8 encoding is E2 80 93, not C2 96. That table you linked to is incorrect. The first two columns have nothing to do with UCS-2 or Unicode; they actually contain the windows-1252 encodings for the characters in question. The columns labeled "UTF-8 Hex" and "UTF-8 Native" are just plain wrong, at least for the rows labeled 128 to 159. The entities – and – represent an en-dash, but the UTF-8 sequence C2 96 represents a non-displayable control character.
You shouldn't need to encode those characters manually anyway. Just tell your text editor (or whatever you use to create the content) to save the file as UTF-8.
I suspect this is because the characters between U+0080 and U+009F inclusive are control characters. I'm still slightly surprised that they show differently when encoded directly in the HTML than using entities, but basically you shouldn't be using them to start with. U+0096 isn't really "hyphen", it's "start of guarded area".
See the U+0080-U+00FF code chart for more information. Basically, try to avoid control characters...
Two reasons come to mind:
Are you sure that you have output the correct character code to the browser? Better check in some hex viewer.
The font you are using doesn't have a glyph defined at this code point.

Japanese ASCII Code

Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.
I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.
Please see my similar question regarding Kanji/Kana characters. As #coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.
In short, the Unicode ranges for hiragana and katakana are:
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
If you find this answer useful please upvote #coobird's answer to my question as well.
がんばって!
Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).
For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.
Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?
I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.
Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.
Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.
I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.
Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.
If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.

What is the difference between EM Dash #151; and #8212;?

I've an ASCII file that contains an EM Dash (— or — in HTML). The hex value is 0x97. When we pass this file through one application it arrives as UTF-8, and it converts the character to 0xC297, which is — in HTML. However, when we pass this file through a different application it converts the character to 0xE28094 or —.
What would cause these applications to convert these characters differently? Is it perhaps a code page setting?
— is wrong. When you use numeric character references, the number refers to the Unicode codepoint. For numbers below 256 that is the same as the codepoint in ISO-8859-1. In 8859-1, character 151 is amongst the “C1 control codes”, and not a dash or any other visible character.
The confusion arises because character 151 is a dash in Windows code page 1252 (Western European). Many people think cp1252 is the same thing as ISO-8859-1, but in reality it's not: the characters in the C1 range (128 to 159) are different.
The first application is reading your “ASCII” file* as ISO-8859-1, but actually it's probably cp1252 and you'll need a way to clue the app in about what encoding it has to expect.
(*: “ASCII” is a misnomer if there are top-bit-set characters in the file. You probably mean “ANSI”, which is really also a misnomer, but one which has stuck in the Windows world to mean “text encoded in the current system-default code page”.)
— is not em dash, your text was mis-translated from em dash to that value.
— is the HTML decimal entity for em dash. Specifically it is referencing the Unicode code point 8212 which represents an em dash.
Your file is not ASCII if it contains an em dash. ASCII chars only encode to decimal range 0 - 127, and em dash is not a character that can be represented by ASCII encoding. If you have em dash stored as 0x97 (151 in decimal) you probably have an ANSI text file (aka Windows Codepage 1252 (w-1252)).
Your first app...
The data started as an em dash encoded in w-1252. In w-1252 the em dash maps to the decimal value 151 (0x97 in hex, or 10010111 in binary).
At some point the em dash was handled by code that thought the bytes in your file were iso-8859-1 encoded text. When that code interpreted 0x97 as a string/char it mapped 0x97 to a character according to the iso-8859-1 encoding. In iso-8859-1 0x97 maps to the char "End of guarded area".
Next, the string, which the code thinks is the "End of guarded area" control char, was encoded as utf-8. "End of guarded area" encoded in utf-8 is the two-byte sequence: 0xC2 0x97.
Your second app...
The text file was correctly interpreted as w-1252, thus the 0x97 is recognized as em dash, which was correctly encoded as the em dash in utf-8: 0xE2 0x80 0x94.
What influences this behavior
Not sure if you're dealing with web apps or what, but the concept should be the same whatever it is. We had the same 0x97->0xC297 scenario in a web app where people input data into a form. I found that the charset of the web page was declared as iso8859-1, and the browser's best way to handle the w1252 chars was to just send them along as as the iso bytes without alerting the user or the server. The server receives the data thinks it's iso and converts to utf-8, resulting in 0xC297.
Basically any time an app touches text it needs to be told how the text is encoded, or else it might fall back to a system default. If that happens you risk data corruption.
According to the HTML4 specification's character entity reference, the emdash is — (U+2014).
An ASCII file can not contain the character 0x97, as the ASCII character set only ranges from 0x00 to 0x7F. Therefore your file is not ASCII, but some other single byte encoding. The windows-1250 encoding for example has the em-dash at 0x97.
If the applications decode the text file using some other encoding than the one that was used to create the file, any character above 0x7F will be wrong.
In unicode the em-dash has the character code 0x2014, or 8212 in decimal.
Unicode Character 'EM DASH' (U+2014)
In a web page that for example uses windows-1250 as encoding, the code — will render as an em-dash:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>em-dash</title>
<meta http-equiv="content-type" content="text/html; charset=windows-1250"/>
</head>
<body>
<div>—</div>
</body>
</html>