Which single-byte charset matches first 0x100 unicode characters? - unicode

Is there a single-byte charset (e.g. ISO-8859-x) that matches the first 256 unicode characters (i.e. characters \u0000-\u00FF) exactly or almost exactly?

ISO-8859-1 matches the first Unicode code points the closest, by design.

Related

Are the first 128 characters of utf-8 and ascii identical?

Are the first 128 characters of utf-8 and ascii identical?
utf-8 table
Ascii table
Yes. This was an intentional choice in the design of UTF-8 so that existing 7-bit ASCII would be compatible.
The encoding is also designed intentionally so that 7-bit ASCII values cannot mean anything except their ASCII equivalent. For example, in UTF-16, the Euro symbol (€) is encoded as 0x20 0xAC. But 0x20 is SPACE in ASCII. So if an ASCII-only algorithm tries to space-delimit a string like "€ 10" encoded in UTF-16, it'll corrupt the data.
This can't happen in UTF-8. € is encoded there as 0xE2 0x82 0xAC, none of which are legal 7-bit ASCII values. So an ASCII algorithm that naively splits on the ASCII SPACE (0x20) will still work, even though it doesn't know anything about UTF-8 encoding. (The same is true for any ASCII character like slash, comma, backslash, percent, etc.) UTF-8 is an incredibly clever text encoding.

Encoding and character set for iso-8859-1

I have read Joel's article about encodings. As I understand in case of unicode:
unicode is a charater set - mapping between integer value and character
utf-8 is an encoding which is used for unicode integers to present them in binary view
What about iso-8859-1? Is it encoding or character set or both?
ISO 8859-1 (Latin-1) is a single-byte encoding. It represents the first 256 Unicode characters. So, as long as it is subset of Unicode character set, I suppose it could be treated as both encoding and character set.
What about iso-8859-1? Is it encoding or character set or both?
Historically, it was described as a coded character set: it defined both a set of characters, and a mapping of those characters to byte values — what we would today call an encoding, but it was not explicitly described in those terms.
When Unicode was created, it was designed to encompass (nearly) all characters in widely-used character sets, and hence it recast the byte stream defined by the ISO-8859-1 coded character set as an encoding of the wider Universal Character Set.
So if you are working in a modern Unicode environment you would consider ISO-8859-1 to be an encoding. But it can't really be said to be wrong to consider it also a character set.
(There are other encodings which are definitely not character sets: for example the UTFs, and multibyte encodings like Shift-JIS, which was itself defined as an encoding for the JIS X 0208 character set prior to Unicode's extend-and-embrace.)

Are Unicode and Ascii characters the same?

What exactly are unicode character codes? And how are they different from ascii characters?
Unicode is a way to assign unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. There are many ways to encode Unicode strings as bytes, such as UTF-8 and UTF-16.
ASCII assigns values only to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters).
For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same.
In most modern applications you should prefer to use Unicode strings rather than ASCII. This will for example allow you to have users with accented characters in their name or address, and to localize your interface to languages other than English.
The first 128 Unicode code points are the same as ASCII. Then they have a 100,000 or so more.
There are two common formats for Unicode, UTF-8 which uses 1-4 bytes for each value (so for the first 128 characters, UTF-8 is exactly the same as ASCII) and UTF-16, which uses 2 or 4 bytes.

ISO-8859-1 (True / False)

Is this a true or false statement?
Unicode is a superset of ISO-8859-1 such that the first 256 Unicode characters correspond to ISO-8859-1.
The encoding for specification: ISO-8859-1 only consists of ONLY 256
encoding. Meaning, there is nothing more than the 256 codes.
True. The encoding uses only eight bits for each character, so there are only 256 possible characters.
UTF-8 is a superset that has for its first 256 encoding code the same
as ISO-8859-1.
Not exactly correct, but essentially true.
The ISO-8859-1 character set is the same as the first 256 characters in the Unicode character set. The UTF-8 encoding is used to encode Unicode characters. As UTF-8 is a multi-byte encoding, it uses some codes in the 0-255 range as the start of multi-byte codes. This means that you can't safely decode ISO-8859-1 as UTF-8 or vice versa.
Ref: en.wikipedia.org/wiki/ISO/IEC_8859-1
The first paragraph of the Wikipedia page1 answers this: "[ISO 8859-1] defines the first 256 code point assignments in Unicode[.]"
256 characters : http://htmlhelp.com/reference/charset/
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
I vote TRUE

Are ASCII characters always encoded the same way in all character encodings?

In ASCII, the character < is encoded as a single-byte character 0x3C, what I'd like to know is that is there a character set where < is encoded differently? I tried UTF-8, it's the same. I tried GB2312 and it's the same...
Another question, are all ASCII characters the same in all character sets?
The first 127 characters of ASCII are the same in all ASCII-derived character sets. They are not the same in non-ASCII-character sets (such as EBCDIC).
Characters with codes > 127 are different depending on the codepage and/or the encoding.
No, there are some unofficial regional variants of ISO-646 which differ quite a lot from ASCII.
In UTF-16 'abc' is encoded as '0 97 0 98 0 99', which is very similar to ASCII, but if you try to interpret it as ASCII, you will end up with an extra NUL character before (or after, depending on endianness) each character. Not a huge difference, but enough to make them uninterchangable.