Is this a true or false statement?
Unicode is a superset of ISO-8859-1 such that the first 256 Unicode characters correspond to ISO-8859-1.
The encoding for specification: ISO-8859-1 only consists of ONLY 256
encoding. Meaning, there is nothing more than the 256 codes.
True. The encoding uses only eight bits for each character, so there are only 256 possible characters.
UTF-8 is a superset that has for its first 256 encoding code the same
as ISO-8859-1.
Not exactly correct, but essentially true.
The ISO-8859-1 character set is the same as the first 256 characters in the Unicode character set. The UTF-8 encoding is used to encode Unicode characters. As UTF-8 is a multi-byte encoding, it uses some codes in the 0-255 range as the start of multi-byte codes. This means that you can't safely decode ISO-8859-1 as UTF-8 or vice versa.
Ref: en.wikipedia.org/wiki/ISO/IEC_8859-1
The first paragraph of the Wikipedia page1 answers this: "[ISO 8859-1] defines the first 256 code point assignments in Unicode[.]"
256 characters : http://htmlhelp.com/reference/charset/
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
I vote TRUE
Related
I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check
if (first_byte & 0xfc == 0xdc) {
return -1;
}
Can you please help me understand why this check is present.
Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.
See e.g. Wikipedia article UTF-16 for more information.
The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.
In UTF-16, the two byte sequence
DCFO
cannot begin the encoding of any character at all.
The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:
0000 .. D7FF
E000 .. FFFF
All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range
D800 .. DBFF
and the second pair of bytes must be in the range
DC00 .. DFFF
This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.
Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.
Are the first 128 characters of utf-8 and ascii identical?
utf-8 table
Ascii table
Yes. This was an intentional choice in the design of UTF-8 so that existing 7-bit ASCII would be compatible.
The encoding is also designed intentionally so that 7-bit ASCII values cannot mean anything except their ASCII equivalent. For example, in UTF-16, the Euro symbol (€) is encoded as 0x20 0xAC. But 0x20 is SPACE in ASCII. So if an ASCII-only algorithm tries to space-delimit a string like "€ 10" encoded in UTF-16, it'll corrupt the data.
This can't happen in UTF-8. € is encoded there as 0xE2 0x82 0xAC, none of which are legal 7-bit ASCII values. So an ASCII algorithm that naively splits on the ASCII SPACE (0x20) will still work, even though it doesn't know anything about UTF-8 encoding. (The same is true for any ASCII character like slash, comma, backslash, percent, etc.) UTF-8 is an incredibly clever text encoding.
In UTF-8, code points >127 encoded with multiple bytes. For example, character U+041F (100'0001'1111) encoded as:
1101'0000 1001'1111
^^^ ^^
Marked bits determine leading and trailing bytes, other bits are actual bits of the code point.
But can we encode code point 1 as
1100'0000 1000'0001
Of course, it is redundant, but is it legal in UTF-8?
Overlong UTF-8 sequences are not considered valid UTF-8 representations of a code point. A UTF-8 decoder must reject overlong sequences.
Wikipedia citation: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings
Original RFC 2279 specification: https://www.ietf.org/rfc/rfc2279.txt
Is there a single-byte charset (e.g. ISO-8859-x) that matches the first 256 unicode characters (i.e. characters \u0000-\u00FF) exactly or almost exactly?
ISO-8859-1 matches the first Unicode code points the closest, by design.
I have a sting in unicode is "hao123--我的上网主页", while in utf8 in C++ string is "hao123锛嶏紞鎴戠殑涓婄綉涓婚〉", but I should write it to a file in this format "hao123\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875", how can I do it. I know little about this encoding. Can anyone help? thanks!
You seem to mix up UTF-8 and UTF-16 (or possibly UCS-2). UTF-8 encoded characters have a variable length of 1 to 4 bytes. Contrary to this, you seem to want to write UTF-16 or UCS-2 to your files (I am guessing this from the \uxxxx character references in your file output string).
For an overview of these character sets, have a look at Wikipedia's article on UTF-8 and browse from there.
Here's some of the very basic basics (heavily simplified):
UCS-2 stores all characters as exactly 16 bits. It therefore cannot encode all Unicode characters, only the so-called "Basic Multilingual Plane".
UTF-16 stores the most frequently-used characters in 16 bits, but some characters must be encoded in 32 bits.
UTF-8 encodes characters with a variable length of 1 to 4 bytes. Only characters from the original 7-bit ASCII charset are encoded as 1 byte.