How does the Chinese GB18030 code set differ from Unicode?
What special techniques are required for handling GB18030?
Are there any (open source) libraries for handling GB18030?
As per the Wikipedia article on GB18030, "GB18030 can be be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set." That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.
The ICU project is an open source library (for C or Java) that has full support for many different encodings, including GB18030. Information on converting between different encodings with ICU can be found here.
What special techniques are required for handling GB18030?
The biggest thing to be aware of is that, unlike UTF-8, GB18030 allows ASCII bytes to occur within the encoding of a multi-byte character. (For example, 'ß' is encoded as the bytes 81 30 89 38, which contains the ASCII encoding of '0' and '8'.) This means that you can't use a simple byte-oriented find/index function.
Related
I know of the following Unicode encodings:
UTF-7
UTF-8
UTF-16
UTF-32
UCS-2
Are there more Unicode encodings? And are all of the Unicode encodings still in use, or are some of them now obsolete?
There is one Unicode (really there are different versions).
You can define any kind of encoding, this do not matter much.
There are UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE as official encoding form. Also officially, so in Unicode standard, you have description of UTF-8, UTF-16, and UTF-32.
UCS2 was the old unicode encoding (equal to UTF-16, but with support only to code < 65536), so now it is obsolete (replaced by UTF16, which is capable to code all (also the newer) unicode code points). UTF-7 is also obsolete.
There are also April fools UTF-9 and UTF-18.
Some application have the UTF8-sig (which is UTF-8 with initial BOM).
On mail, probably you will use UTF8 + BASE64, or some other double encoding.
Mysql uses UTF8MB3 and UTF8MB4, so it specifies UFT-8 but also how many bytes to reserve (3 or 4) per SQL CHAR.
Python3 uses (internally, you probably never see it) a mixed encoding: UTF-8, UTF-16, or UTF-32 according to the larger code in the entire string (and the "encoding" is saved together with string length, outside the "true string"). So this is also a sort of encoding.
We have 21 bits to describe any unicode code point. Then we are free to choose any encodings (in a manner that we can get back to the original code point). UTF-8, UTF-16 and UTF-32 are just the most common (and described in Unicode standard).
I want to know what encoding is used for Khmer Language (Official language of Cambodia). Is it UTF-8 or UTF-16.
A language doesn't have an encoding. What it might have is a set of characters that are commonly used when writing in that language. Some encodings only support subsets of Unicode, but both UTF-8 and UTF-16 can encode all Unicode characters, so either will work (i.e. will allow you to represent anything in Unicode) if you have the ability to choose which to use. (If you want characters which aren't in Unicode, neither UTF-8 nor UTF-16 is going to help you.)
I would typically use UTF-8 as a default - pretty much every platform supports UTF-8, whereas UTF-16 is slightly less well supported. On the other hand, you may find that there are lots of characters used by Khmer which take two bytes to encode in UTF-16 but three bytes in UTF-8, making UTF-8 take a bit more space. You should investigate this by encoding sample documents in both encodings if space is important to you.
Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding.
It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?
In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.
Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)
And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.
Unicode and ASCII are both Codepoints + Encoding scheme
Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.
Conversion and Representation(in binary/hexadecimal) of String:
String := sequence of Graphemes(character is a "kind of" its subset).
Sequence of graphemes(characters) is converted into Codepoints (also using Encoding scheme)
Codepoints are Encoded(converted) to binary/hex also using Encoding Schemes
for Graphemes its UTF-8/UTF-32(aka Unicodes), for Character its ASCII.
Unicode(UTF-8) supports 1,112,064 valid character codepoints(covers most of the graphemes from different languages)
ASCII supports 128 character codepoints(mostly english)
This question already has answers here:
What is the difference between UTF-8 and Unicode?
(18 answers)
Closed 6 years ago.
Consider:
Is it true that unicode=utf16?
Many are saying Unicode is a standard, not an encoding, but most editors support save as Unicode encoding actually.
As Rasmus states in his article "The difference between UTF-8 and Unicode?":
If asked the question, "What is the difference between UTF-8 and
Unicode?", would you confidently reply with a short and precise
answer? In these days of internationalization all developers should be
able to do that. I suspect many of us do not understand these concepts
as well as we should. If you feel you belong to this group, you should
read this ultra short introduction to character sets and encodings.
Actually, comparing UTF-8 and Unicode is like comparing apples and
oranges:
UTF-8 is an encoding - Unicode is a character
set
A character set is a list of characters with unique numbers (these
numbers are sometimes referred to as "code points"). For example, in
the Unicode character set, the number for A is 41.
An encoding on the other hand, is an algorithm that translates a
list of numbers to binary so it can be stored on disk. For example
UTF-8 would translate the number sequence 1, 2, 3, 4 like this:
00000001 00000010 00000011 00000100
Our data is now translated into binary and can now be saved to
disk.
All together now
Say an application reads the following from the disk:
1101000 1100101 1101100 1101100 1101111
The app knows this data represent a Unicode string encoded with
UTF-8 and must show this as text to the user. First step, is to
convert the binary data to numbers. The app uses the UTF-8 algorithm
to decode the data. In this case, the decoder returns this:
104 101 108 108 111
Since the app knows this is a Unicode string, it can assume each
number represents a character. We use the Unicode character set to
translate each number to a corresponding character. The resulting
string is "hello".
Conclusion
So when somebody asks you "What is the difference between UTF-8 and
Unicode?", you can now confidently answer short and precise:
UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding
used to translate numbers into binary data. Unicode is a character set
used to translate characters into numbers.
most editors support save as ‘Unicode’ encoding actually.
This is an unfortunate misnaming perpetrated by Windows.
Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).
This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.
This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.
(Other editors that do encodings themselves, like Notepad++, don't have this problem.)
If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.
It's not that simple.
UTF-16 is a 16-bit, variable-width encoding. Simply calling something "Unicode" is ambiguous, since "Unicode" refers to an entire set of standards for character encoding. Unicode is not an encoding!
http://en.wikipedia.org/wiki/Unicode#Unicode_Transformation_Format_and_Universal_Character_Set
and of course, the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) link.
There's a lot of misunderstanding being displayed here. Unicode isn't an encoding, but the Unicode standard is devoted primarily to encoding anyway.
ISO 10646 is the international character set you (probably) care about. It defines a mapping between a set of named characters (e.g., "Latin Capital Letter A" or "Greek small letter alpha") and a set of code points (a number assigned to each -- for example, 61 hexadecimal and 3B1 hexadecimal for those two respectively; for Unicode code points, the standard notation would be U+0061 and U+03B1).
At one time, Unicode defined its own character set, more or less as a competitor to ISO 10646. That was a 16-bit character set, but it was not UTF-16; it was known as UCS-2. It included a rather controversial technique to try to keep the number of necessary characters to a minimum (Han Unification -- basically treating Chinese, Japanese and Korean characters that were quite a bit alike as being the same character).
Since then, the Unicode consortium has tacitly admitted that that wasn't going to work, and now concentrate primarily on ways to encode the ISO 10646 character set. The primary methods are UTF-8, UTF-16 and UCS-4 (aka UTF-32). Those (except for UTF-8) also have LE (little endian) and BE (big-endian) variants.
By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).
UTF-16 and UTF-8 are both encodings of Unicode. They are both Unicode; one is not more Unicode than the other.
Don't let an unfortunate historical artifact from Microsoft confuse you.
The development of Unicode was aimed
at creating a new standard for mapping
the characters in a great majority of
languages that are being used today,
along with other characters that are
not that essential but might be
necessary for creating the text. UTF-8
is only one of the many ways that you
can encode the files because there are
many ways you can encode the
characters inside a file into Unicode.
Source:
http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/
In addition to Trufa's comment, Unicode explicitly isn't UTF-16. When they were first looking into Unicode, it was speculated that a 16-bit integer might be enough to store any code, but in practice that turned out not to be the case. However, UTF-16 is another valid encoding of Unicode - alongside the 8-bit and 32-bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems.
Let's start from keeping in mind that data is stored as bytes; Unicode is a character set where characters are mapped to code points (unique integers), and we need something to translate these code points data into bytes. That's where UTF-8 comes in so called encoding – simple!
It's weird. Unicode is a standard, not an encoding. As it is possible to specify the endianness I guess it's effectively UTF-16 or maybe 32.
Where does this menu provide from?
It seems the most confusing issue to me.
How is the beginning of a new character recognized?
How are the codepoints allocated?
Let's take Chinese character for example.
What range of codepoints are allocated to them,
and why is it thus allocated,any reason?
EDIT:
Plz describe it in your own words,not by citation.
Or could you recommend a book that talks about Unicode systematically,which you think have made it clear(it's the most important).
The Unicode Consortium is responsible for the codepoint allocation. If you have want a new character or a code page allocated, you can apply there. See the proposal pipeline for examples.
Chapter 2 of the Unicode specification defines the general structure of Unicode, including what ranges are allocated for what kind of characters.
Take a look here for a general overview of Unicode that might be helpful: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)
Unicode is a standard specified by the Unicode Consortium. The specification defines Unicode’s character set, the Universal Character Set (UCS), and some encodings to encode that characters, the Unicode Transformation Formats UTF-7, UTF-8, UTF-16 and UTF-32.
How is the beginning of a new character recognized?
It depends on the encoding that’s been used. UTF-16 and UTF-32 are encodings with fixed code word lengths (16 and 32 bits respectively) while UTF-7 and UTF-8 have a variable code word length (from 8 bits up to 32 bits) depending on the character point that is to be encoded.
How are the codepoints allocated? Let's take Chinese character for example. What range of codepoints are allocated to them, and why is it thus allocated,any reason?
The UCS is separated into so called character planes. The first one is Basic Latin (U+0000–U+007F, encoded like ASCII), the second is Latin-1 Supplement (U+0080–U+00FF, encoded like ISO 8859-1) and so on.
It is better to say Character Encoding instead of Codepage
A Character Encoding is a way to map some character to some data (and also vice-versa!)
As Wikipedia says:
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers
Most popular character encodings are ASCII,UTF-16 and UTF-8
ASCII
First code-page that widely used in computers. in ANSI just one byte is allocated for each character. So ANSI could have a very limited set of characters (English letters, Numbers,...)
As I said, ASCII used videly in old operating systems like MS-DOS. But ASCII is not dead and still used. When you have a txt file with 10 characters and it is 10 bytes, you have a ASCII file!
UTF-16
In UTF-16, Two bytes is allocated of a character. So we can have 65536 different characters in UTF-16 !
Microsoft Windows uses UTF-16 internally.
UTF-8
UTF-8 is another popular way for encoding characters. it uses variable-length bytes (1byte to 4bytes) for characters. It is also compatible with ASCII because uses 1byte for ASCII characters.
Most Unix based systems uses UTF-8
Programming languages do not depend on code-pages. Maybe a specific implementation of a programming language do not support codepages (like Turbo C++)
You can use any code-page in modern programming languages. They also have some tools for converting the code-pages.
There is different Unicode versions like Utf-7,Utf-8,... You can read about them here (recommanded!) and maybe for more formal details here