suppose I have been given a hex 416f1c7918f83a4f1922d86df5e78348
how does my program know how to convert this to unicode?
since I dont see how my program split this number into unicode.
Depends on the encoding. If it's UTF-8, then implement a UTF-8 decoding algorithm. If it's Unicode 16, then you need to find out if it's big-endian or little-endian, then implement a Unicode 16 decoding algorithm. If it's Big5, etc.
Related
UTF-32 has its last bits zeroed.
As I understand it UTF-16 doesn't use all its bits either.
Is there a 16-bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7-bit?
UTF-32 has its last bits zeroed
This might be not correct, depending on how you count. Typically we count from left, so the high (i.e. first) bits of UTF-32 will be zero
As I understand it UTF-16 doesn't use all its bits either
It's not correct either. UTF-16 uses all of its bits. It's just that the range [0xD800—0xDFFF] is reserved for UTF-16 surrogate pairs so those values will never be assigned any character and will never appear in UTF-32. If you need to encode characters outside the BMP with UTF-16 then those values will be used
In fact Unicode was limited to U+10FFFF just because of UTF-16, even though UTF-8 and UTF-32 themselves are able to represent up to U+7FFFFFFF and U+FFFFFFFF respectively. The use of surrogate pair makes it impossible to encode values larger than 0x10FFFF in UTF-16
See Why Unicode is restricted to 0x10FFFF?
Is there a 16 bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7 bit?
First there's no such thing as "a subset of UTF", since UTF isn't a character set but a way to encode Unicode code points
Prior to the existence of UTF-16 Unicode was a fixed 16-bit character set encoded with UCS-2. So UCS-2 might be the closest you'll get, which encodes only the characters in the BMP. Other fixed 16-bit non-Unicode charsets also has an encoding that maps all of the bit combinations to some characters
However why would you want that? UCS-2 has been deprecated long ago. Some old tools and less experienced programmers still imply that Unicode is always 16-bit long like that which is correct and will break modern text processing
Also note that not all the values below 0xFFFF are assigned, so no encoding can map every 16-bit value to a Unicode code point
Further reading
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
What is a "surrogate pair" in Java?
I've seen that >2 byte unicode codepoints like U+10000 can be written as a pair, like \uD800\uDC00. They seem to start with the nibble d, but that's all I've noticed.
What is that splitting action called and how does it work?
UTF-8 means (using my own words) that the minimum atom of processing is a byte (the code unit is 1-byte long). I don't know if historically, but at least, conceptually spoken, the UCS-2 and UCS-4 Unicode encodings come first, and UTF-8/UTF-16 appear to solve some problems of UCS-*.
UCS-2 means that each character uses 2 bytes instead of one. It's a fixed-length encoding. UCS-2 saves the bytestring of each code point as you say. The problem is there are characters which codepoints require more than 2 bytes to store it. So, UCS-2 only can handle a subset of Unicode (the range U+0000 to U+FFFF of course).
UCS-4 uses 4 bytes for each character instead, and it's capable enough to store the bitstring of any Unicode code point, obviously (the Unicode range is from U+000000 to U+10FFFF).
The problem with UCS-4 is that characters outside the 2-bytes range are very, very uncommon, and any text encoded using UCS-4 will waste too much space. So, using UCS-2 is a better approach, unless you need characters outside the 2-bytes range.
But again, English texts, source code files and so on use mostly ASCII characters and UCS-2 has the same problem: wasting too much space for texts which use mostly ASCII characters (too many useless zeros).
That is what UTF-8 does. Characters inside the ASCII range are saved in UTF-8 texts as-is. It just takes the bitstring of the code point/ASCII value of each character. So, if a UTF-8 encoded text uses only ASCII characters, it is indistinguishable from any other Latin1 encoding. Clients without UTF-8 support can handle UTF-8 texts using only ASCII characters, because they look identical. It's a backward compatible encoding.
From then on (Unicode characters outside the ASCII range), UTF-8 texts use two, three or four bytes to save code points, depending on the character.
I don't know the exact method, but the bitestring is split in two, three or four bytes using known bit prefixes to know the amount of bytes used to save the code point. If a byte begins with 0, means the character is ASCII and uses only 1 byte (the ASCII range is 7-bits long). If it begins with 1, the character is encoded using two, three or four bytes depending on what bit comes next.
The problem with UTF-8 is that it requires too much processing (it must examine the first bits of each character to know its length), specially if the text is not English-like. For example, a text written in Greek will use mostly two-byte characters.
UTF-16 uses two-bytes code units to solve that problem for non-ASCII texts. That means that the atoms of processing are 16-bit words. If a character encoding doesn't fit in a two-byte code unit, then it uses 2 code units (four bytes) to encode the character. That pair of two code units is called a surrogate pair. I think a UTF-16 text using only characters inside the 2-byte range is equivalent to the same text using UCS-2.
UTF-32, in turn, uses 4-bytes code units, as UCS-4 does. I don't know the differences between them though.
The complete picture filling in your confusion is formatted below:
Referencing what I learned from the comments...
U+10000 is a Unicode code point (hexadecimal integer mapped to a character).
Unicode is a one-to-one mapping of code points to characters.
The inclusive range of code points from 0xD800 to 0xDFFF is reserved for UTF-161 (Unicode vs UTF) surrogate units (see below).
\uD800\uDC002 are two such surrogate units, called a surrogate pair. (A surrogate unit is a code unit that's part of a surrogate pair.)
Abstract representation: Code point (abstract character) --> Code unit (abstract UTF-16) --> Code unit (UTF-16 encoded bytes) --> Interpreted UTF-16
Actual usage example: Input data is bytes and may be wrapped in a second encoding, like ASCII for HTML entities and unicode escapes, or anything the parser handles --> Encoding interpreted; mapped to code point via scheme --> Font glyph --> Character on screen
How surrogate pairs work
Surrogate pair advantages:
There are only high and low units. A high must be followed by a low. No confusing high&low units.
UTF-16 can use 2 bytes for the first 63487 code points because surrogates cannot be mistaken for code points.
A range of 2048 code points is (2048/2)**2 to yield a range of 1048576 code points.
The processing is done on the less frequently used characters.
1 UTF-16 is the only UTF which uses surrogate pairs.
2 This is formatted as a unicode escape sequence.
Graphics describing character encoding:
Keep reading:
How does UTF-8 "variable-width encoding" work?
Unicode, UTF, ASCII, ANSI format differences
Code point
ASCII uses a 8 bit system. Each character is assigned a unique ASCII value. But UNICODE uses 32 or 64 bit representation. So how the characters are assigned values there? Does C/C++ use UNICODE?
From this
To convert ASCII to Unicode, take all one byte ASCII codes, and zero-extend them to 16 bits. That should be the Unicode version of the ASCII characters.
Unicode in c/c++ look into this
Unicode first and foremost defines characters by a code point. This is simply a giant table which specifies that the letter "A" (LATIN CAPITAL LETTER A) has the code point U+0041, "ท" (THAI CHARACTER THO THAHAN) has the code point U+0E17 and so on and so forth.
There are then several Unicode encodings which encode these code points into physical bits. UCS-2 was an early encoding which is now superseded by UTF-16. UTF-32 exists as well, but UTF-8 has become the de facto standard Unicode encodings. Each encoding works differently and has different pros and cons, read their specification in detail if you are interested. The most obvious difference is that UTF-8 uses a minimum of 8 bits per character, UTF-16 a minimum of 16 bits and UTF-32 32 bits.
as far I know, the UNICODE is the industry standard for character mapping.
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
do i make any sense?
Who says it must be encoded as UTF-8? There are several common encodings for Unicode, including UTF-16 (big- or little-endian), and some less common ones such as UTF-7 and UTF-32.
Unicode itself is not an encoding; it's merely a specification of numeric code points for several thousand characters.
The Unicode code point for lowercase a is 0x61 in hexadecimal, 97 in decimal, or 0141 in octal.
If you're suggesting that 'a' should be encoded as the 6-character ASCII string "U+0061", that would be terribly wasteful of space and more difficult to decode than UTF-8.
If you're suggesting storing the numeric values directly, that's what UTF-32 does: it stores each character as a 32-bit (4-octet) number that directly represents the code point. The trouble with that is that it's nearly as wasteful of space as "U+0061" (4 bytes per character vs. 6.)
The UTF-8 encoding has a number of advantages. One is that it's upward compatible with ASCII. Another is that it's reasonably efficient even for non-ASCII characters, as long as most of the encoded text is within the first few thousand code points.
UTF-16 has some other advantages, but I personally prefer UTF-8. MS Windows tends to use UTF-16, but mostly for historical reasons; Windows added Unicode support when there were fewer than 65536 defined code points, which made UTF-16 equvalent to UCS-2, which is a simpler representation.
UTF-8 is only one 'memory format' of Unicode. There is also UTF-16, UTF-32 and a number of other memory mapping formats.
UTF-8 has been used broadly because it is upwardly compatible with an 8 bit character code like Ascii.
You can tell a browser via html, mySQL at several levels, and Notepad++ vie encoding option to use other formats for the data they operate on.
DuckDuckGo or Google Unicode and you will find plenty of articles on this on the internet. Here is one: https://ssl.icu-project.org/docs/papers/forms_of_unicode/
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value
Stored data is a sequence of byte values, generally interpreted at the lowest level as numbers. We usually use bytes that can be one of 256 values, so we look at them as numbers in the range 0 to 255.
So when you say 'just stored as a String with "U+0061"' what sequence of numbers in the range 0-255 do you mean?
Unicode code points like U+0061 are written in hexadecimal. Hexadecimal 61 is the number 97 in the more familiar decimal system, so perhaps you think that the letter 'a' should be stored as a single byte with the value 97. You might be surprised to learn that this is exactly how the encoding UTF-8 represents this string.
Of course there are more than 256 characters defined in Unicode, so not all Unicode characters can be stored as bytes with the same value as their Unicode codepoint. UTF-8 has one way of dealing with this, and there are other encodings with different ways.
UTF-32, for example, is an encoding which uses 4 bytes together at a time to represent a codepoint. Since one byte has 256 values four bytes can have 256 × 256 × 256 × 256, or 4,294,967,296 different arrangements. We can number those arrangements of bytes from 0 to 4,294,967,295 and then store every Unicode codepoint as the arrangement of bytes that we've numbered with the number corresponding to the Unicode codepoint value. This is exactly what UTF-32 does.
(However, there are different ways to assign numbers to those arrangements of four bytes and so there are multiple versions of UTF-32, such as UTF-32BE and UTF-32LE. Typically a particular medium of storing or transmitting bytes specifies its own numbering scheme, and the encoding 'UTF-32' without further qualification implies that whatever the medium's native scheme is should be used.)
Read this article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
do i make any sense?
Not a lot! (Read on ...)
as far I know, the UNICODE (sic) is the industry standard for character mapping.
That is incorrect. Unicode IS NOT a standard for character mapping. It is a standard that defines a set of character codes and what they mean.
It is essentially a catalogue that defines a mapping of codes (Unicode "code points") to conceptual characters, but it is not a standard for mapping characters. It certainly DOES NOT define a standard way to represent the code points; i.e. a mapping to a representation. (That is what character encoding schemes do!)
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
That is incorrect. Character data DOES NOT have to be encoded in UTF-8. It can be encoded as UTF-8. But it can also be encoded in a number of other ways too:
The Unicode has specified a number of encoding schemes, including UTF-8, UTF-16 and UTF-32, and various historical variants.
There are many other standard encoding schemes (probably hundreds of them). This Wikipedia page lists some of the common ones.
The various different encoding schemes have different purposes (and different limitations). For example:
ASCII and LATIN-1 are 7 and 8-bit character sets (respectively) that encode a small subset of Unicode code-points. (ASCII encodes roman letters and numbers, some punctuation, and "control codes". LATIN-1 adds a number of accented latin letters using in Western Europe and some other common "typographical" characters.)
UTF-8 is a variable length encoding scheme that encodes Unicode code points as 1 to 5 bytes (octets). (It is biased towards western usage ... since it encodes all latin / roman letters and numbers as single bytes.)
UTF-16 is designed for encoding Unicode code points in 16-bit units. (Java Strings are essentially UTF-16 encoded.)
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
In fact, a Java String is represented as a sequence of char values. The char type is a 16-bit unsigned integer type; i.e. it has values 0 through 65535. And the char value that represents a lowercase "a" character is hex 0061 == octal 141 == decimal 97.
You are incorrect about "octal 0061" ... but I can't figure out what distinction you are actually trying to make here, so I can't really comment on that.
I was reading about unicode at http://www.joelonsoftware.com/articles/Unicode.html. Joel says UCS-2 encodes all unicode characters in 2 bytes whereas UTF-8 may take upto 6 bytes to encode some of the unicode characters. Would you please explain with an example, how a 6 byte UTF-8 encoded unicode character is encoded in UCS-2?
UCS-2 was created when Unicode had less than 65536 codepoints, so they all fit in 2 bytes max. Once Unicode grew to more than 65536 codepoints, UCS-2 became obsolete and was replaced with UTF-16, which encodes all of the UCS-2 compatible codepoints using 2 bytes and the rest using 4 bytes via surrogate pairs.
UTF-8 was originally written to encode codepoints up to 6 bytes (U+7FFFFFFF max) but was later limited to 4 bytes (U+1FFFFF max, though anything above U+10FFFF is forbidden) so that it is 100% compatible with UTF-16 back and forth and does not encode any codepoints that UTF-16 does not support. The maximum codepoint that both UTF-8 and UTF-16 support is U+10FFFF.
So, to answer your question, a codepoint that requires a 5- or 6-byte UTF-8 sequence ( U+200000 to U+7FFFFFFF) cannot be encoded in UCS-2, or even UTF-16. There are not enough bits available to hold such large codepoint values.
UCS-2 stores everything it can in two bytes, and does nothing about the code points that won't fit into that space. Which is why UCS-2 is pretty much useless today.
Instead, we have UTF-16, which looks like UCS-2 for all the two-byte sequences, but also allows surrogate pairs, pairs of two-byte sequences. Using those, remaining code points can be encoded using a total of 4 bytes.