In Swift Programming Language 3.0, chapter on string and character, the book states
A Unicode scalar is any Unicode code point in the range U+0000 to
U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do
not include the Unicode surrogate pair code points, which are the code
points in the range U+D800 to U+DFFF inclusive
What does this mean?
A Unicode Scalar is a code point which is not serialised as a pair of UTF-16 code units.
A code point is the number resulting from encoding a character in the Unicode standard. For instance, the code point of the letter A is 0x41 (or 65 in decimal).
A code unit is each group of bits used in the serialisation of a code point. For instance, UTF-16 uses one or two code units of 16 bit each.
The letter A is a Unicode Scalar because it can be expressed with only one code unit: 0x0041. However, less common characters require two UTF-16 code units. This pair of code units is called surrogate pair. Thus, Unicode Scalar may also be defined as: any code point except those represented by surrogate pairs.
The answer from courteouselk is correct by the way, this is just a more plain english version.
From Unicode FAQ:
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
Basically, surrogates are codepoints that are reserved for special purposes and promised to never encode a character on their own but always as a first codepoint in a pair of UTF-16 encoding.
[UPD] Also, from wikipedia:
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.
Related
UTF-32 has its last bits zeroed.
As I understand it UTF-16 doesn't use all its bits either.
Is there a 16-bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7-bit?
UTF-32 has its last bits zeroed
This might be not correct, depending on how you count. Typically we count from left, so the high (i.e. first) bits of UTF-32 will be zero
As I understand it UTF-16 doesn't use all its bits either
It's not correct either. UTF-16 uses all of its bits. It's just that the range [0xD800—0xDFFF] is reserved for UTF-16 surrogate pairs so those values will never be assigned any character and will never appear in UTF-32. If you need to encode characters outside the BMP with UTF-16 then those values will be used
In fact Unicode was limited to U+10FFFF just because of UTF-16, even though UTF-8 and UTF-32 themselves are able to represent up to U+7FFFFFFF and U+FFFFFFFF respectively. The use of surrogate pair makes it impossible to encode values larger than 0x10FFFF in UTF-16
See Why Unicode is restricted to 0x10FFFF?
Is there a 16 bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7 bit?
First there's no such thing as "a subset of UTF", since UTF isn't a character set but a way to encode Unicode code points
Prior to the existence of UTF-16 Unicode was a fixed 16-bit character set encoded with UCS-2. So UCS-2 might be the closest you'll get, which encodes only the characters in the BMP. Other fixed 16-bit non-Unicode charsets also has an encoding that maps all of the bit combinations to some characters
However why would you want that? UCS-2 has been deprecated long ago. Some old tools and less experienced programmers still imply that Unicode is always 16-bit long like that which is correct and will break modern text processing
Also note that not all the values below 0xFFFF are assigned, so no encoding can map every 16-bit value to a Unicode code point
Further reading
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
What is a "surrogate pair" in Java?
I am learning about UTF-16 encoding, and I have read that if you want to represent code points in the range of U+10000 to U+10FFFF, then you have to use surrogate pairs, which are in the range of U+D800 to U+DFFF.
So let's say I want to encode the following code point: U+10123 (10000000100100011 in binary):
First I layout this sequence of bits:
110110xxxxxxxxxx 110111xxxxxxxxxx
Then I fill the places with the x with the binary format of the code point:
1101100001000000 1101110100100011 (D840 DD23 in hexadecimal)
I have also read that the code points in the range of U+D800 to U+DFFF were removed from the Unicode character set, but I don't understand why this range was removed!
I mean this range can be easily encoded in 4 bytes, for example the following is the UTF-16 encoded format of the U+D812 code point (1101100000010010 in binary):
1101100000110110 1101110000010010 (D836 DC12 in hexadecimal)
Note: I was using UTF-16 Big Endian in my examples.
Codepoints U+D800 - U+DFFF are reserved exclusively1 for use with UTF-16. Since they are not in the range of U+10000 - U+10FFFF, UTF-16 would not encode them individually using surrogate pairs, so it would be ambiguous (and illegal2) for these individual codepoints to appear un-encoded in a UTF-16 sequence.
Per the Unicode.org UTF-16 FAQ:
1: Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
2: Q: Are there any 16-bit values that are invalid?
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.
I don't have an official source to back this up, but I believe it was to prevent confusion, so that you couldn't get a code sequence that could be interpreted as both valid UTF-16 and valid UCS-2. The loss of 2048 code points was nothing compared to the addition of 1048576 new ones.
Since encoding a code point as surrogate pair starts by subtracting 0x010000, this would lead to negative numbers. And the point of this subtraction is to allow 65536 more code points instead of encoding the left-out 2048. Maybe this will prove useful, should the whole code space be used up in a distant future.
I've seen that >2 byte unicode codepoints like U+10000 can be written as a pair, like \uD800\uDC00. They seem to start with the nibble d, but that's all I've noticed.
What is that splitting action called and how does it work?
UTF-8 means (using my own words) that the minimum atom of processing is a byte (the code unit is 1-byte long). I don't know if historically, but at least, conceptually spoken, the UCS-2 and UCS-4 Unicode encodings come first, and UTF-8/UTF-16 appear to solve some problems of UCS-*.
UCS-2 means that each character uses 2 bytes instead of one. It's a fixed-length encoding. UCS-2 saves the bytestring of each code point as you say. The problem is there are characters which codepoints require more than 2 bytes to store it. So, UCS-2 only can handle a subset of Unicode (the range U+0000 to U+FFFF of course).
UCS-4 uses 4 bytes for each character instead, and it's capable enough to store the bitstring of any Unicode code point, obviously (the Unicode range is from U+000000 to U+10FFFF).
The problem with UCS-4 is that characters outside the 2-bytes range are very, very uncommon, and any text encoded using UCS-4 will waste too much space. So, using UCS-2 is a better approach, unless you need characters outside the 2-bytes range.
But again, English texts, source code files and so on use mostly ASCII characters and UCS-2 has the same problem: wasting too much space for texts which use mostly ASCII characters (too many useless zeros).
That is what UTF-8 does. Characters inside the ASCII range are saved in UTF-8 texts as-is. It just takes the bitstring of the code point/ASCII value of each character. So, if a UTF-8 encoded text uses only ASCII characters, it is indistinguishable from any other Latin1 encoding. Clients without UTF-8 support can handle UTF-8 texts using only ASCII characters, because they look identical. It's a backward compatible encoding.
From then on (Unicode characters outside the ASCII range), UTF-8 texts use two, three or four bytes to save code points, depending on the character.
I don't know the exact method, but the bitestring is split in two, three or four bytes using known bit prefixes to know the amount of bytes used to save the code point. If a byte begins with 0, means the character is ASCII and uses only 1 byte (the ASCII range is 7-bits long). If it begins with 1, the character is encoded using two, three or four bytes depending on what bit comes next.
The problem with UTF-8 is that it requires too much processing (it must examine the first bits of each character to know its length), specially if the text is not English-like. For example, a text written in Greek will use mostly two-byte characters.
UTF-16 uses two-bytes code units to solve that problem for non-ASCII texts. That means that the atoms of processing are 16-bit words. If a character encoding doesn't fit in a two-byte code unit, then it uses 2 code units (four bytes) to encode the character. That pair of two code units is called a surrogate pair. I think a UTF-16 text using only characters inside the 2-byte range is equivalent to the same text using UCS-2.
UTF-32, in turn, uses 4-bytes code units, as UCS-4 does. I don't know the differences between them though.
The complete picture filling in your confusion is formatted below:
Referencing what I learned from the comments...
U+10000 is a Unicode code point (hexadecimal integer mapped to a character).
Unicode is a one-to-one mapping of code points to characters.
The inclusive range of code points from 0xD800 to 0xDFFF is reserved for UTF-161 (Unicode vs UTF) surrogate units (see below).
\uD800\uDC002 are two such surrogate units, called a surrogate pair. (A surrogate unit is a code unit that's part of a surrogate pair.)
Abstract representation: Code point (abstract character) --> Code unit (abstract UTF-16) --> Code unit (UTF-16 encoded bytes) --> Interpreted UTF-16
Actual usage example: Input data is bytes and may be wrapped in a second encoding, like ASCII for HTML entities and unicode escapes, or anything the parser handles --> Encoding interpreted; mapped to code point via scheme --> Font glyph --> Character on screen
How surrogate pairs work
Surrogate pair advantages:
There are only high and low units. A high must be followed by a low. No confusing high&low units.
UTF-16 can use 2 bytes for the first 63487 code points because surrogates cannot be mistaken for code points.
A range of 2048 code points is (2048/2)**2 to yield a range of 1048576 code points.
The processing is done on the less frequently used characters.
1 UTF-16 is the only UTF which uses surrogate pairs.
2 This is formatted as a unicode escape sequence.
Graphics describing character encoding:
Keep reading:
How does UTF-8 "variable-width encoding" work?
Unicode, UTF, ASCII, ANSI format differences
Code point
as far I know, the UNICODE is the industry standard for character mapping.
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
do i make any sense?
Who says it must be encoded as UTF-8? There are several common encodings for Unicode, including UTF-16 (big- or little-endian), and some less common ones such as UTF-7 and UTF-32.
Unicode itself is not an encoding; it's merely a specification of numeric code points for several thousand characters.
The Unicode code point for lowercase a is 0x61 in hexadecimal, 97 in decimal, or 0141 in octal.
If you're suggesting that 'a' should be encoded as the 6-character ASCII string "U+0061", that would be terribly wasteful of space and more difficult to decode than UTF-8.
If you're suggesting storing the numeric values directly, that's what UTF-32 does: it stores each character as a 32-bit (4-octet) number that directly represents the code point. The trouble with that is that it's nearly as wasteful of space as "U+0061" (4 bytes per character vs. 6.)
The UTF-8 encoding has a number of advantages. One is that it's upward compatible with ASCII. Another is that it's reasonably efficient even for non-ASCII characters, as long as most of the encoded text is within the first few thousand code points.
UTF-16 has some other advantages, but I personally prefer UTF-8. MS Windows tends to use UTF-16, but mostly for historical reasons; Windows added Unicode support when there were fewer than 65536 defined code points, which made UTF-16 equvalent to UCS-2, which is a simpler representation.
UTF-8 is only one 'memory format' of Unicode. There is also UTF-16, UTF-32 and a number of other memory mapping formats.
UTF-8 has been used broadly because it is upwardly compatible with an 8 bit character code like Ascii.
You can tell a browser via html, mySQL at several levels, and Notepad++ vie encoding option to use other formats for the data they operate on.
DuckDuckGo or Google Unicode and you will find plenty of articles on this on the internet. Here is one: https://ssl.icu-project.org/docs/papers/forms_of_unicode/
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value
Stored data is a sequence of byte values, generally interpreted at the lowest level as numbers. We usually use bytes that can be one of 256 values, so we look at them as numbers in the range 0 to 255.
So when you say 'just stored as a String with "U+0061"' what sequence of numbers in the range 0-255 do you mean?
Unicode code points like U+0061 are written in hexadecimal. Hexadecimal 61 is the number 97 in the more familiar decimal system, so perhaps you think that the letter 'a' should be stored as a single byte with the value 97. You might be surprised to learn that this is exactly how the encoding UTF-8 represents this string.
Of course there are more than 256 characters defined in Unicode, so not all Unicode characters can be stored as bytes with the same value as their Unicode codepoint. UTF-8 has one way of dealing with this, and there are other encodings with different ways.
UTF-32, for example, is an encoding which uses 4 bytes together at a time to represent a codepoint. Since one byte has 256 values four bytes can have 256 × 256 × 256 × 256, or 4,294,967,296 different arrangements. We can number those arrangements of bytes from 0 to 4,294,967,295 and then store every Unicode codepoint as the arrangement of bytes that we've numbered with the number corresponding to the Unicode codepoint value. This is exactly what UTF-32 does.
(However, there are different ways to assign numbers to those arrangements of four bytes and so there are multiple versions of UTF-32, such as UTF-32BE and UTF-32LE. Typically a particular medium of storing or transmitting bytes specifies its own numbering scheme, and the encoding 'UTF-32' without further qualification implies that whatever the medium's native scheme is should be used.)
Read this article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
do i make any sense?
Not a lot! (Read on ...)
as far I know, the UNICODE (sic) is the industry standard for character mapping.
That is incorrect. Unicode IS NOT a standard for character mapping. It is a standard that defines a set of character codes and what they mean.
It is essentially a catalogue that defines a mapping of codes (Unicode "code points") to conceptual characters, but it is not a standard for mapping characters. It certainly DOES NOT define a standard way to represent the code points; i.e. a mapping to a representation. (That is what character encoding schemes do!)
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
That is incorrect. Character data DOES NOT have to be encoded in UTF-8. It can be encoded as UTF-8. But it can also be encoded in a number of other ways too:
The Unicode has specified a number of encoding schemes, including UTF-8, UTF-16 and UTF-32, and various historical variants.
There are many other standard encoding schemes (probably hundreds of them). This Wikipedia page lists some of the common ones.
The various different encoding schemes have different purposes (and different limitations). For example:
ASCII and LATIN-1 are 7 and 8-bit character sets (respectively) that encode a small subset of Unicode code-points. (ASCII encodes roman letters and numbers, some punctuation, and "control codes". LATIN-1 adds a number of accented latin letters using in Western Europe and some other common "typographical" characters.)
UTF-8 is a variable length encoding scheme that encodes Unicode code points as 1 to 5 bytes (octets). (It is biased towards western usage ... since it encodes all latin / roman letters and numbers as single bytes.)
UTF-16 is designed for encoding Unicode code points in 16-bit units. (Java Strings are essentially UTF-16 encoded.)
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
In fact, a Java String is represented as a sequence of char values. The char type is a 16-bit unsigned integer type; i.e. it has values 0 through 65535. And the char value that represents a lowercase "a" character is hex 0061 == octal 141 == decimal 97.
You are incorrect about "octal 0061" ... but I can't figure out what distinction you are actually trying to make here, so I can't really comment on that.
Unicode UTF-32 values we can call codepoints, though I suppose even this is wrong since a single surrogate is itself a codepoint. UTF-8 can be called multi-byte or multi-octet. But what about UTF-16 and UCS-2. They aren't exactly multi-byte since they deal in 2 bytes, and I think multi-word is more of a MS naming scheme.
What is a more accurate name to describe UTF-32 codepoints that can be made up of bytes, as in UTF-8 and words as in UTF-16?
I believe the term you're looking for is 'code unit'.
Code points are simply integral values that may be assigned a character in a character set.
A code unit is a fixed width integer representation used in sequences to represent encoded text. UTF-8, UTF-16, and UTF-32 are all encodings, and use 8, 16, and 32 bit code units respectively.
UTF-32 is unique among the three in that its code unit values are always exactly the code point values of the represented Unicode data.
'multi-byte' can appropriately be used to in reference to UTF-16. (And 'Unicode' can be used in reference to UTF-8; Microsoft's usage of the terminology is misleading on both counts.)
a single surrogate is itself a codepoint.
Unicode classifies code point in the range [U+D800-U+DFFF] as surrogates. These code points are never used as such, however. They are reserved and cannot be used because UTF-16 cannot represent code points in this range; in order to represent such code points UTF-16 would use code units in the range [0xD800-0xDFFF], however UTF-16 uses code unit values in this range to represent code points above U+FFFF and therefore cannot use them to represent code points in the range [U+D800-U+DFFF].