Is utf-8 null the same as utf-16/utf-32 null? - unicode

Does one byte of zeros mean null in utf16 and utf32? as in utf8 or do we need 2 and 4 bytes of zeros to create null in utf16 and utf32 correspondingly?

In UTF-16 it would be two bytes, and in UTF-32 it would be 4 bytes.
After all, otherwise you couldn't differentiate between a character whose encoded value just happened to start with a zero byte and a single zero byte representing U+0000.
Basically UTF-16 works in blocks of 2 bytes, and UTF-32 works in blocks of 4 bytes. (Admittedly for characters outside the BMP you need two "blocks" of UTF-16, but the principle is still the same.) If you were to implement a UTF-16 decoder, you'd read two bytes at a time.

Related

unicode UTF-8, Decoding issue

UTF-8 is variable-length encoding. If a character can be represented using a single byte eg: A (alphabet A in English), UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on.
Now consider i encode A (01000001) あ(11100011 10000001 10000010).
This will be stored in memory as continuous space: 01000001 11100011 10000001 10000010.
My question is while decoding, how does the editor knows that 1st byte is for first character only and next 3 bytes are for 2nd character?
** It could end up decoding 4 characters where each byte is considered as character, I mean where is the distinction here.
The UTF-8 encoding tells the program how many bytes there are for each encoded codepoint. Any byte starting with 0xxxxxxx is an ASCII character from 0 to 127. Any byte starting with 10xxxxxx is a continuation byte and can only occur after a starting byte: 110xxxxx, 1110xxxx or 11110xxx specify that the next byte, two bytes or three bytes are continuation bytes, respectively.
If there aren’t the right number of continuation bytes, or a continuation byte ever appears in the wrong place, then the string is not valid UTF-8. Some programs take advantage of this to try to auto-detect the encoding.

Regarding unicode characters and their utf8 binary representation

Out of curiosity, i wonder why for example a character "ł" with code point 322 has a UTF8 binary representation of 11000101:10000010 in decimal 197:130 and not its actual binary representation 00000001:01000010 in decimal 1:66 ?
UTF-8 encodes Unicode code points in the range U+0000..U+007F in a single byte. Code points in the range U+0080..U+07FF use 2 bytes, code points in the range U+0800..U+FFFF use 3 bytes, and code points in the range U+10000..U+10FFFF use 4 bytes.
When the code point needs two bytes, then the first byte starts with the bit pattern 110; the remaining 5 bits are the high order 5 bits of the Unicode code point. The continuation byte starts with the bit pattern 10; the remaining 6 bits are the low order 6 bits of the Unicode code point.
You are looking at ł U+0142 LATIN SMALL LETTER L WITH STROKE (decimal 322). The bit pattern representing hexadecimal 142 is:
00000001 01000010
With the UTF-8 sub-field grouping marked by colons, that is:
00000:001 01:000010
So the UTF-8 code is:
110:00101 10:000010
11000101 10000010
0xC5 0x82
197 130
The same basic ideas apply to 3-byte and 4-byte encodings — you chop off 6-bits per continuation byte, and combine the leading bits with the appropriate marker bits (1110 for 3 bytes; 11110 for 4 bytes — there are as many leading 1 bits as there are bytes in the complete character). There are a bunch of other rules that don't matter much to you right now. For example, you never encode a UTF-16 high surrogate (U+D800..U+DBFF) or a low surrogate (U+DC00..UDFFF) in UTF-8 (or UTF-32, come to that). You never encode a non-minimal sequence (so although bytes 0xC0 0x80 could be used to encode U+0000, this is invalid). One consequence of these rules is that the bytes 0xC0 and 0xC1 are never valid in UTF-8 (and neither are 0xF5..0xFF).
UTF8 is designed for compatibility with with 7-bit ASCII.
To achieve this the most significant bit of bytes in a UTF8 encoded byte sequence is used to signal whether a byte is part of a multi-byte encoded code point. If the MSB is set, then the byte is part of a sequence of 2 or more bytes that encode a single code point. If the MSB is not set then the byte encodes a code point in the range 0..127.
Therefore in UTF8 the byte sequence [1][66] represents the two code points 1 and 66 respectively since the MSB is not set (=0) in either byte.
Furthermore, the code point #322 must be encoded using a sequence of bytes where the MSB is set (=1) in each byte.
The precise details of UTF8 encoding are quite a bit more complex but there are numerous resources that go into those details.

How does UTF-8 encoding identify single byte and double byte characters?

Recently I've faced an issue regarding character encoding, while I was digging into character set and character encoding this doubt came to my mind.UTF-8 encoding is most popular because of its backward compatibility with ASCII.Since UTF-8 is variable length encoding format, how it differentiates single byte and double byte characters.For example, "Aݔ" is stored as "410754" (Unicode for A is 41 and Unicode for Arabic character is 0754.How encoding identifies 41 is one character and 0754 is another two-byte character?Why it's not considered as 4107 as one double byte character and 54 as a single byte character?
For example, "Aݔ" is stored as "410754"
That’s not how UTF-8 works.
Characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 presentation. For example, U+0041 becomes 0x41 which is 01000001 in binary.
All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.
Computers know where one character ends and the next one starts because UTF-8 was designed so that the single-byte values used for ASCII do not overlap with those used in multi-byte sequences. The bytes 0x00 through 0x7F are only used for ASCII and nothing else; the bytes above 0x7F are only used for multi-byte sequences and nothing else. Furthermore, the bytes that are used at the beginning of the multi-byte sequences also cannot occur in any other position in those sequences.
Because of that the codepoints need to be encoded. Consider the following binary patterns:
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The amount of ones in the first byte tells you how many of the following bytes still belong to the same character. All bytes that belong to the sequence start with 10 in binary. To encode the character you convert its codepoint to binary and fill in the x’s.
As an example: U+0754 is between U+0080 and U+07FF, so it needs two bytes. 0x0754 in binary is 11101010100, so you replace the x’s with those digits:
11011101 10010100
Short answer:
UTF-8 is designed to be able to unambiguously identify the type of each byte in a text stream:
1-byte codes (all and only the ASCII characters) start with a 0
Leading bytes of 2-byte codes start with two 1s followed by a 0 (i.e. 110)
Leading bytes of 3-byte codes start with three 1s followed by a 0 (i.e. 1110)
Leading bytes of 4-byte codes start with four 1s followed by a 0 (i.e. 11110)
Continuation bytes (of all multi-byte codes) start with a single 1 followed by a 0 (i.e. 10)
Your example Aݔ, which consists of the Unicode code points U+0041 and U+0754, is encoded in UTF-8 as:
01000001 11011101 10010100
So, when decoding, UTF-8 knows that the first byte must be a 1-byte code, the second byte must be the leading byte of a 2-byte code, the third byte must be a continuation byte, and since the second byte is the leading byte of a 2-byte code, the second and third byte together must form this 2-byte code.
See here how UTF-8 encodes Unicode code points.
Just to clarify, ASCII mean standard 7-bit ASCII and not extended 8-bit ASCII as commonly used in Europe.
Thus, part of first byte (0x80 to 0xFF) goes to dual byte representation and part of second byte on two bytes (0x0800 to 0xFFFF) takes the full three-byte representation.
Four byte representation uses only the lowest three bytes and only 1.114.111 of the ‭16.777.215‬ available possibilities
You have an xls here
That means that interpreters must 'jump back' a NUL (0) byte when they find those binary patterns.
Hope this helps somebody!

Size of characters in unicode

We are upgrading our database to 11g and also converting everything to Unicode. After reading online, I found out that each character in a string can take 1, 2 or 4 bytes.
I was wondering how can the system know the number of byte the character takes. Is there a reserved bit in each byte in the Unicode encoding that say "this character is 2 byte"?
First, be aware that there are major differences between Unicode and a particular encoding. There are multiple ways to encode Unicode (UTF-8, UTF-16, and UTF-32 being three of the more common) each of which has different properties. You appear to be describing the properties of the UTF-8 encoding.
Yes, the leading bit(s) within each byte of a UTF-8 encoded string indicate how many bytes a particular character uses. The Wikipedia article on the UTF-8 encoding shows the various bit-patterns for each byte for 1, 2, 3, and 4 byte characters.
A Unicode character as such is an abstract concept. When characters are encoded as byte strings, they may have different lengths. In UTF-32, each character is 4 bytes. In UTF-16, each character is 2 or 4 bytes. In UTF-8, each character is 1, 2, 3, or 4 bytes.
In UTF-16, the first two bytes determine whether there are two more bytes. The additional bytes are present if the quantity defined by the first two bytes is in a specific designated range called “high surrogates”.
In UTF-8, the bit pattern of the first byte specifies how many bytes there are for the character. If the most significant bit is 0, there is just this one byte (so Ascii characters are represented just as in Ascii). If the first three bits are 110, there is one more byte. If the first four bits are 1110, two more bytes, and if 1111, three more bytes.
If you pick up an arbitrary byte from a UTF−8 stream, you cannot generally decide whether it is part of a 2, 3, or 4 byte representation. If it is one of the patterns described for the start byte, you know what it is. But if it starts with the bits 10, you cannot know.
This means that a UTF-8 stream must be processed sequentially. Direct addressing by character position is impossible; to find the Nth character, you need to start reading from the beginning and observe the bit patterns of start bytes.

How does UCS-2 display unicode code points that take 6 bytes in UTF-8?

I was reading about unicode at http://www.joelonsoftware.com/articles/Unicode.html. Joel says UCS-2 encodes all unicode characters in 2 bytes whereas UTF-8 may take upto 6 bytes to encode some of the unicode characters. Would you please explain with an example, how a 6 byte UTF-8 encoded unicode character is encoded in UCS-2?
UCS-2 was created when Unicode had less than 65536 codepoints, so they all fit in 2 bytes max. Once Unicode grew to more than 65536 codepoints, UCS-2 became obsolete and was replaced with UTF-16, which encodes all of the UCS-2 compatible codepoints using 2 bytes and the rest using 4 bytes via surrogate pairs.
UTF-8 was originally written to encode codepoints up to 6 bytes (U+7FFFFFFF max) but was later limited to 4 bytes (U+1FFFFF max, though anything above U+10FFFF is forbidden) so that it is 100% compatible with UTF-16 back and forth and does not encode any codepoints that UTF-16 does not support. The maximum codepoint that both UTF-8 and UTF-16 support is U+10FFFF.
So, to answer your question, a codepoint that requires a 5- or 6-byte UTF-8 sequence ( U+200000 to U+7FFFFFFF) cannot be encoded in UCS-2, or even UTF-16. There are not enough bits available to hold such large codepoint values.
UCS-2 stores everything it can in two bytes, and does nothing about the code points that won't fit into that space. Which is why UCS-2 is pretty much useless today.
Instead, we have UTF-16, which looks like UCS-2 for all the two-byte sequences, but also allows surrogate pairs, pairs of two-byte sequences. Using those, remaining code points can be encoded using a total of 4 bytes.