In UTF-8, code points >127 encoded with multiple bytes. For example, character U+041F (100'0001'1111) encoded as:
1101'0000 1001'1111
^^^ ^^
Marked bits determine leading and trailing bytes, other bits are actual bits of the code point.
But can we encode code point 1 as
1100'0000 1000'0001
Of course, it is redundant, but is it legal in UTF-8?
Overlong UTF-8 sequences are not considered valid UTF-8 representations of a code point. A UTF-8 decoder must reject overlong sequences.
Wikipedia citation: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings
Original RFC 2279 specification: https://www.ietf.org/rfc/rfc2279.txt
Related
I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check
if (first_byte & 0xfc == 0xdc) {
return -1;
}
Can you please help me understand why this check is present.
Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.
See e.g. Wikipedia article UTF-16 for more information.
The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.
In UTF-16, the two byte sequence
DCFO
cannot begin the encoding of any character at all.
The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:
0000 .. D7FF
E000 .. FFFF
All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range
D800 .. DBFF
and the second pair of bytes must be in the range
DC00 .. DFFF
This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.
Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.
I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.
Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?
Also known as \xAE
A Unicode character as such does not have any length in bytes. It is the character encoding that matters. You know the length of a character in bytes in a specific encoding from the definition of the encoding.
For example, in the ISO-8859-1 (ISO Larin 1) encoding, which encodes just a small subset of Unicode, including “®”, every character is 1 byte long.
In the UTF-16 encoding, all characters are either 2 or 4 bytes long, and characters in the range U+0000...U+FFFF, such as “®”, are 2 bytes
In the UTF-32 encoding, all characters are 4 bytes long.
In the UTF-8 encoding, characters take 1 to 4 bytes. A simple way to check this out is to use the Fileformat.info Character search (though this is not normative information, just a nice quick reference). E.g., the page about U+00AE shows the character in some encodings, including 0xC2 0xAE (that is, 2 bytes) in UTF-8.
It is unicode number U+00AE. It's in the range [0x80, 0x7ff] so in UTF-8 it'll be encoded as two bytes — the table at the top of the Wikipedia article explains in more detail*.
If you were using UTF-16 it'd also be two bytes, since no continuation is necessary.
(* my summary though: one of the features of UTF-8 is that you can jump midway into a byte stream and synchronise with the text without generating any spurious characters, because you can tell whether any byte is a continuation character without further context.
An unavoidable side effect is that only the 7-bit ASCII characters fit into a single byte and everything else takes multiple bytes. 0xae is sufficiently close to the 7-bit range to require only one extra byte. See Wikipedia for specifics.)
Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?
I'll explain:
Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).
So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.
Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?
UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.
As for UTF-16 and UTF-32 I doubt it'll work at all, but I barely know anything about the details of these, so feel free to surprise me there...
The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.
For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.
Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.
You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).
In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.
However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.
You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.
Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.
For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space.
You can see this in the following table:
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.
See wikipedia UTF-8 for more detail.
UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).
For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.
See wikipedia UTF-16 for more detail.
Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.
It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.
I have a sting in unicode is "hao123--我的上网主页", while in utf8 in C++ string is "hao123锛嶏紞鎴戠殑涓婄綉涓婚〉", but I should write it to a file in this format "hao123\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875", how can I do it. I know little about this encoding. Can anyone help? thanks!
You seem to mix up UTF-8 and UTF-16 (or possibly UCS-2). UTF-8 encoded characters have a variable length of 1 to 4 bytes. Contrary to this, you seem to want to write UTF-16 or UCS-2 to your files (I am guessing this from the \uxxxx character references in your file output string).
For an overview of these character sets, have a look at Wikipedia's article on UTF-8 and browse from there.
Here's some of the very basic basics (heavily simplified):
UCS-2 stores all characters as exactly 16 bits. It therefore cannot encode all Unicode characters, only the so-called "Basic Multilingual Plane".
UTF-16 stores the most frequently-used characters in 16 bits, but some characters must be encoded in 32 bits.
UTF-8 encodes characters with a variable length of 1 to 4 bytes. Only characters from the original 7-bit ASCII charset are encoded as 1 byte.