utf-8: How to recover from missing bit? - encoding

Quote from Wikipedia
If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize at the start of the next code point;
...
UTF-16 and UTF-32 will handle corrupt (altered) bytes by resynchronizing on the next good code point, but an odd number of lost or spurious byte (octet)s will garble all following text
I don't quite understand why UTF-8 works better than UTF-16 when corruption happens. What does "odd number" mean? Does it mean 1 in 0101010111101(bytes stream)?
In addition, Can I say that UTF-16's recovery performance is as good as UTF-8 when no odd number of lost or spurious byte (octet) occur?
Could anyone give some point at how the system work to recover from bit error when using UTF-8 or UTF-16?

This documentation quote refers to recovering the stream after having missing some code points, not correcting the characters that went bad during transmission.
And what it means is simply that: if you miss one or more byte in an utf-8 stream, as soon as the next byte is the beginning of a new code point (usually a character), you can resume your text from there.
In utf-16, all characters are 2 bytes long, and if you miss a single byte then all subsequent bytes will be misplaced, and without further, higher level, correction, no other code-point will be correct for that stream.
For visualizing that, in Python3 interactive mode we could do:
In [19]: a = "Resumé for mr. Fernando"
In [20]: b = a.encode("utf-8"); b = b[:3] + b[4:]
In [21]: print (b.decode("utf-8"))
Resmé for mr. Fernando
In [22]: b = a.encode("utf-16"); b = b[:10] + b[11:]
In [23]: print (b.decode("utf-16", errors="replace"))
Resu 昀漀爀 洀爀⸀ 䘀攀爀渀愀渀搀漀�
(The slicing notation used in Python for those unfamiliar means [<begin (inclusive)>: <end (exclusive)>] as positions. Any parameter left blank is assumed to be the beginning or end of the sliced sequence)
And why "odd" should be obvious at this point: if you miss an even number of bytes, the decoding system will still try to decode at character boundaries - with an odd number, all character boundaries will be incorrect

The magic hides in the structure of the encoding. In UTF-8 structure, all the starting byte of a UTF-8 encoded bytes cannot be misreplace by the byte that is not the starting byte. However, thing is not true in UTF-16.
UTF-8 structure
Let's simulate how a magic machine that receiving UTF-8 encoded bytes as input and outputing code points works when one byte missed in a 3-byte long UTF-8 encoded bytes.
Say, the UTF-8 encoded bytes from two characters 一二 is e4b880(一) and e4ba8c(二). Now the machine reads e4, which is 11100100, knowing that the following code point is a 3-byte long code point. Unfortunately, the following byte b8is missed. Next,80 is read. Then, the first byte of 二 is read, which is e4. However, e4 doesn't fit the rule right: When it comes to 3-byte long code point, byte 3 need to start with 10 according to the table above. Now the machine knows some bytes are missing and the code point is broken. It will try to look around to find the right decode starting point. Obviously, the next byte ba is not a good start because it doesn't fit any of the byte 1 in the table above. Then it will look back, finding that the previous byte e4 is exaclty a good start byte to decode.
In UTF-16, a byte can be misreplaced by the byte that is not the starting byte because of UTF-16 have no way to tell whether it's a starting byte or not.

Related

Are control sequences the same number in every encoding?

I am writing a parser, and the original spec states:
The file header ends with the control sequence Ctrl-Z
They do not specify which encode the header is written in (could be latin1, utf8, windows-1252,...), so I wonder whether the sequence the same number in every language.
It appears to be the case, that it always correspond to decimal 26 or the hexa 1A
It would be good to know in a more general way, whether this is for all sequences.
Most likely, ASCII is assumed. For/if ASCII, especially if you say "Ctrl-Z" corresponds to binary representation/"codepoint" dec 26 hex 1A, this would be the SUB (substitute) sequence.
Other alternatives of the extended character sets/encodings wouldn't apply here, because if dec 26 in ASCII, it's within the first/lower 7 bits of the byte (dec 0-126 of 255 total). The 8th bit then was used to toggle all the previous codepoints/patterns once more and gain/use the other half, the other remaining 127 codepoints from dec 128-255. The idea here is that the extended character sets usually share/retain the lower ASCII codepoints/mappings (also for backward compatibility), but introduce their own special characters in the higher codepoint bit-patterns/range 128-255. And there's then many different ones of this type, trying to support more writing scripts of the world with such custom extended code sets. Like Windows-1252 which is an European mix, ISO-8859-1 for German, ISO-8859-15 which is identical but only adds the Euro currency symbol, code page 437 for IBM DOS shapes to use characters for drawing a TUI on the console (this, for example, has a different mapping at it's code points for what is the control sequences in ASCII), and so on. The problem obviously is, there's a lot of these:
each only gains 128 more characters
you can't combine/load/apply any two of them at the same time (if characters would be needed from multiple different code sets)
each application has to know (or be told) beforehand in which code set a file was saved in order to interpret/display/map the correct character rendering/symbols on the screen for these byte patterns, and if the user or a tool/app applies and saves the wrong code set to save its character renderings while not recognizing that, because the source was previously created/saved with a different code set, some characters didn't appear with the intended original renderings, now the file is "corrupt" because some bytes were stored under the assumption they would be rendered with code set A and some under the assumption they're for code set B, and can't apply both as there's also no mechanism in these flat dumb plain-text files on some old, memory-short DOS file systems to tell which part of a file is for which code-set, the characters can never be rendered correctly and it can be difficult work or impossible to retroactively guess + repair what the desired interpretation/rendering was for the binary pattern in a byte
no hope to get anywhere with only 127 more characters added on top of ASCII when it comes to Chinese etc.
So then the improvement was to not use the 8th bit for these stupid code pages, but instead use it as a marker that, if set, it's an indication that another byte is following (UTF-8), hence expanding the amount of code-points greatly. This can even be repeated with the next, subsequent byte. But, it's optional. If the character is within the 7-bit ASCII codepoints, then UTF-8 does not need to set the 8th bit and add another byte.
Also means, the extended code pages and UTF-8 cannot be mixed (used/applied at the same time). For many/most code pages and for UTF-8/UTF-16 as well, the character-onto-codepoint (latter is the bit pattern) mappings are identical to ASCII. If your characters are within the first/lower 7 bits of the byte, it does not matter what the encoding theoretically would be, as the 8th bit is not used for any of code pages or UTF-8. It only matters a great deal if/for characters that do have the 8th bit set/used (and usually if there's bytes like that, the choice of its encoding would usually then then for the entire file, just that some bytes may stay within the single-byte ASCII, but really should take great care at inserting/interpreting binary patterns that have the 8th bit set in a byte).
Easy rule is: if all bytes (or the byte in question) don't have the 8th bit set, it only matters whether the lower 7 bits are ASCII or not. EBCDIC for example is a non-ASCII alternative, where dec 26 hex 1A is UBS (unit backspace), while it also does have a SUB (substitute) but it's on codepoint (binary pattern) dec 63 hex 3F. Other encodings may not have ASCII's SUB at all, or something similar but with a slightly different meaning/use, or maybe ASCII has it's SUB from EBCDIC, etc. But there's no need to wonder/worry about UTF-8, as it does not apply if ASCII can be assumed, for the characters as encoded in ASCII are encoded identically UTF-8 as a single byte with the highest bit not set.
Maybe it can be determined from the spec if all the characters mentioned are within the ASCII range and according to the ASCII codepoint definitions, or if there's other characters that might only be found in UTF-8 (or UTF-16 or UTF-32) or in one of the old extended code pages (but not found in others), or if there's any indication that the encoding might not be ASCII/ASCII-based.
It's obviously problematic if a spec doesn't explicitly state the encoding it's implicitly assuming, if the spec is about a format, protocol or data representation. On the other hand, maybe the Ctrl-Z is misleading, because dec 26 hex 1A is always the same, no matter what the encoding could be if it were text/characters. Maybe the spec just uses this pattern as a construct with no meaning in terms of character display whatsoever, and is introducing only it's own particular local meaning as defined within the spec.

Why do we have to specify BOM in case of UTF-16 and UTF-32 encodings

I don't quite understand the principles behind UTF encodings and BOM.
What is the point of having BOM in UTF-16 and UTF-32 if computers already know how to compose multibyte data types (for example, integers with the size of 4 bytes) into one variable? Why do we need to specify it explicitly for these encodings then?
And why don't we need to specify it for UTF-8? Unicode standard says that it's "byte oriented" but even then we need to know whether it is the first byte of the encoded code point or not. Or does it specified in the first / last bits of every character?
UTF-16 is two byte wide, lets call that bytes B0|B1.
Let's say we have letter 'a' this is logically number 0x0061. Unfortunately different computer architectures store this number in different ways in memory, on x86 platform less significant byte is stored first (at lower memory address) so 'a' will be stored as 00|61. On PowerPC this will be stored as 61|00, these two architectures are called little endian and big endian for that reason.
To speed up string processing libraries generally store two bytes characters in native order (big ending or little endian). Swapping bytes would be too expensive.
Now imagine that someone on PowerPC writes string to a file, library will write bytes 00|61, now someone on x86 will want to read this bytes but does it mean 00|61 or maybe 61|00? We can put special sequence at the beginning of the string so anyone will know byte order used to save string, and process it correctly (converting string between endian's is a costly operation, but most of the time x86 string will be read on x86 arch, and PowerPC string on PowerPC machines)
With UTF-8 this is different story, UTF-8 uses single order and encodes character length into pattern of first bits of first character. UTF-8 encoding is well described on Wikipedia. Generally speaking it was designed to avoid problem with endian'ess
Different architectures can encode things differently. One system might write 0x12345678 as 0x12 0x34 0x56 0x78 and another might write it as 0x78 0x56 0x34 0x12. It's important to have a way of understanding how the source system has written things. Bytes are the smallest units read or written, so if a format is written byte-by-byte, there is not a problem, just like no system has trouble reading an ASCII file written by another.
The UTF-16 BOM, U+FEFF will either be written as 0xFE 0xFF or 0xFF 0xFE, depending on the system. Knowing in which order those bytes are written tells the reader which order the bytes will be in for the rest of the file. UTF-32 uses the same BOM character, padded with 16 zero bits, but its use is the same.
UTF-8, on the other hand, is designed to be read a byte at a time. Therefore, the order is the same on all systems, even when dealing with mutli-byte characters.
The UTF-16 and UTF-32 encodings do not specify a byte order. In a stream of 8-bit bytes, the code point U+FEFF can be encoded in UTF-16 as the bytes FE, FF (big endian) or as FF, FE (little endian). The stream writer obviously cannot know where the stream will end up (a file, a network socket, a local program?) so you put a BOM at the beginning to help the reader(s) determine the encoding and byte-order variant.
UTF-8 does not have this ambiguity because it is a byte-oriented encoding right from the start. The only way to encode this code point in UTF-8 is with the bytes EF, BB, BF in this precise order. (Conveniently, the high bits in the first byte of the serialization also reveals how many bytes the sequence will occupy.)

Are there bytes that are not used in the UTF-8 encoding?

As I understand it, UTF-8 is a superset of ASCII, and therefore includes the control characters which are not used to represent printable characters.
My question is: Are there any bytes (of the 256 different) that are not used by the UTF-8 encoding?
I wondered if you could convert/encode UTF-8 text to binary.
Here my though process:
I have no idea how the UTF-8 text encoding works and how it can use so many characters (only that it uses multiple bytes for characters not in ASCII (Latin-1??)) but I know that ASCII text is valid in UTF-8 so the control characters (bytes 0-30) are not used differently by the UTF-8 encoding but they are at the same time not used for displaying characters, right??
So of the 256 different bytes, only ~230 are used. For a 1000 (binary) long Unicode text there are only 1000^230 different texts? Right?
If that is true, you could convert it to a binary data which is smaller than 1000 bytes.
Wolfram alpha: 1000 bytes of unicode (assumption unicode only uses 230 of the 256 different bytes) --> 496 bytes
Yes, it is possible to devise encodings which are more space-efficient than UTF-8, but you have to weigh the advantages against the disadvantages.
For example, if your primary target is (say) ISO-8859-1, you could map the character codes 0xA0-0xFF to themselves, and only use 0x80-0x9F to select an extension map somewhat vaguely like UTF-8 uses (nearly) all of 0x80-0xFF to encode sequences which can represent all of Unicode > 0x80. You would gain a significant advantage when the majority of your text does not use characters in the ranges 0x80-0x9F or 0x0100-0x1EFFFFFFFF, but correspondingly lose when this is not the case.
Or you could require the user to keep a state variable which tells you which range of characters is currently selected, and have each byte in the stream act as an index into that range. This has significant disadvantages, but used to be how these things were done way back when (witness e.g. ISO-2022).
The original UTF-8 draft before Ken Thompson and Rob Pike famously intervened was probably also somewhat more space-efficient than the final specification, but the changes they introduced had some very attractive properties, trading (I assume) some space efficiency for lack of contextual ambiguity.
I would urge you to read the Wikipedia article about UTF-8 to understand the design desiderata -- the spec is possible to grasp in just a few minutes, although you might want to reserve an hour or more to follow footnotes etc. (The Thompson anecdote is currently footnote #7.)
All in all, unless you are working on space travel or some similarly effeciency-intensive application, losing UTF-8 compatibility is probably not worth the time you have already spent, and you should stop now.
0xF8-0xFF are not valid anywhere in UTF-8, and some other bytes are not valid at certain positions.
The lead byte of a character indicates the number of bytes used to encode the character, and each continuation byte has 10 as its two high order bits. This is so that you can pick any byte within the text and find the start of the character containing it. If you don't mind losing this ability, you could certainly come up with more efficient encoding.
You have to distinguish Characters, Unicode and UTF-8 encoding:
In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).
To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).

Checking Unicode string for whitespace - byte for byte!

Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?
I'll explain:
Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).
So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.
Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?
UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.
As for UTF-16 and UTF-32 I doubt it'll work at all, but I barely know anything about the details of these, so feel free to surprise me there...
The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.
For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.
Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.
You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).
In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.
However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.
You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.
Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.
For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space.
You can see this in the following table:
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.
See wikipedia UTF-8 for more detail.
UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).
For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.
See wikipedia UTF-16 for more detail.
Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.
It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.

How do I determine the character set of a string?

I have several files that are in several different languages. I thought they were all encoded UTF-8, but now I'm not so sure. Some characters look fine, some do not. Is there a way that I can break out the strings and try to identify the character sets? Perhaps split on white space then identify each word? Finally, is there an easy way to translate characters from one set to UTF-8?
If you don't know the character set for sure You can only guess, basically. utf8::valid might help you with that, but you can't really know for sure. If you know that if it isn't unicode it must be a specific character set (Like Latin-1), you lucky. If you have no idea, you're screwed. In any case, you should always assume the whole file is in the same character set, unless otherwise specified. You will lose your sanity if you don't.
As for your question how to convert between character sets: Encode is there to do that for you
Determining whether a file is probably UTF-8 or not should be pretty easy. Determining the encoding if it is not UTF-8 would be very difficult in general.
If the file is encoded with UTF-8, the high bits of each byte should follow a pattern. If a character is one byte, its high bit will be cleared (zero). Otherwise, an n byte character (where n is 2–4) will have the high n bits of the first byte set to one, followed by a single zero bit. The following n - 1 bytes should all have the highest bit set and the second-highest bit cleared.
If all the bytes in your file follow these rules, it's probably encoded with UTF-8. I say probably, because anyone can invent a new encoding that happens to follow the same rules, deliberately or by chance, but interprets the codes differently.
Note that a file encoded with US-ASCII will follow these rules, but the high bit of every byte is zero. It's okay to treat such a file as UTF-8, since they are compatible in this range. Otherwise, it's some other encoding, and there's not an inherent test to distinguish the encoding. You'll have to use some contextual knowledge to guess.
Take a look at iconv
http://www.gnu.org/software/libiconv/
Text::Iconv