What's a good terminator byte for UTF-8 data? - unicode

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.
What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?

0xFF and 0xFE cannot appear in legal UTF-8 data. Also the bytes 0xF8-0xFD will only appear in the obsolete version of UTF-8 that allows up to six byte sequences.
0x00 is legal but won't appear anywhere except in the encoding of U+0000. This is exactly the same as other encodings, and the fact that it's legal in all these encodings never stopped it from being used as a terminator in C strings. I'd probably go with 0x00.

The byte 0xff cannot appear in a valid UTF-8 sequence, nor can any of 0xfc, 0xfd, 0xfe.
All UTF-8 bytes must match one of
0xxxxxxx - Lower 7 bit.
10xxxxxx - Second and subsequent bytes in a multi-byte sequence.
110xxxxx - First byte of a two-byte sequence.
1110xxxx - First byte of a three-byte sequence.
11110xxx - First byte of a four-byte sequence.
111110xx - First byte of a five-byte sequence.
1111110x - First byte of a six-byte sequence.
There are no seven or larger byte sequences. The latest version of UTF-8 only allows UTF-8 sequences up to 4 bytes in length, which would leave 0xf8-0xff unused, but is possible though that a byte sequence could be validly called UTF-8 according to an obsolete version and include octets in 0xf8-0xfb.

What about using one of the UTF-8 control characters?
You can choose one from http://www.utf8-chartable.de/

Related

UTF16 BIG ENDIAN to UTF8 conversion for failed for 0xdcf0

I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check
if (first_byte & 0xfc == 0xdc) {
return -1;
}
Can you please help me understand why this check is present.
Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.
See e.g. Wikipedia article UTF-16 for more information.
The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.
In UTF-16, the two byte sequence
DCFO
cannot begin the encoding of any character at all.
The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:
0000 .. D7FF
E000 .. FFFF
All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range
D800 .. DBFF
and the second pair of bytes must be in the range
DC00 .. DFFF
This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.
Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.

Given a UTF-8 string, can I treat it as a string of bytes when searching for ASCII characters?

Since a string in many modern languages now are sequence of unicode character, it can span more than a single byte. But, If I only care about some ascii character, is it safe to treat string as sequences of byte (assuming the given string is a sequence of valid unicode characters)?
Yes.
From Wikipedia:
[...] ASCII bytes do not occur when encoding non-ASCII code points into UTF-8 [...]
Moreover, 7-bit bytes (bytes where the most significant bit is 0) never appear in a multi-byte sequence, and no valid multi-byte sequence decodes to an ASCII code-point. [...] Therefore, the 7-bit bytes in a UTF-8 stream represent all and only the ASCII characters in the stream. Thus, many [programs] will continue to work as intended by treating the UTF-8 byte stream as a sequence of single-byte characters, without decoding the multi-byte sequences.
From utf8everywhere.org:
By design of this encoding, UTF-8 guarantees that an ASCII character value or a substring will never match a part of a multi-byte encoded character.
This is visualized nicely by this table from Wikipedia:
Number of bytes Byte 1 Byte 2 Byte 3 Byte 4
1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
All ASCII characters seen as a 8-bit byte have the most significant bit set to 0. But in multi-byte encoded characters, all bytes have the MSB set to 1.
Note that UTF8 is one encoding of Unicode. They are not the same! My answer talks about UTF8 encoded strings (which luckily is the most prominent encoding).
An additional thing to be aware of is Unicode normalization, combining characters and other characters that "kind of" contain an ASCII character. Take the Umlaut ä for example:
ä 0xC3A4 LATIN SMALL LETTER A WITH DIAERESIS
ä 0x61CC88 LATIN SMALL LETTER A + COMBINING DIAERESIS
If you search for the ASCII character 'a', you will find it in the second line, but not in the first one, despite the lines logically containing the same "user perceived characters". You can tackle this at least partially by normalizing your strings beforehand.
It is helpful to drop the notion of a unicode character and rather talk about a unicode codpoint (for example U+0065: 'LATIN SMALL LETTER E') and different encodings (ASCII, UTF-8, UTF-16, etc). You are asking about properties of the UTF-8 encoding. In the case of UTF-8: code points below U+0080 have the same encoding as ASCII. The wikipedia page has a nice table
Number Bits for First Last Byte 1 Byte 2
of bytes code point code point code point
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
...
Talking about strings in languages is too broad in my opinion, because even if you language stores string values in some specified encoding you can still receive input in a different encoding. (Think of a Java program (which uses UTF-16 for the internal representation. You can still serialize a string as UTF-8 or get user input which is encoded in ASCII.)

Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?

Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?
Also known as \xAE
A Unicode character as such does not have any length in bytes. It is the character encoding that matters. You know the length of a character in bytes in a specific encoding from the definition of the encoding.
For example, in the ISO-8859-1 (ISO Larin 1) encoding, which encodes just a small subset of Unicode, including “®”, every character is 1 byte long.
In the UTF-16 encoding, all characters are either 2 or 4 bytes long, and characters in the range U+0000...U+FFFF, such as “®”, are 2 bytes
In the UTF-32 encoding, all characters are 4 bytes long.
In the UTF-8 encoding, characters take 1 to 4 bytes. A simple way to check this out is to use the Fileformat.info Character search (though this is not normative information, just a nice quick reference). E.g., the page about U+00AE shows the character in some encodings, including 0xC2 0xAE (that is, 2 bytes) in UTF-8.
It is unicode number U+00AE. It's in the range [0x80, 0x7ff] so in UTF-8 it'll be encoded as two bytes — the table at the top of the Wikipedia article explains in more detail*.
If you were using UTF-16 it'd also be two bytes, since no continuation is necessary.
(* my summary though: one of the features of UTF-8 is that you can jump midway into a byte stream and synchronise with the text without generating any spurious characters, because you can tell whether any byte is a continuation character without further context.
An unavoidable side effect is that only the 7-bit ASCII characters fit into a single byte and everything else takes multiple bytes. 0xae is sufficiently close to the 7-bit range to require only one extra byte. See Wikipedia for specifics.)

How to print escaped hexadecimal in a string in C++?

I have questions related to Unicode, printing escaped hexadecimal values in const char*.
From what I have understood, utf-8 includes 2-, 3- or 4-byte characters, ranging from pound symbol to Kanji characters. Within strings these are represented in hexadecimal values using \u as escape sequence. Also I have understood while using hexadecimal escape in a string, the characters whose value can be included in the escape will be included. For example say "abc\x0f0dab" will include 0f0dab to be considered within \x as hex even though you want only 0f0d to be considered.
Now while writing a Unicode string, say you want to write "abc𤭢def₤ghi", where Unicode for 𤭢 is 0x24B62 and ₤ is 0x00A3. So I will have to compose the string as "abc0x24B62def0x00A3ghi". The 0x will consider all values that can be included in it. So if you want to print "abc𤭢62" the string will be "abc0x24B6262". Won't the entire string be taken as a 4-byte unicode (0x24B6262) value considered within 0x? How to solve this? How to print "abc𤭢62" and not abc(0x24B6262)?
I have a string const char* tmp = "abc\x0fdef";. When I print using printf("\n string = %s", tmp); then it will print abcdef. Where is 0f here? I know the decimal value of \x0f will be stored in the string, i.e. 15, so when we try to print, 15 should be printed right? I mean, it should be "abc15def" but it prints only "abcdef".
I think you may be unfamiliar with the concept of encodings, from reading your post.
For instance, you say "unicode of ... ₤ is 0x00A3". That is true - unicode codepoint U+00A3 is the pound sign. But 0x00A3 is not how you represent the pound sign in, for example, UTF-8 (a particular common encoding of Unicode). Take a look here to see what I mean. As you can see, the UTF-8 encoding of U+00A3 is the two bytes is 0xc2, 0xa3 (in that order).
There are several things that happen between your call to printf() and when something appears on your screen.
First, your program runs the code printf("abc\x0fdef"), and that means that the following bytes in order, are written to stdout for your program:
0x61, 0x62, 0x63, 0x0f, 0x64, 0x65, 0x66
Note: I'm assuming your source code is ASCII (or UTF-8), which is very common. Technically, the interpretation of your source code's character set is implementation-defined, I believe.
Now, in order to see output, you will typically be running this program inside some kind of shell, and it has to eventually transform those bytes into visual characters. It does this by using an encoding. Again, something ASCII-compatible is common, such as UTF-8. On Windows, CP1252 is common.
And if that is the case, you get the following mapping:
0x61 - a
0x62 - b
0x63 - c
0x0f - the 'shift in' ASCII control code
0x64 - d
0x65 - e
0x66 - f
This prints out as "abcdef" because the 'shift in' control code is a non-printing character.
Note: The above can change depending on what exact character sets are involved, but ASCII or UTF-8 is very likely what you're dealing with unless you have an exotic setup.
If you have a UTF-8 compatible terminal, the following should print out "abc₤def", just as an example to get you started:
printf("abc\xc2\xa3def");
Make sense?
Update: To answer the question from your comment: you need to distinguish between a codepoint and the byte values for an encoding of that codepoint.
The Unicode standard defines 'codepoints' which are numerical values for characters. These are commonly written as U+XYZ where XYZ is a hexidecimal value.
For instance, the character U+219e is LEFTWARDS TWO HEADED ARROW.
This might also be written 0x219e. You would know from context that the writer is talking about a codepoint.
When you need to encode that codepoint (to print, or save to file, etc), you use an encoding, such as UTF-8. Note, if you used, for example, the UTF-32 encoding, every codepoint corresponds exactly to the encoded value. So in UTF-32, the codepoint U+219e would indeed be encoded simply as 0x219e. But other encodings will do things differently. UTF-8 will encode U+219e as the three bytes 0xE2 0x86 0x9E.
Lastly, the \x notation is simply how you write arbitrary byte values inside a C/C++ quoted string. If I write, in C source code, "\xff", then that string in memory will be the two bytes 0xff 0x00 (since it automatically gets a null terminator).

Checking Unicode string for whitespace - byte for byte!

Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?
I'll explain:
Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).
So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.
Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?
UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.
As for UTF-16 and UTF-32 I doubt it'll work at all, but I barely know anything about the details of these, so feel free to surprise me there...
The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.
For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.
Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.
You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).
In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.
However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.
You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.
Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.
For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space.
You can see this in the following table:
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.
See wikipedia UTF-8 for more detail.
UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).
For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.
See wikipedia UTF-16 for more detail.
Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.
It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.