Are the first 128 characters of utf-8 and ascii identical? - encoding

Are the first 128 characters of utf-8 and ascii identical?
utf-8 table
Ascii table

Yes. This was an intentional choice in the design of UTF-8 so that existing 7-bit ASCII would be compatible.
The encoding is also designed intentionally so that 7-bit ASCII values cannot mean anything except their ASCII equivalent. For example, in UTF-16, the Euro symbol (€) is encoded as 0x20 0xAC. But 0x20 is SPACE in ASCII. So if an ASCII-only algorithm tries to space-delimit a string like "€ 10" encoded in UTF-16, it'll corrupt the data.
This can't happen in UTF-8. € is encoded there as 0xE2 0x82 0xAC, none of which are legal 7-bit ASCII values. So an ASCII algorithm that naively splits on the ASCII SPACE (0x20) will still work, even though it doesn't know anything about UTF-8 encoding. (The same is true for any ASCII character like slash, comma, backslash, percent, etc.) UTF-8 is an incredibly clever text encoding.

Related

Given a UTF-8 string, can I treat it as a string of bytes when searching for ASCII characters?

Since a string in many modern languages now are sequence of unicode character, it can span more than a single byte. But, If I only care about some ascii character, is it safe to treat string as sequences of byte (assuming the given string is a sequence of valid unicode characters)?
Yes.
From Wikipedia:
[...] ASCII bytes do not occur when encoding non-ASCII code points into UTF-8 [...]
Moreover, 7-bit bytes (bytes where the most significant bit is 0) never appear in a multi-byte sequence, and no valid multi-byte sequence decodes to an ASCII code-point. [...] Therefore, the 7-bit bytes in a UTF-8 stream represent all and only the ASCII characters in the stream. Thus, many [programs] will continue to work as intended by treating the UTF-8 byte stream as a sequence of single-byte characters, without decoding the multi-byte sequences.
From utf8everywhere.org:
By design of this encoding, UTF-8 guarantees that an ASCII character value or a substring will never match a part of a multi-byte encoded character.
This is visualized nicely by this table from Wikipedia:
Number of bytes Byte 1 Byte 2 Byte 3 Byte 4
1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
All ASCII characters seen as a 8-bit byte have the most significant bit set to 0. But in multi-byte encoded characters, all bytes have the MSB set to 1.
Note that UTF8 is one encoding of Unicode. They are not the same! My answer talks about UTF8 encoded strings (which luckily is the most prominent encoding).
An additional thing to be aware of is Unicode normalization, combining characters and other characters that "kind of" contain an ASCII character. Take the Umlaut ä for example:
ä 0xC3A4 LATIN SMALL LETTER A WITH DIAERESIS
ä 0x61CC88 LATIN SMALL LETTER A + COMBINING DIAERESIS
If you search for the ASCII character 'a', you will find it in the second line, but not in the first one, despite the lines logically containing the same "user perceived characters". You can tackle this at least partially by normalizing your strings beforehand.
It is helpful to drop the notion of a unicode character and rather talk about a unicode codpoint (for example U+0065: 'LATIN SMALL LETTER E') and different encodings (ASCII, UTF-8, UTF-16, etc). You are asking about properties of the UTF-8 encoding. In the case of UTF-8: code points below U+0080 have the same encoding as ASCII. The wikipedia page has a nice table
Number Bits for First Last Byte 1 Byte 2
of bytes code point code point code point
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
...
Talking about strings in languages is too broad in my opinion, because even if you language stores string values in some specified encoding you can still receive input in a different encoding. (Think of a Java program (which uses UTF-16 for the internal representation. You can still serialize a string as UTF-8 or get user input which is encoded in ASCII.)

ASCII compatibles and not compatibles characters encoding

What is an example of a character encoding which is not compatible with ASCII and why isn't it?
Also, what are other encoding which have upward compatibility with ASCII (except UTF and ISO8859, which I already know) and for what reason?
There are EBCDIC-based encodings that are not compatible with ASCII. For example, I recently encountered an email that was encoded using CP1026, aka EBCDIC 1026. If you look at its character table, letters and numbers are encoded at very different offsets than in ASCII. This was throwing off my email parser particularly because LF is encoded as 0x25 instead of as 0x0A in ASCII.

Scandinavian characters when encoding to Ascii in Powershell

I need to export some data using Powershell to a ASCII encoded file.
My problem is that Scandinavian characters like Æ, Ø and Å turns into ? ? ? in the output file.
Example:
$str = "ÆØÅ"
$str | Out-File C:\test\test.txt -Encoding ascii
In the output file the result of this is: ???
It seems as though you have conflicting requirements.
Save the text in ASCII encoding
Save characters outside the ASCII character range
ASCII encoding does not support the characters you mention, which is the reason they do not work as you expect them to. The MSDN documentation on ASCII Encoding states that:
ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.
And also further that
If your application requires 8-bit encoding (which is sometimes incorrectly referred to as "ASCII"), the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.
You can read more about ASCII encoding on the Wikipedia page regarding ASCII Encoding (this page also includes tables showing all possible ASCII characters and control codes).
You need to either use a different encoding (such as UTF-8) or accept that you can't use characters which fall outside the ASCII range.

Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?

Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?
Also known as \xAE
A Unicode character as such does not have any length in bytes. It is the character encoding that matters. You know the length of a character in bytes in a specific encoding from the definition of the encoding.
For example, in the ISO-8859-1 (ISO Larin 1) encoding, which encodes just a small subset of Unicode, including “®”, every character is 1 byte long.
In the UTF-16 encoding, all characters are either 2 or 4 bytes long, and characters in the range U+0000...U+FFFF, such as “®”, are 2 bytes
In the UTF-32 encoding, all characters are 4 bytes long.
In the UTF-8 encoding, characters take 1 to 4 bytes. A simple way to check this out is to use the Fileformat.info Character search (though this is not normative information, just a nice quick reference). E.g., the page about U+00AE shows the character in some encodings, including 0xC2 0xAE (that is, 2 bytes) in UTF-8.
It is unicode number U+00AE. It's in the range [0x80, 0x7ff] so in UTF-8 it'll be encoded as two bytes — the table at the top of the Wikipedia article explains in more detail*.
If you were using UTF-16 it'd also be two bytes, since no continuation is necessary.
(* my summary though: one of the features of UTF-8 is that you can jump midway into a byte stream and synchronise with the text without generating any spurious characters, because you can tell whether any byte is a continuation character without further context.
An unavoidable side effect is that only the 7-bit ASCII characters fit into a single byte and everything else takes multiple bytes. 0xae is sufficiently close to the 7-bit range to require only one extra byte. See Wikipedia for specifics.)

utf8 and encoding

I have a sting in unicode is "hao123--我的上网主页", while in utf8 in C++ string is "hao123锛嶏紞鎴戠殑涓婄綉涓婚〉", but I should write it to a file in this format "hao123\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875", how can I do it. I know little about this encoding. Can anyone help? thanks!
You seem to mix up UTF-8 and UTF-16 (or possibly UCS-2). UTF-8 encoded characters have a variable length of 1 to 4 bytes. Contrary to this, you seem to want to write UTF-16 or UCS-2 to your files (I am guessing this from the \uxxxx character references in your file output string).
For an overview of these character sets, have a look at Wikipedia's article on UTF-8 and browse from there.
Here's some of the very basic basics (heavily simplified):
UCS-2 stores all characters as exactly 16 bits. It therefore cannot encode all Unicode characters, only the so-called "Basic Multilingual Plane".
UTF-16 stores the most frequently-used characters in 16 bits, but some characters must be encoded in 32 bits.
UTF-8 encodes characters with a variable length of 1 to 4 bytes. Only characters from the original 7-bit ASCII charset are encoded as 1 byte.