What is an example of a character encoding which is not compatible with ASCII and why isn't it?
Also, what are other encoding which have upward compatibility with ASCII (except UTF and ISO8859, which I already know) and for what reason?
There are EBCDIC-based encodings that are not compatible with ASCII. For example, I recently encountered an email that was encoded using CP1026, aka EBCDIC 1026. If you look at its character table, letters and numbers are encoded at very different offsets than in ASCII. This was throwing off my email parser particularly because LF is encoded as 0x25 instead of as 0x0A in ASCII.
Related
I am a very stupid Programme Manager and I have a client requesting us to send in either ASCII or ANSI encoding format.
Our programmers has used Unicode (UTF-16), so my question is if Unicode (UTF-16) is compatible with ASCII or ANSI? Or am I understanding this incorrectly? Are we to change encoding or?
We haven't tried anything yet.
In short: ASCII encoding contains 128 characters. ANSI encoding contains 256 characters. UTF-16 encoding has the capacity for 1,112,064 character codes. There is some nuance such as the bytes used to store each character, but I don't think that is relevant here.
You can certainly convert a UTF-16 document down to ANSI or ASCII encoding, but any characters that are beyond their specification will be lost (probably converted to the 128th or 256th character, respectively, or some sort of null character).
For you, as a manager, there are some questions. At minimum:
Why does the client need this particular encoding? Can it be accommodated in some other way?
Are any characters in your data beyond the scope of ASCII/ANSI. Most (all?) programming languages provide a method to retrieve an integer representation of a character and determine if it is beyond the range of the desired encoding. This could be leveraged to discover how many instances exist of a character not compatible with the desired encoding.
Are the first 128 characters of utf-8 and ascii identical?
utf-8 table
Ascii table
Yes. This was an intentional choice in the design of UTF-8 so that existing 7-bit ASCII would be compatible.
The encoding is also designed intentionally so that 7-bit ASCII values cannot mean anything except their ASCII equivalent. For example, in UTF-16, the Euro symbol (€) is encoded as 0x20 0xAC. But 0x20 is SPACE in ASCII. So if an ASCII-only algorithm tries to space-delimit a string like "€ 10" encoded in UTF-16, it'll corrupt the data.
This can't happen in UTF-8. € is encoded there as 0xE2 0x82 0xAC, none of which are legal 7-bit ASCII values. So an ASCII algorithm that naively splits on the ASCII SPACE (0x20) will still work, even though it doesn't know anything about UTF-8 encoding. (The same is true for any ASCII character like slash, comma, backslash, percent, etc.) UTF-8 is an incredibly clever text encoding.
I need to export some data using Powershell to a ASCII encoded file.
My problem is that Scandinavian characters like Æ, Ø and Å turns into ? ? ? in the output file.
Example:
$str = "ÆØÅ"
$str | Out-File C:\test\test.txt -Encoding ascii
In the output file the result of this is: ???
It seems as though you have conflicting requirements.
Save the text in ASCII encoding
Save characters outside the ASCII character range
ASCII encoding does not support the characters you mention, which is the reason they do not work as you expect them to. The MSDN documentation on ASCII Encoding states that:
ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.
And also further that
If your application requires 8-bit encoding (which is sometimes incorrectly referred to as "ASCII"), the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.
You can read more about ASCII encoding on the Wikipedia page regarding ASCII Encoding (this page also includes tables showing all possible ASCII characters and control codes).
You need to either use a different encoding (such as UTF-8) or accept that you can't use characters which fall outside the ASCII range.
Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding.
It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?
In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.
Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)
And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.
Unicode and ASCII are both Codepoints + Encoding scheme
Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.
Conversion and Representation(in binary/hexadecimal) of String:
String := sequence of Graphemes(character is a "kind of" its subset).
Sequence of graphemes(characters) is converted into Codepoints (also using Encoding scheme)
Codepoints are Encoded(converted) to binary/hex also using Encoding Schemes
for Graphemes its UTF-8/UTF-32(aka Unicodes), for Character its ASCII.
Unicode(UTF-8) supports 1,112,064 valid character codepoints(covers most of the graphemes from different languages)
ASCII supports 128 character codepoints(mostly english)
I have a sting in unicode is "hao123--我的上网主页", while in utf8 in C++ string is "hao123锛嶏紞鎴戠殑涓婄綉涓婚〉", but I should write it to a file in this format "hao123\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875", how can I do it. I know little about this encoding. Can anyone help? thanks!
You seem to mix up UTF-8 and UTF-16 (or possibly UCS-2). UTF-8 encoded characters have a variable length of 1 to 4 bytes. Contrary to this, you seem to want to write UTF-16 or UCS-2 to your files (I am guessing this from the \uxxxx character references in your file output string).
For an overview of these character sets, have a look at Wikipedia's article on UTF-8 and browse from there.
Here's some of the very basic basics (heavily simplified):
UCS-2 stores all characters as exactly 16 bits. It therefore cannot encode all Unicode characters, only the so-called "Basic Multilingual Plane".
UTF-16 stores the most frequently-used characters in 16 bits, but some characters must be encoded in 32 bits.
UTF-8 encodes characters with a variable length of 1 to 4 bytes. Only characters from the original 7-bit ASCII charset are encoded as 1 byte.