Difficulties inherent in ASCII and Extended ASCII, and Unicode Compatibility?

Difficulties inherent in ASCII and Extended ASCII, and Unicode Compatibility? - unicode

What are the difficulties inherent in ASCII and Extended ASCII and how these difficulties are overcome by Unicode?
Can some one explain me the unicode compatibility?
And what does the terms associated with Unicode like Planes, Basic Multilingual Plane (BMP), Suplementary Multilingual Plane (SMP), Suplementary Ideographic Plane (SIP), Supplementary Special Plane (SSP) and Private Use Planes (PUP) means.
I have found all these words very confusing

ASCII
ASCII was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.
Extended ASCII and ISO 8859
Later the remaining bit of the byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. But because everyone used the remaining room their own way (IBM, Commodore, Universities, Organizations, etcetera), it was not interchangeable. Characters which were originally encoded using encoding X will show up as Mojibake when they are decoded using a different encoding Y. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards based on top of ASCII such as ISO 8859-1, so that it is all better interchangeable.
Unicode
8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera, let alone to include them all in only 8 bits. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. It provides room for over a million characters of which currently about 10% is filled. The UTF-8 character encoding is based on Unicode.
Unicode Planes
The Unicode characters are categorized in seventeen planes, each providing room for 65536 characters (16 bits).
Plane 0: Basic Multilingual Plane (BMP), it contains characters of all modern languages known in the world.
Plane 1: Suplementary Multilingual Plane (SMP), it contains historic languages/scripts as well as multilingual musical and mathematical symbols.
Plane 2: Suplementary Ideographic Plane (SIP), it contains "special" CJK (Chinese/Japanese/Korean) characters of which there are pretty a lot, but very seldom used in modern writing. The "normal" CJK characters are already present in BMP.
Planes 3-13: unused.
Plane 14: Supplementary Special Plane (SSP), as far it contains only some tag characters and glyph variation selectors. The tag characters are currently deprecated and may be removed in the future. The glyph variation selectors are to be used as kind of metadata which you add to existing characters which in turn can instruct the reader to give the character a slight different glyph.
Planes 15-16: Private Use Planes (PUP), it provides room for (major) organizations or user initiatives to include their own special characters or symbols in the standard so that it is interchangeable everywhere. For example Emoji (Japanese-style smilies/emoticions).
Usually, you would be only interested in the BMP and using UTF-8 encoding as the standard character encoding throughout your entire application.

Related

Unicode scalar value in Swift

In Swift Programming Language 3.0, chapter on string and character, the book states
A Unicode scalar is any Unicode code point in the range U+0000 to
U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do
not include the Unicode surrogate pair code points, which are the code
points in the range U+D800 to U+DFFF inclusive
What does this mean?

A Unicode Scalar is a code point which is not serialised as a pair of UTF-16 code units.
A code point is the number resulting from encoding a character in the Unicode standard. For instance, the code point of the letter A is 0x41 (or 65 in decimal).
A code unit is each group of bits used in the serialisation of a code point. For instance, UTF-16 uses one or two code units of 16 bit each.
The letter A is a Unicode Scalar because it can be expressed with only one code unit: 0x0041. However, less common characters require two UTF-16 code units. This pair of code units is called surrogate pair. Thus, Unicode Scalar may also be defined as: any code point except those represented by surrogate pairs.
The answer from courteouselk is correct by the way, this is just a more plain english version.

From Unicode FAQ:
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
Basically, surrogates are codepoints that are reserved for special purposes and promised to never encode a character on their own but always as a first codepoint in a pair of UTF-16 encoding.
[UPD] Also, from wikipedia:
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.
However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.

How is a high unicode codepoint expressed as two codepoints?

I've seen that >2 byte unicode codepoints like U+10000 can be written as a pair, like \uD800\uDC00. They seem to start with the nibble d, but that's all I've noticed.
What is that splitting action called and how does it work?

UTF-8 means (using my own words) that the minimum atom of processing is a byte (the code unit is 1-byte long). I don't know if historically, but at least, conceptually spoken, the UCS-2 and UCS-4 Unicode encodings come first, and UTF-8/UTF-16 appear to solve some problems of UCS-*.
UCS-2 means that each character uses 2 bytes instead of one. It's a fixed-length encoding. UCS-2 saves the bytestring of each code point as you say. The problem is there are characters which codepoints require more than 2 bytes to store it. So, UCS-2 only can handle a subset of Unicode (the range U+0000 to U+FFFF of course).
UCS-4 uses 4 bytes for each character instead, and it's capable enough to store the bitstring of any Unicode code point, obviously (the Unicode range is from U+000000 to U+10FFFF).
The problem with UCS-4 is that characters outside the 2-bytes range are very, very uncommon, and any text encoded using UCS-4 will waste too much space. So, using UCS-2 is a better approach, unless you need characters outside the 2-bytes range.
But again, English texts, source code files and so on use mostly ASCII characters and UCS-2 has the same problem: wasting too much space for texts which use mostly ASCII characters (too many useless zeros).
That is what UTF-8 does. Characters inside the ASCII range are saved in UTF-8 texts as-is. It just takes the bitstring of the code point/ASCII value of each character. So, if a UTF-8 encoded text uses only ASCII characters, it is indistinguishable from any other Latin1 encoding. Clients without UTF-8 support can handle UTF-8 texts using only ASCII characters, because they look identical. It's a backward compatible encoding.
From then on (Unicode characters outside the ASCII range), UTF-8 texts use two, three or four bytes to save code points, depending on the character.
I don't know the exact method, but the bitestring is split in two, three or four bytes using known bit prefixes to know the amount of bytes used to save the code point. If a byte begins with 0, means the character is ASCII and uses only 1 byte (the ASCII range is 7-bits long). If it begins with 1, the character is encoded using two, three or four bytes depending on what bit comes next.
The problem with UTF-8 is that it requires too much processing (it must examine the first bits of each character to know its length), specially if the text is not English-like. For example, a text written in Greek will use mostly two-byte characters.
UTF-16 uses two-bytes code units to solve that problem for non-ASCII texts. That means that the atoms of processing are 16-bit words. If a character encoding doesn't fit in a two-byte code unit, then it uses 2 code units (four bytes) to encode the character. That pair of two code units is called a surrogate pair. I think a UTF-16 text using only characters inside the 2-byte range is equivalent to the same text using UCS-2.
UTF-32, in turn, uses 4-bytes code units, as UCS-4 does. I don't know the differences between them though.

The complete picture filling in your confusion is formatted below:
Referencing what I learned from the comments...
U+10000 is a Unicode code point (hexadecimal integer mapped to a character).
Unicode is a one-to-one mapping of code points to characters.
The inclusive range of code points from 0xD800 to 0xDFFF is reserved for UTF-161 (Unicode vs UTF) surrogate units (see below).
\uD800\uDC002 are two such surrogate units, called a surrogate pair. (A surrogate unit is a code unit that's part of a surrogate pair.)
Abstract representation: Code point (abstract character) --> Code unit (abstract UTF-16) --> Code unit (UTF-16 encoded bytes) --> Interpreted UTF-16
Actual usage example: Input data is bytes and may be wrapped in a second encoding, like ASCII for HTML entities and unicode escapes, or anything the parser handles --> Encoding interpreted; mapped to code point via scheme --> Font glyph --> Character on screen
How surrogate pairs work
Surrogate pair advantages:
There are only high and low units. A high must be followed by a low. No confusing high&low units.
UTF-16 can use 2 bytes for the first 63487 code points because surrogates cannot be mistaken for code points.
A range of 2048 code points is (2048/2)**2 to yield a range of 1048576 code points.
The processing is done on the less frequently used characters.
1 UTF-16 is the only UTF which uses surrogate pairs.
2 This is formatted as a unicode escape sequence.
Graphics describing character encoding:
Keep reading:
How does UTF-8 "variable-width encoding" work?
Unicode, UTF, ASCII, ANSI format differences
Code point

What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?

What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction?
The Unicode Standard, Chapter 3, D52
Combining character: A character with the General Category of Combining Mark (M).
Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me).
All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class.
The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation.
These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non- joiner. The combining character is said to apply to that base character.
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
The Unicode Standard, Chapter 3, D59
Grapheme extender: A character with the property Grapheme_Extend.
Grapheme extender characters consist of all nonspacing marks, zero width joiner, zero width non-joiner, U+FF9E, U+FF9F, and a small number of spacing marks.
A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character.
zero width joiner and zero width non-joiner are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders.
The small number of spacing marks that have the property Grapheme_Extend are all the second parts of a two-part combining mark.
The set of characters with the Grapheme_Extend property and the set of characters with the Grapheme_Base property are disjoint, by definition.

The difference in actual usage is that combining characters are defined as a General Category for rough classification of characters and grapheme extenders are mainly used for UAX #29 text segmentation.
EDIT: Since you offered a bounty, I can elaborate a bit.
Combining characters are characters that can't be use as stand-alone characters but must be combined with another character. They're used to define combining character sequences.
Grapheme extenders were introduced in Unicode 3.2 to be used in Unicode Technical Report #29: Text Boundaries (then in a proposed status, now known as Unicode Standard Annex #29:
Unicode Text Segmentation). The main use is to define grapheme clusters. Grapheme clusters are basically user-perceived characters. According to UAX #29:
Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.
The main difference is that grapheme extenders don't include most of the spacing marks (the set is actually smaller than the set of combining characters). Most of the spacing marks are vowel signs for Asian scripts. In these scripts, vowels are sometimes written by modifying a consonant character. If this modification takes up horizontal space (spacing mark), it used to be seen as a separate user-perceived character and forms a new (legacy) grapheme cluster. In later versions of UAX #29, this was changed and extended grapheme clusters were introduced where most but not all spacing marks don't break a cluster.
I think they key sentence from the standard is: "A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character." Combining characters, on the other hand, also include spacing marks that are applied to the left or right. There are a few exceptions, though (see property Other_Grapheme_Extend).
Example
U+0995 BENGALI LETTER KA:
ক
U+09C0 BENGALI VOWEL SIGN II (combining character, but no grapheme extender):
ী
Combination of the two:
কী
This is a single combining character sequence consisting of two legacy grapheme clusters. The vowel sign can't be used by itself but it still counts as a legacy grapheme cluster. A text editor, for example, could allow to place the cursor between the two characters.
There are over 300 combining characters like this which do not extend graphemes, and four characters which are not combining but do extend graphemes.

I’ve posted this question on the Unicode mailing list and got some more responses. I’ll post some of them here.
Tom Gewecke wrote:
I'm not an expert on this aspect of Unicode, but I understand that
"grapheme extender" is a finer distinction in character properties
designed to be used in certain specific and complex processes like
grapheme breaking. You might find this blog article helpful in seeing
where it comes into play:
http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html
PS The answer by nwellnhof at StackOverflow is an excellent explanation of this issue in my view.
Philippe Verdy wrote:
Many grapheme extenders are not "combining characters". Combining
characters are classified this way for legacy reasons (the very weak
"general category" property) and this property is normatively
stabilized. As well most combining characters have a non-zero
combining class and they are stabilized for the purpose of
normalization.
Grapheme extenders include characters that are also NOT combining
characters but controls (e.g. joiners). Some graphemclusters are also
more complex in some scripts (there are extenders encoded BEFORE the
base character; and they cannot be classified as combining characters
because combining characters are always encoded AFTER a base
character)
For legacy reasons (and roundtrip compatibility with older standards)
not all scripts are encoded using the UCS character model using
combining characters. (E.g. the Thai script; not following the
"logical" encoding order; but following the model used in TIS-620 and
other standards based on it; including for Windows, and *nix/*nux).
Richard Wordingham wrote:
Spacing combining marks (category Mc) are in general not grapheme
extenders. The ones that are included are mostly included so that the
boundaries between 'legacy grapheme clusters'
http://www.unicode.org/reports/tr29/tr29-23.html are invariant under
canonical equivalence. There are six grapheme extenders that are not
nonspacing (Mn) or enclosing (Me) and are not needed by this rule:
ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT
TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH
KATAKANA SEMI-VOICED SOUND MARK
I can see that it will sometimes be helpful to ZWNJ and ZWJ along with
the previous base character. The fullwidth soundmarks U+3099 and
U+309A are included for reasons of canonical equivalence, so it makes
sense to include their halfwidth versions.
I don't actually see the logic for including U+302E and U+302F. If
you're going to encourage forcing someone who's typed the wrong base
character before a sequence of 3 non-spacing marks to retype the lot,
you may as well do the same with Hangul tone marks.

May I quote from Yannis Haralambous' Fonts and Encodings, page 116f.:
The idea is that a script or a system of notation is sometimes too
finely divided into characters. And when we have cut constructs up
into characters, there is no way to put them back together again to
rebuild larger characters. For example, Catalan has the ligature
‘ŀl’. This ligature is encoded as two Unicode characters: an ‘ŀ’
0x0140 latin small letter l with middle dot and an ordinary ‘l’. But
this division may not always be what we want.
Suppose that we wish to
place a circumflex accent over this ligature, as we might well wish to
do with the ligatures ‘œ’ and ‘æ’. How can this be done in Unicode?
To allow users to build up characters in constructs that play the rôle
of new characters, Unicode introduced three new properties (grapheme
base, grapheme extension, grapheme link) and one new character:
0x034F combining grapheme joiner.
So the way I see it, this means that grapheme extenders are used to apply (for example) accents on characters that are themselves composed of several characters.

What's the difference between ASCII and Unicode?

What's the exact difference between Unicode and ASCII?
ASCII has a total of 128 characters (256 in the extended set).
Is there any size specification for Unicode characters?

ASCII defines 128 characters, which map to the numbers 0–127. Unicode defines (less than) 221 characters, which, similarly, map to numbers 0–221 (though not all numbers are currently assigned, and some are reserved).
Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'".
Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8.

Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two.
ASCII, Origins
As stated in the other answers, ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations*. Which means that we can represent 128 characters maximum.
Wait, 7 bits? But why not 1 byte (8 bits)?
The last bit (8th) is used for avoiding errors as parity bit.
This was relevant years ago.
Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc.
See below the binary representation of a few characters in ASCII:
0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)
See the full ASCII table over here.
ASCII was meant for English only.
What? Why English only? So many languages out there!
Because the center of the computer industry was in the USA at that
time. As a consequence, they didn't need to support accents or other
marks such as á, ü, ç, ñ, etc. (aka diacritics).
ASCII Extended
Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). And not 2^7 as before (128).
10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)
The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII".
As #Tom pointed out in his comment below there is no such thing as "extended ASCII" yet this is an easy way to refer to this 8th-bit trick. There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1.
Unicode, The Rise
ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? Greek? Russian? Chinese and the likes?
We would have needed an entirely new character set... that's the rational behind Unicode. Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters (see this table).
You cannot save text to your hard drive as "Unicode". Unicode is an abstract representation of the text. You need to "encode" this abstract representation. That's where an encoding comes into play.
Encodings: UTF-8 vs UTF-16 vs UTF-32
This answer does a pretty good job at explaining the basics:
UTF-8 and UTF-16 are variable length encodings.
In UTF-8, a character may occupy a minimum of 8 bits.
In UTF-16, a character length starts with 16 bits.
UTF-32 is a fixed length encoding of 32 bits.
UTF-8 uses the ASCII set for the first 128 characters. That's handy because it means ASCII text is also valid in UTF-8.
Mnemonics:
UTF-8: minimum 8 bits.
UTF-16: minimum 16 bits.
UTF-32: minimum and maximum 32 bits.
Note:
Why 2^7?
This is obvious for some, but just in case. We have seven slots available filled with either 0 or 1 (Binary Code).
Each can have two combinations. If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. Think about this as a combination lock with seven wheels, each wheel having two numbers only.
Source: Wikipedia, this great blog post and Mocki.co where I initially posted this summary.

ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. With incompatible choices, causing the code page disaster. Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page.
Unicode came about to solve this disaster. Version 1 started out with 65536 code points, commonly encoded in 16 bits. Later extended in version 2 to 1.1 million code points. The current version is 6.3, using 110,187 of the available 1.1 million code points. That doesn't fit in 16 bits anymore.
Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. And language runtimes like Java. The v2 spec came up with a way to map those 1.1 million code points into 16-bits. An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. The original v1 code points take 2 bytes, added ones take 4.
Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. Not often used since it is pretty wasteful. There are other ones, like UTF-1 and UTF-7, widely ignored.
An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.
Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". Their association with operating system defaults pretty much draws the lines. One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common.

java provides support for Unicode i.e it supports all world wide alphabets. Hence the size of char in java is 2 bytes. And range is 0 to 65535.

ASCII has 128 code positions, allocated to graphic characters and control characters (control codes).
Unicode has 1,114,112 code positions. About 100,000 of them have currently been allocated to characters, and many code points have been made permanently noncharacters (i.e. not used to encode any character ever), and most code points are not yet assigned.
The only things that ASCII and Unicode have in common are: 1) They are character codes. 2) The 128 first code positions of Unicode have been defined to have the same meanings as in ASCII, except that the code positions of ASCII control characters are just defined as denoting control characters, with names corresponding to their ASCII names, but their meanings are not defined in Unicode.
Sometimes, however, Unicode is characterized (even in the Unicode standard!) as “wide ASCII”. This is a slogan that mainly tries to convey the idea that Unicode is meant to be a universal character code the same way as ASCII once was (though the character repertoire of ASCII was hopelessly insufficient for universal use), as opposite to using different codes in different systems and applications and for different languages.
Unicode as such defines only the “logical size” of characters: Each character has a code number in a specific range. These code numbers can be presented using different transfer encodings, and internally, in memory, Unicode characters are usually represented using one or two 16-bit quantities per character, depending on character range, sometimes using one 32-bit quantity per character.

ASCII and Unicode are two character encodings. Basically, they are standards on how to represent difference characters in binary so that they can be written, stored, transmitted, and read in digital media. The main difference between the two is in the way they encode the character and the number of bits that they use for each. ASCII originally used seven bits to encode each character. This was later increased to eight with Extended ASCII to address the apparent inadequacy of the original. In contrast, Unicode uses a variable bit encoding program where you can choose between 32, 16, and 8-bit encodings. Using more bits lets you use more characters at the expense of larger files while fewer bits give you a limited choice but you save a lot of space. Using fewer bits (i.e. UTF-8 or ASCII) would probably be best if you are encoding a large document in English.
One of the main reasons why Unicode was the problem arose from the many non-standard extended ASCII programs. Unless you are using the prevalent page, which is used by Microsoft and most other software companies, then you are likely to encounter problems with your characters appearing as boxes. Unicode virtually eliminates this problem as all the character code points were standardized.
Another major advantage of Unicode is that at its maximum it can accommodate a huge number of characters. Because of this, Unicode currently contains most written languages and still has room for even more. This includes typical left-to-right scripts like English and even right-to-left scripts like Arabic. Chinese, Japanese, and the many other variants are also represented within Unicode. So Unicode won’t be replaced anytime soon.
In order to maintain compatibility with the older ASCII, which was already in widespread use at the time, Unicode was designed in such a way that the first eight bits matched that of the most popular ASCII page. So if you open an ASCII encoded file with Unicode, you still get the correct characters encoded in the file. This facilitated the adoption of Unicode as it lessened the impact of adopting a new encoding standard for those who were already using ASCII.
Summary:
1.ASCII uses an 8-bit encoding while Unicode uses a variable bit encoding.
2.Unicode is standardized while ASCII isn’t.
3.Unicode represents most written languages in the world while ASCII does not.
4.ASCII has its equivalent within Unicode.
Taken From: http://www.differencebetween.net/technology/software-technology/difference-between-unicode-and-ascii/#ixzz4zEjnxPhs

Storage
Given numbers are only for storing 1 character
ASCII ⟶ 27 bits (1 byte)
Extended ASCII ⟶ 28 bits (1 byte)
UTF-8 ⟶ minimum 28, maximum 232 bits (min 1, max 4 bytes)
UTF-16 ⟶ minimum 216, maximum 232 bits (min 2, max 4 bytes)
UTF-32 ⟶ 232 bits (4 bytes)
Usage (as of Feb 2020)

ASCII defines 128 characters, as Unicode contains a repertoire of more than 120,000 characters.

Beyond how UTF is a superset of ASCII, another good difference to know between ASCII and UTF is in terms of disk file encoding and data representation and storage in random memory. Programs know that given data should be understood as an ASCII or UTF string either by detecting special byte order mark codes at the start of the data, or by assuming from programmer intent that the data is text and then checking it for patterns that indicate it is in one text encoding or another.
Using the conventional prefix notation of 0x for hexadecimal data, basic good reference is that ASCII text starts with byte values 0x00 to 0x7F representing one of the possible ASCII character values. UTF text is normally indicated by starting with the bytes 0xEF 0xBB 0xBF for UTF8. For UTF16, start bytes 0xFE 0xFF, or 0xFF 0xFE are used, with the endian-ness order of the text bytes indicated by the order of the start bytes. The simple presence of byte values that are not in the ASCII range of possible byte values also indicates that data is probably UTF.
There are other byte order marks that use different codes to indicate data should be interpreted as text encoded in a certain encoding standard.

Why does the unicode Superscripts and Subscripts block not contain simple sequences of all letters?

The arrangement of the characters that can be used as super-/subscript letters seems completely chaotic. Most of them are obviously not meant to be used as sup/subscr. letters, but even those which are do not hint a very reasonable ordering. In Unicode 6.0 there is now at last an alphabetically-ordered subset of the subscript letters h-t in U+2095 through U+209C, but this was obviously rather squeezed into the remaining space in the block and encompasses less than 1/3 of all letters.
Why did the consortium not just allocate enough space for at least one sup and one subscript alphabet in lower case?

The disorganization in the arrangement of these characters is because they were encoded piecemeal as scripts that used them were encoded, and as round-trip compatibility with other character sets was added. Chapter 15 of the Unicode Standard has some discussion of their origins: for example superscript digits 1 to 3 were in ISO Latin-1 while the others were encoded to support the MARC-8 bibliographic character set (see table here); and U+2071 SUPERSCRIPT LATIN SMALL LETTER I and U+207F SUPERSCRIPT LATIN SMALL LETTER N were encoded to support the Uralic Phonetic Alphabet.
The Unicode Consortium have a general policy of not encoding characters unless there's some evidence that people are using the characters to make semantic distinctions that require encoding. So characters won't be encoded just to complete the set, or to make things look neat.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse