Confusion regarding ASCII - encoding

I understand that ASCII is a scheme of character encoding, where a Byte is assigned a certain decimal number, hexcode or a letter of our alphabet.
What I don't understand and couldn't find out via Google is how exactly the computer deals with ASCII behind the scenes. For instance when I write a text file with the text "hello world", what is the computer doing? Does it save the bytes in memory and where does the ASCII encoding come into play?

Almost anything that computers store on disk, transfer over the network or keep in their memory is handled as 8-bit chunks of data, called bytes.
Those bytes are just numbers. Anything between 0 and 255 *.
So a 100 byte file is just 100 numbers one after each other.
A network message is similar: it's just a bunch of numbers one after the other.
(We tend to abstract over those and call them something like "streams", because at some level it often doesn't matter if you read from a file on disk or receive a network message, they are fundamentally just finite streams of bytes).
If you want to display a file from the disk as text, something needs to convert those numbers to something meaningful for humans. Because if I tell you that a file contains the bytes 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a, then chances are you don't really know what that means. (By the way, those are hex values which is already an interpretation, one could equivalently say that the file contains the decimal byte values 104, 101, 108, ...)
ASCII is a pattern of how to interpret those numbers. It tells you that 0x68 (decimal 104) represents the character h. And that 0x65 (decimal 101) represents e. And if you apply that mapping to those bytes you'll get hello world.
That decoding only has to happen when the computer wants to show the text to a user, because internally it doesn't care that 0x65 is h. So if the computer wants to display some text to you it looks up what letter 0x65 represents h, probably represented again via its Unicode codepoint which happens to be U+0065 and then it looks up how that character is represented in the font. The font then has a mapping of U+0065 to some instructions on how to draw the h.
And since we're talking about ASCII it should be mentioned that ASCII is not actually used an awful lot these days, mostly because it only supports a very limited set of characters (basically just barely enough to write English language text, and not even all of that). More commonly used encodings today are UTF-8 (which has the benefit of being ASCII compatible which means all valid ASCII text is also valid UTF-8 text, but not the other way around) and UTF-16. Other encodings that used to be popular, but are on the decline are the ISO-8859-* family (which are basically extended versions of ASCII, but still only support a small number of characters each).
* So technically even saying "those are numbers between 0 and 255" is already an interpretation. Technically they are 8 bits, each one of which can be off or on. Those can be interpreted as an unsigned number (0 to 255), a signed number (-128 to 127), a character (using something like the ASCII encoding) or potentially anything else you want. But the "unsigned number" interpretation is one of the most straightforward ones.

For instance when I write a text file with the text "hello world", what is the computer doing?
When you hit those keys on your keyboard, a certain protocol between the keyboard and computer lets the computer know which keys were hit. The computer translates that into a character, like "h", based on what keyboard layout is currently selected. It may also cause your video game character to move sideways or whatever else, there's no direct connection between a key and what it causes to happen. But let's say you're in a text editor and your computer interpreted your hitting the "h" key as "inputting the letter h". It now turns that into some internal, in-memory character representation. Often in-memory representations will be UTF-16 encoded bytes, so the computer can represent any and all possible Unicode characters.
When you hit File → Save as..., you select to store the file in ASCII encoding. The text editor now goes through the UTF-16 bytes stored in memory and converts them all into equivalent ASCII bytes, according to a UTF-16/Unicode → ASCII encoding table. Those bytes are stored on disk.
When you open that file again, the text editor reads those bytes from disk, probably turns them into its internal UTF-16 representation, and stores them in memory so you can edit the file. At this point you can typically think about each character as a character; it doesn't matter what bytes it's encoded as, that is abstracted away. An "h" is just an "h" at this point.
Each in-memory character is mapped to a glyph in a font, typically by its Unicode code point, to be able to display a graphical representation of it on screen for you.

Related

Are control sequences the same number in every encoding?

I am writing a parser, and the original spec states:
The file header ends with the control sequence Ctrl-Z
They do not specify which encode the header is written in (could be latin1, utf8, windows-1252,...), so I wonder whether the sequence the same number in every language.
It appears to be the case, that it always correspond to decimal 26 or the hexa 1A
It would be good to know in a more general way, whether this is for all sequences.
Most likely, ASCII is assumed. For/if ASCII, especially if you say "Ctrl-Z" corresponds to binary representation/"codepoint" dec 26 hex 1A, this would be the SUB (substitute) sequence.
Other alternatives of the extended character sets/encodings wouldn't apply here, because if dec 26 in ASCII, it's within the first/lower 7 bits of the byte (dec 0-126 of 255 total). The 8th bit then was used to toggle all the previous codepoints/patterns once more and gain/use the other half, the other remaining 127 codepoints from dec 128-255. The idea here is that the extended character sets usually share/retain the lower ASCII codepoints/mappings (also for backward compatibility), but introduce their own special characters in the higher codepoint bit-patterns/range 128-255. And there's then many different ones of this type, trying to support more writing scripts of the world with such custom extended code sets. Like Windows-1252 which is an European mix, ISO-8859-1 for German, ISO-8859-15 which is identical but only adds the Euro currency symbol, code page 437 for IBM DOS shapes to use characters for drawing a TUI on the console (this, for example, has a different mapping at it's code points for what is the control sequences in ASCII), and so on. The problem obviously is, there's a lot of these:
each only gains 128 more characters
you can't combine/load/apply any two of them at the same time (if characters would be needed from multiple different code sets)
each application has to know (or be told) beforehand in which code set a file was saved in order to interpret/display/map the correct character rendering/symbols on the screen for these byte patterns, and if the user or a tool/app applies and saves the wrong code set to save its character renderings while not recognizing that, because the source was previously created/saved with a different code set, some characters didn't appear with the intended original renderings, now the file is "corrupt" because some bytes were stored under the assumption they would be rendered with code set A and some under the assumption they're for code set B, and can't apply both as there's also no mechanism in these flat dumb plain-text files on some old, memory-short DOS file systems to tell which part of a file is for which code-set, the characters can never be rendered correctly and it can be difficult work or impossible to retroactively guess + repair what the desired interpretation/rendering was for the binary pattern in a byte
no hope to get anywhere with only 127 more characters added on top of ASCII when it comes to Chinese etc.
So then the improvement was to not use the 8th bit for these stupid code pages, but instead use it as a marker that, if set, it's an indication that another byte is following (UTF-8), hence expanding the amount of code-points greatly. This can even be repeated with the next, subsequent byte. But, it's optional. If the character is within the 7-bit ASCII codepoints, then UTF-8 does not need to set the 8th bit and add another byte.
Also means, the extended code pages and UTF-8 cannot be mixed (used/applied at the same time). For many/most code pages and for UTF-8/UTF-16 as well, the character-onto-codepoint (latter is the bit pattern) mappings are identical to ASCII. If your characters are within the first/lower 7 bits of the byte, it does not matter what the encoding theoretically would be, as the 8th bit is not used for any of code pages or UTF-8. It only matters a great deal if/for characters that do have the 8th bit set/used (and usually if there's bytes like that, the choice of its encoding would usually then then for the entire file, just that some bytes may stay within the single-byte ASCII, but really should take great care at inserting/interpreting binary patterns that have the 8th bit set in a byte).
Easy rule is: if all bytes (or the byte in question) don't have the 8th bit set, it only matters whether the lower 7 bits are ASCII or not. EBCDIC for example is a non-ASCII alternative, where dec 26 hex 1A is UBS (unit backspace), while it also does have a SUB (substitute) but it's on codepoint (binary pattern) dec 63 hex 3F. Other encodings may not have ASCII's SUB at all, or something similar but with a slightly different meaning/use, or maybe ASCII has it's SUB from EBCDIC, etc. But there's no need to wonder/worry about UTF-8, as it does not apply if ASCII can be assumed, for the characters as encoded in ASCII are encoded identically UTF-8 as a single byte with the highest bit not set.
Maybe it can be determined from the spec if all the characters mentioned are within the ASCII range and according to the ASCII codepoint definitions, or if there's other characters that might only be found in UTF-8 (or UTF-16 or UTF-32) or in one of the old extended code pages (but not found in others), or if there's any indication that the encoding might not be ASCII/ASCII-based.
It's obviously problematic if a spec doesn't explicitly state the encoding it's implicitly assuming, if the spec is about a format, protocol or data representation. On the other hand, maybe the Ctrl-Z is misleading, because dec 26 hex 1A is always the same, no matter what the encoding could be if it were text/characters. Maybe the spec just uses this pattern as a construct with no meaning in terms of character display whatsoever, and is introducing only it's own particular local meaning as defined within the spec.

Understanding the need of encoding and decoding in context to saving the strings on disk

I have read the answer here. I understand what a byte stream is (a stream of 1s and 0s), encoding is (a mapping from that stream to what characters that we humans understand) and decoding is (a reverse mapping from characters to corresponding bytes).
I still cannot reconcile the entire concept in my head. In the RAM we already have everything as bytes only. And I guess my interpreter is inherently using some decoding scheme to show me the characters corresponding to that bytes stream. What then do we mean by having to encode before saving to the disk? If my interpreter is using 'utf-8' to show us this text that I am typing and I ask it to save this text using 'cp-1252' have I changed the underlying bytes stream?
There are different ways to see it.
On way: "Hello World!" could be encoded in different way. You want the semantic of the string: so a salutation and a target. But if you save to a UTF-8 file, you will have different values, as in a UTF-16LE file, or in a EBCDIC encoding.
E.g. A is 65 on ASCII encoding, but 193 in EBCDIC encoding (used e.g. by many IBM mainframes), 0 65 on a UTF-16 encoding (or 65 0). Etc. So when you save a number, you need to specify the encoding (as expected for the reader, so it may depend on file format).
But also libraries on a language could not handle all encodings (for all functions). Usually it is better to decode, using the standard libraries, and then encode when the data should go out. So you need to implement just encoding and decoding (e.g. for EBCDIC), and not all sorting, upper/lower case handling, is_digits, is_symbol, etc.
it is standard practice to divide semantic with real values. Or display with logic. If you are a control freak, you can do all without decoding values. But it is error prone, and you should know so many details, that few people want to know.
An other example, do you need to know the real values of your data/strings? You have a number, it is encoded little-endian or big-endian? Or maybe as a float (e.g. JavaScript). We just know it, when we save data (e.g. to send in internet, we need a way to tell the ordering. Or when saving images: we tell the ordering, so on some machines, the bytes will be swapped, when reading a large number).
Or an other example: you take a selfies. You have an image, but you can save it as a PNG file, or a JPEG file: you will get very different files, with different values. But you know the encoding (fortunately, for such image files, the first bytes describe the format, and then few data about the encoding). For you it is enough to know that it is your image. But do you think computer will take the bytes of the two formats? Probably no. When you read the image, you will convert in a different encoding in memory (but you probably do not need to care about it): often a RGB (or RGBA) format, but how many bit per channel, or if there is some colour rendering (from profiles), you do not know [JPEG saves it as YCC]
Python has a stricter semantic view: you do not know how Python will encode the string. It may be 8bit: ASCII/Latin1, or 16-bit (UCS2), or 32-bit (UTF-32). It handles the internal encoding dynamically, according the most efficient way to store a string. You can still get a codepoint, a for each character, and many string/character function. Just then you encode a string, you have a fix sequence of numbers. On the string side you really do not know how strings are represented in memory. So this keep the two different parts of Unicode clearly separated: semantic value (description of all character), and the encoding/decoding (how to represent the values in bytes).
When you are handling a string in Python, you should just care about the semantic. The implementation (and so the physical layout of string in memory) is not your businesses, and Python can change it. (it changed it).
But with your example:
You may not get much of it, because recent standardisation: ASCII become nearly the only encoding for the most common Latin letters, and symbols. Latin-1 is compatible with ASCII, just extending from 7-bit to 8-bit. "Windows ANSI" uses Latin-1 and add characters on the non-allocated parts. Unicode based from Latin-1 (for first 256 characters). So you may see a character with a fixed number (or not available), but this was not the rule, also in early Windows.
So your cp-1252 is for most characters compatible with UTF-8 (but few characters). But if you uses other encoding, you should do much a transcoding (changing from an encoding to an other). But usually you do this just when you save: you keep the internal encoding, but you do a copy to be saved.
A byte is 8 bits, whether it is in RAM, on disk, or on the wire.
A bit is the "atom" of computer data. A byte is the "molecule", except that there is only one kind of byte.
A bit is the smallest unit of information in computers. It is usually said to represent 0 or 1, or OFF or ON.
Whether you "interpret" a byte as a number (0 to 255), a signed number (-128 to +127), an "ascii" character, like the characters I am typing, depends on what you (or the computer) does with the byte. Or a byte can be part of a bigger number, one that requires several bytes to represent.
Because there are too many "letters" or "characters" (especially in Chinese), to fit in a byte, there is the additional concept of a "character" may be composed of multiple bytes. UTF-8 is the main standard today. Giacomo discusses several less-common encodings that say what "character" is represented by a byte (or bytes). Remember, each byte is composed of 8 bits.
English letters and numbers and some punctuation is represented (encoded) in bytes in the same way for Ascii, Latin1, cp-1252, and UTF-8 (and some other encodings). But as soon as you get into European accented letters, the encodings diverge.
A common thing you may hear of is to represent one byte as two hexadecimal digits.

troubles with understanding how ASCII works

I have few questions about ASCII encoding:
How come there is 127 characters definable in ASCII coding? It should be 7-bit, which means 2 to the power of 7 which equals 128. Where is the one character missing?
When I save a textfile as a textfile.txt it should be saved in ASCII coding right? But when I write like 10 characters into the file it has 10 bytes, which is 80 bits, shouldnt it be 70 ?
How do I save a file to a kind of ASCII code which has 7 bits per character?
Do some softwares still use the ASCII encoding for storing the information ?
1) Ascii has 128 value, but they are enumerated from 0-127 like most computer arrays. 0 means null.
2) Either the ascii is fit into 8 bit, or you are using the extended 8 bit version.
3) Define your own program that writes to a bytestream, then you can check the bytes yourself.
4) Most readable text are encoded using UTF, but things that only need the basic characters, such as computer code, can still use ascii.
1) How come there is 127 characters definable in ASCII coding ? It
should be 7-bit, which means 2 to the power of 7 which equals
128...where is the one character missing
The NUL character, whose ASCII code is 0. That's the one you missed.
2) When I save a textfile as a textfile.txt it should be saved in
ASCII coding right ? But when I write like 10 characters into the file
it has 10 bytes, which is 80 bits, shouldnt it be 70 ?
Storage systems (and main memory) tend to use a byte as the mininum piece of information to store, so a file full of standard ASCII characters waste one bit per character. Non english users give thanks for that, because it allowed to extend ASCII to 8 bits, giving codes to store accentuated vocals and things like that.
3) How do I save a file to a kind of ASCII code which has 7 bits per
character ?
Just make sure all your file contents are ASCII standard. You will not, however, recover those missing bits. A compression algorithm might take advantage of that to squeeze a text file a little, though.
4) Do some softweres still use the ASCII encoding for storing the
information ?
The vast majority of software use ASCII even to encode things that are not ASCII by themselves. Notable examples: e-mail, and the HTML source of this very page you are reading.
Stepping back a bit…
ASCII is an archaic, nearly obsolete character set. That said, nearly all character sets in use are a superset of ASCII and have compatible encodings. For example, Unicode has the UTF-8 encoding which maps the bytes for the first 128 characters the same as ASCII. Windows-1252 has 251 characters with the first 128 the same as ASCII.
Many modern programming environments use Unicode (at least for their source code and/or strings): Java, .NET, XML, HTML, ….
So, if you are reading a file, don't assume that it is ASCII. And, if you are writing a file, you could loose data if your programming environment uses Unicode and you force the output to be ASCII. In either case, the intended character set and encoding should be known by readers and writers.

Are there bytes that are not used in the UTF-8 encoding?

As I understand it, UTF-8 is a superset of ASCII, and therefore includes the control characters which are not used to represent printable characters.
My question is: Are there any bytes (of the 256 different) that are not used by the UTF-8 encoding?
I wondered if you could convert/encode UTF-8 text to binary.
Here my though process:
I have no idea how the UTF-8 text encoding works and how it can use so many characters (only that it uses multiple bytes for characters not in ASCII (Latin-1??)) but I know that ASCII text is valid in UTF-8 so the control characters (bytes 0-30) are not used differently by the UTF-8 encoding but they are at the same time not used for displaying characters, right??
So of the 256 different bytes, only ~230 are used. For a 1000 (binary) long Unicode text there are only 1000^230 different texts? Right?
If that is true, you could convert it to a binary data which is smaller than 1000 bytes.
Wolfram alpha: 1000 bytes of unicode (assumption unicode only uses 230 of the 256 different bytes) --> 496 bytes
Yes, it is possible to devise encodings which are more space-efficient than UTF-8, but you have to weigh the advantages against the disadvantages.
For example, if your primary target is (say) ISO-8859-1, you could map the character codes 0xA0-0xFF to themselves, and only use 0x80-0x9F to select an extension map somewhat vaguely like UTF-8 uses (nearly) all of 0x80-0xFF to encode sequences which can represent all of Unicode > 0x80. You would gain a significant advantage when the majority of your text does not use characters in the ranges 0x80-0x9F or 0x0100-0x1EFFFFFFFF, but correspondingly lose when this is not the case.
Or you could require the user to keep a state variable which tells you which range of characters is currently selected, and have each byte in the stream act as an index into that range. This has significant disadvantages, but used to be how these things were done way back when (witness e.g. ISO-2022).
The original UTF-8 draft before Ken Thompson and Rob Pike famously intervened was probably also somewhat more space-efficient than the final specification, but the changes they introduced had some very attractive properties, trading (I assume) some space efficiency for lack of contextual ambiguity.
I would urge you to read the Wikipedia article about UTF-8 to understand the design desiderata -- the spec is possible to grasp in just a few minutes, although you might want to reserve an hour or more to follow footnotes etc. (The Thompson anecdote is currently footnote #7.)
All in all, unless you are working on space travel or some similarly effeciency-intensive application, losing UTF-8 compatibility is probably not worth the time you have already spent, and you should stop now.
0xF8-0xFF are not valid anywhere in UTF-8, and some other bytes are not valid at certain positions.
The lead byte of a character indicates the number of bytes used to encode the character, and each continuation byte has 10 as its two high order bits. This is so that you can pick any byte within the text and find the start of the character containing it. If you don't mind losing this ability, you could certainly come up with more efficient encoding.
You have to distinguish Characters, Unicode and UTF-8 encoding:
In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).
To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).

What are Unicode, UTF-8, and UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16?
I have researched this on Google and searched here as well, but it's not clear to me.
In VSS, when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?
Please explain in simple terms.
Why do we need Unicode?
In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).
But for argument's sake, let’s say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.
Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.
Memory considerations
So how many bytes give access to what characters in these encodings?
UTF-8:
1 byte: Standard ASCII
2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
3 bytes: BMP
4 bytes: All Unicode characters
UTF-16:
2 bytes: BMP
4 bytes: All Unicode characters
It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese, Japanese, and Korean (CJK) characters.
If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.
Encoding basics
Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.
UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.
UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.
Practical programming considerations
Character and string data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.
Recommended, default, and dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.
Library support: The libraries you are using support some kind of encoding. Which one? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.
Counting characters: There exist combining characters in Unicode. For example, the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, and 1 for the latter. This isn't necessarily wrong, but it may not be the desired outcome either.
Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ. One is a letter, and the other is a Roman numeral. In addition, we have the combining characters to consider as well. For more information, see Duplicate characters in Unicode.
Surrogate pairs: These come up often enough on Stack Overflow, so I'll just provide some example links:
Getting string length
Removing surrogate pairs
Palindrome checking
Unicode
is a set of characters used around the world
UTF-8
a character encoding capable of encoding all possible characters (called code points) in Unicode.
code unit is 8-bits
use one to four code units to encode Unicode
00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)
UTF-16
another character encoding
code unit is 16-bits
use one to two code units to encode Unicode
00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits)
Unicode is a fairly complex standard. Don’t be too afraid, but be
prepared for some work! [2]
Because a credible resource is always needed, but the official report is massive, I suggest reading the following:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) An introduction by Joel Spolsky, Stack Exchange CEO.
To the BMP and beyond! A tutorial by Eric Muller, Technical Director then, Vice President later, at The Unicode Consortium (the first 20 slides and you are done)
A brief explanation:
Computers read bytes and people read characters, so we use encoding standards to map characters to bytes. ASCII was the first widely used standard, but covers only Latin (seven bits/character can represent 128 different characters). Unicode is a standard with the goal to cover all possible characters in the world (can hold up to 1,114,112 characters, meaning 21 bits/character maximum. Current Unicode 8.0 specifies 120,737 characters in total, and that's all).
The main difference is that an ASCII character can fit to a byte (eight bits), but most Unicode characters cannot. So encoding forms/schemes (like UTF-8 and UTF-16) are used, and the character model goes like this:
Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called a code point.
An encoding form maps a code point to a code unit sequence. A code unit is the way you want characters to be organized in memory, 8-bit units, 16-bit units and so on. UTF-8 uses one to four units of eight bits, and UTF-16 uses one or two units of 16 bits, to cover the entire Unicode of 21 bits maximum. Units use prefixes so that character boundaries can be spotted, and more units mean more prefixes that occupy bits. So, although UTF-8 uses one byte for the Latin script, it needs three bytes for later scripts inside a Basic Multilingual Plane, while UTF-16 uses two bytes for all these. And that's their main difference.
Lastly, an encoding scheme (like UTF-16BE or UTF-16LE) maps (serializes) a code unit sequence to a byte sequence.
character: π
code point: U+03C0
encoding forms (code units):
      UTF-8: CF 80
      UTF-16: 03C0
encoding schemes (bytes):
      UTF-8: CF 80
      UTF-16BE: 03 C0
      UTF-16LE: C0 03
Tip: a hexadecimal digit represents four bits, so a two-digit hex number represents a byte.
Also take a look at plane maps on Wikipedia to get a feeling of the character set layout.
The article What every programmer absolutely, positively needs to know about encodings and character sets to work with text explains all the details.
Writing to buffer
if you write to a 4 byte buffer, symbol あ with UTF8 encoding, your binary will look like this:
00000000 11100011 10000001 10000010
if you write to a 4 byte buffer, symbol あ with UTF16 encoding, your binary will look like this:
00000000 00000000 00110000 01000010
As you can see, depending on what language you would use in your content this will effect your memory accordingly.
Example: For this particular symbol: あ UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.
Reading from buffer
Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.
e.g. If you decode this :
00000000 11100011 10000001 10000010
into UTF16 encoding, you will end up with 臣 not あ
Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. あ symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.
30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.
30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.
Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.
Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.
Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.
Unicode is a standard which maps the characters in all languages to a particular numeric value called a code point. The reason it does this is that it allows different encodings to be possible using the same set of code points.
UTF-8 and UTF-16 are two such encodings. They take code points as input and encodes them using some well-defined formula to produce the encoded string.
Choosing a particular encoding depends upon your requirements. Different encodings have different memory requirements and depending upon the characters that you will be dealing with, you should choose the encoding which uses the least sequences of bytes to encode those characters.
For more in-depth details about Unicode, UTF-8 and UTF-16, you can check out this article,
What every programmer should know about Unicode
Why Unicode? Because ASCII has just 127 characters. Those from 128 to 255 differ in different countries, and that's why there are code pages. So they said: let’s have up to 1114111 characters.
So how do you store the highest code point? You'll need to store it using 21 bits, so you'll use a DWORD having 32 bits with 11 bits wasted. So if you use a DWORD to store a Unicode character, it is the easiest way, because the value in your DWORD matches exactly the code point.
But DWORD arrays are of course larger than WORD arrays and of course even larger than BYTE arrays. That's why there is not only UTF-32, but also UTF-16. But UTF-16 means a WORD stream, and a WORD has 16 bits, so how can the highest code point 1114111 fit into a WORD? It cannot!
So they put everything higher than 65535 into a DWORD which they call a surrogate-pair. Such a surrogate-pair are two WORDS and can get detected by looking at the first 6 bits.
So what about UTF-8? It is a byte array or byte stream, but how can the highest code point 1114111 fit into a byte? It cannot! Okay, so they put in also a DWORD right? Or possibly a WORD, right? Almost right!
They invented utf-8 sequences which means that every code point higher than 127 must get encoded into a 2-byte, 3-byte or 4-byte sequence. Wow! But how can we detect such sequences? Well, everything up to 127 is ASCII and is a single byte. What starts with 110 is a two-byte sequence, what starts with 1110 is a three-byte sequence and what starts with 11110 is a four-byte sequence. The remaining bits of these so called "startbytes" belong to the code point.
Now depending on the sequence, following bytes must follow. A following byte starts with 10, and the remaining bits are 6 bits of payload bits and belong to the code point. Concatenate the payload bits of the startbyte and the following byte/s and you'll have the code point. That's all the magic of UTF-8.
ASCII - Software allocates only 8 bit byte in memory for a given character. It works well for English and adopted (loanwords like façade) characters as their corresponding decimal values falls below 128 in the decimal value. Example C program.
UTF-8 - Software allocates one to four variable 8-bit bytes for a given character. What is meant by a variable here? Let us say you are sending the character 'A' through your HTML pages in the browser (HTML is UTF-8), the corresponding decimal value of A is 65, when you convert it into decimal it becomes 01000010. This requires only one byte, and one byte memory is allocated even for special adopted English characters like 'ç' in the word façade. However, when you want to store European characters, it requires two bytes, so you need UTF-8. However, when you go for Asian characters, you require minimum of two bytes and maximum of four bytes. Similarly, emojis require three to four bytes. UTF-8 will solve all your needs.
UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1 or 3 bytes. Each character is either represented in 16 bit or 32 bit.
Then why does UTF-16 exist? Originally, Unicode was 16 bit not 8 bit. Java adopted the original version of UTF-16.
In a nutshell, you don't need UTF-16 anywhere unless it has been already been adopted by the language or platform you are working on.
Java program invoked by web browsers uses UTF-16, but the web browser sends characters using UTF-8.
UTF stands for stands for Unicode Transformation Format. Basically, in today's world there are scripts written in hundreds of other languages, formats not covered by the basic ASCII used earlier. Hence, UTF came into existence.
UTF-8 has character encoding capabilities and its code unit is eight bits while that for UTF-16 it is 16 bits.