What is the use of Binary encoding when we have Character encoding? - encoding

I use to think that the use of binary encoding is because every device has its way to interpret bytes. Thus if a router sends a bit as some significant information other router might treat this byte as a parity byte or something else...
But isn't it all already covered in character encoding?? I mean character encoding tells what byte is representing which character, right? (Or am I missing something? ) Isn't the information about character encoding(like UTF-8) enough for devices to read bytes directly? If yes why would anyone want to encode this (Using something like base64) cause it will increase the size of the data required to be transferred.

Related

Why Base64 is used "only" to encode binary data?

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?
There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.
One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

Understanding the need of encoding and decoding in context to saving the strings on disk

I have read the answer here. I understand what a byte stream is (a stream of 1s and 0s), encoding is (a mapping from that stream to what characters that we humans understand) and decoding is (a reverse mapping from characters to corresponding bytes).
I still cannot reconcile the entire concept in my head. In the RAM we already have everything as bytes only. And I guess my interpreter is inherently using some decoding scheme to show me the characters corresponding to that bytes stream. What then do we mean by having to encode before saving to the disk? If my interpreter is using 'utf-8' to show us this text that I am typing and I ask it to save this text using 'cp-1252' have I changed the underlying bytes stream?
There are different ways to see it.
On way: "Hello World!" could be encoded in different way. You want the semantic of the string: so a salutation and a target. But if you save to a UTF-8 file, you will have different values, as in a UTF-16LE file, or in a EBCDIC encoding.
E.g. A is 65 on ASCII encoding, but 193 in EBCDIC encoding (used e.g. by many IBM mainframes), 0 65 on a UTF-16 encoding (or 65 0). Etc. So when you save a number, you need to specify the encoding (as expected for the reader, so it may depend on file format).
But also libraries on a language could not handle all encodings (for all functions). Usually it is better to decode, using the standard libraries, and then encode when the data should go out. So you need to implement just encoding and decoding (e.g. for EBCDIC), and not all sorting, upper/lower case handling, is_digits, is_symbol, etc.
it is standard practice to divide semantic with real values. Or display with logic. If you are a control freak, you can do all without decoding values. But it is error prone, and you should know so many details, that few people want to know.
An other example, do you need to know the real values of your data/strings? You have a number, it is encoded little-endian or big-endian? Or maybe as a float (e.g. JavaScript). We just know it, when we save data (e.g. to send in internet, we need a way to tell the ordering. Or when saving images: we tell the ordering, so on some machines, the bytes will be swapped, when reading a large number).
Or an other example: you take a selfies. You have an image, but you can save it as a PNG file, or a JPEG file: you will get very different files, with different values. But you know the encoding (fortunately, for such image files, the first bytes describe the format, and then few data about the encoding). For you it is enough to know that it is your image. But do you think computer will take the bytes of the two formats? Probably no. When you read the image, you will convert in a different encoding in memory (but you probably do not need to care about it): often a RGB (or RGBA) format, but how many bit per channel, or if there is some colour rendering (from profiles), you do not know [JPEG saves it as YCC]
Python has a stricter semantic view: you do not know how Python will encode the string. It may be 8bit: ASCII/Latin1, or 16-bit (UCS2), or 32-bit (UTF-32). It handles the internal encoding dynamically, according the most efficient way to store a string. You can still get a codepoint, a for each character, and many string/character function. Just then you encode a string, you have a fix sequence of numbers. On the string side you really do not know how strings are represented in memory. So this keep the two different parts of Unicode clearly separated: semantic value (description of all character), and the encoding/decoding (how to represent the values in bytes).
When you are handling a string in Python, you should just care about the semantic. The implementation (and so the physical layout of string in memory) is not your businesses, and Python can change it. (it changed it).
But with your example:
You may not get much of it, because recent standardisation: ASCII become nearly the only encoding for the most common Latin letters, and symbols. Latin-1 is compatible with ASCII, just extending from 7-bit to 8-bit. "Windows ANSI" uses Latin-1 and add characters on the non-allocated parts. Unicode based from Latin-1 (for first 256 characters). So you may see a character with a fixed number (or not available), but this was not the rule, also in early Windows.
So your cp-1252 is for most characters compatible with UTF-8 (but few characters). But if you uses other encoding, you should do much a transcoding (changing from an encoding to an other). But usually you do this just when you save: you keep the internal encoding, but you do a copy to be saved.
A byte is 8 bits, whether it is in RAM, on disk, or on the wire.
A bit is the "atom" of computer data. A byte is the "molecule", except that there is only one kind of byte.
A bit is the smallest unit of information in computers. It is usually said to represent 0 or 1, or OFF or ON.
Whether you "interpret" a byte as a number (0 to 255), a signed number (-128 to +127), an "ascii" character, like the characters I am typing, depends on what you (or the computer) does with the byte. Or a byte can be part of a bigger number, one that requires several bytes to represent.
Because there are too many "letters" or "characters" (especially in Chinese), to fit in a byte, there is the additional concept of a "character" may be composed of multiple bytes. UTF-8 is the main standard today. Giacomo discusses several less-common encodings that say what "character" is represented by a byte (or bytes). Remember, each byte is composed of 8 bits.
English letters and numbers and some punctuation is represented (encoded) in bytes in the same way for Ascii, Latin1, cp-1252, and UTF-8 (and some other encodings). But as soon as you get into European accented letters, the encodings diverge.
A common thing you may hear of is to represent one byte as two hexadecimal digits.

utf-8: How to recover from missing bit?

Quote from Wikipedia
If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize at the start of the next code point;
...
UTF-16 and UTF-32 will handle corrupt (altered) bytes by resynchronizing on the next good code point, but an odd number of lost or spurious byte (octet)s will garble all following text
I don't quite understand why UTF-8 works better than UTF-16 when corruption happens. What does "odd number" mean? Does it mean 1 in 0101010111101(bytes stream)?
In addition, Can I say that UTF-16's recovery performance is as good as UTF-8 when no odd number of lost or spurious byte (octet) occur?
Could anyone give some point at how the system work to recover from bit error when using UTF-8 or UTF-16?
This documentation quote refers to recovering the stream after having missing some code points, not correcting the characters that went bad during transmission.
And what it means is simply that: if you miss one or more byte in an utf-8 stream, as soon as the next byte is the beginning of a new code point (usually a character), you can resume your text from there.
In utf-16, all characters are 2 bytes long, and if you miss a single byte then all subsequent bytes will be misplaced, and without further, higher level, correction, no other code-point will be correct for that stream.
For visualizing that, in Python3 interactive mode we could do:
In [19]: a = "Resumé for mr. Fernando"
In [20]: b = a.encode("utf-8"); b = b[:3] + b[4:]
In [21]: print (b.decode("utf-8"))
Resmé for mr. Fernando
In [22]: b = a.encode("utf-16"); b = b[:10] + b[11:]
In [23]: print (b.decode("utf-16", errors="replace"))
Resu 昀漀爀 洀爀⸀ 䘀攀爀渀愀渀搀漀�
(The slicing notation used in Python for those unfamiliar means [<begin (inclusive)>: <end (exclusive)>] as positions. Any parameter left blank is assumed to be the beginning or end of the sliced sequence)
And why "odd" should be obvious at this point: if you miss an even number of bytes, the decoding system will still try to decode at character boundaries - with an odd number, all character boundaries will be incorrect
The magic hides in the structure of the encoding. In UTF-8 structure, all the starting byte of a UTF-8 encoded bytes cannot be misreplace by the byte that is not the starting byte. However, thing is not true in UTF-16.
UTF-8 structure
Let's simulate how a magic machine that receiving UTF-8 encoded bytes as input and outputing code points works when one byte missed in a 3-byte long UTF-8 encoded bytes.
Say, the UTF-8 encoded bytes from two characters 一二 is e4b880(一) and e4ba8c(二). Now the machine reads e4, which is 11100100, knowing that the following code point is a 3-byte long code point. Unfortunately, the following byte b8is missed. Next,80 is read. Then, the first byte of 二 is read, which is e4. However, e4 doesn't fit the rule right: When it comes to 3-byte long code point, byte 3 need to start with 10 according to the table above. Now the machine knows some bytes are missing and the code point is broken. It will try to look around to find the right decode starting point. Obviously, the next byte ba is not a good start because it doesn't fit any of the byte 1 in the table above. Then it will look back, finding that the previous byte e4 is exaclty a good start byte to decode.
In UTF-16, a byte can be misreplaced by the byte that is not the starting byte because of UTF-16 have no way to tell whether it's a starting byte or not.

What is encoding & decoding in communication?

Can someone please redirect me to some good references about the encoding and decoding in communication and different encoding techniques(unicode, base64, utf7) etc.
Wikipedia is always a good start.
Then there's always Joel Spolsky's article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Note that the three things you name operate on different levels.
Unicode is a character set: a mapping between characters and numbers (code points).
UTF7 maps between code points and bytes.
base64 maps between bytes and bytes. (It mangles bytes so that they are represented by bytes in the ASCII range.)
The definitions of encoding and decoding are somewhat subjective.
Both are forms of transliteration, being the process of converting from one alphabet to another. ASCII to UTF8, ASCII to base64, etc are all examples of this.
What distinguishes the two is that "encoding" is often used when transliterating from a usable format to a transmission or intermediate format of some kind and decoding is the reverse. This is where the "subjective" bit comes in. ASCII to UTF8 can be viewed as encoding or decoding depending on the context.
Other formats like base64 are used almost universally for transmission only (eg binary data in email) and as such converting to them is almost universally called "encoding" and converting from as "decoding".
The important point to take away from all this is that something like ASCII or UTF8 is not magical in any way. All these formats are simply an agreed-upon encoding of information into a binary format. So ASCII 65 is 'A' for no other reason than that's the standard.
Unicode formats get more interesting because they make the distinction between the code point and the encoding. Unicode defines the code points for each character. The binary data is different for each encoding format. For example, see Unicode Character 'EURO-CURRENCY SIGN' (U+20A0) to see all the different binary values for one code point.
Regarding yours unicode, base64, utf7 (no one uses it, it might be utf8). They are not just "encoding & decoding" but encoding & decoding of text data.
Unicode is the way all real and possible characters are enumerated. It has nothing about encoding itself. UTFXX is set of encoding of unicode (converting code to actual bytes). most popular are UTF8 and UTF16. Very basically UTF8 is ASCII compatible (chars with codes < 128 are represented same way as ASCII), but other characters are represented by 2-3 bytes. UTF16 encode most of characters to 2 bytes.
Base64 has nothing about text data. It encodes generic binary data to text that consists of 64 printable ascii characters. It is used to transfer binary data, UTF8 and UTF16 via Email usually.

How do I determine the character set of a string?

I have several files that are in several different languages. I thought they were all encoded UTF-8, but now I'm not so sure. Some characters look fine, some do not. Is there a way that I can break out the strings and try to identify the character sets? Perhaps split on white space then identify each word? Finally, is there an easy way to translate characters from one set to UTF-8?
If you don't know the character set for sure You can only guess, basically. utf8::valid might help you with that, but you can't really know for sure. If you know that if it isn't unicode it must be a specific character set (Like Latin-1), you lucky. If you have no idea, you're screwed. In any case, you should always assume the whole file is in the same character set, unless otherwise specified. You will lose your sanity if you don't.
As for your question how to convert between character sets: Encode is there to do that for you
Determining whether a file is probably UTF-8 or not should be pretty easy. Determining the encoding if it is not UTF-8 would be very difficult in general.
If the file is encoded with UTF-8, the high bits of each byte should follow a pattern. If a character is one byte, its high bit will be cleared (zero). Otherwise, an n byte character (where n is 2–4) will have the high n bits of the first byte set to one, followed by a single zero bit. The following n - 1 bytes should all have the highest bit set and the second-highest bit cleared.
If all the bytes in your file follow these rules, it's probably encoded with UTF-8. I say probably, because anyone can invent a new encoding that happens to follow the same rules, deliberately or by chance, but interprets the codes differently.
Note that a file encoded with US-ASCII will follow these rules, but the high bit of every byte is zero. It's okay to treat such a file as UTF-8, since they are compatible in this range. Otherwise, it's some other encoding, and there's not an inherent test to distinguish the encoding. You'll have to use some contextual knowledge to guess.
Take a look at iconv
http://www.gnu.org/software/libiconv/
Text::Iconv