Encoding - what is it and why do we need it? - encoding

Can someone explaing me about encoding and its importance. I understand that we have various encodings and in each of them first 127 characters are same.

Read Joel Spolsky's excellent article on the subject.
An interesting point that was noted in the discussion of another answer (which I didn't really think the author needed to delete) is that there is a difference between a character set, which (in the other author's words - don't remember his username) defines a mapping between integers and characters (e.g. "Capital A is 65"), and an encoding, which defines how those integers are to be represented in a byte stream. Most old character sets, such as ASCII, have only one very simple encoding: each integer becomes exactly one byte. The Unicode character set, on the other hand, has many different encodings, none of which are equally simple: UTF-8, UTF-16, UTF-32...

Apart from the article above mentioned by Aasmund Eldhuset, I find this Tedx Talk really interesting and explanatory on the same topic.
Hope this helps!

I think encoding is the technique to convert your message into a form that is not readable to unauthorized persons so that you can maintain your secrecy.

Related

Why are there different encoding types?

This is a noob question, but I wanna know why there are different encoding types and what are their differences (ie. ASCII, utf-8 and 16, base64, etc.)
Reasons are many I believe but the main point is: "How many characters you need to display (encode)?" If you live in US for example, you could go pretty far with ASCII. But in many counties we need characters like ä, å, ü etc. (If SO was ASCII only or you try to read this text as ASCII encoded text, you'd see some weird characters in the places of ä, å and ü.) Think also the China, Japan, Thailand and other "exotic" countires. Those weird figures on photos you may have seen around the world just might be letters, not pretty pictures.
As for the differences between different encoding types you need to see their specification. Here's something for UTF-8.
http://www.unicode.org/standard/standard.html
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8#Compared_to_other_multi-byte_encodings
I'm not familiar with UTF-16. Here's some information about the differences.
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Unicode_plane
Base64 is used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. If you've ever made somesort of email system with PHP, you've probably encountered Base64.
http://en.wikipedia.org/wiki/Base64
http://www.phpeveryday.com/articles/PHP-Email-Using-Embedded-Images-in-HTML-Email-P113.html
Is short: To support computer program's user interface localizations to many different languages. (Programming languages still mainly consist of characters found in ASCII encoding, althought it's possible for example in Java to use UTF-8 encoding in variable names, and the source code file is usually stored as something else than ASCII encoded text, for example UTF-8 encoding.)
In short vol.2: Always when different people are trying to solve some problem from a specific point of view (or even without a point of view if it's even possible), results may be quite different. Quote from Joel's unicode article (link below): "Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255."
Thanks to Joachim and tchrist for all the info and discussion. Here's two articles I just read. (Both links are on the page I linked to earlier.) I'd forgotten most of the stuff from Joel's article since I last read it a few years back. Good introduction to the subject I hope. Mark Davis goes a little deeper.
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
The real reason why there are so many variants is that the Unicode consortium came along too late.
In The Beginning memory and storage was expensive and using more than 8 (or sometimes only 7) bit of memory to store a single character was considered excessive. Thus pretty much all text was stored using 7 or 8 bit per character. Clearly, 8 bit are not enough memory to represent the characters of all human languages. It's barely enough to represent most characters used in a single language (and for some languages even that's not possible). Therefore many different character encodings where designed to allow different languages (English, German, Greek, Russian, ...) to encode their texts in 8 bits per characters. After all a single text file (and usually even a single computer system) will only ever used in a single language, right?
This led to a situation where there was no single agreed-upon mapping of characters to numbers of any kind. Many different, incompatible solutions where produced and no real central control existed. Some computer systems used ASCII, others used EBCDIC (or more precisely: one of the many variations of EBCDIC), ISO-8859-* (or one of its many derivatives) or any of a big list of encodings that are hardly heard about now.
Finally, the Unicode Consortium stepped up to the task to produce that single mapping (together with lots of auxiliary data that's useful but outside of the bounds of this answer).
When the Unicode consortium finally produced a fairly comprehensive list of characters that a computer might represent (together with a number of encoding schemes to encode them to binary data, depending on your concrete needs), the other character encoding schemes were already widely used. This slowed down the adoption of Unicode and its encodings (UTF-8, UTF-16) considerably.
These days, if you want to represent text, your best bet is to use one of the few encodings that can represent all Unicode characters. UTF-8 and UTF-16 together should suffice for 99% of all use cases, UTF-32 covers almost all the others. And just to be clear: all the UTF-* encodings can encode all valid Unicode characters. But due to the fact that UTF-8 and UTF-16 are variable-width encodings, they might not be ideal for all use cases. Unless you need to be able to interact with a legacy system that can't handle those encodings, there is rarely a reason to choose anything else these days.
The main reason is to be able to show more characters. When the internet was in it's infancy, noone really planned ahead thinking that one day there would be people using it from all countries and all languages around the world. So a small character set was good enough. Gradually it was revealed to be limited and English-centric, thus the demand for bigger character sets.

When to use Unicode (aside with non-unicode!)

I haven't found much (concise) info about when exactly to use Unicode. I understand that many say best practice is to always use Unicode. But Unicode strings DO have more memory footprint. Am I correct to say that Unicode must be used only when
Printing something to screen other than local (for example debugging) use.
Generally, sending any type of text across a network with the two ends being in different locales/country
When you're not sure which to use
I think it would be beneficial if someone explained the basics (concise) of what actually happens with Unicode... am I correct to say that things get messy when :
the physical (byte) string gets sent to a machine using a representation of strings (code page, others... this is already detail although interesting) different from the sender.
The context is using Unicode in a programming language (say C++), but I hope answers to this question can be used for any encoding situation.
Also, I'm aware Unicode and NLS are not the same thing, but is it correct to say that NLS implies usage of Unicode?
P.S. awesome site
Always use Unicode, it will save you and others a lot of pain.
What you may have confused is the issue of encoding. Unicode strings do not necessarily take more memory than the equivalent ASCII (or other encoding) strings, that depends a lot on the encoding used.
Sometimes "Unicode" is used as a synonym for "UCS-2" or "UTF-16". Strictly speaking that use is wrong, because "Unicode" is the standard that defines the set of characters and their unicode codepoints. It does not as such define a mapping to bytes (or words). UTF-16, UTF-8 and other encoding take over the job of mapping the characters to concrete bytes.
The beauty of Unicode is that it frees you from restrictions and lots of headaches. Unicode is the largest character set available to date, i.e. it enables you to actually encode and use virtually any character of any halfway mainstream language in use today. With any other character set you need to think about whether it can actually encode a character or not. Latin-1 cannot encode the character "あ", Shift-JIS cannot encode the character "ڥ" and so on. Only if you're very sure you will never ever need anything other than basic Latin/Arabic/Japanaese/whatever other subset of characters should you choose a specialized encoding such as Latin-1, BIG-5, Shift-JIS or ASCII.
Unicode is the most versatile charset available and therefore a good standard to adhere to.
The Unicode encodings are nothing special, they're just a little more complex in their bit representation since they have to encode many more characters while still trying to be space efficient. For a very detailed excursion into this topic, please see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I have a little utility which is sometimes helpful in seeing the difference between character encodings. http://sodved.awardspace.info/unicode.pl. If you paste in ö into the Raw (UTF-8) field you will see that it is represented by different byte sequences in different encodings. And as the other two good answers describe, some non-unicode encodings cannot represent it at all.

What is exactly an overlong form/encoding?

Reading the Wikipedia article on UTF-8, I've been wondering about the term overlong. This term is used various times but the article doesn't provide a definition or reference for its meaning.
I would like to know if someone can explain the term and its purpose.
It's an encoding of a code point which takes more code units than it needs to.
For example, U+0020 is represented in UTF-8 by the single byte 0x20. If you decode the two bytes 0xc0 0xa0 in the normal fashion, you'll still end up back at U+0020, but that's an invalid representation.
The Unicode Corrigendum #1 has more information, particularly around table 3.1B.
UTF-8 theoretically allows for different representations of characters that also have a shorter one. For example, you could encode an ASCII character in two bytes by setting the MSBs to zero. The UTF-8 specification explicitly forbids this.

Can I get a single canonical UTF-8 string from a Unicode string?

I have a twelve-year-old Windows program. As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there's one spot that still needs to be changed over. There is a serious constraint on it though: the exact same ASCII byte sequence MUST be created by different encoders, some of which will be operating on non-Windows systems.
I'm trying to determine whether UTF-8 will do the trick or not. I've heard in passing that different UTF-8 sequences can come up with the same Unicode string, which would be a problem here.
So the question is: given a Unicode string, can I expect a single canonical UTF-8 sequence to be generated by any standards-conforming implementation of a converter? Or are there multiple possibilities?
Any given Unicode string will have only one representation in UTF-8.
I think the confusion here is that there are multiple ways in Unicode to get the same visual output for some languages. Not to mention that Unicode has several characters that have no visual representation.
But this has nothing to do with UTF-8, its a property of Unicode itself. The encoding of a given Unicode as UTF-8 is a purely mechanical process, and it's perfectly reversible.
The conversion rules are here:
http://en.wikipedia.org/wiki/UTF-8
As John already said, there is only one standards-conforming UTF-8 representation.
But the tricky point is "standards-conforming".
Older encoders are usually unable to properly convert UTF-16 because of surrogates.
Java is one notable case of those non-conforming converters (it will produce two 3-bytes sequences instead of one 4-byte sequence).
MySQL had problems until recently, and I am not sure about the current status.
Now, you will only have problems with code points that need surrogates, meaning above U+FFFF. If you application survived without Unicode for a long time, it means you never needed to move such "esoteric" characters :-)
But it is good to get things right from the get go.
Try using standards-conforming encoders and you will be fine.

Dummy's guide to Unicode

Could anyone give me a concise definitions of
Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
Þ (LATIN CAPITAL LETTER THORN)
fi (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.
UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.
UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.
Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!
Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.