Unicode is awesome. There aren't too many people who disagree with this.
Apart from Python 3 (which did it wrong), what would be the negative impact (if any) of the next major version of all programming languages defaulting to using Unicode/UTF-8 strings?
I'm talking specifically about the many cases which require workarounds to get UTF-8. For example, running a Java program:
java ... -Dfile.encoding=UTF-8
Or working with strings in Python 2:
# -*- coding: utf8 -*-
unicode_string = u"This is Unicode Text"
Certain MySQL databases default to a different character encoding by default:
etc. etc.
Why don't we all just default to using Unicode/UTF-8 and allow users to use the workarounds if they need support for other character encodings? What would be the problems with doing this?

UTF-8 is a variable-length encoding, which is slower to parse than fixed-length encodings. Example: the 7th character of an ASCII string is always the 7th byte. We don't know exactly where the 7th character of a UTF-8 string is in memory without starting from the beginning of the string and parsing the whole thing. For long strings this can be expensive.
So for string operations where finding specific substrings based on character/byte position is important (SQL databases are a great example of this) other encodings can often be preferable.
Additionally, UTF-8 encodes non-english text (outside the ASCII range) as two or more bytes, while a lot of character encodings (koi8-r for Russian, as an example) encode all of the commonly used characters of other languages in a single byte, which is handy for mediums such as email where all the data must be sent over the network.
GB2312 is the primary Chinese character set, which encodes the entire Chinese alphabet in two-byte characters, while all of these characters would be 3 bytes in UTF-8 (50% increase)
UTF-8 is amazing for compatibility, but in terms of how it represents characters in memory, other encodings outcompete it in a lot of scenarios.


I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.

How does code pages work in case of chinese / japanese?
It is unable to encode all alphabet's characters for these languages in the limits of one byte so how does it work then?
Note that I'm taking about pre-Unicode times.
I'm most familiar with Japanese, but in general the strategy is the same for any language that needs more characters than fit in a single byte - you use a variable width multibyte encoding where some bytes are recognized as starting a "wide" character and ASCII is left alone.
In the early days so-called "ASCII-safe" encodings were useful. These used only seven bits (the high bit was always 0) so they worked with a variety of systems (including hardware) that expected only control characters to set the high bit in any byte. ISO-2022-JP is one of these and is still used in email quite often (mostly on feature phones).
Here's what ISO-2022-JP looks like if you don't decode it:
echo "日本語" | iconv -f utf8 -t iso2022jp | cat -v
Note that "test" comes through unchanged and all other characters are valid ASCII; ^[ is an ASCII escape character. (ISO-2022 also has 8-bit versions, but the 7-bit is the most commonly used variety.)
Later variable width encodings, like EUC, Shift-JIS, and UTF-8 all work on the same principle except they use binary (non-ASCII) escapes, so the first character of any multi-byte character has the high bit set (that is, the unsigned byte value is >128). The Wikipedia article for UTF-8 has a nice table explaining how UTF handles this. Just like the older ASCII-safe encodings, these leave ASCII strings unmodified.
There also exist fixed-width multibyte encodings, but they're relatively uncommon. There was an attempt to popularize an encoding that just used two bytes for everything, called "UCS-2", but it ended up not having room for enough characters and was mostly superseded by variable width UTF-16 in the 1990s. UTF-16 is (practically speaking) the internal encoding used in Java and Javascript, but due to the history with UCS-2 sometimes things like string length work in strange ways.
Technically fixed-width UTF-32 exists, but it's not widely used and I've never personally encountered it in the wild.

Is it true that unicode=utf16?
Many are saying Unicode is a standard, not an encoding, but most editors support save as Unicode encoding actually.
As Rasmus states in his article "The difference between UTF-8 and Unicode?":
If asked the question, "What is the difference between UTF-8 and
Unicode?", would you confidently reply with a short and precise
answer? In these days of internationalization all developers should be
able to do that. I suspect many of us do not understand these concepts
as well as we should. If you feel you belong to this group, you should
read this ultra short introduction to character sets and encodings.
Actually, comparing UTF-8 and Unicode is like comparing apples and
UTF-8 is an encoding - Unicode is a character
A character set is a list of characters with unique numbers (these
numbers are sometimes referred to as "code points"). For example, in
the Unicode character set, the number for A is 41.
An encoding on the other hand, is an algorithm that translates a
list of numbers to binary so it can be stored on disk. For example
UTF-8 would translate the number sequence 1, 2, 3, 4 like this:
00000001 00000010 00000011 00000100
Our data is now translated into binary and can now be saved to
All together now
Say an application reads the following from the disk:
1101000 1100101 1101100 1101100 1101111
The app knows this data represent a Unicode string encoded with
UTF-8 and must show this as text to the user. First step, is to
convert the binary data to numbers. The app uses the UTF-8 algorithm
to decode the data. In this case, the decoder returns this:
104 101 108 108 111
Since the app knows this is a Unicode string, it can assume each
number represents a character. We use the Unicode character set to
translate each number to a corresponding character. The resulting
string is "hello".
So when somebody asks you "What is the difference between UTF-8 and
Unicode?", you can now confidently answer short and precise:
UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding
used to translate numbers into binary data. Unicode is a character set
used to translate characters into numbers.
most editors support save as ‘Unicode’ encoding actually.
This is an unfortunate misnaming perpetrated by Windows.
Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).
This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.
This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.
(Other editors that do encodings themselves, like Notepad++, don't have this problem.)
If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.
It's not that simple.
UTF-16 is a 16-bit, variable-width encoding. Simply calling something "Unicode" is ambiguous, since "Unicode" refers to an entire set of standards for character encoding. Unicode is not an encoding!
and of course, the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) link.
There's a lot of misunderstanding being displayed here. Unicode isn't an encoding, but the Unicode standard is devoted primarily to encoding anyway.
ISO 10646 is the international character set you (probably) care about. It defines a mapping between a set of named characters (e.g., "Latin Capital Letter A" or "Greek small letter alpha") and a set of code points (a number assigned to each -- for example, 61 hexadecimal and 3B1 hexadecimal for those two respectively; for Unicode code points, the standard notation would be U+0061 and U+03B1).
At one time, Unicode defined its own character set, more or less as a competitor to ISO 10646. That was a 16-bit character set, but it was not UTF-16; it was known as UCS-2. It included a rather controversial technique to try to keep the number of necessary characters to a minimum (Han Unification -- basically treating Chinese, Japanese and Korean characters that were quite a bit alike as being the same character).
Since then, the Unicode consortium has tacitly admitted that that wasn't going to work, and now concentrate primarily on ways to encode the ISO 10646 character set. The primary methods are UTF-8, UTF-16 and UCS-4 (aka UTF-32). Those (except for UTF-8) also have LE (little endian) and BE (big-endian) variants.
By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).
UTF-16 and UTF-8 are both encodings of Unicode. They are both Unicode; one is not more Unicode than the other.
Don't let an unfortunate historical artifact from Microsoft confuse you.
The development of Unicode was aimed
at creating a new standard for mapping
the characters in a great majority of
languages that are being used today,
along with other characters that are
not that essential but might be
necessary for creating the text. UTF-8
is only one of the many ways that you
can encode the files because there are
many ways you can encode the
characters inside a file into Unicode.
In addition to Trufa's comment, Unicode explicitly isn't UTF-16. When they were first looking into Unicode, it was speculated that a 16-bit integer might be enough to store any code, but in practice that turned out not to be the case. However, UTF-16 is another valid encoding of Unicode - alongside the 8-bit and 32-bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems.
Let's start from keeping in mind that data is stored as bytes; Unicode is a character set where characters are mapped to code points (unique integers), and we need something to translate these code points data into bytes. That's where UTF-8 comes in so called encoding – simple!
It's weird. Unicode is a standard, not an encoding. As it is possible to specify the endianness I guess it's effectively UTF-16 or maybe 32.
I'm really trying to get better with this stuff. I'm pretty functional with internationalization concepts like this, but I need to get a better background on the theory behind it.
I've read Spolsky's article, but I'm still unclear because these three terms get used interchangeably a LOT -- even in that article. I think at least two of them are talking about the same thing.
I suspect a high percentage of developers flub their way through this stuff on a daily basis. I don't want to be one of those developers anymore.
A ‘character set’ is just what it says: a properly-specified list of distinct characters.
An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.
UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).
The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.
A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.
When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.
[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]
A Character Set is just that, a set of characters that can be used.
Each of these characters is mapped to an integer called code point.
How these code points are represented in memory is the encoding. An encoding is just a method to transform a code-point (U+0041 - Unicode code-point for the character 'A') into raw data (bits and bytes).
I thought Joel's article was pretty much spot on - it is the history behind the evolution of character sets and storage which has brought this about.
FWIW, in my oversimplistic view
Character Sets (ASCII, EBCDIC, UNICODE) would be the numeric representation of characters, independent of storage considerations
Encoding would relate to the efficient storage of characters, ANSI, UTF-7, UTF-8 etc, for file, across the wire etc
Code Page would be the 'kluge' needed when the demand for the addition of new characters (without wanting to increase storage capacity) meant that (certain) characters were only knowable in the additional context of a code page.
IMHO Wikipedia currently doesn't help things by defining code page as 'another name for character encoding'
and redirecting 'character set' to 'character encoding'
A character set is a set of characters, i.e. "glyphs" i.e. visual symbols representing units of communication. The letter a is a glyph and so is € (euro sign). Character sets usually map integers (codepoints) to each character, but it's the encoding that dictates the binary/byte-level representation of the character.
I'm a ruby programmer, so here are some examples to help you understand the concepts.
This reveals how the Unicode character set maps codepoints to characters, but not how each byte is stored. (ruby 1.9 defaults to Unicode strings.)
>> 'a'.codepoints.to_a
=> [97]
>> '€'.codepoints.to_a
=> [8364]
Since 8364 (base 10) is too large to fit in one byte, various encoding strategies exist to specify a translation from Unicode codepoints into one or many bytes. The UTF-8 encoding is probably the most popular of these encodings. (Wikipedia shows the UTF-8 encoding algorithm, if you want to delve into the implementation.) Note that the UTF-8 encoding only makes sense in the context of the Unicode character set.
The following reveals how the UTF-8 encoding stores each Unicode character as bytes (0 thru 255 in base-10). (Ruby 1.9's default encoding is UTF-8.)
>> 'a'.bytes.to_a
=> [97]
>> '€'.bytes.to_a
=> [226, 130, 172]
Here's the same thing in ISO-8859-15 character set:
>> 'a'.encode('iso-8859-15').codepoints.to_a
=> [97]
>> '€'.encode('iso-8859-15').codepoints.to_a
=> [164]
And the ISO-8859-15 encoding:
>> 'a'.encode('iso-8859-15').bytes.to_a
=> [97]
>> '€'.encode('iso-8859-15').bytes.to_a
=> [164]
Notice that the ISO-8859-15 codepoints match the byte representation.
Here's a blog entry that might be helpful: http://graysoftinc.com/character-encodings/what-is-a-character-encoding. Entries 1 thru 3 are good if you don't want to get too ruby-specific.
The chapter on Unicode in this book, Advanced Perl Programming contains the best description of encoding, character sets and the other entities of unicode that I've come across. Unfortunately I don't think its available for free on line.

It seems the most confusing issue to me.
How is the beginning of a new character recognized?
How are the codepoints allocated?
Let's take Chinese character for example.
What range of codepoints are allocated to them,
and why is it thus allocated,any reason?
Plz describe it in your own words,not by citation.
Or could you recommend a book that talks about Unicode systematically,which you think have made it clear(it's the most important).
The Unicode Consortium is responsible for the codepoint allocation. If you have want a new character or a code page allocated, you can apply there. See the proposal pipeline for examples.
Chapter 2 of the Unicode specification defines the general structure of Unicode, including what ranges are allocated for what kind of characters.
Take a look here for a general overview of Unicode that might be helpful: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)
Unicode is a standard specified by the Unicode Consortium. The specification defines Unicode’s character set, the Universal Character Set (UCS), and some encodings to encode that characters, the Unicode Transformation Formats UTF-7, UTF-8, UTF-16 and UTF-32.
How is the beginning of a new character recognized?
It depends on the encoding that’s been used. UTF-16 and UTF-32 are encodings with fixed code word lengths (16 and 32 bits respectively) while UTF-7 and UTF-8 have a variable code word length (from 8 bits up to 32 bits) depending on the character point that is to be encoded.
How are the codepoints allocated? Let's take Chinese character for example. What range of codepoints are allocated to them, and why is it thus allocated,any reason?
The UCS is separated into so called character planes. The first one is Basic Latin (U+0000–U+007F, encoded like ASCII), the second is Latin-1 Supplement (U+0080–U+00FF, encoded like ISO 8859-1) and so on.
It is better to say Character Encoding instead of Codepage
A Character Encoding is a way to map some character to some data (and also vice-versa!)
As Wikipedia says:
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers
Most popular character encodings are ASCII,UTF-16 and UTF-8
First code-page that widely used in computers. in ANSI just one byte is allocated for each character. So ANSI could have a very limited set of characters (English letters, Numbers,...)
As I said, ASCII used videly in old operating systems like MS-DOS. But ASCII is not dead and still used. When you have a txt file with 10 characters and it is 10 bytes, you have a ASCII file!
In UTF-16, Two bytes is allocated of a character. So we can have 65536 different characters in UTF-16 !
Microsoft Windows uses UTF-16 internally.
UTF-8 is another popular way for encoding characters. it uses variable-length bytes (1byte to 4bytes) for characters. It is also compatible with ASCII because uses 1byte for ASCII characters.
Most Unix based systems uses UTF-8
Programming languages do not depend on code-pages. Maybe a specific implementation of a programming language do not support codepages (like Turbo C++)
You can use any code-page in modern programming languages. They also have some tools for converting the code-pages.
There is different Unicode versions like Utf-7,Utf-8,... You can read about them here (recommanded!) and maybe for more formal details here