EUC-JP or GB18030 Text File - unicode

I have a text file with the following contents in it:
Ã(195) Ü(220) Â(195) ë(235) Ó(211) Ã(195) »(187) §(167) Ã(195) û(251) Ã(195) Ü(220) Â(194) ë(235) Ã(195) û(251) ³(179) Æ(198) Ã(195) û(251) ³(179) Æ(198). For simplicity, along with the text I have added the Unicode values that I got from http://www.fileformat.info/. Going by the Unicode Character set, this file seems to comply with this line A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. mentioned in https://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP and my rendering engine seems to display Japanese characters. However, this actually is a Chinese Text File containing 密码用户名密码名称名称 which gets recognized as GB2312 encoded file by Notepad++. Are there some more restrictions for determining if a file is JIS-X-0208(EUC-JP) encoded, since it seems to comply with what Wiki says?
However, my rendering engine seems to recognize this file as both EUC-JP and Chinese,but since EUC-JP comes higher in the order, we think it is Japanese and Japanese Characters are displayed.

There is no completely reliable way to identify an unknown encoding.
Distribution patterns can probably help you determine whether you are looking at an 8-bit or a 16-bit encoding. Double-byte encodings tend to have a slightly constrained distribution pattern for every other byte. This is where you are now.
Among 16-bit encodings, you can also probably easily determine whether you are looking at a big-endian or little-endian encoding. Little-endian will have the constrained pattern on the even bytes, while big-endian will have it on the odd bytes. Unfortunately, most double-byte encodings seem to be big-endian, so this is not going to help much. If you are looking at little-endian, it's likely UTF-16LE.
Looking at your example data, every other byte seems to be equal to or close to 0xC3, starting at the first byte (but there seem to be some bytes missing, perhaps?)
There are individual byte sequences which are invalid in individual encodings, but on the whole, this is rather unlikely to help you reach a conclusion. If you can remove one or more candidate 16-bit encodings with this tactic, good for you; but it will probably not be sufficient to solve your problem.
Within this space, all you have left is statistics. If the text is long enough, you can probably find repeating patterns, or use a frequency table for your candidate encodings to calculate a score for each. Because the Japanese writing system is sharing a common heritage with Chinese, you will find similarities in their distributions, but also differences. Typologically, the Japanese language is quite different from Chinese, which means that Japanese will have particles every few characters, whereas Chinese does not have them at all. So you would look for "no" の, "wa" は, "ka" か, "ga" が, "ni" に etc and if they are present, conclude that you are looking at Japanese (or, conversely, surmise that perhaps you are looking at Chinese if they are absent; but if you are looking at lists of names, for example, it could still be Japanese).
Within Chinese (and also tangentially for Japanese) you can look at http://www.zein.se/patrick/3000char.html for frequency information; but keep in mind that the Japanese particles will be much more common in Japanese running text than any of these glyphs.
For example, 的 (the first item on the list) aka U+7684 will be 0x76 0x84 in UTF-16be, 0xAA 0xBA in Big-5, 0xC5 0xAA in EUC-JP, 0xB5 0xC4 in GB2312, etc.
From your sample data, you likely have item 139 on that list 名 aka U+540D which is 0x54 0x0D in UTF-16be, 0xA5 0x57 in Big-5, 0xCC 0xBE in EUC-JP, and 0xC3 0xFB in GB2312. (Do you see? Hit!)

Are there some more restrictions for determining if a file is JIS-X-0208(EUC-JP) encoded
A little, in that the lead bytes 0xF5–0xF8 and 0xFD–0xFE are unassigned, and there are also some other unassigned characters sprinkled at the end of blocks throughout.
That doesn't help you here, though, as the byte sequence C3DCC2EBD3C3BBA7C3FBC3DCC2EBC3FBB3C6C3FBB3C6 is equally valid in GB (密码用户名密码名称名称) and EUC-JP (畜鷹喘薩兆畜鷹兆各兆各).
Such is the joy of charset sniffing. You'll have to prune and reorder the charsets you have based on the likelihood of them existing in your input. Typically in a Windows world EUC-JP is rare (Shift-JIS-alike code page 932 would be used instead) so GB-alike code page 936 would typically ‘win’.

Related

True double byte encoding

Exist some real double byte encoding (DBCS)?
Except UCS-2, UTF-16 of course.
I mean encoding, which save also ASCII as 2 bytes.
I mean with null bytes. (00 20 - space)
Please tell if it is used, if it is obsolete in standart / in use.
The same question for 4 bytes encoding, exists any(not UCS-4, UTF-32)?
Thanks.
There are certainly legacy character sets that use exactly two bytes for every character, but these generally do not encode ASCII characters at all, being intended to supplement a single-byte character set rather than replacing it. All of those that I am aware of exist to support Chinese, Japanese, and/or Korean ideograph characters.
There are plenty of legacy documents around that use such encodings, and I would not be surprised to find that in some places they are still used in new documents.
If you are trying to determine whether your software can ignore the existence of multi-byte character encodings other than the UTFs, then I'm afraid you won't come away with an easy answer. Of course your software can do so, in the same sense that it can ignore single-byte encodings other than ISO-8859-15, but only you can determine whether your program will adequately serve its purpose if it does so.
No, there are no double-byte character sets that satisfy your list of requirements. This is because designers back in the day used 7-bit ASCII as their starting point (good for compatibility), then put extra characters or multi-byte start codes in the upper half of the 256 byte values.
Similarly for quad-byte character sets, no serious standard before Unicode even tried to provision for more than 65536 characters.
To give one example, Chinese Big5 uses ASCII definitions for bytes 0x00 to 0x7F, uses 0x81 to 0xFF as a start byte for extended characters, and {0x40 to 0x7E, 0xA1 to 0xFE} for the second byte. This can code a maximum of 20067 different characters.

Are there bytes that are not used in the UTF-8 encoding?

As I understand it, UTF-8 is a superset of ASCII, and therefore includes the control characters which are not used to represent printable characters.
My question is: Are there any bytes (of the 256 different) that are not used by the UTF-8 encoding?
I wondered if you could convert/encode UTF-8 text to binary.
Here my though process:
I have no idea how the UTF-8 text encoding works and how it can use so many characters (only that it uses multiple bytes for characters not in ASCII (Latin-1??)) but I know that ASCII text is valid in UTF-8 so the control characters (bytes 0-30) are not used differently by the UTF-8 encoding but they are at the same time not used for displaying characters, right??
So of the 256 different bytes, only ~230 are used. For a 1000 (binary) long Unicode text there are only 1000^230 different texts? Right?
If that is true, you could convert it to a binary data which is smaller than 1000 bytes.
Wolfram alpha: 1000 bytes of unicode (assumption unicode only uses 230 of the 256 different bytes) --> 496 bytes
Yes, it is possible to devise encodings which are more space-efficient than UTF-8, but you have to weigh the advantages against the disadvantages.
For example, if your primary target is (say) ISO-8859-1, you could map the character codes 0xA0-0xFF to themselves, and only use 0x80-0x9F to select an extension map somewhat vaguely like UTF-8 uses (nearly) all of 0x80-0xFF to encode sequences which can represent all of Unicode > 0x80. You would gain a significant advantage when the majority of your text does not use characters in the ranges 0x80-0x9F or 0x0100-0x1EFFFFFFFF, but correspondingly lose when this is not the case.
Or you could require the user to keep a state variable which tells you which range of characters is currently selected, and have each byte in the stream act as an index into that range. This has significant disadvantages, but used to be how these things were done way back when (witness e.g. ISO-2022).
The original UTF-8 draft before Ken Thompson and Rob Pike famously intervened was probably also somewhat more space-efficient than the final specification, but the changes they introduced had some very attractive properties, trading (I assume) some space efficiency for lack of contextual ambiguity.
I would urge you to read the Wikipedia article about UTF-8 to understand the design desiderata -- the spec is possible to grasp in just a few minutes, although you might want to reserve an hour or more to follow footnotes etc. (The Thompson anecdote is currently footnote #7.)
All in all, unless you are working on space travel or some similarly effeciency-intensive application, losing UTF-8 compatibility is probably not worth the time you have already spent, and you should stop now.
0xF8-0xFF are not valid anywhere in UTF-8, and some other bytes are not valid at certain positions.
The lead byte of a character indicates the number of bytes used to encode the character, and each continuation byte has 10 as its two high order bits. This is so that you can pick any byte within the text and find the start of the character containing it. If you don't mind losing this ability, you could certainly come up with more efficient encoding.
You have to distinguish Characters, Unicode and UTF-8 encoding:
In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).
To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).

What is the Best UTF [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm really confused about UTF in Unicode.
there is UTF-8, UTF-16 and UTF-32.
my question is :
what UTF that are support all Unicode blocks ?
What is the best UTF(performance, size, etc), and why ?
What is different between these three UTF ?
what is endianness and byte order marks (BOM) ?
Thanks
what UTF that are support all Unicode blocks ?
All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.
What is the best UTF(performance, size, etc), and why ?
For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).
What is different between these three UTF ?
UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.
UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.
UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF is encoded as 0x0010FFFF.
what is endianness and byte order marks (BOM) ?
Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.
In a little-endian encoding or system, the 32-bit value 0x12345678 is stored or transmitted as 0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78.
A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.
A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.
Benefits of UTF-8
ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
UTF-8 text is the most compact format for ASCII text
Valid UTF-8 can be sorted on byte values and result in sorted codepoints
Benefits of UTF-16
UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII
Benefits of UTF-32
UTF-32 is not variable-length, so it requires no special logic to decode
“Answer me these questions four, as all were answered long before.”
You really should have asked one question, not four. But here are the answers.
All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways:
UCS-2 is not a valid Unicode encoding.
UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.
UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include
all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF
the 32 non-character code points between U+FDD0 and U+FDEF
both sentinels at U+FFEF and U+FFFF
For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled “Unicode Support Shootout”. It varies a great deal.
UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:
UTF-8 is the de facto standard Unicode encoding on the web.
UTF-8 can be stored in a null-terminated string.
UTF-8 is free of the vexing BOM issue.
UTF-8 risks no confusion of UCS-2 vs UTF-16.
UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.
If need be, it can be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.
I use UTF-8 whenever I can get away with it.
I have already given properties of UTF-8, so here are some for the other two:
UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s strlen function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.
UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either. It is grudgingly true that when properly handled, UTF-16 can certainly be made to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there.
That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse.
Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.
I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.
Category Errors
Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters.
If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, re, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.
For example, Unicode defines U+03C3 GREEK SMALL LETTER SIGMA and U+03C2 GREEK SMALL LETTER FINAL SIGMA as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works.
(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is Matthew Barnett’s marvelous regex library for Python 2 and Python 3. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard re gets miserably wrong.)
REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.
All of them support all Unicode code points.
They have different performance characteristics - for example, UTF-8 is more compact for ASCII characters, whereas UTF-32 makes it easier to deal with the whole of Unicode including values outside the Basic Multilingual Plane (i.e. above U+FFFF). Due to its variable width per character, UTF-8 strings are hard to use to get to a particular character index in the binary encoding - you have scan through. The same is true for UTF-16 unless you know that there are no non-BMP characters.
It's probably easiest to look at the wikipedia articles for UTF-8, UTF-16 and UTF-32
Endianness determines (for UTF-16 and UTF-32) whether the most significant byte comes first and the least significant byte comes last, or vice versa. For example, if you want to represent U+1234 in UTF-16, that can either be { 0x12, 0x34 } or { 0x34, 0x12 }. A byte order mark indicates which endianess you're dealing with. UTF-8 doesn't have different endiannesses, but seeing a UTF-8 BOM at the start of a file is a good indicator that it is UTF-8.
Some good questions here and already a couple good answers. I might be able to add something useful.
As said before, all three cover the full set of possible codepoints, U+0000 to U+10FFFF.
Depends on the text, but here are some details that might be of interest. UTF-8 uses 1 to 4 bytes per char; UTF-16 uses 2 or 4; UTF-32 always uses 4. A useful thing to note is this. If you use UTF-8 then then English text will be encoded with the vast majority of characters in one byte each, but Chinese needs 3 bytes each. Using UTF-16, English and Chinese will both require 2. So basically UTF-8 is a win for English; UTF-16 is a win for Chinese.
The main difference is mentioned in the answer to #2 above, or as Jon Skeet says, see the Wikipedia articles.
Endianness: For UTF-16 and UTF-32 this refers to the order in which the bytes appear; for example in UTF-16, the character U+1234 can be encoded either as 12 34 (big endian), or 34 12 (little endian). The BOM, or byte order mark is interesting. Let's say you have a file encoded in UTF-16, but you don't know whether it is big or little endian, but you notice the first two bytes of the file are FE FF. If this were big-endian the character would be U+FEFF; if little endian, it would signify U+FFFE. But here's the thing: In Unicode the codepoint FFFE is permanently unassigned: there is no character there! Therefore we can tell the encoding must be big-endian. The FEFF character is harmless here; it is the ZERO-WIDTH NO BREAK SPACE (invisible, basically). Similarly if the file began with FF FE we know it is little endian.
Not sure if I added anything to the other answers, but I have found the English vs. Chinese concrete analysis useful in explaining this to others in the past.
One way of looking at it is as size over complexity. Generally they increase in the number of bytes they need to encode text, but decrease in the complexity of decoding the scheme they use to represent characters. Therefore, UTF-8 is usually small but can be complex to decode, whereas UTF-32 takes up more bytes but is easy to decode (but is rarely used, UTF-16 being more common).
With this in mind UTF-8 is often chosen for network transmission, as it has smaller size. Whereas UTF-16 is chosen where easier decoding is more important than storage size.
BOMs are intended as information at the beginning of files which describes which encoding has been used. This information is often missing though.
Joel Spolsky wrote a nice introductory article about Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

What are Unicode, UTF-8, and UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16?
I have researched this on Google and searched here as well, but it's not clear to me.
In VSS, when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?
Please explain in simple terms.
Why do we need Unicode?
In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).
But for argument's sake, let’s say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.
Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.
Memory considerations
So how many bytes give access to what characters in these encodings?
UTF-8:
1 byte: Standard ASCII
2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
3 bytes: BMP
4 bytes: All Unicode characters
UTF-16:
2 bytes: BMP
4 bytes: All Unicode characters
It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese, Japanese, and Korean (CJK) characters.
If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.
Encoding basics
Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.
UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.
UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.
Practical programming considerations
Character and string data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.
Recommended, default, and dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.
Library support: The libraries you are using support some kind of encoding. Which one? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.
Counting characters: There exist combining characters in Unicode. For example, the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, and 1 for the latter. This isn't necessarily wrong, but it may not be the desired outcome either.
Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ. One is a letter, and the other is a Roman numeral. In addition, we have the combining characters to consider as well. For more information, see Duplicate characters in Unicode.
Surrogate pairs: These come up often enough on Stack Overflow, so I'll just provide some example links:
Getting string length
Removing surrogate pairs
Palindrome checking
Unicode
is a set of characters used around the world
UTF-8
a character encoding capable of encoding all possible characters (called code points) in Unicode.
code unit is 8-bits
use one to four code units to encode Unicode
00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)
UTF-16
another character encoding
code unit is 16-bits
use one to two code units to encode Unicode
00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits)
Unicode is a fairly complex standard. Don’t be too afraid, but be
prepared for some work! [2]
Because a credible resource is always needed, but the official report is massive, I suggest reading the following:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) An introduction by Joel Spolsky, Stack Exchange CEO.
To the BMP and beyond! A tutorial by Eric Muller, Technical Director then, Vice President later, at The Unicode Consortium (the first 20 slides and you are done)
A brief explanation:
Computers read bytes and people read characters, so we use encoding standards to map characters to bytes. ASCII was the first widely used standard, but covers only Latin (seven bits/character can represent 128 different characters). Unicode is a standard with the goal to cover all possible characters in the world (can hold up to 1,114,112 characters, meaning 21 bits/character maximum. Current Unicode 8.0 specifies 120,737 characters in total, and that's all).
The main difference is that an ASCII character can fit to a byte (eight bits), but most Unicode characters cannot. So encoding forms/schemes (like UTF-8 and UTF-16) are used, and the character model goes like this:
Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called a code point.
An encoding form maps a code point to a code unit sequence. A code unit is the way you want characters to be organized in memory, 8-bit units, 16-bit units and so on. UTF-8 uses one to four units of eight bits, and UTF-16 uses one or two units of 16 bits, to cover the entire Unicode of 21 bits maximum. Units use prefixes so that character boundaries can be spotted, and more units mean more prefixes that occupy bits. So, although UTF-8 uses one byte for the Latin script, it needs three bytes for later scripts inside a Basic Multilingual Plane, while UTF-16 uses two bytes for all these. And that's their main difference.
Lastly, an encoding scheme (like UTF-16BE or UTF-16LE) maps (serializes) a code unit sequence to a byte sequence.
character: π
code point: U+03C0
encoding forms (code units):
      UTF-8: CF 80
      UTF-16: 03C0
encoding schemes (bytes):
      UTF-8: CF 80
      UTF-16BE: 03 C0
      UTF-16LE: C0 03
Tip: a hexadecimal digit represents four bits, so a two-digit hex number represents a byte.
Also take a look at plane maps on Wikipedia to get a feeling of the character set layout.
The article What every programmer absolutely, positively needs to know about encodings and character sets to work with text explains all the details.
Writing to buffer
if you write to a 4 byte buffer, symbol あ with UTF8 encoding, your binary will look like this:
00000000 11100011 10000001 10000010
if you write to a 4 byte buffer, symbol あ with UTF16 encoding, your binary will look like this:
00000000 00000000 00110000 01000010
As you can see, depending on what language you would use in your content this will effect your memory accordingly.
Example: For this particular symbol: あ UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.
Reading from buffer
Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.
e.g. If you decode this :
00000000 11100011 10000001 10000010
into UTF16 encoding, you will end up with 臣 not あ
Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. あ symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.
30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.
30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.
Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.
Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.
Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.
Unicode is a standard which maps the characters in all languages to a particular numeric value called a code point. The reason it does this is that it allows different encodings to be possible using the same set of code points.
UTF-8 and UTF-16 are two such encodings. They take code points as input and encodes them using some well-defined formula to produce the encoded string.
Choosing a particular encoding depends upon your requirements. Different encodings have different memory requirements and depending upon the characters that you will be dealing with, you should choose the encoding which uses the least sequences of bytes to encode those characters.
For more in-depth details about Unicode, UTF-8 and UTF-16, you can check out this article,
What every programmer should know about Unicode
Why Unicode? Because ASCII has just 127 characters. Those from 128 to 255 differ in different countries, and that's why there are code pages. So they said: let’s have up to 1114111 characters.
So how do you store the highest code point? You'll need to store it using 21 bits, so you'll use a DWORD having 32 bits with 11 bits wasted. So if you use a DWORD to store a Unicode character, it is the easiest way, because the value in your DWORD matches exactly the code point.
But DWORD arrays are of course larger than WORD arrays and of course even larger than BYTE arrays. That's why there is not only UTF-32, but also UTF-16. But UTF-16 means a WORD stream, and a WORD has 16 bits, so how can the highest code point 1114111 fit into a WORD? It cannot!
So they put everything higher than 65535 into a DWORD which they call a surrogate-pair. Such a surrogate-pair are two WORDS and can get detected by looking at the first 6 bits.
So what about UTF-8? It is a byte array or byte stream, but how can the highest code point 1114111 fit into a byte? It cannot! Okay, so they put in also a DWORD right? Or possibly a WORD, right? Almost right!
They invented utf-8 sequences which means that every code point higher than 127 must get encoded into a 2-byte, 3-byte or 4-byte sequence. Wow! But how can we detect such sequences? Well, everything up to 127 is ASCII and is a single byte. What starts with 110 is a two-byte sequence, what starts with 1110 is a three-byte sequence and what starts with 11110 is a four-byte sequence. The remaining bits of these so called "startbytes" belong to the code point.
Now depending on the sequence, following bytes must follow. A following byte starts with 10, and the remaining bits are 6 bits of payload bits and belong to the code point. Concatenate the payload bits of the startbyte and the following byte/s and you'll have the code point. That's all the magic of UTF-8.
ASCII - Software allocates only 8 bit byte in memory for a given character. It works well for English and adopted (loanwords like façade) characters as their corresponding decimal values falls below 128 in the decimal value. Example C program.
UTF-8 - Software allocates one to four variable 8-bit bytes for a given character. What is meant by a variable here? Let us say you are sending the character 'A' through your HTML pages in the browser (HTML is UTF-8), the corresponding decimal value of A is 65, when you convert it into decimal it becomes 01000010. This requires only one byte, and one byte memory is allocated even for special adopted English characters like 'ç' in the word façade. However, when you want to store European characters, it requires two bytes, so you need UTF-8. However, when you go for Asian characters, you require minimum of two bytes and maximum of four bytes. Similarly, emojis require three to four bytes. UTF-8 will solve all your needs.
UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1 or 3 bytes. Each character is either represented in 16 bit or 32 bit.
Then why does UTF-16 exist? Originally, Unicode was 16 bit not 8 bit. Java adopted the original version of UTF-16.
In a nutshell, you don't need UTF-16 anywhere unless it has been already been adopted by the language or platform you are working on.
Java program invoked by web browsers uses UTF-16, but the web browser sends characters using UTF-8.
UTF stands for stands for Unicode Transformation Format. Basically, in today's world there are scripts written in hundreds of other languages, formats not covered by the basic ASCII used earlier. Hence, UTF came into existence.
UTF-8 has character encoding capabilities and its code unit is eight bits while that for UTF-16 it is 16 bits.

Smallest Unicode encodings for different languages?

What are the typical average bytes-per-character rates for different unicode encodings in different languages?
E.g. if I wanted the smallest number of bytes to encode some english text, then on average UTF-8 would be 1-byte per character and UTF-16 would be 2 so I'd pick UTF-8.
If I wanted some Korean text, then UTF-16 might average about 2 per character but UTF-8 might average about 3 (I don't know, I'm just making up some illustrative numbers here).
Which encodings yield the smallest storage requirements for different languages and character sets?
For any given language, your bytes-per-character rates are fairly constant, because most languages are allocated to contiguous code pages. The big exception is accented Latin characters, which are allocated higher in the code space than the unaccented forms. I don't have hard numbers for these.
For languages with contiguous character allocation, there is a table with detailed numbers for various languages on Wikipedia. In general, UTF-8 works well for most small character sets (except the ones allocated on high code pages), and UTF-16 is great for two-byte character sets.
If you need denser compression, you may also want to look at Unicode Technical Note 14, which compares some special-purpose encodings designed to reduce data size for a variety of languages. But these techniques aren't especially common.
If you're really worried about string/character size, have you thought about compressing them? That would automatically reduce the string to it's 'minimal' encoding. It's a layer of headache, especially if you want to do it in memory, and there are plenty of cases in which it wouldn't buy you anything, but encoding, especially, tend to be too general purpose to the level of compactness you seem to be aiming for.
UTF8 is best for any character-set where characters are primarily below U+0800. Otherwise UTF16.
That is, UTF8 for Latin, Greek, Cyrillic, Hebrew and Arabic and a few others. In langs other than Latin, characters will take up the same space as they would in UTF16, but you'll save bytes on punctuation and spacing.
In UTF-16, all the languages that matter (i.e. anything but klingons, elven and other strange things) will be encoded into 2 byte chars.
So the question is to find the languages that will have glyphs that will be 2-bytes or 1-byte sized characters long.
In the Wikipedia page on UTF-8:
http://en.wikipedia.org/wiki/Utf-8
We see that a character with an unicode index of 0x0800 or more will be at least 3 bytes long in UTF-8.
Knowing that, you just need to look at the code charts on unicode: http://www.unicode.org/charts/
for the languages that comply to your requirements.
:-)
Now, note that, depending on the framework you're using, the choice could well be not yours to do:
On Windows API, Unicode is handled by wchar_t chars, and is UTF-16
On Linux, Unicode is handled by char, and is UTF-8
Java is internally UTF-16, as are most compliant XML parsers
I was told (some tech meeting I was not interested on... sorry...) that UTF-8 was the encoding of choices on Databases.
So, pick up your poison...
:-)
I don't know exact figures, but for Japanese Shift_JIS averages fewer bytes per character than UTF-8, and so does EUC-JP, since they're optimised for Japanese text. However, they don't cover the same space of code points as Unicode, so they might not be correct answers to your question.
UTF-16 is better than UTF-8 for Japanese characters (2 bytes per char as opposed to 3), but worse than UTF-8 if there's a lot of 7-bit chars. It depends on the context - technical text is more likely to contain a lot of chars in the 1-byte range. A classical Japanese text might not have any.
Note that for transport, the encoding doesn't matter much if you can zip (gzip, bz2) the data. Code points for an alphabet in Unicode are close together, so you'd expect common prefixes with very short representations in the compressed data.
UTF-8 is usually good for representation in memory, since it's often more compact than UTF-32 or UTF-16, and is compatible with functions on char* which 'expect' ASCII or ISO-8859-1 NUL-terminated strings. It's useless if you need random access to characters by index, though.
If you don't care about non-BMP characters, UCS-2 is always 2 bytes per character and so offers random access. But that depends what you mean by 'Unicode'.
UTF-8
There is a very good article about unicode on JoelOnSoftware:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)