Unicode usage in software - unicode

I am tormented by the question concerning the usage of Unicode for a long time. Unicode allows to accelerate and simplify the development of software (in terms of globalization), but I am concerned by the following factors:
increased memory and diskspace usage;
reduction of the text processing performance;
Asian languages treated all alike to the detriment of the national specificities.
With the first paragraph of all it is obvious... But I don't know the true or not the others. Is there anyone who is faced with the need to localize software for Asian countries, and is ready to share the experience?
At the moment I try to use the encoding of a narrow profile (cp1251 - for Russia, cp1254 for Turkey, etc.). Will somebody advice on this issue?

The impact on the size of data in bytes is affected by the choice of the Unicode encoding and by the type of data. For example, using UTF-8 (the only useful Unicode encoding on the web), English text has the same size as in 8-bit encodings, except for typographically correct punctuation marks, which may take two bytes each; for Turkish text, any non-Ascii letter is 2 bytes instead of 1 byte; for Russian text, any Cyrillic letter is 2 bytes. In most cases, this does not matter much.
Text processing performance depends on what you do and how you do that. The reasonable expectation is that there is no problem worth worrying about. If processing is fast enough, it hardly matters whether it would be 10% faster using an 8-bit encoding.
Unicode unification has its impact, but surely Asian languages are not treated all alike. The Unicode standard has a lot to say about specific treatment of characters in Asian scripts and languages. If you are referring to the different shapes of CJK characters in different languages, then the usual solution is to use fonts designed for the language used. (In addition, it can in principle at least also be handled within a font, when OpenType fonts are used.)
Check out the official Unicode FAQ. It has a lot to say about issues like these.

The first two points are very much negligible. You'd need to have a very specific use case where the difference in size and performance make a discernible difference that justifies the headaches of mixed encodings.
Regarding the Unihan characters: They are grouped by meaning of the character, but that character may be written slightly differently in different writing systems. This is a problem of properly marking up the language, it's not really an encoding problem. In HTML documents, you can mark the document with lang attributes and/or set specific fonts using CSS which will alter the appearance of the character for the language appropriately. How to handle this correctly depends on the type of software (HTML, desktop app, etc). I'd advise you open a new, detailed question about that.

Increased text size: Yes. Text size may be increased up to 6 times (for UTF-8). But storage for texts nowadays is nothing a big problem.
Reduction of text processing performance: As per my opinion, no. An UTF-8 character may take up to 6 bytes, but when scanning thru' the text, and right at the first byte of an UTF-8 character we already know how many bytes more for to read for it (the current character in scanning). So most likely the scanning performance stays the same as O(n), where 'n' is the length of the text. To keep the best performance, try not to access the characters in a text by index (yeah, this is a down-point for performance). Java string is not effected by random index access to string character because Java string is a series of 2-byte characters.
Asian languages treated all alike to the detriment of the national specificities: Yeah, human languages when presented in text format are all alike, but a letter 'i' of a single stroke or a letter of '長' of 16 strokes is just a character.

Increased text size, and all of the following are actually untrue.
They may be true, for old-school encodings of unicode, such as UTF-16. UTF-8 is not larger, or slower than ASCII for ASCII-only strings, and yet it allows encoding every Unicode code point. UTF-8 is also a de-facto standard of doing Unicode on the marketplace today.
There is an extensive analysis of performance of different Unicode encodings in http://www.utf8everywhere.org, including for the Asian languages.

Related

Why are there different encoding types?

This is a noob question, but I wanna know why there are different encoding types and what are their differences (ie. ASCII, utf-8 and 16, base64, etc.)
Reasons are many I believe but the main point is: "How many characters you need to display (encode)?" If you live in US for example, you could go pretty far with ASCII. But in many counties we need characters like ä, å, ü etc. (If SO was ASCII only or you try to read this text as ASCII encoded text, you'd see some weird characters in the places of ä, å and ü.) Think also the China, Japan, Thailand and other "exotic" countires. Those weird figures on photos you may have seen around the world just might be letters, not pretty pictures.
As for the differences between different encoding types you need to see their specification. Here's something for UTF-8.
http://www.unicode.org/standard/standard.html
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8#Compared_to_other_multi-byte_encodings
I'm not familiar with UTF-16. Here's some information about the differences.
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Unicode_plane
Base64 is used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. If you've ever made somesort of email system with PHP, you've probably encountered Base64.
http://en.wikipedia.org/wiki/Base64
http://www.phpeveryday.com/articles/PHP-Email-Using-Embedded-Images-in-HTML-Email-P113.html
Is short: To support computer program's user interface localizations to many different languages. (Programming languages still mainly consist of characters found in ASCII encoding, althought it's possible for example in Java to use UTF-8 encoding in variable names, and the source code file is usually stored as something else than ASCII encoded text, for example UTF-8 encoding.)
In short vol.2: Always when different people are trying to solve some problem from a specific point of view (or even without a point of view if it's even possible), results may be quite different. Quote from Joel's unicode article (link below): "Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255."
Thanks to Joachim and tchrist for all the info and discussion. Here's two articles I just read. (Both links are on the page I linked to earlier.) I'd forgotten most of the stuff from Joel's article since I last read it a few years back. Good introduction to the subject I hope. Mark Davis goes a little deeper.
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
The real reason why there are so many variants is that the Unicode consortium came along too late.
In The Beginning memory and storage was expensive and using more than 8 (or sometimes only 7) bit of memory to store a single character was considered excessive. Thus pretty much all text was stored using 7 or 8 bit per character. Clearly, 8 bit are not enough memory to represent the characters of all human languages. It's barely enough to represent most characters used in a single language (and for some languages even that's not possible). Therefore many different character encodings where designed to allow different languages (English, German, Greek, Russian, ...) to encode their texts in 8 bits per characters. After all a single text file (and usually even a single computer system) will only ever used in a single language, right?
This led to a situation where there was no single agreed-upon mapping of characters to numbers of any kind. Many different, incompatible solutions where produced and no real central control existed. Some computer systems used ASCII, others used EBCDIC (or more precisely: one of the many variations of EBCDIC), ISO-8859-* (or one of its many derivatives) or any of a big list of encodings that are hardly heard about now.
Finally, the Unicode Consortium stepped up to the task to produce that single mapping (together with lots of auxiliary data that's useful but outside of the bounds of this answer).
When the Unicode consortium finally produced a fairly comprehensive list of characters that a computer might represent (together with a number of encoding schemes to encode them to binary data, depending on your concrete needs), the other character encoding schemes were already widely used. This slowed down the adoption of Unicode and its encodings (UTF-8, UTF-16) considerably.
These days, if you want to represent text, your best bet is to use one of the few encodings that can represent all Unicode characters. UTF-8 and UTF-16 together should suffice for 99% of all use cases, UTF-32 covers almost all the others. And just to be clear: all the UTF-* encodings can encode all valid Unicode characters. But due to the fact that UTF-8 and UTF-16 are variable-width encodings, they might not be ideal for all use cases. Unless you need to be able to interact with a legacy system that can't handle those encodings, there is rarely a reason to choose anything else these days.
The main reason is to be able to show more characters. When the internet was in it's infancy, noone really planned ahead thinking that one day there would be people using it from all countries and all languages around the world. So a small character set was good enough. Gradually it was revealed to be limited and English-centric, thus the demand for bigger character sets.

What's the big deal with unicode?

I've heard a lot of people talk about how some new version of a language now supports unicode, and how much of an achievement unicode is. What's the big deal about being able to support a new characterset. It seems like something which would rarely if ever be used but people mention it quite often. What's the benefit or reason people use or even care about unicode?
Programming languages are used to produce software.
Software is used to solve problems faced by humans.
Producing software has a cost.
Software that solves problems for humans produces value. This value can be expressed in the form of profit, or the reduction of costs, depending on the business model of the software developer. How the value is expressed is irrelevant for the purposes of this discussion; what is relevant is that net value is produced.
There are seven billion humans in the world. A significant fraction of them are most comfortable reading text that is not written in the Latin alphabet.
Software which purports to solve a problem for some fraction of those seven billion humans who do not use the Latin alphabet does so more effectively if developers can easily manipulate text written in non-Latin alphabets.
Therefore, a programming language which supports non-Latin character sets lowers the costs of software developers, thereby enabling them to solve more problems for more people at lower costs, and thereby produce more value.
Unicode is the de facto standard for manipulation of non-Latin text.
Therefore, Unicode is important to the design and implementation of programming languages.
Our goal as programming language designers is the creation of tools which produce maximum value. Supporting Unicode is an easy way to massively increase the scope and range of real human problems that can be solved in software.
In the beginning, there were 256 possible characters and many different Code pages to represent them. It became a tangled mess. Supporting multiple languages and multiple characters sets became a programmer's nightmare.
Then the Unicode Consortium was formed. It created a standard that would allow a single character set with 256 x 256 = 65536 characters (plus combinations thereof) to include almost all languages of the world.
The biggest advantage is that a single character string may contain multiple languages. That is no small thing.
Unicode is now the native character specification used in Windows ever since Windows 2000. it is also allowed as a character set in HTML and on websites.
If your application does not support Unicode, or is not planning to support it, then it is only a matter of time until your application will be left behind.
What's the big deal about being able
to support a new characterset.
Unicode is not just "a new characterset". It's the character set that removes the need to think about character sets.
How would you rather write a string containing the Euro sign?
"\x80", "\x88", "\x9c", "\x9f", "\xa2\xe3", "\xa2\xe6", "\xa3\xe1", "\xa4", "\xa9\xa1", "\xd9\xe6", "\xdb", or "\xff" depending upon the encoding.
"\u20AC", in every locale, on every OS.
Unicode can support pretty much any language in the world. Without such an encoding you would have to worry about choosing the correct encoding for different languages, which is very bothersome (not to mention mixing multiple languages in the same text block, ugh)
Unicode support in a language means that the language's native character/string type supports all those languages as well, without the user having to worry about character encodings or multibyte characters and such while doing computations. Of course, one still has to acnowledge character encodings when doing I/O, but doing your string processing in one single sensible encoding helps a lot.
Well if you care anything about internationalization (AKA the rest of the world) scientific notations, etc you would care about unicode. Unicode is difficult to deal with because we have been so ingrained just ASCII support. But now that modern systems support Unicode, there is no reason really not to just encode your things UTF-8. I know I work in publishing and for a long time we had to do hack things like insert gif images of formulas etc. Now we can put unicode straight in and people can search and copy and paste etc, and our code can deal with it by using unicode regexes etc.
If you wish to communicate with someone whose native language is not English (either the British or American variants), you care. A lot.
As everyone says - support for all the charactersets and formatting used by every other language and locale in the world. Open source and commercial developers both like that because it increases their potential user base by about 20x fold (and growing).
Unicode is a good thing because it eliminates character set problems and leaves one less thing to worry about. Even if your software never leaves the U.S., you never know when you're going to run into a filename or text field with an odd character in it, and Unicode lets you live in ignorance.
Americans like Daisetsu may not care about Unicode, but the rest of the world uses a bit more than 26 Latin letters, and there Unicode is heavily used.
We had hundreds of messed up charsets in the past solely because American computer scientists thought "why would anyone want to use more than 26 Latin characters like we have in English?"
Narrow-mindedness is a bad thing.

Dummy's guide to Unicode

Could anyone give me a concise definitions of
Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
Þ (LATIN CAPITAL LETTER THORN)
fi (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.
UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.
UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.
Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!
Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.

What do I need to know about Unicode? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Being a application developer, do I need to know Unicode?
Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:
The standard for digital
representation of the characters used
in writing all of the world's
languages. Unicode provides a uniform
means for storing, searching, and
interchanging text in any language. It
is used by all modern computers and is
the foundation for processing text on
the Internet. Unicode is developed and
maintained by the Unicode Consortium.
There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.
First, go to the source for
authoritative, detailed information
and implementation guidelines.
As mentioned by others, Joel Spolsky
has a good list of these
errors.
I also like Elliotte Rusty Harold's
Ten Commandments of Unicode.
Developers should also watch out for
canonical representation attacks.
Some of the key concepts you should be aware of are:
Glyphs—concrete graphics used to represent written characters.
Composition—combining glyphs to create another glyph.
Encoding—converting Unicode points to a stream of bytes.
Collation—locale-sensitive comparison of Unicode strings.
At the risk of just adding another link, unicode.org is a spectacular resource.
In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.
(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)
Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.
One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).
You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.
E.G :
You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters.
PHP let you now that some function (like stristr) does not work with unicode.
Python declare unicode string this way : u"Hello World".
That's the kind of thin you must know.
Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.
Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).
So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.
Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.
Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.
Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.
Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".
There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

Smallest Unicode encodings for different languages?

What are the typical average bytes-per-character rates for different unicode encodings in different languages?
E.g. if I wanted the smallest number of bytes to encode some english text, then on average UTF-8 would be 1-byte per character and UTF-16 would be 2 so I'd pick UTF-8.
If I wanted some Korean text, then UTF-16 might average about 2 per character but UTF-8 might average about 3 (I don't know, I'm just making up some illustrative numbers here).
Which encodings yield the smallest storage requirements for different languages and character sets?
For any given language, your bytes-per-character rates are fairly constant, because most languages are allocated to contiguous code pages. The big exception is accented Latin characters, which are allocated higher in the code space than the unaccented forms. I don't have hard numbers for these.
For languages with contiguous character allocation, there is a table with detailed numbers for various languages on Wikipedia. In general, UTF-8 works well for most small character sets (except the ones allocated on high code pages), and UTF-16 is great for two-byte character sets.
If you need denser compression, you may also want to look at Unicode Technical Note 14, which compares some special-purpose encodings designed to reduce data size for a variety of languages. But these techniques aren't especially common.
If you're really worried about string/character size, have you thought about compressing them? That would automatically reduce the string to it's 'minimal' encoding. It's a layer of headache, especially if you want to do it in memory, and there are plenty of cases in which it wouldn't buy you anything, but encoding, especially, tend to be too general purpose to the level of compactness you seem to be aiming for.
UTF8 is best for any character-set where characters are primarily below U+0800. Otherwise UTF16.
That is, UTF8 for Latin, Greek, Cyrillic, Hebrew and Arabic and a few others. In langs other than Latin, characters will take up the same space as they would in UTF16, but you'll save bytes on punctuation and spacing.
In UTF-16, all the languages that matter (i.e. anything but klingons, elven and other strange things) will be encoded into 2 byte chars.
So the question is to find the languages that will have glyphs that will be 2-bytes or 1-byte sized characters long.
In the Wikipedia page on UTF-8:
http://en.wikipedia.org/wiki/Utf-8
We see that a character with an unicode index of 0x0800 or more will be at least 3 bytes long in UTF-8.
Knowing that, you just need to look at the code charts on unicode: http://www.unicode.org/charts/
for the languages that comply to your requirements.
:-)
Now, note that, depending on the framework you're using, the choice could well be not yours to do:
On Windows API, Unicode is handled by wchar_t chars, and is UTF-16
On Linux, Unicode is handled by char, and is UTF-8
Java is internally UTF-16, as are most compliant XML parsers
I was told (some tech meeting I was not interested on... sorry...) that UTF-8 was the encoding of choices on Databases.
So, pick up your poison...
:-)
I don't know exact figures, but for Japanese Shift_JIS averages fewer bytes per character than UTF-8, and so does EUC-JP, since they're optimised for Japanese text. However, they don't cover the same space of code points as Unicode, so they might not be correct answers to your question.
UTF-16 is better than UTF-8 for Japanese characters (2 bytes per char as opposed to 3), but worse than UTF-8 if there's a lot of 7-bit chars. It depends on the context - technical text is more likely to contain a lot of chars in the 1-byte range. A classical Japanese text might not have any.
Note that for transport, the encoding doesn't matter much if you can zip (gzip, bz2) the data. Code points for an alphabet in Unicode are close together, so you'd expect common prefixes with very short representations in the compressed data.
UTF-8 is usually good for representation in memory, since it's often more compact than UTF-32 or UTF-16, and is compatible with functions on char* which 'expect' ASCII or ISO-8859-1 NUL-terminated strings. It's useless if you need random access to characters by index, though.
If you don't care about non-BMP characters, UCS-2 is always 2 bytes per character and so offers random access. But that depends what you mean by 'Unicode'.
UTF-8
There is a very good article about unicode on JoelOnSoftware:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)