how important is it that libraries treat utf16xx and utf32xx as equal peers to utf8? - unicode

does any signifigant interchange take place in formats other than ascii/utf8? are there any fields where utf16xx and utf32xx are used heavily? i ask as a writer of multiple libraries that work on unicode text, and the burden of supporting all five major variants is quite high compared to the perceived utility.

Windows and Java both treat Unicode as UTF-16 internally, and Python uses UTF-16 or UTF-32 depending on the platform. So more than just UTF-8 is important for these. These are just the cases I'm most familiar with, I'm sure there are others.
So, in my opinion, if you have a Unicode library, you should support UTF-16 and UTF-32. (I can't believe UTF-32 is too difficult, since there's no special processing involved besides byte ordering. Although, I'm not a Unicode library author :) )

One important point is XML: it can come in pretty much any encoding imaginable, but UTF-8 is by far the most common.
However, the XML spec says this:
All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode
So if your application/library handles XML in any way it must support UTF-16 at least in that portion. Note that a conforming parser that converts the data to UTF-8 for processing would be enough here.

When it comes to interchange, I guess you are right that UTF-8 is prevalent. Some cases of using UTF-16 are various binary protocols such as DCOM, Java RMI and (maybe???) CORBA.
As for UTF-32 I've never heard of a case where it is used for interchange.

Related

What determines how strings are encoded in memory?

Say we have a file that is Latin-1 encoded and that we use a text editor to read in that file into memory. My questions are then:
How will those character strings be represented in memory? Latin-1, UTF-8, UTF-16 or something else?
What determines how those strings are represented in memory? Is it the application, the programming language the application was written in, the OS or the hardware?
As a follow-up question:
How do applications then save files to encoding schemes that use different character sets? F.e. converting UTF-8 to UTF-16 seems fairly intuitive to me as I assume you just decode to the Unicode codepoint, then encode to the target encoding. But what about going from UTF-8 to Shift-JIS which has a different character set?
Operating system
Windows
1993: Windows adopted Unicode 1.0 with NT 3.1 - back then Unicode was what is nowadays known as UCS-2. That Windows version also introduced NTFS (New Technology File System), which also stores every filename in UCS-2 like manner (16 bit codepoints).
2000: With NT 5.0 (aka Windows 2000) there was a shift/improvement from UCS-2 to UTF-16 - both OS and encoding became available in this year.
Since then nothing has changed. Internally, Windows uses 16 bit codepoints for almost 30 years already, and thanks to UTF-16 also newest codepoints such as Emojis are supported. Its API works the same way, with compatibility functions for byte-wise encodings merely being stubs that convert the input to UTF-16. See also
What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types?
"Windows uses UTF-16 as its internal encoding", what exactly does this mean?
Why does Windows use UTF-16LE?
Is it safe to assume all Windows platforms will be in UCS-2 LE
Unix: most distributions use UTF-8 by default, because it's most backward compatible while being future proof enough.
Programming language
Depends on their age or on their compiler: while languages themselves are not necessarily bound to an OS the compiler which produces the binaries might treat things differently as per OS.
Pascal: based in 1970 the String was just an array of bytes, not even necessarily meaning text. And for text ASCII or one of the other single-byte encodings could easily be dealt with.
Delphi: adopted as per Windows WideString, dealing with 16 bit per character, to perfectly make use of the WinAPI and its Unicode support. Later additions also emerged the UTF8String, which works with bytes again, but not necessarily only one byte per character. But also creations such as UCS4String are available since 2009, eating 4 bytes per character.
Free Pascal: stays with the old String but always defaults to UTF-8 encoding. While this always needs conversion when using the WinAPI it is also more platform independent. Several other String (compatibilty) types also exist, each with different memory usage.
ECMAScript (JavaScript): as per standard an engine should use UTF-16 for texts. See also JavaScript strings - UTF-16 vs UCS-2?
Java: engines must support a minimum of encodings, including UTF-16, thus internal String handling/memory usage may differ. See also What is the Java's internal represention for String? Modified UTF-8? UTF-16?
Application/program
Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.
Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.
Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.
Conversion
The most common approach is to use a common denominator:
Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.

When to use Unicode (aside with non-unicode!)

I haven't found much (concise) info about when exactly to use Unicode. I understand that many say best practice is to always use Unicode. But Unicode strings DO have more memory footprint. Am I correct to say that Unicode must be used only when
Printing something to screen other than local (for example debugging) use.
Generally, sending any type of text across a network with the two ends being in different locales/country
When you're not sure which to use
I think it would be beneficial if someone explained the basics (concise) of what actually happens with Unicode... am I correct to say that things get messy when :
the physical (byte) string gets sent to a machine using a representation of strings (code page, others... this is already detail although interesting) different from the sender.
The context is using Unicode in a programming language (say C++), but I hope answers to this question can be used for any encoding situation.
Also, I'm aware Unicode and NLS are not the same thing, but is it correct to say that NLS implies usage of Unicode?
P.S. awesome site
Always use Unicode, it will save you and others a lot of pain.
What you may have confused is the issue of encoding. Unicode strings do not necessarily take more memory than the equivalent ASCII (or other encoding) strings, that depends a lot on the encoding used.
Sometimes "Unicode" is used as a synonym for "UCS-2" or "UTF-16". Strictly speaking that use is wrong, because "Unicode" is the standard that defines the set of characters and their unicode codepoints. It does not as such define a mapping to bytes (or words). UTF-16, UTF-8 and other encoding take over the job of mapping the characters to concrete bytes.
The beauty of Unicode is that it frees you from restrictions and lots of headaches. Unicode is the largest character set available to date, i.e. it enables you to actually encode and use virtually any character of any halfway mainstream language in use today. With any other character set you need to think about whether it can actually encode a character or not. Latin-1 cannot encode the character "あ", Shift-JIS cannot encode the character "ڥ" and so on. Only if you're very sure you will never ever need anything other than basic Latin/Arabic/Japanaese/whatever other subset of characters should you choose a specialized encoding such as Latin-1, BIG-5, Shift-JIS or ASCII.
Unicode is the most versatile charset available and therefore a good standard to adhere to.
The Unicode encodings are nothing special, they're just a little more complex in their bit representation since they have to encode many more characters while still trying to be space efficient. For a very detailed excursion into this topic, please see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I have a little utility which is sometimes helpful in seeing the difference between character encodings. http://sodved.awardspace.info/unicode.pl. If you paste in ö into the Raw (UTF-8) field you will see that it is represented by different byte sequences in different encodings. And as the other two good answers describe, some non-unicode encodings cannot represent it at all.

Dummy's guide to Unicode

Could anyone give me a concise definitions of
Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
Þ (LATIN CAPITAL LETTER THORN)
fi (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.
UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.
UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.
Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!
Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.

Why does anyone use an encoding other than UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to know why any developer would need to use an encoding other than UTF-8.
Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:
http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages
The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.
Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.
Those are the usual excuses for not using Unicode.
As for not using UTF-8 specifically there are different reasons. Some systems, like Windows1 (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.
Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.
So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.
1 NT.
In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.
UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.
For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.
Examples:
ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
ISO-2022-JP in Japan for emails, Shift JIS for websites
Big5 in Taiwan
GB2312 in China
The last two examples show that encodings can even be a political issue.
Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).
Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).
Sometimes it's just laziness learning the UTF-8 functions for a particular language.
Because they do not know better.
The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings.
UTF-8 is superior because
It is ASCII compatible. Most known and tried string operations do not need adaptation.
It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age. If you have important data in encoding X, spend two minutes on Google and write a conversion function. Even if you have to interface with sourceless legacy app Z, you can run your communications through a pipe so that your logic stays in the 21st century.
UTF-16 isn't fixed length either and assuming it is like many do, will only cause terrible bugs.
Additionally Unicode is very complex and it is almost certain than any fixed-size algorithm adapted from ASCII will yield bad results even in UTF-32.
Say you have this UTF-16 string.
[0][1][2][F|3] [4] [5]
And you want to insert a character with code 8 between [3] and [4]
you would do insert(5,8)
If you don't check for characters outside BMP(serially as in UTF-8 as you cannot know how many double sized characters you have) you get:
[0][1][2][F|8][3][4][5]
Two new garbage characters. So much for your fixed size encoding.
You can of course disallow such characters altogether, but then when your code interfaces with the real world, you might find your program saves the profile for this user who lives in rm -Rf / in .profile instead of [Classical Chinese Proverb].profile.
Or just an angry user that cannot write his thesis on Classical Chinese Proverbs with your software.
One legitimate reason is when you need to deal with legacy documents, software or hardware that are not Unicode compatible.
Another legitimate reason is that you need to use a programming language / libraries that do not support UTF8 / Unicode well ... or at all.
Other answers mention that UTF-16 is more compact than UTF-8 for Asian languages / characters.
And of course there are reasons like short-sightedness, ignorance, laziness ... and deadlines.
Its also worth remembering that in some circumstances (where a non-latin set of characters are needed) UTF-8 can actually bloat larger than the 16 bit Unicode encoding. In those cases ucs-2 or utf-16 would be a better choice.
The reasons for using non-Unicode 8-bit character sets / encodings are all back compatibility of some kind, and/or inertia. For that matter, the most frequent reasons for using UTF-8 are compatibility with standards like XML that mandate or prefer UTF-8.
Differences in the number of bytes you think text will take up in different encodings, especially in storage, are mostly theoretical. In real world situations, compatibility requirements are more important. If compression is used, the size differences go away anyway. Even if compression is not used, total text size is hard to predict and is rarely a deciding factor.
When converting legacy code that used non-Unicode 8-bit encodings, using UTF-16 can be a tool for making sure all code has been converted, because mismatches can be caught as compile-time type errors. Many languages, runtimes and libraries like Javascript, JVM, .NET, ICU use 16-bit strings and UTF-16, even though storage and Internet protocols are usually 8-bit.
Imagine all files to consider are in GB2312 (China mainland standard). Then you might choose GB18030 as Unicode encoding instead. They are compatible the same way as all ASCII is UTF-8. That is useful in China mainland!
You might decide even quicker when you find out that both mentioned GB-standards are required in your IT-product by law (as far as I have heard), if you want to ship in China (mainland).
Another upside is that GB2312, and as such GB18030 as well, are also ASCII compatible.
It is algorithmically not so robust, though. – So if you have no political reasons or any GB2312 legacy, it makes no sense to use it. But if you do, here you got your answer.
Related to the subject, when using MySQL, as if it wasn't complex enough, you get the option the choose which kind of UTF-8 collation you want to use. So what would you use?
UTF-8 general ci
or
UTF-8 unicode ci?
(I tend to use the UTF-8 variant that is used for the database connection)
Because you sometimes want to operate easily on codepoints -- then you'd choose f.e. UCS-2 or UCS-4.
Many APIs require other Unicode encodings - mostly UTF-16. For instance, Java, .NET, Win32.
At my previous employer we used iso-8859-1 for some of our ASP pages to match the collation of our SQL Server, which as you can guess was not Unicode. I wanted to change the collation, but the manager said to wait till we upgrade our SQL Server to do it. Needless to say it never happened - I haven't been with them for a little over a year now, so I don't know if they finally did it.
Unicode certainly is a good place to work from in most cases, but a developer should be familiar with many different types of character encoding. Certainly ASCII might be used if the set of characters is limited.
What if you're a developer and receiving data from a source that doesn't send UTF-8? There could be lots of interface issues if you don't understand your input.
Joel's article on the must-knows for character encoding is good and worth reading.

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?
In what way are these helpful for programmers?
Going down your list:
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
There's more on my Unicode page and tips for debugging Unicode problems.
The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.
Some reading to get you started on character encodings: Joel on Software:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By the way - ASP.NET has nothing to do with it. Encodings are universal.