What is exactly an overlong form/encoding? - unicode

Reading the Wikipedia article on UTF-8, I've been wondering about the term overlong. This term is used various times but the article doesn't provide a definition or reference for its meaning.
I would like to know if someone can explain the term and its purpose.

It's an encoding of a code point which takes more code units than it needs to.
For example, U+0020 is represented in UTF-8 by the single byte 0x20. If you decode the two bytes 0xc0 0xa0 in the normal fashion, you'll still end up back at U+0020, but that's an invalid representation.
The Unicode Corrigendum #1 has more information, particularly around table 3.1B.

UTF-8 theoretically allows for different representations of characters that also have a shorter one. For example, you could encode an ASCII character in two bytes by setting the MSBs to zero. The UTF-8 specification explicitly forbids this.

Related

Understanding encoding schemes

I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.

When to use Unicode (aside with non-unicode!)

I haven't found much (concise) info about when exactly to use Unicode. I understand that many say best practice is to always use Unicode. But Unicode strings DO have more memory footprint. Am I correct to say that Unicode must be used only when
Printing something to screen other than local (for example debugging) use.
Generally, sending any type of text across a network with the two ends being in different locales/country
When you're not sure which to use
I think it would be beneficial if someone explained the basics (concise) of what actually happens with Unicode... am I correct to say that things get messy when :
the physical (byte) string gets sent to a machine using a representation of strings (code page, others... this is already detail although interesting) different from the sender.
The context is using Unicode in a programming language (say C++), but I hope answers to this question can be used for any encoding situation.
Also, I'm aware Unicode and NLS are not the same thing, but is it correct to say that NLS implies usage of Unicode?
P.S. awesome site
Always use Unicode, it will save you and others a lot of pain.
What you may have confused is the issue of encoding. Unicode strings do not necessarily take more memory than the equivalent ASCII (or other encoding) strings, that depends a lot on the encoding used.
Sometimes "Unicode" is used as a synonym for "UCS-2" or "UTF-16". Strictly speaking that use is wrong, because "Unicode" is the standard that defines the set of characters and their unicode codepoints. It does not as such define a mapping to bytes (or words). UTF-16, UTF-8 and other encoding take over the job of mapping the characters to concrete bytes.
The beauty of Unicode is that it frees you from restrictions and lots of headaches. Unicode is the largest character set available to date, i.e. it enables you to actually encode and use virtually any character of any halfway mainstream language in use today. With any other character set you need to think about whether it can actually encode a character or not. Latin-1 cannot encode the character "あ", Shift-JIS cannot encode the character "ڥ" and so on. Only if you're very sure you will never ever need anything other than basic Latin/Arabic/Japanaese/whatever other subset of characters should you choose a specialized encoding such as Latin-1, BIG-5, Shift-JIS or ASCII.
Unicode is the most versatile charset available and therefore a good standard to adhere to.
The Unicode encodings are nothing special, they're just a little more complex in their bit representation since they have to encode many more characters while still trying to be space efficient. For a very detailed excursion into this topic, please see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I have a little utility which is sometimes helpful in seeing the difference between character encodings. http://sodved.awardspace.info/unicode.pl. If you paste in ö into the Raw (UTF-8) field you will see that it is represented by different byte sequences in different encodings. And as the other two good answers describe, some non-unicode encodings cannot represent it at all.

What is the Best UTF [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm really confused about UTF in Unicode.
there is UTF-8, UTF-16 and UTF-32.
my question is :
what UTF that are support all Unicode blocks ?
What is the best UTF(performance, size, etc), and why ?
What is different between these three UTF ?
what is endianness and byte order marks (BOM) ?
Thanks
what UTF that are support all Unicode blocks ?
All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.
What is the best UTF(performance, size, etc), and why ?
For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).
What is different between these three UTF ?
UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.
UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.
UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF is encoded as 0x0010FFFF.
what is endianness and byte order marks (BOM) ?
Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.
In a little-endian encoding or system, the 32-bit value 0x12345678 is stored or transmitted as 0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78.
A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.
A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.
Benefits of UTF-8
ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
UTF-8 text is the most compact format for ASCII text
Valid UTF-8 can be sorted on byte values and result in sorted codepoints
Benefits of UTF-16
UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII
Benefits of UTF-32
UTF-32 is not variable-length, so it requires no special logic to decode
“Answer me these questions four, as all were answered long before.”
You really should have asked one question, not four. But here are the answers.
All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways:
UCS-2 is not a valid Unicode encoding.
UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.
UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include
all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF
the 32 non-character code points between U+FDD0 and U+FDEF
both sentinels at U+FFEF and U+FFFF
For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled “Unicode Support Shootout”. It varies a great deal.
UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:
UTF-8 is the de facto standard Unicode encoding on the web.
UTF-8 can be stored in a null-terminated string.
UTF-8 is free of the vexing BOM issue.
UTF-8 risks no confusion of UCS-2 vs UTF-16.
UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.
If need be, it can be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.
I use UTF-8 whenever I can get away with it.
I have already given properties of UTF-8, so here are some for the other two:
UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s strlen function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.
UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either. It is grudgingly true that when properly handled, UTF-16 can certainly be made to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there.
That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse.
Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.
I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.
Category Errors
Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters.
If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, re, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.
For example, Unicode defines U+03C3 GREEK SMALL LETTER SIGMA and U+03C2 GREEK SMALL LETTER FINAL SIGMA as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works.
(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is Matthew Barnett’s marvelous regex library for Python 2 and Python 3. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard re gets miserably wrong.)
REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.
All of them support all Unicode code points.
They have different performance characteristics - for example, UTF-8 is more compact for ASCII characters, whereas UTF-32 makes it easier to deal with the whole of Unicode including values outside the Basic Multilingual Plane (i.e. above U+FFFF). Due to its variable width per character, UTF-8 strings are hard to use to get to a particular character index in the binary encoding - you have scan through. The same is true for UTF-16 unless you know that there are no non-BMP characters.
It's probably easiest to look at the wikipedia articles for UTF-8, UTF-16 and UTF-32
Endianness determines (for UTF-16 and UTF-32) whether the most significant byte comes first and the least significant byte comes last, or vice versa. For example, if you want to represent U+1234 in UTF-16, that can either be { 0x12, 0x34 } or { 0x34, 0x12 }. A byte order mark indicates which endianess you're dealing with. UTF-8 doesn't have different endiannesses, but seeing a UTF-8 BOM at the start of a file is a good indicator that it is UTF-8.
Some good questions here and already a couple good answers. I might be able to add something useful.
As said before, all three cover the full set of possible codepoints, U+0000 to U+10FFFF.
Depends on the text, but here are some details that might be of interest. UTF-8 uses 1 to 4 bytes per char; UTF-16 uses 2 or 4; UTF-32 always uses 4. A useful thing to note is this. If you use UTF-8 then then English text will be encoded with the vast majority of characters in one byte each, but Chinese needs 3 bytes each. Using UTF-16, English and Chinese will both require 2. So basically UTF-8 is a win for English; UTF-16 is a win for Chinese.
The main difference is mentioned in the answer to #2 above, or as Jon Skeet says, see the Wikipedia articles.
Endianness: For UTF-16 and UTF-32 this refers to the order in which the bytes appear; for example in UTF-16, the character U+1234 can be encoded either as 12 34 (big endian), or 34 12 (little endian). The BOM, or byte order mark is interesting. Let's say you have a file encoded in UTF-16, but you don't know whether it is big or little endian, but you notice the first two bytes of the file are FE FF. If this were big-endian the character would be U+FEFF; if little endian, it would signify U+FFFE. But here's the thing: In Unicode the codepoint FFFE is permanently unassigned: there is no character there! Therefore we can tell the encoding must be big-endian. The FEFF character is harmless here; it is the ZERO-WIDTH NO BREAK SPACE (invisible, basically). Similarly if the file began with FF FE we know it is little endian.
Not sure if I added anything to the other answers, but I have found the English vs. Chinese concrete analysis useful in explaining this to others in the past.
One way of looking at it is as size over complexity. Generally they increase in the number of bytes they need to encode text, but decrease in the complexity of decoding the scheme they use to represent characters. Therefore, UTF-8 is usually small but can be complex to decode, whereas UTF-32 takes up more bytes but is easy to decode (but is rarely used, UTF-16 being more common).
With this in mind UTF-8 is often chosen for network transmission, as it has smaller size. Whereas UTF-16 is chosen where easier decoding is more important than storage size.
BOMs are intended as information at the beginning of files which describes which encoding has been used. This information is often missing though.
Joel Spolsky wrote a nice introductory article about Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

What issues would come from treating UTF-16 as a fixed 16-bit encoding?

I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:
Dean Harding: UTF-8 is a
variable-length encoding, which is
more complex to process than a
fixed-length encoding. Also, see my
comments on Gumbo's answer: basically,
combining characters exist in all
encodings (UTF-8, UTF-16 & UTF-32) and
they require special handling. You can
use the same special handling that you
use for combining characters to also
handle surrogate pairs in UTF-16, so
for the most part you can ignore
surrogates and treat UTF-16 just like
a fixed encoding.
I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?
I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!
Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"
Edit2:
I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:
Andrew Russell: For performance:
UTF-8 is much harder to decode than
UTF-16. In UTF-16 characters are
either a Basic Multilingual Plane
character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters can
be anywhere between 1 and 4 bytes
This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!
UTF-16 includes all "base plane" characters. The BMP covers most of the current writing systems, and includes many older characters that one can practically encounter. Take a look at them and decide whether you really are going to encounter any characters from the extended planes: cuneiform, alchemical symbols, etc. Few people will really miss them.
If you still encounter characters that require extended planes, these are encoded by two code points (surrogates), and you'll see two empty squares or question marks instead of such a non-character. UTF is self-synchronizing, so a part of a surrogate character never looks like a legitimate character. This allows things like string searches to work even if surrogates are present and you don't handle them.
Thus issues arising from treating UTF-16 as effectively USC-2 are minimal, aside from the fact that you don't handle the extended characters.
EDIT: Unicode uses 'combining marks' that render at the space of previous character, like accents, tilde, circumflex, etc. Sometimes a combination of a diacritic mark with a letter can be represented as a distinct code point, e.g. á can be represented as a single \u00e1 instead of a plain 'a' + accent which are \u0061\u0301. Still you can't represent unusual combinations like z̃ as one code point. This makes search and splitting algorithms a bit more complex. If you somehow make your string data uniform (e.g. only using plain letters and combining marks), search and splitting become simple again, but anyway you lose the 'one position is one character' property. A symmetrical problem happens if you're seriously into typesetting and want to explicitly store ligatures like fi or ffl where one code point corresponds to 2 or 3 characters. This is not a UTF issue, it's an issue of Unicode in general, AFAICT.
It is important to understand that even UTF-32 is fixed-length when it comes to code points, not characters. There are many characters that are composed from multiple code points, and therefore you can't really have a Unicode encoding where one number (code unit) corresponds to one character (as perceived by users).
To answer your question - the most obvious issue with treating UTF-16 as fixed-length encoding form would be to break a string in a middle of a surrogate pair so you get two invalid code points. It all really depends what you are doing with the text.
I guess what I really mean is
"Why would anyone suggest treating
UTF-16 as fixed encoding when it seems
bogus?"
Two words: Backwards compatibility.
Unicode was originally intended to use a fixed-width 16-bit encoding (UCS-2), which is why early adopters of Unicode (e.g., Sun with Java and Microsoft with Windows NT), used a 16-bit character type. When it turned out that 65,536 characters wasn't enough for everyone, UTF-16 was developed in order to allow this 16-bit character systems to represent the 16 new "planes".
This meant that characters were no longer fixed-width, so people created the rationalization that "that's OK because UTF-16 is almost fixed width."
But I'm still not convinced this is
any different to assuming UTF-8 is
single-byte characters!
Strictly speaking, it's not any different. You'll get incorrect results for things like "\uD801\uDC00".lower().
However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.
You can use the same special handling
that you use for combining characters
to also handle surrogate pairs in
UTF-16
I don't know what he's talking about. Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs.
In particular, the characters within a combining sequence can be converted to a different encoding form one characters at a time.
>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'
But not surrogates:
>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed
UTF-16 is a variable-length encoding. The older UCS-2 is not. If you treat a variable-length encoding like fixed (constant length) you risk introducing error whenever you use "number of 16-bit numbers" to mean "number of characters", since the number of characters might actually be less than the number of 16-bit quantities.
The Unicode standard has changed several times along the way. For example, UCS-2 is not a valid encoding anymore. It has been deprecated for a while now.
As mentioned by user 9000, even in UTF-32, you have sequences of characters that are interdependent. The à is a good example, although this character can be canonicalized to \x00E1. So you can make it simple.
Unicode, even when using the UTF-32 encoding, supports up to 30 code points, one after the other, to represent the most complex characters. (The existing characters do not use that many, I think the longest in existence is currently 17 if I'm correct.)
For that reason, Unicode developed Normalization Forms. It actually considers five different forms:
Unnormalized -- a sequence you create manually, for example; text editors are expected to save properly normalized (NFC) code sequences
NFD -- Normalization Form Decomposition
NFKD -- Normalization Form Compatibility Decomposition
NFC -- Normalization Form Canonical Composition
NFKC -- Normalization Form Compatibility Canonical Composition
Although in most situations it does not matter much because long compositions are rare, even in languages that use them.
And in most cases, your code already deals with canonical compositions. However, if you create strings manually in your code, you are not unlikely to create an unnormalized string (assuming you use such long forms).
Properly implemented servers on the Internet are expected to refused strings that are not canonical compositions as per Unicode. Long forms are also forbidden over connections. For example, the UTF-8 encoding technically allows for ASCII characters to be encoded using 1, 2, 3, or 4 bytes (and the old encoding allowed up to 6 bytes!) but those encoding are not permitted.
Any comment on the Internet that contradicts the Unicode Normalization Form document is simply incorrect.

Dummy's guide to Unicode

Could anyone give me a concise definitions of
Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
Þ (LATIN CAPITAL LETTER THORN)
fi (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.
UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.
UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.
Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!
Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.