Does UTF-8 have fixed bytes order - unicode

I heard that I don't have to place BOM at the start of UTF-8 file / stream.
Does it have a fixed bytes order then?
What about UTF-16 and UTF-32 in this case?

UTF-8 does not need a byte order since it is defined in terms of a stream of bytes. The order is given directly by the address of the individual byte. A varying number of bytes makes up a single codepoint.
UTF-32 on the other hand is defined in terms of a stream of 32bit units (i.e. 4 bytes each, each mapping directly to a Unicode codepoint) which can be encoded in different ways into a stream of bytes.
That is what the BOM indicates for you, basically whether the bytes are ordered with their significance (i.e. the earliest byte in the stream is the least significant, little endian) or against it (i.e. the earliest byte is the most significant, big endian).
UTF-16 is similar but a bit funkier. It's defined as a stream of 16bit units, so you have to worry about the byte order. Additionally, since a single 16bit unit is not (anymore) enough to encode all of Unicode, it is also a multi-"unit" encoding, thus combining that shortcomings of UTF-8 and UTF-32 :)

Related

16-bit encoding that has all bits mapped to some value

UTF-32 has its last bits zeroed.
As I understand it UTF-16 doesn't use all its bits either.
Is there a 16-bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7-bit?
UTF-32 has its last bits zeroed
This might be not correct, depending on how you count. Typically we count from left, so the high (i.e. first) bits of UTF-32 will be zero
As I understand it UTF-16 doesn't use all its bits either
It's not correct either. UTF-16 uses all of its bits. It's just that the range [0xD800—0xDFFF] is reserved for UTF-16 surrogate pairs so those values will never be assigned any character and will never appear in UTF-32. If you need to encode characters outside the BMP with UTF-16 then those values will be used
In fact Unicode was limited to U+10FFFF just because of UTF-16, even though UTF-8 and UTF-32 themselves are able to represent up to U+7FFFFFFF and U+FFFFFFFF respectively. The use of surrogate pair makes it impossible to encode values larger than 0x10FFFF in UTF-16
See Why Unicode is restricted to 0x10FFFF?
Is there a 16 bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7 bit?
First there's no such thing as "a subset of UTF", since UTF isn't a character set but a way to encode Unicode code points
Prior to the existence of UTF-16 Unicode was a fixed 16-bit character set encoded with UCS-2. So UCS-2 might be the closest you'll get, which encodes only the characters in the BMP. Other fixed 16-bit non-Unicode charsets also has an encoding that maps all of the bit combinations to some characters
However why would you want that? UCS-2 has been deprecated long ago. Some old tools and less experienced programmers still imply that Unicode is always 16-bit long like that which is correct and will break modern text processing
Also note that not all the values below 0xFFFF are assigned, so no encoding can map every 16-bit value to a Unicode code point
Further reading
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
What is a "surrogate pair" in Java?

What is this variable-length integer encoding?

I am documenting an old file format and have stumped myself with the following issue.
It seems to be that integers are variable-length encoded, with numbers <= 0x7F encoded in a single byte, but >= 0x80 are encoded in two bytes. An example set of integers and their encoded counterparts:
0x390 is encoded as 0x9007
0x150 is encoded as 0xD002
0x82 is encoded as 0x8201
0x89 is encoded as 0x8901
I have yet to come across any numbers that are larger than 0xFFFF, so I can't be sure if/how they are encoded. For the life of me, I can't work out the pattern here. Any ideas?
At a glance it looks like the numbers are split into 7-bit chunks, each of which is encoded as the 7 least significant bits of an output byte, while the most significant bit signifies whether there are more bytes following this one (i.e. the last byte of an encoded integer has 0 as its MSB).
The least significant bits of the input come first, so I guess you could call this "little endian".
Edit: see https://en.wikipedia.org/wiki/Variable-length_quantity (this is used in MIDI and Google protocol buffers)

Unicode code point limit

As explained here, All unicode encodings end at largest code point 10FFFF But I've heard differently that
they can go upto 6 bytes, is it true?
UTF-8 underwent some changes during its life, and there are many specifications (most of which are outdated now) which standardized UTF-8. Most of the changes were introduced to help compatibility with UTF-16 and to allow for the ever-growing amount of codepoints.
To make the long story short, UTF-8 was originally specified to allow codepoints with up to 31 bits (or 6 bytes). But with RFC3629, this was reduced to 4 bytes max. to be more compatible to UTF-16.
Wikipedia has some more information. The specification of the Universal Character Set is closely linked to the history of Unicode and its transformation format (UTF).
See the answers to Do UTF-8,UTF-16, and UTF-32 Unicode encodings differ in the number of characters they can store?
UTF-8 and UTF-32 are theoretically capable of representing characters above U+10FFFF, but were artificially restricted to match UTF-16's capacity.
The largest unicode codepoint and the encodings for unicode characters used, are two things. According to the standard, the highest codepoint really is 0x10ffff but herefore you'll need just 21 bits which fit easily into 4 bytes, even with 11 bits wasted!
I guess with your question about 6 bytes you mean a 6-byte utf-8 sequence, right? As others have answered already, using the utf-8 mechanism you could really deal with 6-byte sequences, you can even deal with 7-byte sequences and even with an 8-byte sequence. The 7-byte sequence gives you a range of just what the following bytes have to offer, 6 x 6 bits = 36 bits and a 8-byte sequence gives you 7 x 6 bits = 42 bits. You could deal with it but it is not allowed because unneeded, the highest codepoint is 0x10ffff.
It is also forbidden to use longer sequences than needed as Hibou57 has mentioned. With utf-8 one must always use the shortest sequence possible or the sequence will be treated as invalid! An ASCII character must be in a 7-bit singlebyte of course. The second thing is that the utf-8 4-byte sequence gives you 3 bits of payload in the startbyte and 18 bits of payload in the following bytes which are 21 bits and that matches to the calculation of surrogates when using the utf-16 encoding. The bias 0x10000 is subtracted from the codepoint and the remaining 20 bits go to the high- as well lo-surrogate payload area, each of 10 bits. The third and last thing is, that within utf-8 it is not allowed to encode hi- or -lo-surrogate values. Surrogates are not characters but containers for them, surrogates can only appear in utf-16, not in utf-8 or utf-32 encoded files.
Indeed, for some view of the UTF‑8 encoding, UTF‑8 may technically permit to encode code‑points beyond the forever‑fixed valid range upper‑limit; so one may encode a code‑point beyond that range, but it will not be a valid code‑point anywhere. On the other hand, you may encode a character with unneeded zeroed high‑order bits, ex. encoding an ASCII code‑point with multiple bits, like in 2#1100_0001#, 2#1000_0001# (using Ada's notation), which would for the ASCII letter A UTF‑8 encoded with two bytes. But then, it may be rejected by some safety/security filters, at this use to be used for hacking and piracy. RFC 3629 has some explanation about it. One should just stick to encode valid code‑points (as defined by Unicode), the safe way (no extraneous bytes).

What is the most efficient binary to text encoding?

The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?
This really depends on the nature of the binary data, and the constraints that "text" places on your output.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95 (total number of symbols)
2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded
000000 0000000
000001 0000001
...
100000 0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.
The short answer would be: No, there still isn't.
I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.
I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.
If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.
But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).
This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.
In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.
I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.
My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.
Finally, I've written Z85 library in C/C++ and use it in production.
According to Wikipedia
basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.
Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions
Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays
Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
http://blog.kevinalbs.com/base122
See also How viable is base128 encoding for scenarios like JavaScript strings?
Next to the ones listed on Wikipedia, there is Bommanews:
B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.
It's comparable to yEnc: source
yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.
http://b-news.sourceforge.net/
http://www.iguana.be/~stef/
http://bnews-plus.sourceforge.net/
If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32?
I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?
UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes these into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.
UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.
UTF-32 will cover all possible characters in 4 bytes. This makes it pretty bloated. I can't think of any advantage to using it.
In short:
UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.
In long: see Wikipedia: UTF-8, UTF-16, and UTF-32.
UTF-8 is variable 1 to 4 bytes.
UTF-16 is variable 2 or 4 bytes.
UTF-32 is fixed 4 bytes.
Unicode defines a single huge character set, assigning one unique integer value to every graphical symbol (that is a major simplification, and isn't actually true, but it's close enough for the purposes of this question). UTF-8/16/32 are simply different ways to encode this.
In brief, UTF-32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.
UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.
And UTF-8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters (the most significant bit is used to signify that this is the start of a multi-byte sequence, leaving 7 bits for the actual character value). All other characters are encoded as sequences of up to 4 bytes (if memory serves).
And that leads us to the advantages. Any ASCII-character is directly compatible with UTF-8, so for upgrading legacy apps, UTF-8 is a common and obvious choice. In almost all cases, it will also use the least memory. On the other hand, you can't make any guarantees about the width of a character. It may be 1, 2, 3 or 4 characters wide, which makes string manipulation difficult.
UTF-32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), but on the other hand, you know that every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF-8.
UTF-16 is a compromise. It lets most characters fit into a fixed-width 16-bit value. So as long as you don't have Chinese symbols, musical notes or some others, you can assume that each character is 16 bits wide. It uses less memory than UTF-32. But it is in some ways "the worst of both worlds". It almost always uses more memory than UTF-8, and it still doesn't avoid the problem that plagues UTF-8 (variable-length characters).
Finally, it's often helpful to just go with what the platform supports. Windows uses UTF-16 internally, so on Windows, that is the obvious choice.
Linux varies a bit, but they generally use UTF-8 for everything that is Unicode-compliant.
So short answer: All three encodings can encode the same character set, but they represent each character as different byte sequences.
Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes:
UTF-8 - "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)
UTF-16 - "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)
UTF-32 - "performance": allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage
I tried to give a simple explanation in my blogpost.
UTF-32
requires 32 bits (4 bytes) to encode any character. For example, in order to represent the "A" character code-point using this scheme, you'll need to write 65 in 32-bit binary number:
00000000 00000000 00000000 01000001 (Big Endian)
If you take a closer look, you'll note that the most-right seven bits are actually the same bits when using the ASCII scheme. But since UTF-32 is fixed width scheme, we must attach three additional bytes. Meaning that if we have two files that only contain the "A" character, one is ASCII-encoded and the other is UTF-32 encoded, their size will be 1 byte and 4 bytes correspondingly.
UTF-16
Many people think that as UTF-32 uses fixed width 32 bit to represent a code-point, UTF-16 is fixed width 16 bits. WRONG!
In UTF-16 the code point maybe represented either in 16 bits, OR 32 bits. So this scheme is variable length encoding system. What is the advantage over the UTF-32? At least for ASCII, the size of files won't be 4 times the original (but still twice), so we're still not ASCII backward compatible.
Since 7-bits are enough to represent the "A" character, we can now use 2 bytes instead of 4 like the UTF-32. It'll look like:
00000000 01000001
UTF-8
You guessed right.. In UTF-8 the code point maybe represented using either 32, 16, 24 or 8 bits, and as the UTF-16 system, this one is also variable length encoding system.
Finally we can represent "A" in the same way we represent it using ASCII encoding system:
01001101
A small example where UTF-16 is actually better than UTF-8:
Consider the Chinese letter "語" - its UTF-8 encoding is:
11101000 10101010 10011110
While its UTF-16 encoding is shorter:
10001010 10011110
In order to understand the representation and how it's interpreted, visit the original post.
UTF-8
has no concept of byte-order
uses between 1 and 4 bytes per character
ASCII is a compatible subset of encoding
completely self-synchronizing e.g. a dropped byte from anywhere in a stream will corrupt at most a single character
pretty much all European languages are encoded in two bytes or less per character
UTF-16
must be parsed with known byte-order or reading a byte-order-mark (BOM)
uses either 2 or 4 bytes per character
UTF-32
every character is 4 bytes
must be parsed with known byte-order or reading a byte-order-mark (BOM)
UTF-8 is going to be the most space efficient unless a majority of the characters are from the CJK (Chinese, Japanese, and Korean) character space.
UTF-32 is best for random access by character offset into a byte-array.
I made some tests to compare database performance between UTF-8 and UTF-16 in MySQL.
Update Speeds
UTF-8
UTF-16
Insert Speeds
Delete Speeds
In UTF-32 all of characters are coded with 32 bits. The advantage is that you can easily calculate the length of the string. The disadvantage is that for each ASCII characters you waste an extra three bytes.
In UTF-8 characters have variable length, ASCII characters are coded in one byte (eight bits), most western special characters are coded either in two bytes or three bytes (for example € is three bytes), and more exotic characters can take up to four bytes. Clear disadvantage is, that a priori you cannot calculate string's length. But it's takes lot less bytes to code Latin (English) alphabet text, compared to UTF-32.
UTF-16 is also variable length. Characters are coded either in two bytes or four bytes. I really don't see the point. It has disadvantage of being variable length, but hasn't got the advantage of saving as much space as UTF-8.
Of those three, clearly UTF-8 is the most widely spread.
I'm surprised this question is 11yrs old and not one of the answers mentioned the #1 advantage of utf-8.
utf-8 generally works even with programs that are not utf-8 aware. That's partly what it was designed for. Other answers mention that the first 128 code points are the same as ASCII. All other code points are generated by 8bit values with the high bit set (values from 128 to 255) so that from the POV of a non-unicode aware program it just sees strings as ASCII with some extra characters.
As an example let's say you wrote a program to add line numbers that effectively does this (and to keep it simple let's assume end of line is just ASCII 13)
// pseudo code
function readLine
if end of file
return null
read bytes (8bit values) into string until you hit 13 or end or file
return string
function main
lineNo = 1
do {
s = readLine
if (s == null) break;
print lineNo++, s
}
Passing a utf-8 file to this program will continue to work. Similarly, splitting on tabs, commas, parsing for ASCII quotes, or other parsing for which only ASCII values are significant all just work with utf-8 because no ASCII value appear in utf-8 except when they are actually meant to be those ASCII values
Some other answers or comments mentions that utf-32 has the advantage that you can treat each codepoint separately. This would suggest for example you could take a string like "ABCDEFGHI" and split it at every 3rd code point to make
ABC
DEF
GHI
This is false. Many code points affect other code points. For example the color selector code points that lets you choose between 👨🏻‍🦳👨🏼‍🦳👨🏽‍🦳👨🏾‍🦳👨🏿‍🦳. If you split at any arbitrary code point you'll break those.
Another example is the bidirectional code points. The following paragraph was not entered backward. It is just preceded by the 0x202E codepoint
‮This line is not typed backward it is only displayed backward
So no, utf-32 will not let you just randomly manipulate unicode strings without a thought to their meanings. It will let you look at each codepoint with no extra code.
FYI though, utf-8 was designed so that looking at any individual byte you can find out the start of the current code point or the next code point.
If you take a arbitrary byte in utf-8 data. If it is < 128 it's the correct code point by itself. If it's >= 128 and < 192 (the top 2 bits are 10) then to find the start of the code point you need to look the preceding byte until you find a byte with a value >= 192 (the top 2 bits are 11). At that byte you've found the start of a codepoint. That byte encodes how many subsequent bytes make the code point.
If you want to find the next code point just scan until the byte < 128 or >= 192 and that's the start of the next code point.
Num Bytes
1st code point
last code point
Byte 1
Byte 2
Byte 3
Byte 4
1
U+0000
U+007F
0xxxxxxx
2
U+0080
U+07FF
110xxxxx
10xxxxxx
3
U+0800
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx
4
U+10000
U+10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
Where xxxxxx are the bits of the code point. Concatenate the xxxx bits from the bytes to get the code point
Depending on your development environment you may not even have the choice what encoding your string data type will use internally.
But for storing and exchanging data I would always use UTF-8, if you have the choice. If you have mostly ASCII data this will give you the smallest amount of data to transfer, while still being able to encode everything. Optimizing for the least I/O is the way to go on modern machines.
As mentioned, the difference is primarily the size of the underlying variables, which in each case get larger to allow more characters to be represented.
However, fonts, encoding and things are wickedly complicated (unnecessarily?), so a big link is needed to fill in more detail:
http://www.cs.tut.fi/~jkorpela/chars.html#ascii
Don't expect to understand it all, but if you don't want to have problems later it's worth learning as much as you can, as early as you can (or just getting someone else to sort it out for you).
Paul.
After reading through the answers, UTF-32 needs some loving.
C#:
Data1 = RandomNumberGenerator.GetBytes(500_000_000);
sw = Stopwatch.StartNew();
int l = Encoding.UTF8.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-8: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.Unicode.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"Unicode: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.UTF32.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-32: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
sw = Stopwatch.StartNew();
l = Encoding.ASCII.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"ASCII: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s} Size - {l:###,###,###}");
UTF-8 -- Elapsed 9.939s - Size 473,752,800
Unicode -- Elapsed 0.853s - Size 250,000,000
UTF-32 -- Elapsed 3.143s - Size 125,030,570
ASCII -- Elapsed 2.362s - Size 500,000,000
UTF-32 -- MIC DROP
In short, the only reason to use UTF-16 or UTF-32 is to support non-English and ancient scripts respectively.
I was wondering why anyone would chose to have non-UTF-8 encoding when it is obviously more efficient for web/programming purposes.
A common misconception - the suffixed number is NOT an indication of its capability. They all support the complete Unicode, just that UTF-8 can handle ASCII with a single byte, so is MORE efficient/less corruptible to the CPU and over the internet.
Some good reading: http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/10/which_utf_do_i_use.html
and http://utf8everywhere.org