Why is the vocab size of Byte level BPE smaller than Unicode's vocab size? - unicode

I recently read GPT2 and the paper says:
This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256.
I really don't understand the words. The number of characters that Unicode represents is 130K but how can this be reduced to 256? Where's the rest of approximately 129K characters? What am I missing? Does byte-level BPE allow duplicating of representation between different characters?
I don't understand the logic. Below are my questions:
Why the size of vocab is reduced? (from 130K to 256)
What's the logic of the BBPE (Byte-level BPE)?
Detail question
Thank you for your answer but I really don't get it. Let's say we have 130K unique characters. What we want (and BBPE do) is to reduce this basic (unique) vocabulary. Each Unicode character can be converted 1 to 4 bytes by utilizing UTF-8 encoding. The original paper of BBPE says (Neural Machine Translation with Byte-Level Subwords):
Representing text at the level of bytes and using the 256 bytes set as vocabulary is a potential solution to this issue.
Each byte can represent 256 characters (bits, 2^8), we only need 2^17 (131072) bits for representing the unique Unicode characters. In this case, where did the 256 bytes in the original paper come from? I don't know both the logic and how to derive this result.
I arrange my questions again, more detail:
How does BBPE work?
Why the size of vocab is reduced? (from 130K to 256 bytes)
Anyway, we always need 130K space for a vocab. What's the difference between representing unique characters as Unicode and Bytes?
Since I have little knowledge of computer architecture and programming, please let me know if there's something I missed.
Sincerely, thank you.

Unicode code points are integers in the range 0..1,114,112, of which roughly 130k are in use at the moment. Every Unicode code point corresponds to a character, like "a" or "λ" or "龙", which is handy to work with in many cases (but there are a lot of complicated details, eg. combining marks).
When you save text data to a file, you use one of the UTFs (UTF-8, UTF-16, UTF-32) to convert code points (integers) to bytes. For UTF-8 (the most popular file encoding), each character is represented by 1, 2, 3, or 4 bytes (there's some inner logic to discriminate single- and multi-byte characters).
So when the base vocabulary are bytes, this means that rare characters will be encoded with multiple BPE segments.
Example
Let's consider a short example sentence like “That’s great 👍”.
With a base vocabulary of all Unicode characters, the BPE model starts off with something like this:
T 54
h 68
a 61
t 74
’ 2019
s 73
20
g 67
r 72
e 65
a 61
t 74
20
👍 1F44D
(The first column is the character, the second its codepoint in hexadecimal notation.)
If you first encode this sentence with UTF-8, then this sequence of bytes is fed to BPE instead:
T 54
h 68
a 61
t 74
� e2
� 80
� 99
s 73
20
g 67
r 72
e 65
a 61
t 74
20
� f0
� 9f
� 91
� 8d
The typographic apostrophe "’" and the thumbs-up emoji are represented by multiple bytes.
With either input, the BPE segmentation (after training) may end with something like this:
Th|at|’s|great|👍
(This is a hypothetical segmentation, but it's possible that capitalised “That“ is too rare to be represented as a single segment.)
The number of BPE operations is different though: to arrive at the segment ’s, only one merge step is required for code-point input, but three steps for byte input.
With byte input, the BPE segmentation is likely to end up with sub-character segments for rare characters.
The down-stream language model will have to learn to deal with that kind of input.

So you already know the BPE right Byte-level BPE is an improvisation of how the base vocabulary is defined. Recall, there is 1,43,859 unicode characters in unicode alphabets, but wonder how the gpt-2 vocabulary size is just 50,257. Having a base vocabulary of 1.4L will increase the size even more during the training process(where we will combine frequent occuring unicode characters).
To solve this issue GPT-2 uses a byte-level process which has a base vocabulary of just 256 characters using which any unicode characters can be represented by either a single or multiple byte-level characters. I still dont know the process of how a unicode character is converted to byte-level representation.
Does this explanation gave you a clarity why we go to a byte-level representation. Once again gpt-2 uses this 256 base vocabulary and increase the vocabulary size by adding frequent co occuring characters.

Related

UTF8, codepoints, and their representation in Erlang and Elixir

going through Elixir's handling of unicode:
iex> String.codepoints("abc§")
["a", "b", "c", "§"]
very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.
The ? operator (or is it a macro? can't find the answer) tells me that
iex(69)> ?§
167
Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating ?§ earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:
iex(72)> <<0xc2, 0xa7>>
"§"
And to make me go completely bananas, this is what I get in Erlang shell:
24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.
<<"§">>
!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?
The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.
The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.
When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).
In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:
24> <<167>>.
<<"§">> %% this is encoded in latin1
25> <<"\x{a7}">>.
<<"§">> %% still latin1
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says
27> <<"\x{c2a7}">>.
<<"§">> %% this is latin1
The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.
The code point is a number assigned to the character. It's an abstract value, not dependent on any particular representation in actual memory somewhere.
In order to store the character, you have to convert the code point to some sequence of bytes. There are several different ways to do this; each is called a Unicode Transformation Format, and named UTF-n, where the n is the number of bits in the basic unit of encoding. There used to be a UTF-7, used where 7-bit ASCII was assumed and even the 8th bit of a byte couldn't be reliably transmitted; in modern systems, there are UTF-8, UTF-16, and UTF-32.
Since the largest code point value fits comfortably in 21 bits, UTF-32 is the simplest; you just store the code point as a 32-bit integer. (There could theoretically be a UTF-24 or even a UTF-21, but common modern computing platforms deal naturally with values that take up either exactly 8 or a multiple of 16 bits, and have to work harder to deal with anything else.)
So UTF-32 is simple, but inefficient. Not only does it have 11 extra bits that will never be needed, it has 5 bits that are almost never needed. Far and away most Unicode characters found in the wild are in the Basic Multilingual Plane, U+0000 through U+FFFF. UTF-16 lets you represent all of those code points as a plain integer, taking up half the space of UTF-32. But it can't represent anything from U+10000 on up that way, so part of the 0000-FFFF range is reserved as "surrogate pairs" that can be put together to represent a high-plane Unicode character with two 16-bit units, for a total of 32 bits again but only when needed.
Java uses UTF-16 internally, but Erlang (and therefore Elixir), along with most other programming systems, uses UTF-8. UTF-8 has the advantage of completely transparent compatibility with ASCII - all characters in the ASCII range (U+0000 through U+007F, or 0-127 decimal) are represented by single bytes with the corresponding value. But any characters with code points outside the ASCII range require more than one byte each - even those in the range U+0080 through U+00FF, decimal 128 through 255, which only take up one byte in the Latin-1 encoding that used to be the default before Unicode.
So with Elixir/Erlang "binaries", unless you go out of your way to encode things differently, you are using UTF-8. If you look at the high bit of the first byte of a UTF-8 character, it's either 0, meaning you have a one-byte ASCII character, or it's 1. If it's 1, then the second-highest bit is also 1, because the number of consecutive 1-bits counting down from the high bit before you get to a 0 bit tells you how many bytes total the character takes up. So the pattern 110xxxxx means the character is two bytes, 1110xxxx means three bytes, and 11110xxx means four bytes. (There is no legal UTF-8 character that requires more than four bytes, although the encoding could theoretically support up to seven.)
The rest of the bytes all have the two high bits set to 10, so they can't be mistaken for the start of a character. And the rest of the bits are the code point itself.
To use your case as an example, the code point for "§" is U+00A7 - that is, hexadecimal A7, which is decimal 167 or binary 10100111. Since that's greater than decimal 127, it will require two bytes in UTF-8. Those two bytes will have the binary form 110abcde 10fghijk, where the bits abcdefghijk will hold the code point. So the binary representation of the code point, 10100111, is padded out to 00010100111 and split unto the sequences 00010, which replaces abcde in the UTF-8 template, and 100111, which replaces fghijk. That yields two bytes with binary values 11000010 and 10100111, which are C2 and A7 in hexadecimal, or 194 and 167 in decimal.
You'll notice that the second byte coincidentally has the same value as the code point you're encoding, but t's important to realize that this correspondence is just a coincidence. There are a total of 64 code points, from 128 (U+0080) through 191 (U+00BF), that work out that way: their UTF-8 encoding consists of a byte with decimal value 194 followed by a byte whose value is equal to the code point itself. But for the other 1,114,048 code points possible in Unicode, that is not the case.

Are there any UTF-8 code units that have byte 60 or 62 (`<` and `>`) as not the first byte of their binary representation?

I need to debug a XML parser and I am wondering if I can construct "malicious" input that will cause it to not recognize opening and closing tags correctly.
Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won't have trouble with other special characters such as &, = , ", etc.
UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:
If the highest bit is not set, i.e. the code unit is 0xxxxxxx, then this is byte expresses an entire code point, whose value is xxxxxxx (i.e. 7 bits of information).
If the highest bit is set and the code unit is 10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.
Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.
1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.
11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.
As you can see, a value 60, which is 00111100, is a single-byte codepoint of value 60, and the same byte cannot occur as part of any multibyte sequence.
The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.
Update: As #Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like 10111100, but you'd have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can't trust the browser, you don't really get around decoding everything and just check­ing the resulting code point sequence for occurrences of U+3C etc., and don't even bother looking at the byte stream.
In UTF-8, no. In other encodings, yes.
In UTF-8, by design, all bytes of a multibyte character will always have the highest bit set. Vice versa, a byte that doesn't have the highest bit set is always an ASCII character.
However, this is not true for other encodings, which are also valid for XML.
For more information about UTF-8, check e.g wikipedia
A poorly-designed UTF-8 decoder could interpret the bytes C0 BC and C0 BE as U+003C and U+003E. As #KerrekSB stated in his answer:
The standard mandates that a code point must be represented with the minimal number of code units.
But a poor algorithm might still decode a malformed two-byte UTF-8 sequence that is not the minimal number of code units:
C0 BC = 11000000 10111100 = 00000111100 = 3Chex = 60dec = '<'
So in your testing be sure to include malformed UTF-8 sequences and verify that they are rejected.

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).
Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).
To give a metaphorically accurate answer, all of them.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)
According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.
Unicode has the hexadecimal amount of 110000, which is 1114112

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?
It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.
I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.
Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.

What is the most efficient binary to text encoding?

The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?
This really depends on the nature of the binary data, and the constraints that "text" places on your output.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95 (total number of symbols)
2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded
000000 0000000
000001 0000001
...
100000 0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.
The short answer would be: No, there still isn't.
I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.
I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.
If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.
But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).
This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.
In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.
I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.
My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.
Finally, I've written Z85 library in C/C++ and use it in production.
According to Wikipedia
basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.
Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions
Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays
Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
http://blog.kevinalbs.com/base122
See also How viable is base128 encoding for scenarios like JavaScript strings?
Next to the ones listed on Wikipedia, there is Bommanews:
B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.
It's comparable to yEnc: source
yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.
http://b-news.sourceforge.net/
http://www.iguana.be/~stef/
http://bnews-plus.sourceforge.net/
If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.