Variable-byte encoding clarification - encoding

I am very new to the world of byte encoding so please excuse me (and by all means, correct me) if I am using/expressing simple concepts in the wrong way.
I am trying to understand variable-byte encoding. I have read the Wikipedia article (http://en.wikipedia.org/wiki/Variable-width_encoding) as well as a book chapter from an Information Retrieval textbook. I think I understand how to encode a decimal integer. For example, if I wanted to provide variable-byte encoding for the integer 60, I would have the following result:
1 0 1 1 1 1 0 0
(please let me know if the above is incorrect). If I understand the scheme, then I'm not completely sure how the information is compressed. Is it because usually we would use 32 bits to represent an integer, so that representing 60 would result in 1 1 1 1 0 0 preceded by 26 zeros, thus wasting that space as opposed to representing it with just 8 bits instead?
Thank you in advance for the clarifications.

The way you do it is by reserving one of the bits to mean "I'm not done with the value." Usually, that's the most significant bit.
When you read a byte, you process the lower 7 bits. If the most significant bit is 1, then you know there's one more byte to read, and you repeat the process, adding the next 7 bits to the current 7 bits.
The MIDI format uses that exact encoding to represent lengths of MIDI events, in the following manner:
ExpectedValue = 0
byte=ReadFromFile
ExpectedValue = ExpectedValue + (byte AND 0x7f)
if byte > 127 then
ExpectedValue = ExpectedValue SHL 7
Goto 2
Done
For example, the value 0x80 would be represented using the bytes 0x81 0x00. You can try running the algorithm on those two bytes, and you see you'll get the right value.
UTF-8 works similarly, but it uses a slightly more complex scheme to tell you how many bytes you should be expecting. This allows for some error correction, since you can easily tell if the bytes you're getting match the length claimed. Wikipedia describes their structure quite well.

You hit the nail on the head.
There are many encoding schemes, such as gamma and delta, which are special cases of elias coding. These are bit-level codes, as opposed to the byte-level code you used, and are useful when you have a strong skew towards small numbers (which can often be achieved by encoding deltas instead of absolute values).
Bit-level encoding schemes are much more difficult to implement than byte-level schemes and the additional CPU burden may outweigh the time saved by having less data to read, though most modern CPUs have "highest-bit" and "lowest-bit" instructions that dramatically improve the performance of bit-level codecs. As CPU speeds continue to outpace RAM speeds, bit-level schemes will become more attractive, though the simplicity of byte-level codecs is a big factor too.

Yes, you are right, you save space by encoding using one byte instead of 4.
Generally, you will save memory if the values you are encoding are much smaller than the maximum value that would have fit in your original fixed-width encoding.

Related

Is there any classic 3 byte fingerprint function?

I need a checksum/fingerprint function for short strings (say, 16 to 256 bytes) which fits in a 24 bits word. Is there any well known algorithm for that?
I propose to use a 24-bit CRC as an easy solution. CRCs are available in all lengths and always simple to compute. Wikipedia has a matching entry. The quality is far better than a modulo-reduced sum, because swapping characters will most likely produce a different CRC.
The next step (if it is a real threat to have a wrong string with the same checksum) would be a cryptographic MAC like CMAC. While this is too long out of the book, it can be reduced by taking the first 24 bits.
Simplest thing to do is a basic checksum - add up the bytes in the string, mod (2^24).
You have to watch out for character set issues when converting to bytes though, so everyone agrees on the same encoding of characters to bytes.

Representation of a Kilo/Mega/Tera Byte

I was getting a little confused with the representation of different units of bytes.
It is accepted throughout that 1 byte = 8 bits.
However, in a lot of sources I have seen that
1 kiloByte = 2^10 bytes = 1024 bytes
AND
1 kiloByte = 1000 bytes
Doesn't this contradict as in both cases it is stated that 1 byte is 8 bits...?
Different sources claim different reasons for these different representations, thus I am not sure what the most important/real reason is for this rather confusing difference in representation.
Can someone please explain and clarify?
It is accepted throughout that 1 byte = 8 bits
However, in a lot of sources I have seen that
1 kiloByte = 2^ 10 bytes = 1024 bytes
AND
1 kiloByte = 1000 bytes
To make sure we're all clear, your question is "Is a kilobyte equal to 1024 bytes or 1000 bytes?".
Doesn't this contradict as in both cases it is stated that 1 byte is 8 bits...?
This is irrelevant to the question.
So, let's begin. In SI (metric), the multiplier of 1000 is called kilo, abbreviated k. k always means 1000, never anything else.
When binary computers entered the world, we noticed that 2 to the power of 10 is 1024, which is conveniently close to 1000. Computer engineers decided to abuse this coincidence and say that kilo means 1024. By extension, they say that mega means 10242 (instead of the proper definition of 10002), and so on with giga, tera, etc.
While the difference between 1000 and 1024 is small for many purposes, there are times when exact answers are required, and this is where the abusive terminology hurts everyone. Only after decades after kilo=1024 got established did anyone really try to fix the problem. The IEC proposed new prefixes for the binary multipliers: 1024 = kibi, 10242 = mebi, 10243 = gibi, etc.
In summary, the notion that kilo=1024 is an abusive deviation from the consistent SI definition of kilo=1000. While kilo=1024 is popular in the computer industry, it is nevertheless wrong and should be replaced by kibi=1024. Or numbers need to be recomputed to reflect the true definition of kilo/mega/etc. (For example, "512 MB" of RAM is actually about 536.9 MB.)
Btw, don't use random capitalization; it's spelled kilobyte, not kiloByte.
References and links:
http://physics.nist.gov/cuu/Units/binary.html
http://en.wikipedia.org/wiki/Kilo-
http://en.wikipedia.org/wiki/Kilobyte
http://en.wikipedia.org/wiki/Kibibyte
http://xkcd.com/394/
When you talk about data information in computer science, you always have to calculate the result by a power of two. See what wikipedia says:
"In computing, a binary prefix is a
specifier or mnemonic that is
prepended to the units of digital
information, the bit and the byte, to
indicate multiplication by a power of
2. In practice the powers used are multiples of 10, so the prefixes
denote powers of 1024 = 2^10."
Sometimes people use to round it as you have mentioned, but it is a bad use of it.
I don't see what the byte to bits has to do with anything if you are asking whether 1 kiloByte is equal to 1024 or 1000 bytes. These measurements are not set in stone and are not really controlled at all. Computer makers can (and have) used the 1000 conversion to make it look like they have more memory.
The problem comes up when thinking about binary (base 2) or base 10. Base 10 you would use 1000, base 2, 1024.

Binary integral data compression

I need to transmit integral data types over the network but don't want to transfer all 32 (or 64) bits all the time - data fits into just one byte 99% of time - so it looks like it's need to compress it somehow: for example first bit of a byte is 0 if other 7 bits means just some value (0-127), otherwise (if first byte is 1) it's need to shift these 7 bytes left and read second byte to do the same process.
Is there some common way to do this? I don't want to reinvent a wheel...
Thank you.
The scheme you describe (which is essentially a base-128 encoding: each byte is a 7-bit base-128 "digit" and a single bit flag to indicate whether or not it is the final digit) is a common way of doing this.
For example, see:
the section on "LEB128" in the DWARF spec (§7.6);
"Base 128 Varints" in Google's protocol buffers;
"Variable Width Integers" in the LLVM bitcode format (various different widths are used in various different places there).
Just about any data compression algorithm would be able to compress that kind of data stream very well. Use whatever compression libraries your language provides.

Binary storage of numbers with overflow bits...how's the format called?

For a serialization/protocol format I have to encode unsigned numbers up to unsigned 64bit integer in a space-saving way that should still be easy to implement (meaning, I'm not looking for a dedicated compression algorithm). I was thinking about the following:
if n<128
take bits 0..6 for representing n, set overflow bit 7 to 0
store one byte
if n>=128 and n<16384
take bits 0..6 of byte 1 as bits 0..6 of n, set overflow bit 7 of byte 1 to 1
take bits 0..6 of byte 2 as bits 7..13 of n, set overflow bit 7 of byte 2 to 0
store byte 1 followed by byte 2
if n>=16384 and n<2^21
...set overflow bit 7 of byte 2 to 1... (and so on)
I have two questions about this:
How is this format called? Where can I look up implementations?
This is for a binary protocol that will be sent over sockets, where small numbers <128 will be sent very often. Do you think the extra processing is worth it?
Not the same as, but similar to UTF-8.
Edit
BTW: try and choose a known protocol. UTF-8, Huffman encoding...
Okay, after some more research I've finally found it. It's called 'variable-length quantity' and used in MIDI and ASN.1 (see Wikipedia Entry)
To answer my other question, I'm tending to believe it isn't worth the processing overhead but I'm still pondering about it.

What is the most efficient binary to text encoding?

The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?
This really depends on the nature of the binary data, and the constraints that "text" places on your output.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95 (total number of symbols)
2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded
000000 0000000
000001 0000001
...
100000 0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.
The short answer would be: No, there still isn't.
I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.
I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.
If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.
But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).
This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.
In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.
I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.
My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.
Finally, I've written Z85 library in C/C++ and use it in production.
According to Wikipedia
basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.
Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions
Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays
Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
http://blog.kevinalbs.com/base122
See also How viable is base128 encoding for scenarios like JavaScript strings?
Next to the ones listed on Wikipedia, there is Bommanews:
B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.
It's comparable to yEnc: source
yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.
http://b-news.sourceforge.net/
http://www.iguana.be/~stef/
http://bnews-plus.sourceforge.net/
If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.