when is fixed length encoding better than huffman?

when is fixed length encoding better than huffman? - encoding

For the word "sleeplessness" Huffman encoding is 27 bits while Fixed length encoding is 39
Is there a word or a general condition in which Huffman will need more bits than Fixed length encoding?

A Huffman coding using the probability of the symbols in the message will never need more bits than a fixed-length coding, though only if we ignore the bits required to transmit a description of the code itself. The Huffman code description plus the Huffman-coded message for short messages will often be larger than a fixed-length code that requires no description.

Related

Struggling with Base 64 encoding in T-SQL - and the padding [duplicate]

What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?

Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.
However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.
If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.
Edit: An Illustration
Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.
I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)
So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.
Why Bother with Padding?
Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.
That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.
If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.
Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.

On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com/
What are Padding Characters?
Padding characters help satisfy length requirements and carry no other meaning.
Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".
Binary Encoding
The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.
Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.
Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.
Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.
We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.
This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.
So why is padding encouraged even though encoding could work just fine without the padding characters?
If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.
As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.
And this is exactly what the base64 RFC says,
In some circumstances, the use of padding ("=") in base-encoded data
is not required or used. In the general case, when assumptions about
the size of transported data cannot be made, padding is required to
yield correct decoded data.
[...]
The padding step in base 64 [...] if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.
Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.
Examples
Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)
Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.
BASE64("") = "" (No bytes used. 0 % 3 = 0)
BASE64("f") = "Zg==" (One byte used. 1 % 3 = 1)
BASE64("fo") = "Zm8=" (Two bytes. 2 % 3 = 2)
BASE64("foo") = "Zm9v" (Three bytes. 3 % 3 = 0)
BASE64("foob") = "Zm9vYg==" (Four bytes. 4 % 3 = 1)
BASE64("fooba") = "Zm9vYmE=" (Five bytes. 5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy" (Six bytes. 6 % 3 = 0)
Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp

There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
A full encoding quantum is always completed at the end of a message.
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.

With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.
I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.

Padding fills the output length to a multiple of four bytes in a defined way.

Why use Base64?

Base64 encoding increases the size of the input by around 37% when sent over the wire. If this is the case, why not use UTF-8 to encode the contents(say a .jpg file). This way the size of the file does not increase right?
eg: If I want to send the string "asd", a UTF-8 encoded version of this will be 3 bytes, whereas a Base64 encoded version will be 4 bytes.

The purpose of Base64 is to allow binary data to be transferred over a communication channel that cannot be relied on to transfer all possible byte values end-to-end. In particular, Base64 is used where byte values between 128 and 255 cannot be easily and reliably transferred.
In contrast, UTF-8 is used to encode Unicode across a channel that can be assumed to reliably transfer all possible byte values end-to-end (sometime referred to as an "8-bit clean" channel).
So, you have two problems with your proposal. First, a JPEG is binary data, not Unicode, so UTF-8 isn't really appropriate: if you "encode a JPEG as UTF-8" in the obvious way (treating the JPEG as a sequence of bytes, each associated with a Unicode code point from U+00 to U+FF, and then encoding those code points as UTF-8), it will double the size of all byte values from 128-255, so you'll have, on average, a 50% increase in file size. Second, even if you did this, the resulting encoded JPEG would require a communication channel that's 8-bit clean, so it couldn't be used in situations where Base64 is needed anyway.
Edit: In a comment, you asked if we couldn't use "input binary -> 7 bit ASCII encoding -> send over wire" to save space. I assume you mean taking the input binary as a long stream of bits and chopping them up into 7-bit chunks and sending those as ASCII? Yes, that could be done and would only increase size by 14%, but it's not just the non-ASCII byte values 128-255 that cause problems. In MIME email, where Base64 is most frequently used, differences in line-ending convention (carriage return, line feed, or a combination) from platform to platform, certain historical line length restrictions enshrined in the standards, and so on mean that not all ASCII characters (bytes 0-127) can be safely used. Base64 is not the best trade-off possible between compatibility and efficiency, but it's pretty close.

Base64 is usually used in instances to represent arbitrary binary data in a text format, it has a 33.3'% overhead but that's better than say hex notation which has a 50% overhead.
utf-8 is a text encoding which cannot represent arbitrary binary data which is what a jped file is.
There is little to no reason to convert the binary data to text to transfer it over the wire so a many times people do it because they don't know any better.
The only reason to use it is if you get it from apis or libraries.

Is there any classic 3 byte fingerprint function?

I need a checksum/fingerprint function for short strings (say, 16 to 256 bytes) which fits in a 24 bits word. Is there any well known algorithm for that?

I propose to use a 24-bit CRC as an easy solution. CRCs are available in all lengths and always simple to compute. Wikipedia has a matching entry. The quality is far better than a modulo-reduced sum, because swapping characters will most likely produce a different CRC.
The next step (if it is a real threat to have a wrong string with the same checksum) would be a cryptographic MAC like CMAC. While this is too long out of the book, it can be reduced by taking the first 24 bits.

Simplest thing to do is a basic checksum - add up the bytes in the string, mod (2^24).
You have to watch out for character set issues when converting to bytes though, so everyone agrees on the same encoding of characters to bytes.

Why does base64 encoding require padding if the input length is not divisible by 3?

What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?

There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
A full encoding quantum is always completed at the end of a message.
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.

With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.
I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.

Padding fills the output length to a multiple of four bytes in a defined way.

What is the most efficient binary to text encoding?

The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?

This really depends on the nature of the binary data, and the constraints that "text" places on your output.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95 (total number of symbols)
2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded
000000 0000000
000001 0000001
...
100000 0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.

The short answer would be: No, there still isn't.
I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.
I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.
If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.
But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).
This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.
In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.

I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.
My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.
Finally, I've written Z85 library in C/C++ and use it in production.

According to Wikipedia
basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.

Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions
Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays
Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
http://blog.kevinalbs.com/base122
See also How viable is base128 encoding for scenarios like JavaScript strings?

Next to the ones listed on Wikipedia, there is Bommanews:
B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.
It's comparable to yEnc: source
yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.
http://b-news.sourceforge.net/
http://www.iguana.be/~stef/
http://bnews-plus.sourceforge.net/

If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse