How can I find the average length of a codeword encoded in Huffman if there are N(10 or more) symbols? - encoding

I'm practicing for an exam and I found a problem which asks to find the average length of codewords which are encoded in Huffman.
This usually wouldn't be hard, but in this problem we have to encode 100 symbols which all have the same probability (1/100).
Since there is obviously no point in trying to encode 100 symbols by hand I was wondering if there is a method to find out the average length without actually going through the process of encoding.
I'm guessing this is possible since all the probabilities are equal, however I couldn't find anything online.
Any help is appreciated!

For 100 symbols with equal probability, some will be encoded with six bits, some with seven bits. A Huffman code is a complete prefix code. "Complete" means that all possible bits patterns are used.
Let's say that i codes are six bits long and j codes are seven bits long. We know that i + j = 100. There are 64 possible six-bit codes, so after i get used up, there are 64 - i left. Adding one bit to each of those to make them seven bits long doubles the number of possible codes. So now we can have up to 2(64 - i) seven-bit codes.
For the code to be complete, all of those codes must be used, so j = 2(64 - i). We now have two equations in two unknowns. We get i = 28 and j = 72.
Since all symbols are equally probable, the average number of bits used per symbol is (28x6 + 72x7) / 100, which is 6.72. Not too bad, considering the entropy of each symbol is 6.64 bits.

Related

Matlab mantissa base

I've been using matlab to solve some boundary value problems lately, and I've noticed an annoying quirk. Suppose I start with the interval [0,1], and I want to search inside it. Naturally, one would perform a binary search, so I would subdivide the interval into [0,0.5] and [0.5,1]. Excellent: let's now suppose we narrow down our search to [0.5,1]. Now we divide the interval [0.5,0.75] and [0.75,1]. No apparent problem yet. However, as we keep going, representation of powers of 2 in base 10 becomes less and less natural. For example, 2^-22 in binary is just 22 bits, while in decimal it is 16 digits. However, keep in mind that each digit of decimal is really encoding ~ 4 bits. In other words, representing these fractions as decimal is extremely inefficient.
Matlab's precision only extends to 16 digit decimal floats, so a binary search going to 2^-22 is as good as you can do. However, 2^-22 ~ 10^-7, which is much bigger than 10^-16, so the best search strategy in matlab seems to be a decimal search! In any case, this is what I have done so far: to take full advantage of the 16 digit precision, I've had to subdivide the interval [0,1] into 10 pieces.
Hopefully I've made my problem clear. So, my question is: how do I make matlab count in native binary? I want to work with 64 bit binary floats!

How can I calculate the impact on collision probability when truncating a hash?

I'd like to reduce an MD5 digest from 32 characters down to, ideally closer to 16. I'll be using this as a database key to retrieve a set of (public) user-defined parameters. I'm expecting the number of unique "IDs" to eventually exceed 10,000. Collisions are undesirable but not the end of the world.
I'd like to understand the viability of a naive truncation of the MD5 digest to achieve a shorter key. But I'm having trouble digging up a formula that I can understand (given I have a limited Math background), let alone use to determine the impact on collision probability that truncating the hash would have.
The shorter the better, within reason. I feel there must be a simple formula, but I'd rather have a definitive answer than do my own guesswork cobbled together from bits and pieces I have read around the web.
You can calculate the chance of collisions with this formula:
chance of collision = 1 - e^(-n^2 / (2 * d))
Where n is the number of messages, d is the number of possibilities, and e is the constant e (2.718281828...).
#mypetition's answer is great.
I found a few other equations that are more-or-less accurate and/or simplified here, along with a great explanation and a handy comparison of real-world probabilities:
1−e^((−k(k−1))/2N) - sample plot here
(k(k-1))/2N - sample plot here
k^2/2N - sample plot here
...where k is the number of ID's you'll be generating (the "messages") and N is the largest number that can be produced by the hash digest or the largest number that your truncated hexadecimal number could represent (technically + 1, to account for 0).
A bit more about "N"
If your original hash is, for example, "38BF05A71DDFB28A504AFB083C29D037" (32 hex chars), and you truncate it down to, say, 12 hex chars (e.g.: "38BF05A71DDF"), the largest number you could produce in hexadecimal is "0xFFFFFFFFFFFF" (281474976710655 - which is 16^12-1 (or 256^6 if you prefer to think in terms of bytes). But since "0" itself counts as one of the numbers you could theoretically produce, you add back that 1, which leaves you simply with 16^12.
So you can think of N as 16 ^ (numberOfHexDigits).

Jpeg huffman coding procedure

A Huffman table, in JPEG standard, is generated from a collection of statistics in two steps. One of the steps is implementing function/method given by this picture: (This picture is given in Annex K of JPEG standard):
Problem is here. Previously in standard (Annex C) says this sentence:
Huffman tables are specified in terms of a 16-byte list (BITS) giving the number of codes for each code length from
1 to 16. This is followed by a list of the 8-bit symbol values (HUFFVAL), each of which is assigned a Huffman code.
Obviously BITS is list of 16 elements. But in picture above, i is first set to 32 (i=32) then we want to access BITS[i]. Probably I misunderstood something, so please let someone gives me an answer.
Here is JPEG standard description of picture:
Figure K.3 gives the procedure for adjusting the BITS list so that no code is longer than 16 bits. Since symbols are paired
for the longest Huffman code, the symbols are removed from this length category two at a time. The prefix for the pair
(which is one bit shorter) is allocated to one of the pair; then (skipping the BITS entry for that prefix length) a code word
from the next shortest non-zero BITS entry is converted into a prefix for two code words one bit longer. After the BITS
list is reduced to a maximum code length of 16 bits, the last step removes the reserved code point from the code length
count.
Here is code for picture above:
void adjustBitLengthTo16Bits(vector<char>&BITS){
int i=32,j=0;
while(1){
if(BITS[i]>0){
j=i-1;
j--;
while(BITS[j]<=0)
j--;
BITS[i]=BITS[i]-2;
BITS[i-1]=BITS[i-1]+1;
BITS[j+1]=BITS[j+1]+2;
BITS[j]=BITS[j]-1;
continue;
}
else{
i--;
if(i!=16)
continue;
while(BITS[i]==0)
i--;
BITS[i]--;
return;
}
}
}
This code is only for encoders that want to generate their own custom Huffman tables. The majority of JPEG encoders just use fixed tables that are reasonable approximations of the statistics of most images. In this particular case, the first step in generating a Huffman table for the AC coefficients produces a table up to 32 entries (bits) long. Since there are only 256 unique symbols to encode (skip/length pairs), there should never be more than 32 bits needed to specify all of the Huffman codes. After the first pass has produced a set of codes (up to 32-bits in length), the second pass takes the least frequent (longest) codes and "moves" them into shorter length slots so that the maximum code length is 16-bits. In an ideal Huffman table, the frequency distributions correspond to the code lengths. In this case, the table is being made to fit by squeezing the longest codes into slots reserved for shorter codes. This can be done because the 14/15/16 bit length Huffman codes have "room" for more permutations of bits and can "fit" the longer codes in them.
Update:
There is limited benefit to "optimizing" the Huffman tables in JPEG. Most of the compression occurs because of the quantization and DCT transform of the pixels. Switching to arithmetic coding has a measurable benefit (~10% size reduction), but then it limits the audience since most JPEG decoders don't support arithmetic coding due to past patent issues.

What's the biggest number in a computer?

Just asked by my 5 year old kid: what is the biggest number in the computer?
We are not talking about max number for a specific data types, but the biggest number that a computer can represent.
Infinity is not allowed.
UPDATE my kid always wants to print as
well, so lets say the computer needs
to print this number and the kid to
know that its a big number. Of course,
in practice we won't print because
theres not enough trees.
This question is actually a very interesting one which mathematicians have devoted a fair bit of thought to. You can read about it in this article, which is a fascinating and accessible read.
Briefly, a guy named Tibor Rado set out to find some really big, but still well-defined, numbers by defining a sequence called the Busy Beaver numbers. He defined BB(n) to be the largest number of steps any Turing Machine could take before halting, given an input of n symbols. Note that this sequence is by its very nature not computable, so the numbers themselves, while well-defined, are very difficult to pin down. Here are the first few:
BB(1) = 1
BB(2) = 6
BB(3) = 21
BB(4) = 107
... wait for it ...
BB(5) >= 8,690,333,381,690,951
No one is sure how big exactly BB(5) is, but it is finite. And no one has any idea how big BB(6) and above are. But at least these numbers are completely well-defined mathematically, unlike "the largest number any human has ever thought of, plus one." ;)
So how about this:
The biggest number a computer can represent is the most instructions a program small enough to fit in its available memory can perform before halting.
Squared.
No, wait, cubed. No, raised to the power of itself!
Dammit!
Bits are not numbers. You, as a programmer, give them the meaning you want, possibly numbers.
Now, I decide that 1 represents "the biggest number ever thought by a human plus one".
Errr this is a five year old?
How about something along the lines of: "I'd love to tell you but the number is so big and would take so long to say, I'd die before I finished telling you".
// wait to see
for(;;)
{
printf("9");
}
roughly 2^AVAILABLE_MEMORY_IN_BITS
EDIT: The above is for actually storing a number and treats all media (RAM, HD, cloud etc.) as memory. Subtracting the OS footprint (measured in KB) doesn't make "roughly" less accurate...
If you want to "represent" a number in a meaningful way, then you probably want to go with what the CPU provides: unsigned 32 bit integers (roughly 4 Gigs) or unsigned 64 bit integers for most computers your kid will come into contact with.
NOTE for talking to 5-year-olds: Often, they just want a factoid. Give him a really big and very accurate number (lots of digits), like 4'294'967'295. Then, once the glazing leaves his eyes, try to see how far you can get with explaining how computers represent numbers.
EDIT #2: I once read this article: Who Can Name the Bigger Number that should provide a whole lot of interesting information for your kid. Obviously he's not your normal five-year-old. So this might get you started in a cool direction about numbers and computation.
The answer to life (and this kids question): 42
That depends on the datatype you use to represent it. The computer only stores bits (0/1). We, as developers, give the bits meaning. (65 can be a number or the letter A).
For example, I can define my datatype as 1^N where N is unsigned and represented by an array of bits of arbitrary size. The next person can come up with 10^N which would be ten times larger than my biggest number.
Sure, there would be gaps but if you don't need them, that doesn't matter.
Therefore, the question is meaningless since it doesn't have context.
Well I had the same question earlier this day, so thought why not to make a little c++ codes to see where the computer gonna stop ...
But my laptop wasn't with me in class so I used another, well the number was to big but it never ends, i'll run it again for a night then i'll share the number
you can try the code is stupid
#include <stdlib.h>
#include <stdio.h>
int main() {
int i = 0;
for (i = 0; i <= i; i++) {
printf("%i\n", i);
i++;
}
}
And let it run till it stops ^^
The size will obviously be limited by the total size of hard drives you manage to put into your PC. After all, you can store a number in a text file occupying all disk space.
You can have 4x2Tb drives even in a simple box so around 8Tb available. if you store as binary, then the biggest number is 2 pow 64000000000000.
If your hard drive is 1 TB (8'000'000'000'000 bits), and you would print the number that fits on it on paper as hex digits (nobody would do that, but let's assume), that's 2,000,000,000,000 hex digits.
Each page would contain 4000 hex digits (40 x 100 digits). That's 500,000,000 pages.
Now stack the pages on top of each other (let's say each page is 0.004 inches / 0.1 mm thick), then the stack would be as 5 km (about 3 miles) tall.
I'll try to give a practical answer.
Common Lisp number crunching is particularly powerful. It has something called "bignums" which are integers that can be arbitrarily large, limited by the amount of available.
See: http://en.wikibooks.org/wiki/Common_Lisp/Advanced_topics/Numbers#Fixnums_and_Bignums
Don't know much about theory, but I far as I understood from your question, is: what is the largest number that the computer can represent (and I add: in a reasonable time, and not printing "9" until the Earth will "be eaten by the Sun"). And I put my PC to make one simple calculation (in PHP or whatever language): echo pow(2,1023) - resulting: 8.9884656743116E+307. So I guess this is the largest number that my PC can calculate. On the other side, I think the respresentation of the largest negative number can be: -0,(0)1
LE: That computed value was obataind through PHP, but I tried to figure out what's the largest number that my windows calculator can compute, and it is pow(2, 33219) = 8.2304951207588748764521361245002E+9999. Now I guess this is the largest number my PC can handle.
I think you should be very proud that your 5 year old is already asking questions like this.
And you should continue to promote that! This is truly amazing! With that said, I would say that saying Infinity does not
count is thinking incorrectly about what numbers mean in computer memory.
I feel like this way of thinking is a handicap.
Mathematicians will never be able to write out ALL the digits of pi or eulers number, BUT we FULLY understand it.
Pi, as an example, is perfectly represented by infinite this series: (Pi / 4) = 1 - 1/3 + 1/5 - 1/7 + 1/9 - …
Just because you literally can’t go to inf. or print every single digit in a console means nothing.
You could have printed the symbol representing pi and therefore capturing the inf. series.
Computer Algebra Systems (CAS) represent numbers symbolically all the time. Pi, for instance,
may be a Symbolic object in memory (the binary in memory did not DIRECTLY represent the number. It represents an "mathematical algorithm" for producing the answer to arbitrary precision).
Then you do some math with it, transforming from one expression to the next.
At no point in time did we not represent the number COMPLETELY.
At the end, you can do 2 things with this:
A) Evaluate the expression, turning it into a number of some kind (or Matrix or whatever).
BUT this number could very well be an approximation (say like 20 digits of pi).
B) Keep it in its symbolic form for reference. Obviously we don’t like staring at symbols because we
need to eventually turn the nobs on the apparatii.
NOTE: sometimes you can get a finite (non-irrational) number perfectly represented in memory (like number 1)
by taking limits or going to inf. Not literally having an inf. number in memory, but symbolically representing it.
Just throw this in Wolfram alpha: Lim[Exp[-x], x --> Inf]; It gives you the number 0. Which is EXACT.
In short:
It was the HUMANS need to have some binary in memory that DIRECTLY represented the number that caused
the number to degrade. Symbolically it was perfectly represented. You could design some algorithm that
just continues to calculate the next digits of pi or eulers number giving you an arbitrary amount of precision (Now, this is obviously not practical of course).
I hope this was at least somewhat useful or interesting to you, even if you disagree =)
Depends on how much the computer can handle. Although there are some times when the computer can handle numbers greater than (2^(bits-1)-1)... For example:
My computer is 64 bit (9223372036854775807), however the calculator that comes with the computer itself can handle numbers of up to 10^9999.
Many other supercomputers can exceed these limits, and the one with the most memory (bits) might as well be the one with the record (current largest number that can be held by computers).
Or, if it comes to visually seeing it on computers, you can just make a program that, on monitor, repeats writing 9 and not skips that line to form an ever-growing bunch of 9. :P
go on chrome then go on three dots above and click them then go on tools and then go on developer tool click on console and type Number.MAX_VALUE

Why are 5381 and 33 so important in the djb2 algorithm?

The djb2 algorithm has a hash function for strings.
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
Why are 5381 and 33 so important?
This hash function is similar to a Linear Congruential Generator (LCG - a simple class of functions that generate a series of psuedo-random numbers), which generally has the form:
X = (a * X) + c; // "mod M", where M = 2^32 or 2^64 typically
Note the similarity to the djb2 hash function... a=33, M=2^32. In order for an LCG to have a "full period" (i.e. as random as it can be), a must have certain properties:
a-1 is divisible by all prime factors of M (a-1 is 32, which is divisible by 2, the only prime factor of 2^32)
a-1 is a multiple of 4 if M is a multiple of 4 (yes and yes)
In addition, c and M are supposed to be relatively prime (which will be true for odd values of c).
So as you can see, this hash function somewhat resembles a good LCG. And when it comes to hash functions, you want one that produces a "random" distribution of hash values given a realistic set of input strings.
As for why this hash function is good for strings, I think it has a good balance of being extremely fast, while providing a reasonable distribution of hash values. But I've seen many other hash functions which claim to have much better output characteristics, but involved many more lines of code. For instance see this page about hash functions
EDIT: This good answer explains why 33 and 5381 were chosen for practical reasons.
33 was chosen because:
1) As stated before, multiplication is easy to compute using shift and add.
2) As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.
3) The shift of 5 is relatively prime to 32 (the number of bits in the register), which helps with avalanching. While there are enough characters left in the string, each bit of an input byte will eventually interact with every preceding bit of input.
4) The shift of 5 is a good shift amount when considering ASCII character data. An ASCII character can sort of be thought of as a 4-bit character type selector and a 4-bit character-of-type selector. E.g. the digits all have 0x3 in the first 4 bits. So an 8-bit shift would cause bits with a certain meaning to mostly interact with other bits that have the same meaning. A 4-bit or 2-bit shift would similarly produce strong interactions between like-minded bits. The 5-bit shift causes many of the four low order bits of a character to strongly interact with many of the 4-upper bits in the same character.
As stated elsewhere, the choice of 5381 isn't too important and many other choices should work as well here.
This is not a fast hash function since it processes it's input a character at a time and doesn't try to use instruction level parallelism. It is, however, easy to write. Quality of the output divided by ease of writing the code is likely to hit a sweet spot.
On modern processors, multiplication is much faster than it was when this algorithm was developed and other multiplication factors (e.g. 2^13 + 2^5 + 1) may have similar performance, slightly better output, and be slightly easier to write.
Contrary to an answer above, a good non-cryptographic hash function doesn't want to produce a random output. Instead, given two inputs that are nearly identical, it wants to produce widely different outputs. If you're input values are randomly distributed, you don't need a good hash function, you can just use an arbitrary set of bits from your input. Some of the modern hash functions (Jenkins 3, Murmur, probably CityHash) produce a better distribution of outputs than random given inputs that are highly similar.
On 5381, Dan Bernstein (djb2) says in this article:
[...] practically any good multiplier works. I think you're worrying
about the fact that 31c + d doesn't cover any reasonable range of hash
values if c and d are between 0 and 255. That's why, when I discovered
the 33 hash function and started using it in my compressors, I started
with a hash value of 5381. I think you'll find that this does just as
well as a 261 multiplier.
The whole thread is here if you're interested.
Ozan Yigit has a page on hash functions which says:
[...] the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
Maybe because 33 == 2^5 + 1 and many hashing algorithms use 2^n + 1 as their multiplier?
Credit to Jerome Berger
Update:
This seems to be borne out by the current version of the software package djb2 originally came from: cdb
The notes I linked to describe the heart of the hashing algorithm as using h = ((h << 5) + h) ^ c to do the hashing... x << 5 is a fast hardware way to use 2^5 as the multiplier.