How Huffman coding figured out the property that the codes are unique - encoding

I have just read this:
This is where a really smart idea called Huffman coding comes in! The idea is that we represent our characters (like a, b, c, d, ….) with codes like
a: 00
b: 010
c: 011
d: 1000
e: 1001
f: 1010
g: 1011
h: 1111
If you look at these carefully, you’ll notice something special! It’s that none of these codes is a prefix of any other code. So if we write down 010001001011 we can see that it’s 010 00 1001 011 or baec! There wasn’t any ambiguity, because 0 and 01 and 0100 don’t mean anything.
I get the gist of this, but I don't understand (a) how it was figured out, and (b) how you know it works, or (c) exactly what it means. Specifically this line describes it:
So if we write down 010001001011 we can see that it’s 010 00 1001 011....
I see that those are the codes, but I don't understand how you know not to read it as 0100 01 0010 11. I see that these values aren't actually codes in the table. However, I don't see how you would ever figure this out! I would like to know how to discover this. If I were trying to tinker with codes and bits like this, I would do this:
Come up with a set of codes, like 10 100 1000 101 1001
Try writing out some examples of the codes. So maybe an example is just concatenating the codes in order above: 1010010001011001.
See if I can parse the codes. So 10 or oops, nope 101 also... Darnit, well maybe I can add a priority to the parsing of the code, and so 10 is higher priority than 101. That gets me to 10 100 1000 10 x nope that last 10 should be 101. Dangit.
So I would try adding different features like that priority function, or other things I can't think of at the moment, to see if it would help solve the problem.
I can't imagine how they would figure out that these codes in the Huffman coding could be uniquely parsed (I still don't see it, how it's actually true, I would have to write out some examples to see it, or, ... that's part of the question, how to see it's true, how to prove it even). Wondering if one could explain in more depth how it's proven to work, and how it was discovered (or how to discover something similar to it on your own).

Huffman code works by laying out data in a tree. If you have a binary tree, you can associate every leaf to a code by saying that left child corresponds to a bit at 0 and right child to a 1. The path that leads from the root to a leaf corresponds to a code in a not ambiguous way.
This works for any tree and the prefix property is based on the fact that a leaf is terminal. Hence, you cannot go to leaf (have a code) by passing though another leaf (by having another code be a prefix).
The basic idea of Huffman coding is that you can build trees in such a way that the depth of every node is correlated with the probability of appearance of the node (codes more likely to happen will be closer the root).
There are several algorithms to build such a tree. For instance, assume you have a set of items you want to code, say a..f. You must know the probabilities of appearance every item, thanks to either a model of the source or an analysis of the actual values (for instance by analysing the file to code).
Then you can:
sort the items by probability
pickup the two items with the lowest probability
remove these items, group them in a new compound node and assign one item to left child (code 0) and the other to right child (code 1).
The probability of the compound node is the sum of individual probabilities and insert this new node in the sorted item list.
goto 2 while the number of items is >1
For the previous tree, it may correspond to a set of probabilities
a (0.5) b (0.2) c (0.1) d (0.05) e (0.05) f (0.1)
Then you pick items with the lowest probability (d and e), group them in a compound node (de) and get the new list
a (0.5) b (0.2) c (0.1) (de) (0.1) f (0.1)
And the successive item lists can be
a (0.5) b (0.2) c(de) (0.2) f (0.1)
a (0.5) b (0.2) (c(de))f (0.3)
a (0.5) b((c(de))f) (0.5)
a(b(((c(de))f)) 1.0
So the prefix property is insured by construction.

Related

How can I improve my BF factorial code to input large numbers and output the whole result?

I recently began again to do some BF for fun, and today I made a factorial code, which as I know is different than a lot I found in the net. I only need five cells to compute it but unfortunately I can't input number such as for example 100.
I'd like to know, if someone has an idea, how could I do to improve my code to be able to do that?
EDIT : A, B, C, D and E are the cells
++++ #For example we put four in input
[->+<] #Put A in B and A=0
[-] #If A=0 (which is true)
>- #Decrease B by one
[->+>+>+<<<] #Put B in C, D and E (at the end pointer is on B)
> #Move on C
[-<+>] #Put C in B and C=0
<+ #Add one in B
>>[-<<<+>>>] #Put D in A and D=0
> #Pointer move on E
[ #While
- #E is not null
<<<<
[->[->+>+<<]>[-<+>]<<]>[-]>>[-<<+>>] #Do A*B and put the result in B
>
[-<+<<<+>>>>] #Put E in A and D
<
[->+<] #Then put D in E
> #pointer goes on E to test the while condition
] #While end
<<< #If E is null go back to cell B
[-<+>] #Put B in A
< #Pointer on A at the end
Thank you in advance for your answers !
If you need help to visualise the code step by step go there
The best interpreter I use is this one, you can perform BF on 32 bits and count numbers of operations. It needs more than two thousands of billion to make factorial of 19... but it is really fast.
well you would need to implement some sort of bigint functionality. How to do so will not fit in an answer here (read: i don't know), considering that the only unbounded data structure that's easy to implement in bf is a single stack placed after any other cells you might want to use.
You could implement, for example, 16 bit operations in terms of multiple 8 bit operations (likewise 32 bit in terms of 16 and so on) but factorials grow really really fast and not even a 64 bit value can hold 100! (which is 3.1e158, which means you would need at least 529 bits to represent)
Good luck implementing bigints in BF

Decomposition into ABC & CDE and preserving functional dependencies

Consider a relation R with five attributes ABCDE. Now
assume that R is decomposed into two smaller relations ABC and CDE.
Define S to be the relation (ABC NaturalJoin CDE).
a) Assume that the above decomposition is lossless join. What is the
dependency that guarantees the lossless join property.
b) Give an additional FD such that “dependency preserving” property is
violated by this decomposition.
c) Give two additional FD's that would be preserved by this
decomposition.
Question seems different to me because there is no FD given and its asking:
a)
R1=(A,B,C) R2=(C,D,E) R1∩R2 =C (how can i control dependency now)
F1' = {A->B,A->C,B->C,B->A,C->A,C->B,AB->C,AC->B,BC->A...}
F2' = {C->D,C->E,D->E....}
then i will find F' ??
b,c) how do i check , do i need to look for all possible FD's for R1 and R2
The question is definitely assuming things it hasn't said clearly. ABCDE could be subject to the JD *{ABC,CDE} while not being subject to any nontrivial FDs at all.
But suppose that the relation is subject to some FDs and isn't subject to any JDs other than ones that they imply. If C is a CK then the join is lossless. But then C -> ABCDE holds, because a CK determines all attributes, and C -> ABDE holds, because a CK determines all other attributes. No other FD holding would imply that the join is lossless, although that requires tedium (by looking at every possible case of CK) or inspiration to show.
Both these FDs guarantee losslessness. Although if one of these holds the other holds, and they express the same condition. So the question is sloppy. Or the question might consider that the two expressions express the same FD in the sense of a condition, but a FD is an expression and not a condition, so that would also be sloppy.
I suspect that the questioner really just wanted you to give some FD whose holding would guarantee losslessness. That would get rid of the complications.

Genetic-algorithm encoding

I am trying to create an algorithm which I believe is similar to a knapsack-problem. The problem is to find recipes/Bill-of-Materials for certain intermediate products. There are different alternatives of recipes for the intermediate products. For example product X can either consist of 25 % raw material A + 75 % raw material B, or 50 % of raw material A + 50 % raw material B, etc. There are between 1 to 100 different alternatives for each recipe.
My question is, how best to encode the different recipe alternatives (and/or where to find similar problems on the internet). I think I have to use value encoding, ie assign a value to each alternative of a recipe. Do I have reasonable, different options?
Thanks & kind regards
You can encode the problem with a number chromosome. If your product has N ingredients, then your number chromosome has the length N: X={x1,x2,..,xN}. Every number xi of the chromosome represents the parts of ingredient i. It is not required, that the numbers sum to one.
E.g. X={23,5,0} means, you need 23 parts of ingredient 1, 5 parts of ingredient 2 and zero parts of ingredient 3.
With this encoding, crossover will not invalidate the chromosome.
You can use a 100 dimentions variable to present a individual just like below
X={x1,x2,x3,...,x100} xi∈[0,1] ∑(xi)=1.0
It's hard to use crossover operation.So I suggest that the offspring can just be produced by mutation operation.
Mutation operation toward parent individual 'X':
(1)randly choose two dimention 'xi' and 'xj' from 'X';
(2)p=rand(0,1);
(3)xj=xj+(1-p)*xi;
(4)xi=xi*p;

Simplifying a 9 variable boolean expression

I am trying to create a tic-tac-toe program as a mental exercise and I have the board states stored as booleans like so:
http://i.imgur.com/xBiuoAO.png
I would like to simplify this boolean expression...
(a&b&c) | (d&e&f) | (g&h&i) | (a&d&g) | (b&e&h) | (c&f&i) | (a&e&i) | (g&e&c)
My first thoughts were to use a Karnaugh Map but there were no solvers online that supported 9 variables.
and heres the question:
First of all, how would I know if a boolean condition is already as simple as possible?
and second: What is the above boolean condition simplified?
2. Simplified condition:
The original expression
a&b&c|d&e&f|g&h&i|a&d&g|b&e&h|c&f&i|a&e&i|g&e&c
can be simplified to the following, knowing that & is more prioritary than |
e&(d&f|b&h|a&i|g&c)|a&(b&c|d&g)|i&(g&h|c&f)
which is 4 chars shorter, performs in the worst case 18 & and | evaluations (the original one counted 23)
There is no shorter boolean formula (see point below). If you switch to matrices, maybe you can find another solution.
1. Making sure we got the smallest formula
Normally, it is very hard to find the smallest formula. See this recent paper if you are more interested. But in our case, there is a simple proof.
We will reason about a formula being the smallest with respect to the formula size, where for a variable a, size(a)=1, for a boolean operation size(A&B) = size(A|B) = size(A) + 1 + size(B), and for negation size(!A) = size(A) (thus we can suppose that we have Negation Normal Form at no cost).
With respect to that size, our formula has size 37.
The proof that you cannot do better consists in first remarking that there are 8 rows to check, and that there is always a pair of letter distinguishing 2 different rows. Since we can regroup these 8 checks in no less than 3 conjuncts with the remaining variable, the number of variables in the final formula should be at least 8*2+3 = 19, from which we can deduce the minimal tree size.
Detailed proof
Let us suppose that a given formula F is the smallest and in NNF format.
F cannot contain negated variables like !a. For that, remark that F should be monotonic, that is, if it returns "true" (there is a winning row), then changing one of the variables from false to true should not change that result. According to Wikipedia, F can be written without negation. Even better, we can prove that we can remove the negation. Following this answer, we could convert back and from DNF format, removing negated variables in the middle or replacing them by true.
F cannot contain a sub-tree like a disjunction of two variables a|b.
For this formula to be useful and not exchangeable with either a or b, it would mean that there are contradicting assignments such that for example
F[a|b] = true and F[a] = false, therefore that a = false and b = true because of monotonicity. Also, in this case, turning b to false makes the whole formula false because false = F[a] = F[a|false] >= F[a|b](b = false).
Therefore there is a row passing by b which is the cause of the truth, and it cannot go through a, hence for example e = true and h = true.
And the checking of this row passes by the expression a|b for testing b. However, it means that with a,e,h being true and all other set to false, F is still true, which contradicts the purpose of the formula.
Every subtree looking like a&b checks a unique row. So the last letter should appear just above the corresponding disjunction (a&b|...)&{c somewhere for sure here}, or this leaf is useless and either a or b can be removed safely. Indeed, suppose that c does not appear above, and the game is where a&b&c is true and all other variables are false. Then the expression where c is supposed to be above returns false, so a&b will be always useless. So there is a shorter expression by removing a&b.
There are 8 independent branches, so there is at least 8 subtrees of type a&b. We cannot regroup them using a disjunction of 2 conjunctions since a, f and h never share the same rows, so there must be 3 outer variables. 8*2+3 makes 19 variables appear in the final formula.
A tree with 19 variables cannot have less than 18 operators, so in total the size have to be at least 19+18 = 37.
You can have variants of the above formula.
QED.
One option is doing the Karnaugh map manually. Since you have 9 variables, that makes for a 2^4 by 2^5 grid, which is rather large, and by the looks of the equation, probably not very interesting either.
By inspection, it doesn't look like a Karnaugh map will give you any useful information (Karnaugh maps basically reduce expressions such as ((!a)&b) | (a&b) into b), so in that sense of simplification, your expression is already as simple as it can get. But if you want to reduce the number of computations, you can factor out a few variables using the distributivity of the AND operators over ORs.
The best way to think of this is how a person would think of it. No person would say to themselves, "a and b and c, or if d and e and f," etc. They would say "Any three in a row, horizontally, vertically, or diagonally."
Also, instead of doing eight checks (3 rows, 3 columns, and 2 diagonals), you can do just four checks (three rows and one diagonal), then rotate the board 90 degrees, then do the same checks again.
Here's what you end up with. These functions all assume that the board is a three-by-three matrix of booleans, where true represents a winning symbol, and false represents a not-winning symbol.
def win?(board)
winning_row_or_diagonal?(board) ||
winning_row_or_diagonal?(rotate_90(board))
end
def winning_row_or_diagonal?(board)
winning_row?(board) || winning_diagonal?(board)
end
def winning_row?(board)
3.times.any? do |row_number|
three_in_a_row?(board, row_number, 0, 1, 0)
end
end
def winning_diagonal?(board)
three_in_a_row?(board, 0, 0, 1, 1)
end
def three_in_a_row?(board, x, y, delta_x, delta_y)
3.times.all? do |i|
board[x + i * delta_x][y + i * deltay]
end
end
def rotate_90(board)
board.transpose.map(&:reverse)
end
The matrix rotate is from here: https://stackoverflow.com/a/3571501/238886
Although this code is quite a bit more verbose, each function is clear in its intent. Rather than a long boolean expresion, the code now expresses the rules of tic-tac-toe.
You know it's a simple as possible when there are no common sub-terms to extract (e.g. if you had "a&b" in two different trios).
You know your tic tac toe solution must already be as simple as possible because any pair of boxes can belong to at most only one winning line (only one straight line can pass through two given points), so (a & b) can't be reused in any other win you're checking for.
(Also, "simple" can mean a lot of things; specifying what you mean may help you answer your own question. )

How to read the classifier confusion matrix in WEKA

Sorry, I am new to WEKA and just learning.
In my decision tree (J48) classifier output, there is a confusion Matrix:
a b <----- classified as
130 8 a = functional
15 150 b = non-functional
How do I read this matrix? What's the difference between a & b?
Also, can anyone explain to me what domain values are?
Have you read the wikipedia page on confusion matrices? The text around the matrix is arranged slightly differently in their example (row labels on the left instead of on the right), but you read it just the same.
The row indicates the true class, the column indicates the classifier output. Each entry, then, gives the number of instances of <row> that were classified as <column>. In your example, 15 Bs were (incorrectly) classified as As, 150 Bs were correctly classified as Bs, etc.
As a result, all correct classifications are on the top-left to bottom-right diagonal. Everything off that diagonal is an incorrect classification of some sort.
Edit: The Wikipedia page has since switched the rows and columns around. This happens. When studying a confusion matrix, always make sure to check the labels to see whether it's true classes in rows, predicted class in columns or the other way around.
I'd put it this way:
The confusion matrix is Weka reporting on how good this J48 model is in terms of what it gets right, and what it gets wrong.
In your data, the target variable was either "functional" or "non-functional;" the right side of the matrix tells you that column "a" is functional, and "b" is non-functional.
The columns tell you how your model classified your samples - it's what the model predicted:
The first column contains all the samples which your model thinks are "a" - 145 of them, total
The second column contains all the samples which your model thinks are "b" - 158 of them
The rows, on the other hand, represent reality:
The first row contains all the samples which really are "a" - 138 of them, total
The second row contains all the samples which really are "b" - 165 of them
Knowing the columns and rows, you can dig into the details:
Top left, 130, are things your model thinks are "a" which really are
"a" <- these were correct
Bottom left, 15, are things your model thinks are "a" but which
are really "b" <- one kind of error
Top right, 8, are things your model thinks are "b" but which
really are "a" <- another kind of error
Bottom right, 150 are things your model thinks are "b" which
really are "b"
So top-left and bottom-right of the matrix are showing things your model gets right.
Bottom-left and top-right of the matrix are are showing where your model is confused.