I have been using Weka’s J48 decision tree to classify frequencies of keywords
in RSS feeds into target categories. And I think I may have a problem
reconciling the generated decision tree with the number of correctly classified
instances reported and in the confusion matrix.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
...
#attribute Keyword_64_fear_Frequency numeric
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S
And so on: there’s a total of 64 keywords (columns) and 570 rows where each one contains the frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
My use of the decision tree is with default parameters using 10x validation.
Weka reports the following:
Correctly Classified Instances 210 36.8421 %
Incorrectly Classified Instances 360 63.1579 %
With the following confusion matrix:
=== Confusion Matrix ===
a b c d e f g <-- classified as
11 0 0 0 39 0 0 | a = BFE
0 0 0 0 60 0 0 | b = FCL
1 0 5 0 72 0 2 | c = F
0 0 1 0 69 0 0 | d = M
3 0 0 0 153 0 4 | e = NCA
0 0 0 0 90 10 0 | f = SNT
0 0 0 0 19 0 31 | g = S
The tree is as follows:
Keyword_22_health_Frequency <= 0
| Keyword_7_open_Frequency <= 0
| | Keyword_52_libya_Frequency <= 0
| | | Keyword_21_job_Frequency <= 0
| | | | Keyword_48_pic_Frequency <= 0
| | | | | Keyword_63_world_Frequency <= 0
| | | | | | Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
| | | | | | Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
| | | | | Keyword_63_world_Frequency > 0
| | | | | | Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
| | | | | | Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
| | | | Keyword_48_pic_Frequency > 0: F (7.0)
| | | Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
| | Keyword_52_libya_Frequency > 0: NCA (31.0)
| Keyword_7_open_Frequency > 0
| | Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
| | Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)
My question concerns reconciling the matrix to the tree or vice versa. As far as
I understand the results, a rating like (461.0/343.0) indicates that 461 instances have been classified as NCA. But how can that be when the matrix reveals only 153? I am
not sure how to interpret this so any help is welcome.
Thanks in advance.
The number in parentheses at each leaf should be read as (number of total instances of this classification at this leaf / number of incorrect classifications at this leaf).
In your example for the first NCA leaf, it says there are 461 test instances that were classified as NCA, and of those 461, there were 343 incorrect classifications. So there are 461-343 = 118 correctly classified instances at that leaf.
Looking through your decision tree, note that NCA is also at other leaves. I count 118 + 3 + 31 + 4 = 156 correctly classified instances out of 461 + 3 + 31 + 4 = 499 total classifications of NCA.
Your confusion matrix shows 153 correct classifications of NCA out of 39 + 60 + 72 + 69 + 153 + 90 + 19 = 502 total classifications of NCA.
So there is a slight difference between the tree (156/499) and your confusion matrix (153/502).
Note that if you are running Weka from the command-line, it outputs a tree and a confusion matrix for testing on all the training data and also for testing with cross-validation. Be careful that you are looking at the right matrix for the right tree. Weka outputs both training and test results, resulting in two pairs of matrix and tree. You may have mixed them up.
I'm working with some tools, and the only way it can determine if a particular transaction is successful is if it passes various checks. However, it is limited in the way that it can only do one check at a time, and it must be sequential. Everything must be computed from left to right.
For example,
A || C && D
It will be computed with A || C first, and then the result will be AND'ed with D.
It gets tougher with parenthesis. I am unable to compute an expression like this, since B || C would need to be compututed first. I cannot work with any order of operations;
A && ( B || C)
I think I've worked this down to this sequential boolean expression,
C || B && A
Where C || B is computed first, then that result is AND'd with A
Can all boolean expressions be successfully worked into a sequential boolean expression? (Like the example I have)
The answer is no:
Consider A || B && C || D which has the truth table:
A | B | C | D |
0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 1 | 0
0 | 0 | 1 | 0 | 0
0 | 0 | 1 | 1 | 0
0 | 1 | 0 | 0 | 0
0 | 1 | 0 | 1 | 1
0 | 1 | 1 | 0 | 1
0 | 1 | 1 | 1 | 1
1 | 0 | 0 | 0 | 0
1 | 0 | 0 | 1 | 1
1 | 0 | 1 | 0 | 1
1 | 0 | 1 | 1 | 1
1 | 1 | 0 | 0 | 0
1 | 1 | 0 | 1 | 1
1 | 1 | 1 | 0 | 1
1 | 1 | 1 | 1 | 1
If it were possible to evaluate sequentially there would have to be a last expression which would be one of two cases:
Case 1:
X || Y such that Y is one of A,B,C,D and X is any sequential boolean expression.
Now, since there is no variable in A,B,C,D where the entire expression is true whenever that variable is true, none of:
X || A
X || B
X || C
X || D
can possibly be the last operation in the expression (for any X).
Case 2:
X && Y: such that Y is one of A,B,C,D and X is any sequential boolean expression.
Now, since there is no variable in A,B,C,D where the entire expression is false whenever that variable is false, none of:
X && A
X && B
X && C
X && D
can possibly be the last operation in the expression (for any X).
Therefore you cannot write (A || B) && (C || D) in this way.
The reason you are able to do this for some expressions, like: A && ( B || C) becoming C || B && A is because that expression can be built recursively out of expressions which have one of the two properties above:
IE.
The truth table for A && ( B || C) is:
A | B | C |
0 | 0 | 0 | 0
0 | 0 | 1 | 0
0 | 1 | 0 | 0
0 | 1 | 1 | 0
1 | 0 | 0 | 0
1 | 0 | 1 | 1
1 | 1 | 0 | 1
1 | 1 | 1 | 1
Which we can quickly see has the property that it is false whenever A is 0. So Our expression Could be X && A.
Then we take A out of the truth table and look at only the rows where A is 1 is the original:
B | C
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 1
Which has the property that it is True whenever B is 1 (or C, we can pick here). So we can write the expression as
X || B and the entire expression becomes X || B && A
Then we reduce the table again to the portion where B was 0 and we get:
C
0 | 0
1 | 1
X is just C. So the final expression is C || B && A
This is a problem of rewriting an expression so that no parentheses occur on the right. Logical AND (∧) and OR (∨) are both commutative:
A ∧ B = B ∧ A
A ∨ B = B ∨ A
So you can rewrite any expression of the form “X a (Y)” as “(Y) a X”:
A ∧ (B ∧ C) = (A ∧ B) ∧ C
A ∧ (B ∨ C) = (B ∨ C) ∧ A
A ∨ (B ∧ C) = (B ∧ C) ∨ A
A ∨ (B ∨ C) = (B ∨ C) ∨ A
They are also distributive, by the following laws:
(A ∧ B) ∨ (A ∧ C)
= A ∧ (B ∨ C)
= (B ∨ C) ∧ A
(A ∨ B) ∧ (A ∨ C)
= A ∨ (B ∧ C)
= (B ∧ C) ∨ A
So many Boolean expressions can be rewritten without parentheses on the right. But, as a counterexample, there is no way to rewrite an expression such as (A ∧ B) ∨ (C ∧ D) if A ≠ C, because of the lack of common factors.
Can't you just do this:
(( A || C ) && D)
and for your second example:
((( A && C ) || B ) && A )
Would that work for you?
Hope that helps...
You'll hit problems if you need to do something like (A || B) && (C || D) unless you can store the intermediate values for later use.
If you're allowed to construct more than one chain and try them all until either one of them passes or they all fail (so each chain is effectively ORed with the next) then I should think you can handle any combination. The example above would become (where each line is a separate query):
(A && C) ||
(A && D) ||
(B && C) ||
(B && D)
However, for a very complex check this could easily get out of hand!