Determine the attribute that influences the outcome most - classification

I have a dataset in .csv format as shown:
NRC_CLASS,L1_MARKS_FINAL,L2_MARKS_FINAL,L3_MARKS_FINAL,S1_MARKS_FINAL,S2_MARKS_FINAL,S3_MARKS_FINAL,
FAIL,7,12,12,24,4,30,
PASS,49,36,46,51,31,56,
FAIL,59,35,42,18,18,45,
PASS,61,30,51,33,30,52,
PASS,68,30,35,53,45,54,
2,82,77,75,32,36,56,
FAIL,18,35,35,32,21,35,
2,86,56,46,44,37,60,
1,94,45,62,70,50,59,
Where the first column talks about the over all grade:
FAIL - Fail
PASS - Pass class
1 - First class
2 - Second class
D - Distinction
This is followed by marks of each student in 6 subjects.
Is there anyway i can find out performance in which subject makes a difference in overall outcome?
I am using Weka and had used J48 to build a tree.
The summary of J48 classifier is:
=== Summary ===
Correctly Classified Instances 30503 92.5371 %
Incorrectly Classified Instances 2460 7.4629 %
Kappa statistic 0.902
Mean absolute error 0.0332
Root mean squared error 0.1667
Relative absolute error 10.8867 %
Root relative squared error 42.7055 %
Total Number of Instances 32963
Also I discretized the marks data into 10 bins with useEqualFrequency set to true. The summary of J48 now is:
=== Summary ===
Correctly Classified Instances 28457 86.3301 %
Incorrectly Classified Instances 4506 13.6699 %
Kappa statistic 0.8205
Mean absolute error 0.0742
Root mean squared error 0.2085
Relative absolute error 24.3328 %
Root relative squared error 53.4264 %
Total Number of Instances 32963

First of all, you may need to quantify a value for each of the NRC_CLASS Values (or even better, use the actual grade out of 100) to improve the quality of attribute testing.
From there, you could potentially use Attribute Selection (found in the Select Attribute tab of Weka Explorer) to find the attributes that have the greatest influence on the overall grade. Perhaps the CorrelationAttributeEval as the Attribute Evaluator coupled with the Ranker search method could assist in identifying the attributes of greatest importance to the least.
Hope this Helps!

It seems you want to determine the relative relevance of each attribute. In this case, you need to use a weight learning algorithm. Weka has a few, I just used Relief. Go to the tab Select attributes, in Attribute Evaluator, select ReliefF-AttributeEval, it will select the
Select the attribute that has the value for the outcome class.
Search Method for you. Click Start.
The results will include the ranked attributes, the highest ranked is the most relevant.

In a test data set T with 25 attributes, run i=1:25 rounds where you replace the values of the i-th attribute with random values (=noise). Compare the test performance of each of the 25 rounds with the case where no attribute was replaced, and identify the round in which the performance dropped the most.
If the worst performance decrease occurred e.g. in round 13, this indicates that attribute 13 is the most important one.

Related

Neutrality for sentiment analysis in spark

I have built a pretty basic naive bayes over apache spark and using mllib of course. But I have a few clarifications on what exactly neutrality means.
From what I understand, in a given dataset there are pre-labeled sentences which comprise of the necessary classes, let's take 3 for example below.
0-> Negative sentiment
1-> Positive sentiment
2-> Neutral sentiment
This neutral is pre-labeled in the training set itself.
Is there any other form of neutrality handling. Suppose if there are no neutral sentences available in the dataset then is it possible that I can calculate it from the scale of probability like
0.0 - 0.4 => Negative
0.4- - 0.6 => Neutral
0.6 - 1.0 => Positive
Is such kind of mapping possible in spark. I searched around but could not find any. The NaiveBayesModel class in the RDD API has a predict method which just returns a double that is mapped according to the training set i.e if only 0,1 is there it will return only 0,1 and not in a scaled manner such as 0.0 - 1.0 as above.
Any pointers/advice on this would be incredibly helpful.
Edit - 1
Sample code
//Performs tokenization,pos tagging and then lemmatization
//Returns a array of string
val tokenizedString = Util.tokenizeData(text)
val hashingTF = new HashingTF()
//Returns a double
//According to the training set 1.0 => Positive, 0.0 => Negative
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Sample dataset content
1,Awesome movie
0,This movie sucks
Of course the original dataset contains more longer sentences, but this should be enough for explanations I guess
Using the above code I am calculating. My question is the same
1) Neutrality handling in dataset
In the above dataset if I am adding another category such as
2,This movie can be enjoyed by kids
For arguments sake, lets assume that it is a neutral review, then the model.predict method will give either 1.0,0.0,2.0 based on the passed in sentence.
2) Using the model.predictProbabilities it gives an array of doubles, but I am not sure in what order it gives the result i.e index 0 is for negative or for positive? With three features i.e Negative,Positive,Neutral then in what order will that method return the predictions?
It would have been helpful to have the code that builds the model (for your example to work, the 0.0 from the dataset must be converted to 0.0 as a Double in the model, either after indexing it with a StringIndexer stage, or if you converted that from the file), but assuming that this code works:
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Then yes, it means the probabilities at index 0 is that of negative and at 1 that of positive (it's a bit strange and there must be a reason, but everything is a double in ML, even feature and category indexes). If you have something like this in your code:
val labelIndexer = new StringIndexer()
.setInputCol("sentiment")
.setOutputCol("indexedsentiment")
.fit(trainingData)
Then you can use labelIndexer.labels to identify the labels (probability at index 0 is for labelIndexer.labels at index 0.
Now regarding your other questions.
Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.
A Neutral category can be very helpful if you want to manage Type 2. If that is the case, you need neutral examples in your dataset. Naive Bayes is not a good classifier to apply thresholding on the probabilities in order to determine Type 2 neutrality.
Option 1: Build a dataset (if you think you will have to deal with a lot of Type 2 neutral texts). The good news is, building a neutral dataset is not too difficult. For instance you can pick random texts that are not movie reviews and assume they are neutral. It would be even better if you could pick content that is closely related to movies (but neutral), like a dataset of movie synopsis. You could then create a multi-class Naive Bayes classifier (between neutral, positive and negative) or a hierarchical classifier (first step is a binary classifier that determines whether a text is a movie review or not, second step to determine the overall sentiment).
Option 2 (can be used to deal with both Type 1 and 2). As I said, Naive Bayes is not very great to deal with thresholds on the probabilities, but you can try that. Without a dataset though, it will be difficult to determine the thresholds to use. Another approach is to identify the number of words or stems that have a significant polarity. One quick and dirty way to achieve that is to query your classifier with each individual word and count the number of times it returns "positive" with a probability significantly higher than the negative class (discard if the probabilities are too close to each other, for instance within 25% - a bit of experimentations will be needed here). At the end, you may end up with say 20 positive words vs 15 negative ones and determine it is neutral because it is balanced or if you have 0 positive and 1 negative, return neutral because the count of polarized words is too low.
Good luck and hope this helped.
I am not sure if I understand the problem but:
prior in Naive Bayes is computed from the data and cannot be set manually.
in MLLib you can use predictProbabilities to obtain class probabilities.
in ML you can use setThresholds to set prediction threshold for each class.

Matlab non-linear binary Minimisation

I have to set up a phoneme table with a specific probability distribution for encoding things.
Now there are 22 base elements (each with an assigned probability, sum 100%), which shall be mapped on a 12 element table, which has desired element probabilities (sum 100%).
So part of the minimisation is to merge several base elements to get 12 table elements. Each base element must occur exactly once.
In addition, the table has 3 rows. So the same 12 element composition of the 22 base elements must minimise the error for 3 target vectors. Let's say the given target vectors are b1,b2,b3 (dimension 12x1), the given base vector is x (dimension 22x1) and they are connected by the unknown matrix A (12x22) by:
b1+err1=Ax
b2+err2=Ax
b3+err3=Ax
To sum it up: A is to be found so that dot_prod(err1+err2+err3, err1+err2+err3)=min (least squares). And - according to the above explanation - A must contain only 1's and 0's, while having exactly one 1 per column.
Unfortunately I have no idea how to approach this problem. Can it be expressed in a way different from the matrix-vector form?
Which tools in matlab could do it?
I think I found the answer while parsing some sections of the Matlab documentation.
First of all, the problem can be rewritten as:
errSum=err1+err2+err3=3Ax-b1-b2-b3
=> dot_prod(errSum, errSum) = min(A)
Applying the dot product (least squares) yields a quadratic scalar expression.
Syntax-wise, the fmincon tool within the optimization box could do the job. It has constraints parameters, which allow to force Aij to be binary and each column to be 1 in sum.
But apparently fmincon is not ideal for binary problems algorithm-wise and the ga tool should be used instead, which can be called in a similar way.
Since the equation would be very long in my case and needs to be written out, I haven't tried yet. Please correct me, if I'm wrong. Or add further solution-methods, if available.

Enhancing accuracy of knn classifier

I have training set of size 54 * 65536 and a testing set of 18 * 65536.
I want to use a knn classifier, but I have some questions:
1) How should I define trainlabel?
Class = knnclassify(TestVec,TrainVec, TrainLabel,k);
Is it a vector of size 54 * 1 that defines to which group each row in training set belongs? Here the group is numbered as 1 ,2,..
2) To find the accuracy I used this:
cp = classperf(TrainLabel);
Class = knnclassify(TestVec,TrainVec, TrainLabel);
cp = classperf(TestLabel,Class);
cp.CorrectRate*100
Is this right? Is there another method to calculate it?
3) How can I enhance the accuracy?
4) How do I choose the best value of k?
I do not know matlab nor the implementation of the knn you are providing, so I can answer only a few of your questions.
1) You assumption is correct. trainlabel is a 54*1 vector or an array of size 54 or something equivalent that defines which group each datapoint (row) in training set belongs to.
2) ... MATLAB / implementation related, sorry.
3) That is a very big discussion. Possible ways are:
Choose a better value of K.
Preprocess the data (or make preprocessing better if already applied).
Get a better / bigger trainset.
to name a few...
4) You can use different values while measuring the accuracy for each one and keep the best. (Note: If you do that, make sure you do not measure the accuracy of the classifier per value of k only once, but rather you use some technique like 10-Folding or something).
There is more than a fair chance that the library you are using for the K-NNclassifier provides such utilities.

Johansen test on two stocks (for pairs trading) yielding weird results

I hope you can help me with this one.
I am using cointegration to discover potential pairs trading opportunities within stocks and more precisely I am utilizing the Johansen trace test for only two stocks at a time.
I have several securities, but for each test I only test two at a time.
If two stocks are found to be cointegrated using the Johansen test, the idea is to define the spread as
beta' * p(t-1) - c
where beta'=[1 beta2] and p(t-1) is the (2x1) vector of the previous stock prices. Notice that I seek a normalized first coefficient of the cointegration vector. c is a constant which is allowed within the cointegration relationship.
I am using Matlab to run the tests (jcitest), but have also tried utilizing Eviews for comparison of results. The two programs yields the same.
When I run the test and find two stocks to be cointegrated, I usually get output like
beta_1 = 12.7290
beta_2 = -35.9655
c = 121.3422
Since I want a normalized first beta coefficient, I set beta1 = 1 and obtain
beta_2 = -35.9655/12.7290 = -2.8255
c =121.3422/12.7290 = 9.5327
I can then generate the spread as beta' * p(t-1) - c. When the spread gets sufficiently low, I buy 1 share of stock 1 and short beta_2 shares of stock 2 and vice versa when the spread gets high.
~~~~~~~~~~~~~~~~ The problem ~~~~~~~~~~~~~~~~~~~~~~~
Since I am testing an awful lot of stock pairs, I obtain a lot of output. Quite often, however, I receive output where the estimated beta_1 and beta_2 are of the same sign, e.g.
beta_1= -1.4
beta_2= -3.9
When I normalize these according to beta_1, I get:
beta_1 = 1
beta_2 = 2.728
The current pairs trading literature doesn't mention any cases where the betas are of the same sign - how should it be interpreted? Since this is pairs trading, I am supposed to long one stock and short the other when the spread deviates from its long run mean. However, when the betas are of the same sign, to me it seems that I should always go long/short in both at the same time? Is this the correct interpretation? Or should I modify the way in which I normalize the coefficients?
I could really use some help...
EXTRA QUESTION:
Under some of my tests, I reject both the hypothesis of r=0 cointegration relationships and r<=1 cointegration relationships. I find this very mysterious, as I am only considering two variables at a time, and there can, at maximum, only be r=1 cointegration relationship. Can anyone tell me what this means?

Tableau 8.2 - how to get max and min from % difference values on table?

I'm facing problem in getting the max % and min % from a table containing % difference values.
Year-----A----------B---------C---------D---------Max %----Max Type----Min %----Min Type
2012
2013---4.30%---4.42%---4.34%---4.38%----4.42%---------B-----------4.30%---------A
The table above shows the % difference in sales from previous year. Thus 2012 shows no % (because there's no 2011). I used table calculation to compute the % difference, i.e. "Percent Difference From", compute using "Table (Down)" and "Previous".
The last four columns are what I'm having trouble doing. I want to get the max % and min % and also the corresponding types. I'm not trying to add the four columns to the existing table, but to get the correct results, as my ultimate goal is to display that results on the dashboard, i.e. on my dashboard, I want to display the highest % and its corresponding type; similarly the lowest % and its corresponding type. For example: on my dashboard, I want to display:
Highest % and type: 4.42% B
Lowest % and type: 4.30% A
So, I need to have the correct formulas to get the max % and min % and their types. These are what I did:
I tried to use WINDOW_MAX and WINDOW_MIN to display the max % and min % on the table but got funky wrong results.
1) I first get the formula in calculating the % difference from the "Customize" button from "Edit Table Calculation" window of SUM([Sales]): (ZN(SUM([Sales])) - LOOKUP(ZN(SUM([Sales])), -1)) / ABS(LOOKUP(ZN(SUM([Sales])), -1))
Then I created a calculated field of the above formula. I named the calculated field "Percent-Diff".
2) I created another calculated filed (named "Max % Difference") using the formula: WINDOW_MAX([Percent-Diff]). But it shows strange results. See image below. I don't know why it gives me 2.78% and 2.91% for 2012 and 2013 respectively. It should be 0% and 4.42% for 2012 and 2013 respectively. Something is not correct.
If it is just SUM([Sales]) instead of % difference, then I get the correct result of showing the max sales using the formula WINDOW_MAX(SUM([Sales])).
3) Also I don't know how to get the corresponding type. I tried using the formula: IF [Max % Difference] = [Percent-Diff] THEN ATTR([Product Type]). But it returns:
NULL
B
I'm not sure if the formula is correct. It looks correct on the result (i.e. "B" is correct), except that it also shows a NULL value which I don't know why. I think it's because I didn't include the ELSE part in my IF formula? But why the NULL value is shown as the first value? I want the formula to return just one value, "B". So, how to only just show "B"?
I've posted twice the problem in tableau forum, but as of now, nobody has answered my problem. I believe that my formulas are incorrect. So, if anyone here can correct the formulas to get the max % and min % from % difference values and also to get the corresponding type, then it'd be very much appreciated. Thanks a million!
It's hard to tell not knowing how your database looks like (as you didn't explicitly presented it, but I can try to infer based on the clues you left on your post). But I could reproduce something like you said using the Sample - Coffee Chain Database, and it worked out well, calculated yoy sales increase by product and then window_max of that.
What you're probably missing is the partitioning. I suggest avoiding using Table or Pane to create the partitions in more complex situations (as it will work only in that specific arrangement of fields), but rather use the dimensions to partition it.
So, your [Percent-Diff] field should be compute using [Date], and your [Max % Difference] should be compute using [Product Type]. IMPORTANT, for [Max % Difference], when you go to Edit Table Calculation, you'll have to choose the Compute using for [Percent-Diff] as well (you can choose on the top of the window)
Your formula to find which type is the max (or min) is also correct (and should only respect the partitions). Nevertheless, it is very hard to have the exact output you're expecting.
What I would do is to create 2 spreadsheets (and later combine them in a dashboard).
The 1st would be what you already got (Each product [Percent-Diff]
The second one I would change your formula (3) to just [Max % Difference] = [Percent-Diff], and use it as filter (filtering only true). I would drag both Date and Product to the sheet (you choose if you want it on columns, rows, or just detail) so I can use them to partition the table. And drag [Max % Difference] to be visualized.
That way you'll only see the product that is the max, and how much is that max.
Hope it helps