Neutrality for sentiment analysis in spark - scala

I have built a pretty basic naive bayes over apache spark and using mllib of course. But I have a few clarifications on what exactly neutrality means.
From what I understand, in a given dataset there are pre-labeled sentences which comprise of the necessary classes, let's take 3 for example below.
0-> Negative sentiment
1-> Positive sentiment
2-> Neutral sentiment
This neutral is pre-labeled in the training set itself.
Is there any other form of neutrality handling. Suppose if there are no neutral sentences available in the dataset then is it possible that I can calculate it from the scale of probability like
0.0 - 0.4 => Negative
0.4- - 0.6 => Neutral
0.6 - 1.0 => Positive
Is such kind of mapping possible in spark. I searched around but could not find any. The NaiveBayesModel class in the RDD API has a predict method which just returns a double that is mapped according to the training set i.e if only 0,1 is there it will return only 0,1 and not in a scaled manner such as 0.0 - 1.0 as above.
Any pointers/advice on this would be incredibly helpful.
Edit - 1
Sample code
//Performs tokenization,pos tagging and then lemmatization
//Returns a array of string
val tokenizedString = Util.tokenizeData(text)
val hashingTF = new HashingTF()
//Returns a double
//According to the training set 1.0 => Positive, 0.0 => Negative
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Sample dataset content
1,Awesome movie
0,This movie sucks
Of course the original dataset contains more longer sentences, but this should be enough for explanations I guess
Using the above code I am calculating. My question is the same
1) Neutrality handling in dataset
In the above dataset if I am adding another category such as
2,This movie can be enjoyed by kids
For arguments sake, lets assume that it is a neutral review, then the model.predict method will give either 1.0,0.0,2.0 based on the passed in sentence.
2) Using the model.predictProbabilities it gives an array of doubles, but I am not sure in what order it gives the result i.e index 0 is for negative or for positive? With three features i.e Negative,Positive,Neutral then in what order will that method return the predictions?

It would have been helpful to have the code that builds the model (for your example to work, the 0.0 from the dataset must be converted to 0.0 as a Double in the model, either after indexing it with a StringIndexer stage, or if you converted that from the file), but assuming that this code works:
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Then yes, it means the probabilities at index 0 is that of negative and at 1 that of positive (it's a bit strange and there must be a reason, but everything is a double in ML, even feature and category indexes). If you have something like this in your code:
val labelIndexer = new StringIndexer()
.setInputCol("sentiment")
.setOutputCol("indexedsentiment")
.fit(trainingData)
Then you can use labelIndexer.labels to identify the labels (probability at index 0 is for labelIndexer.labels at index 0.
Now regarding your other questions.
Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.
A Neutral category can be very helpful if you want to manage Type 2. If that is the case, you need neutral examples in your dataset. Naive Bayes is not a good classifier to apply thresholding on the probabilities in order to determine Type 2 neutrality.
Option 1: Build a dataset (if you think you will have to deal with a lot of Type 2 neutral texts). The good news is, building a neutral dataset is not too difficult. For instance you can pick random texts that are not movie reviews and assume they are neutral. It would be even better if you could pick content that is closely related to movies (but neutral), like a dataset of movie synopsis. You could then create a multi-class Naive Bayes classifier (between neutral, positive and negative) or a hierarchical classifier (first step is a binary classifier that determines whether a text is a movie review or not, second step to determine the overall sentiment).
Option 2 (can be used to deal with both Type 1 and 2). As I said, Naive Bayes is not very great to deal with thresholds on the probabilities, but you can try that. Without a dataset though, it will be difficult to determine the thresholds to use. Another approach is to identify the number of words or stems that have a significant polarity. One quick and dirty way to achieve that is to query your classifier with each individual word and count the number of times it returns "positive" with a probability significantly higher than the negative class (discard if the probabilities are too close to each other, for instance within 25% - a bit of experimentations will be needed here). At the end, you may end up with say 20 positive words vs 15 negative ones and determine it is neutral because it is balanced or if you have 0 positive and 1 negative, return neutral because the count of polarized words is too low.
Good luck and hope this helped.

I am not sure if I understand the problem but:
prior in Naive Bayes is computed from the data and cannot be set manually.
in MLLib you can use predictProbabilities to obtain class probabilities.
in ML you can use setThresholds to set prediction threshold for each class.

Related

Enhancing accuracy of knn classifier

I have training set of size 54 * 65536 and a testing set of 18 * 65536.
I want to use a knn classifier, but I have some questions:
1) How should I define trainlabel?
Class = knnclassify(TestVec,TrainVec, TrainLabel,k);
Is it a vector of size 54 * 1 that defines to which group each row in training set belongs? Here the group is numbered as 1 ,2,..
2) To find the accuracy I used this:
cp = classperf(TrainLabel);
Class = knnclassify(TestVec,TrainVec, TrainLabel);
cp = classperf(TestLabel,Class);
cp.CorrectRate*100
Is this right? Is there another method to calculate it?
3) How can I enhance the accuracy?
4) How do I choose the best value of k?
I do not know matlab nor the implementation of the knn you are providing, so I can answer only a few of your questions.
1) You assumption is correct. trainlabel is a 54*1 vector or an array of size 54 or something equivalent that defines which group each datapoint (row) in training set belongs to.
2) ... MATLAB / implementation related, sorry.
3) That is a very big discussion. Possible ways are:
Choose a better value of K.
Preprocess the data (or make preprocessing better if already applied).
Get a better / bigger trainset.
to name a few...
4) You can use different values while measuring the accuracy for each one and keep the best. (Note: If you do that, make sure you do not measure the accuracy of the classifier per value of k only once, but rather you use some technique like 10-Folding or something).
There is more than a fair chance that the library you are using for the K-NNclassifier provides such utilities.

Assessing performance of a zero inflated negative binomial model

I am modelling the diffusion of movies through a contact network (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below.)
The next step is to evaluate the performance of the model.
My attempt has been to do multiple out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I received a prediction for the mean length of a diffusion chain for each datapoint, and using
predict(m1, newdata = testData, type = "prob")
I received a matrix containing the probability of each datapoint being a certain length.
Problem with the evaluation: Since I have a 0 (and 1) inflated dataset, the model would be correct most of the time if it predicted 0 for all the values. The predictions I receive are good for chains of length zero (according to the MSE), but the deviation between the predicted and the true value for chains of length 1 or larger is substantial.
My question is:
How can we assess how well our model predicts chains of non-zero length?
Is this approach the correct way to make predictions from a zero inflated negative binomial model?
If yes: how do I interpret these results?
If no: what alternative can I use?
My variables are:
Dependent variable:
length of the diffusion chain (count [0,36])
Independent variables:
movie characteristics (both dummies and continuous variables).
Thanks!
It is straightforward to evaluate RMSPE (root mean square predictive error), but is probably best to transform your counts beforehand, to ensure that the really big counts do not dominate this sum.
You may find false negative and false positive error rates (FNR and FPR) to be useful here. FNR is the chance that a chain of actual non-zero length is predicted to have zero length (i.e. absence, also known as negative). FPR is the chance that a chain of actual zero length is falsely predicted to have non-zero (i.e. positive) length. I suggest doing a Google on these terms to find a paper in your favourite quantitative journals or a chapter in a book that helps explain these simply. For ecologists I tend to go back to Fielding & Bell (1997, Environmental Conservation).
First, let's define a repeatable example, that anyone can use (not sure where your trainData comes from). This is from help on zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that calculate these. But here's the by hand approach. First calculate observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then get the confusion matrix (basis of FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration we can do it visually or via calibration
# let's look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Transforming the counts to "see" what is going on:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case you could also evaluate the correlations, between the predicted counts and the actual counts, either including or excluding the zeros.
So you could fit a regression as a kind of calibration to evaluate this!
However, since the predictions are not necessarily counts, we can't use a poisson
regression, so instead we can use a lognormal, by regressing the log
prediction against the log observed, assuming a Normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more fancy ways of assessing calibration I suppose, as in any modelling exercise ... but this is a start.
For a more advanced assessment of zero-inflated models, check out the ways in which the log likelihood can be used, in the references provided for the zeroinfl function. This requires a bit of finesse.

Determine the attribute that influences the outcome most

I have a dataset in .csv format as shown:
NRC_CLASS,L1_MARKS_FINAL,L2_MARKS_FINAL,L3_MARKS_FINAL,S1_MARKS_FINAL,S2_MARKS_FINAL,S3_MARKS_FINAL,
FAIL,7,12,12,24,4,30,
PASS,49,36,46,51,31,56,
FAIL,59,35,42,18,18,45,
PASS,61,30,51,33,30,52,
PASS,68,30,35,53,45,54,
2,82,77,75,32,36,56,
FAIL,18,35,35,32,21,35,
2,86,56,46,44,37,60,
1,94,45,62,70,50,59,
Where the first column talks about the over all grade:
FAIL - Fail
PASS - Pass class
1 - First class
2 - Second class
D - Distinction
This is followed by marks of each student in 6 subjects.
Is there anyway i can find out performance in which subject makes a difference in overall outcome?
I am using Weka and had used J48 to build a tree.
The summary of J48 classifier is:
=== Summary ===
Correctly Classified Instances 30503 92.5371 %
Incorrectly Classified Instances 2460 7.4629 %
Kappa statistic 0.902
Mean absolute error 0.0332
Root mean squared error 0.1667
Relative absolute error 10.8867 %
Root relative squared error 42.7055 %
Total Number of Instances 32963
Also I discretized the marks data into 10 bins with useEqualFrequency set to true. The summary of J48 now is:
=== Summary ===
Correctly Classified Instances 28457 86.3301 %
Incorrectly Classified Instances 4506 13.6699 %
Kappa statistic 0.8205
Mean absolute error 0.0742
Root mean squared error 0.2085
Relative absolute error 24.3328 %
Root relative squared error 53.4264 %
Total Number of Instances 32963
First of all, you may need to quantify a value for each of the NRC_CLASS Values (or even better, use the actual grade out of 100) to improve the quality of attribute testing.
From there, you could potentially use Attribute Selection (found in the Select Attribute tab of Weka Explorer) to find the attributes that have the greatest influence on the overall grade. Perhaps the CorrelationAttributeEval as the Attribute Evaluator coupled with the Ranker search method could assist in identifying the attributes of greatest importance to the least.
Hope this Helps!
It seems you want to determine the relative relevance of each attribute. In this case, you need to use a weight learning algorithm. Weka has a few, I just used Relief. Go to the tab Select attributes, in Attribute Evaluator, select ReliefF-AttributeEval, it will select the
Select the attribute that has the value for the outcome class.
Search Method for you. Click Start.
The results will include the ranked attributes, the highest ranked is the most relevant.
In a test data set T with 25 attributes, run i=1:25 rounds where you replace the values of the i-th attribute with random values (=noise). Compare the test performance of each of the 25 rounds with the case where no attribute was replaced, and identify the round in which the performance dropped the most.
If the worst performance decrease occurred e.g. in round 13, this indicates that attribute 13 is the most important one.

Scala: large calculation losing value to zero/infinity

I'm trying calculate a perplexity value for a language model and the calculation uses a lot of large powers. I have tried converting my calculation to log space using BigDecimal, but I'm not having any luck.
var sum=0.0
for(ngram<-testNGrams)
{
var prob = Math.log(lm.prob(ngram.last, ngram.slice(0,ngram.size-1)))
if (prob==0.0) sum = sum
else sum = sum + prob
}
Math.pow(Math.log(Math.exp(sum)),-1.0/wordSize.toDouble)
How can I perform such a calculation in Scala without losing my large/small values to zero/Infinity? It seems like a trivial question but I haven't managed to do it.
In the above, you can assume that the method lm.prob issues the correct probabilities between 0 and 1, this has been amply tested.
Write everything in terms of log probabilities, not probabilities.
For instance, things like log(exp(sum)) just warm up your CPU while throwing away useful information. Avoid!
If you must convert to actual probabilities, do so at the very last step you can.

Jahmm lib: how to interpret negative value from ForwardBackwardScaledCalculator.lnProbability()?

I use the Jahmm library for classification of accelerometer sequences.
I have created my models but when i try to calculate the proibablity of a test sequence on a model by:
ForwardBackwardScaledCalculator fbsc = new ForwardBackwardScaledCalculator(test_pair.getValue(),model_pair.getValue().get_hmm());
System.out.println(fbsc.lnProbability());
I get negative values like -1278.0926336276573.
The comment in the code of the library states that the lnProbability method:
Return the napierian logarithm of the probability of the sequence that
generated this object.
Returns: The probability of the sequence of interest's napierian
logarithm
But how to compare two of such logarithms? I call the method on two different models with the two test sequences so i get 4 probabilities:
The test sequence: fast_test.seq on fast_model yields a Napierian log from -1278.0926336276573
The test sequence: fast_test.seq on slow_model yields a Napierian log from -1862.6947488370433
The test sequence: slow_test.seq on fast_model yields a Napierian log from -4433.949818774553
The test sequence: slow_test.seq on slow_model yields a Napierian log from -4208.071445499895
But in this context, does it mean that the closer we get to zero, the more similar the test sequence is to the model (so in this example the classification accuracy = 100%?)
Thank you
If by "Napierian logarithm", the natural logarithm is meant, then you can get a probability from a return value x by raising e to the x, e.g. using Math.exp. However, the reason that logarithms are returned is because the probability values are too small to represent in a double; Math.exp(-1278.0926336276573) will simply return zero. See the Wikipedia article about log probabilities.
does it mean that the closer we get to zero, the more similar the test sequence is to the model
exp(0) == 1 and log(1) == 0, and indeed the lower the probability, the smaller (more negative) its logarithm. So, the closer you get to zero, the more probable the sentence is under the model.
However, this need not directly relate to "similarity to a model", let alone "classification accuracy", since HMMs (being generative models) will ascribe lower probability to longer sequences. Read up on HMMs in your favorite textbook; a full explanation would be too long for this answer box and is a math question, so off-topic for this website.