Can MRMR be used for imbalanced dataset? - feature-selection

I tried using MRMR on a dataset that about 10% of the dataset has class '1' and the remaining 90% has class '0'. I used the MRMR code shown below with K=10. However, I realized that after using count ifs there were more rows that each selected feature and the class had 0,0 than 1,1. Making me realize that the MRMR that the selected features might be predicting the absence of such features and not the presence of such features to the target class. Is there a way to use MRMR for imbalanced data?
use mrmr classification
selected_features = mrmr_classif(X, y, K = 10)
I would like the MRMR to select features that contribute to having class '1' and not class '0'

Related

How to decide the range for the hyperparameter space in SVM tuning? (MATLAB)

I am tuning an SVM using a for loop to search in the range of hyperparameter's space. The svm model learned contains the following fields
SVMModel: [1×1 ClassificationSVM]
C: 2
FeaturesIdx: [4 6 8]
Score: 0.0142
Question1) What is the meaning of the field 'score' and its utility?
Question2) I am tuning the BoxConstraint, C value. Let, the number of features be denoted by the variable featsize. The variable gridC will contain the search space which can start from any value say 2^-5, 2^-3, to 2^15 etc. So, gridC = 2.^(-5:2:15). I cannot understand if there is a way to select the range?
1. score had been documented in here, which says:
Classification Score
The SVM classification score for classifying observation x is the signed distance from x to the decision boundary ranging from -∞ to +∞.
A positive score for a class indicates that x is predicted to be in
that class. A negative score indicates otherwise.
In two class cases, if there are six observations, and the predict function gave us some score value called TestScore, then we could determine which class does the specific observation ascribed by:
TestScore=[-0.4497 0.4497
-0.2602 0.2602;
-0.0746 0.0746;
0.1070 -0.1070;
0.2841 -0.2841;
0.4566 -0.4566;];
[~,Classes] = max(TestScore,[],2);
In the two-class classification, we can also use find(TestScore > 0) instead, and it is clear that the first three observations are belonging to the second class, and the 4th to 6th observations are belonging to the first class.
In multiclass cases, there could be several scores > 0, but the code max(scores,[],2) is still validate. For example, we could use the code (from here, an example called Find Multiple Class Boundaries Using Binary SVM) following to determine the classes of the predict Samples.
for j = 1:numel(classes);
[~,score] = predict(SVMModels{j},Samples);
Scores(:,j) = score(:,2); % Second column contains positive-class scores
end
[~,maxScore] = max(Scores,[],2);
Then the maxScore will denote the predicted classes of each sample.
2. The BoxConstraint denotes C in the SVM model, so we can train SVMs in different hyperparameters and select the best one by something like:
gridC = 2.^(-5:2:15);
for ii=1:length(gridC)
SVModel = fitcsvm(data3,theclass,'KernelFunction','rbf',...
'BoxConstraint',gridC(ii),'ClassNames',[-1,1]);
%if (%some constraints were meet)
% %save the current SVModel
%end
end
Note: Another way to implement this is using libsvm, a fast and easy-to-use SVM toolbox, which has the interface of MATLAB.

Neutrality for sentiment analysis in spark

I have built a pretty basic naive bayes over apache spark and using mllib of course. But I have a few clarifications on what exactly neutrality means.
From what I understand, in a given dataset there are pre-labeled sentences which comprise of the necessary classes, let's take 3 for example below.
0-> Negative sentiment
1-> Positive sentiment
2-> Neutral sentiment
This neutral is pre-labeled in the training set itself.
Is there any other form of neutrality handling. Suppose if there are no neutral sentences available in the dataset then is it possible that I can calculate it from the scale of probability like
0.0 - 0.4 => Negative
0.4- - 0.6 => Neutral
0.6 - 1.0 => Positive
Is such kind of mapping possible in spark. I searched around but could not find any. The NaiveBayesModel class in the RDD API has a predict method which just returns a double that is mapped according to the training set i.e if only 0,1 is there it will return only 0,1 and not in a scaled manner such as 0.0 - 1.0 as above.
Any pointers/advice on this would be incredibly helpful.
Edit - 1
Sample code
//Performs tokenization,pos tagging and then lemmatization
//Returns a array of string
val tokenizedString = Util.tokenizeData(text)
val hashingTF = new HashingTF()
//Returns a double
//According to the training set 1.0 => Positive, 0.0 => Negative
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Sample dataset content
1,Awesome movie
0,This movie sucks
Of course the original dataset contains more longer sentences, but this should be enough for explanations I guess
Using the above code I am calculating. My question is the same
1) Neutrality handling in dataset
In the above dataset if I am adding another category such as
2,This movie can be enjoyed by kids
For arguments sake, lets assume that it is a neutral review, then the model.predict method will give either 1.0,0.0,2.0 based on the passed in sentence.
2) Using the model.predictProbabilities it gives an array of doubles, but I am not sure in what order it gives the result i.e index 0 is for negative or for positive? With three features i.e Negative,Positive,Neutral then in what order will that method return the predictions?
It would have been helpful to have the code that builds the model (for your example to work, the 0.0 from the dataset must be converted to 0.0 as a Double in the model, either after indexing it with a StringIndexer stage, or if you converted that from the file), but assuming that this code works:
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Then yes, it means the probabilities at index 0 is that of negative and at 1 that of positive (it's a bit strange and there must be a reason, but everything is a double in ML, even feature and category indexes). If you have something like this in your code:
val labelIndexer = new StringIndexer()
.setInputCol("sentiment")
.setOutputCol("indexedsentiment")
.fit(trainingData)
Then you can use labelIndexer.labels to identify the labels (probability at index 0 is for labelIndexer.labels at index 0.
Now regarding your other questions.
Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.
A Neutral category can be very helpful if you want to manage Type 2. If that is the case, you need neutral examples in your dataset. Naive Bayes is not a good classifier to apply thresholding on the probabilities in order to determine Type 2 neutrality.
Option 1: Build a dataset (if you think you will have to deal with a lot of Type 2 neutral texts). The good news is, building a neutral dataset is not too difficult. For instance you can pick random texts that are not movie reviews and assume they are neutral. It would be even better if you could pick content that is closely related to movies (but neutral), like a dataset of movie synopsis. You could then create a multi-class Naive Bayes classifier (between neutral, positive and negative) or a hierarchical classifier (first step is a binary classifier that determines whether a text is a movie review or not, second step to determine the overall sentiment).
Option 2 (can be used to deal with both Type 1 and 2). As I said, Naive Bayes is not very great to deal with thresholds on the probabilities, but you can try that. Without a dataset though, it will be difficult to determine the thresholds to use. Another approach is to identify the number of words or stems that have a significant polarity. One quick and dirty way to achieve that is to query your classifier with each individual word and count the number of times it returns "positive" with a probability significantly higher than the negative class (discard if the probabilities are too close to each other, for instance within 25% - a bit of experimentations will be needed here). At the end, you may end up with say 20 positive words vs 15 negative ones and determine it is neutral because it is balanced or if you have 0 positive and 1 negative, return neutral because the count of polarized words is too low.
Good luck and hope this helped.
I am not sure if I understand the problem but:
prior in Naive Bayes is computed from the data and cannot be set manually.
in MLLib you can use predictProbabilities to obtain class probabilities.
in ML you can use setThresholds to set prediction threshold for each class.

Different levels of accuracy for three classes in naive bayes

I am new to machine learning. I am classifying tweets in three classes using term frequency feature. My training and test data are balanced for all classes and it's ratio is 70-30% for all classes. But accuracy for all three classes is so different like class one high.. 92% accurate, class two medium.. 53% accurate and class three very low.. 15% accurate.
Can anyone please tell what can possibly be wrong with my algorithm?
I have three classes information neutral and metaphor.
`# Create term document matrix
tweets.information.matrix <- t(TermDocumentMatrix(tweets.information.corpus,control = list(wordLengths=c(4,Inf))));
tweets.metaphor.matrix <- t(TermDocumentMatrix(tweets.metaphor.corpus,control = list(wordLengths=c(4,Inf))));
tweets.neutral.matrix <- t(TermDocumentMatrix(tweets.neutral.corpus,control = list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));`
and that is how it calculates the probability
`probabilityMatrix <-function(docMatrix)
{
# Sum up the term frequencies
termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
# Add one
termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
# Calculate the probabilties
termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
# Calculate the natural log of the probabilities
termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
# Add pretty names to the columns
colnames(termSums)<-c("term","count","additive","probability","lnProbability")
termSums
}
`
results for information are highly accurate , for neutral moderately accurate nad for Metaphor results are very low

Should changing a model's intercept change its Precision and Recall?

For simplicity's sake -- just setting up a very basic logistic regression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, classification_report
train_grad_des = SGDClassifier(alpha=alpha_optimum, l1_ratio=l1_optimum, loss='log')
train_grad_des.fit(train_x, train_y)
For analysis, creating an array of score_y for predicted probabilities
score_y = train_grad_des.predict_proba(test_x)
precision, recall, thresholds = precision_recall_curve(test_y, score_y[:,1], pos_label=1)
If I set train_grad_des.intercept_ = 100, will that change the probabilities returned by train_grad_des.predict_proba(test_x)?
It seems like the probabilities should not change, they would just all be moved 'over' in one direction. And if the returned probabilities remain unchanged, shouldn't precision and recall at various thresholds remain unchanged as well?
I've been testing this with a model, and am finding that the precision and recall are drastically changed when I alter the intercept, but it's not clear to me if this should happen, and if it should, why it does.
Why shouldn't the probabilities change? If we peel back some layers and look at the model being fit (binary logistic regression with one covariate):
logit(p(y = 1|x)) = alpha + beta*x
Where logit is the log odds in favor of being in class 1, that is:
logit(p(Y=1)) = log( p(Y=1)/p(Y=0) )
You can interpret alpha, the intercept, by itself as the prior probability of belonging to a certain class. So if you change alpha you can expect to change the log odds in favor of being in class 1.
Another way to look at it:
p(y=1|x) = 1 / (1 + e^(-(alpha + beta*x)))
Clearly if I change alpha my probabilities will change.

How to use sample weights for a random forest classificator in Orange?

I am trying to train a random forest classificator on a very imbalanced dataset with 2 classes (benign-malign).
I have seen and followed the code from a previous question (How to set up and use sample weight in the Orange python package?) and tried to set various higher weights to the minority class data instances, but the classificators that I get work exactly the same.
My code:
data = Orange.data.Table(filename)
st = Orange.classification.tree.SimpleTreeLearner(min_instances=3)
forest = Orange.ensemble.forest.RandomForestLearner(learner=st, trees=40, name="forest")
weight = Orange.feature.Continuous("weight")
weight_id = -10
data.domain.add_meta(weight_id, weight)
data.add_meta_attribute(weight, 1.0)
for inst in data:
if inst[data.domain.class_var]=='malign':
inst[weight]=100
classifier = forest(data, weight_id)
Am I missing something?
Simple tree learner is simple: it's optimized for speed and does not support weights. I guess learning algorithms in Orange that do not support weight should raise an exception if the weight argument is specified.
If you need them just to change the class distribution, multiply data instances instead. Create a new data table and add 100 copies of each instance of malignant tumor.