How can I calculate Precision and Recall for sentiment analysis multi-class classifier using Confusion Matrix? - classification

I wonder how to compute precision and recall using a confusion matrix sentiment analysis multi-class classifier using Confusion Matrix. I have a dataset of 5000 texts and I did human labeling for a sample of 100. Now, I would like to compute the Precision and Recall for the classifier based on this sample of data. I have three classes; Positive, Neutral and Negative.
So how can I compute these metrics for each class?
As I am new here in stackoverflow, I couldn't illustrate the confusion matrix I have, so let us assume that we have the following confusion matrix:
red color > Negative
green color > Positive
purple color> Neutral

you can measure
precision=TPos/(TPos+TNeg+TNeu) i.e 30/(30+20+10)=50% ,
recall=TPos/(TPos+FNeg+FNeu) i.e 30/(30+50+20)=30% ,
F-measure=2*precision*recall/(precision+recall)=37.5% ,and
Accuracy(all true)/(all data) =30+60+80/300=56.7% .
for more http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/

You can use sklearn's classification report.

Related

roc curve from SVM classifier is visualise with limite thresholds in Python

i am trying to plot ROC to evaluate my classifier, however my ruc plot is not "smooth". It supposed to be some problem with the thresholds? i am quite new in python classification so propably there is sth wrong with my code. see image below. Where i sould look for solution?
i used that drop_intermediate=False but it does not help;/
This is because you are passing 0 and 1 values (predicted labels) to the plotting function. The ROC curve can only be figured out, when you provide floats in a range of 0.0 to 1.0 (predicted label probabilities) such that the ROC curve can consider multiple cutoff values and appears more "smooth" as a result.
Whatever classifier you are using, make sure y_train_pred contains float values in the range [0.0,1.0]. If you have a scoring classifier with values in the range [-∞,+∞] you can apply a sigmoid function to remap the values to this range.

Cholesky decomposition for simulation correlated random variables

I have a correlation matrix for N random variables. Each of them is uniformly distributed within [0,1]. I am trying to simulate these random variables, how can I do that? Note N > 2. I was trying to using Cholesky Decomposition and below is my steps:
get the lower triangle of the correlation matrix (L=N*N)
independently sample 10000 times for each of the N uniformly distributed random variables (S=N*10000)
multiply the two: L*S, and this gives me correlated samples but the range of them is not within [0,1] anymore.
How can I solve the problem?
I know that if I only have 2 random variables I can do something like:
1*x1+sqrt(1-tho^2)*y1
to get my correlated sample y. But if you have more than two variables correlated, not sure what should I do.
You can get approximate solutions by generating correlated normals using the Cholesky factorization, then converting them to U(0,1)'s using the normal CDF. The solution is approximate because the normals have the desired correlation, but converting to uniforms is a non-linear transformation and only linear xforms preserve correlation.
There's a transformation available which will give exact solutions if the transformed Var/Cov matrix is positive semidefinite, but that's not always the case. See the abstract at https://www.tandfonline.com/doi/abs/10.1080/03610919908813578.

How to find the training accuracy using fitcsvm?

I would like to find the predicted labels of data point feature vectors while training the classifier, i am using MDL=fitcsvm(train_data,train_labels) in matlab the MDL is composed of properties, none of them corresponds to the training accuracy, Is there any way to find it ?
You can apply cross validation
xval = crossval(Mdl,'KFold',10);
kfoldLoss(xval)

Match template histogram with testing histogram

How can we calculate the percentage of similarities between two pattern of Histogram?
For example, I have a histogram of template which I called HistA, and I have another histogram which is HistB where I want to check the similarities percentage of HistB with HistA.
I tried check out some of method such as histogram equalization, histogram matching but none of them works with my problem.
As image below, I create a multiple histogram between HistA and HistB. The value of the frequencies were actually value from a 1D data.
I saw the pattern of HistA and HistB almost the same, so I want to know how to calculate the percentage of the similarities of this two histogram.
Measure Bhattacharya co-efficient between the two normalized histograms and as
where N is the number of bins in the histograms.
Note the normalization.
For more information, see Bhattacharya distance|Wikipedia or On a measure of divergence between two statistical populations defined by their probability distributions.

2D weighted Kernel Density Estimation(KDE) in MATLAB

I'm looking for available code that can estimate the kernel density of a set of 2D weighted points. So far I found this option in for non-weighted 2D KDE in MATLAB: http://www.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation
However it does not incorporate the weighted feature. Is there any other implemented function or library that should come in handy for this? I thought about "hacking" the problem, where suppose I have simple weight vector: [2 1 3 1], I can literally just repeat each sampled point, twice, once, three times and once respectively. I'm not sure if this computation would be valid mathematically though. Again the issue here is that the weight vector I have is decimal, so normalizing to the minimum number of the vector and then multiplying each other entry implies errors in rounding, specially if the weights are in the same order of magnitude.
Note: The ksdensity function in MATLAB has the weighted option but it is only for 1D data.
Found this, so problem solved. (I guess): http://www.ics.uci.edu/~ihler/code/kde.html
I used this function and found it to be excellent. I discuss varying the n parameter (area over which density is calculated) in this Stack Overflow post, and it contains some examples of 2D KDE plots using contour3.