Binary classification(label 0 &1), which one is considered to be 'positive' when calculating recall, precision etc.? - classification

When using pycaret to do binary classification (label 0 and 1), which one is considered to be 'positive' when calculating recall, precision etc.?
For example, I'm trying to build a model to predict if a patient have a certain disease(0-negative, 1-positive). My intention is to aim for a high recall to avoid situations in which the disease is not detected. When I plot the confusion matrix, 0 appears at the place where 'positive' supposes to be in a normal confusion matrix. I'm so confusing. Do I need to switch 0 and 1?
Any help is appreciated!

Maybe a solution is to create a 'manual' plot rather than using the integrated package. You can change the layout of the heatmap if you like.
import seaborn as sns
import matplotlib.pyplot as plt
matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(matrix.T, annot=True)
plt.title("Confusion Matrix")
plt.ylabel("Actuals")
plt.xlabel("Predictions")
plt.ylim(0,2)
plt.xlim(2,0)

Related

MFCC spectrogram vs Scipi Spectrogram

I am currently working on a Convolution Neural Network (CNN) and started to look at different spectrogram plots:
With regards to the Librosa Plot (MFCC), the spectrogram is way different that the other spectrogram plots. I took a look at the comment posted here talking about the "undetailed" MFCC spectrogram. How to accomplish the task (Python Code wise) posted by the solution given there?
Also, would this poor resolution MFCC plot miss any nuisances as the images go through the CNN?
Any help in carrying out the Python Code mentioned here will be sincerely appreciated!
Here is my Python code for the comparison of the Spectrograms and here is the location of the wav file being analyzed.
Python Code
# Load various imports
import os
import librosa
import librosa.display
import matplotlib.pyplot as plt
import scipy.io.wavfile
#24bit accessible version
import wavfile
plt.figure(figsize=(17, 30))
filename = 'AWCK AR AK 47 Attached.wav'
librosa_audio, librosa_sample_rate = librosa.load(filename, sr=None)
plt.subplot(4,1,1)
xmin = 0
plt.title('Original Audio - 24BIT')
fig_1 = plt.plot(librosa_audio)
sr = librosa_sample_rate
plt.subplot(4,1,2)
mfccs = librosa.feature.mfcc(y=librosa_audio, sr=librosa_sample_rate, n_mfcc=40)
librosa.display.specshow(mfccs, sr=librosa_sample_rate, x_axis='time', y_axis='hz')
plt.title('Librosa Plot')
print(mfccs.shape)
plt.subplot(4,1,3)
X = librosa.stft(librosa_audio)
Xdb = librosa.amplitude_to_db(abs(X))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
# plt.colorbar()
# maximum frequency
Fs = 96000.
samplerate, data = scipy.io.wavfile.read(filename)
plt.subplot(4,1,4)
plt.specgram(data, Fs=samplerate)
plt.title('Scipy Plot (Fs=96000)')
plt.show()
MFCCs are not spectrograms (time-frequency), but "cepstrograms" (time-cepstrum). Comparing MFCC with spectrogram visually is not easy, and I am not sure it is very useful either. If you wish to do so, then invert the MFCC to get back a (mel) spectrogram, by doing an inverse DCT. You can probably use mfcc_to_mel for that.
This will allow to estimate how much data has been lost in the MFCC forward transformation. But it may not say much about how much relevant information for your task has been lost, or how much reduction there has been in irrelevant noise.
This needs to be evaluated for your task and dataset. The best way is to try different settings, and evaluate performance using the evaluation metrics that you care about.
Note that MFCCs may not be such a great representation for the typical 2D CNNs that are applied to spectrograms. That is because the locality has been reduced: In the MFCC domain, frequencies that are close to eachother are no longer next to eachother in vertical axis. And because 2D CNNs have kernels with limited locality (typ 3x3 or 5x5 early on), this can reduce performance of the model.

Unable to make sense of the confusion matrix returned by SVM

I am trying to understand why the SVM classifier is not able to correctly classify my data. I have presented 10 samples XX only out of 2000 samples of my original data. I cannot make sense of the confusion matrix returned by Matlab. I used SVM classifier. Is my code wrong, especially the way I did cross-validation?
XX is normalized to X, and Y is the label. Each feature vector is of length 8.
**Question **) Can somebody please help how to tackle this issue?
pred 0 pred 1
actual 0 100 0
actual 1 100 0
Thank you
You have:
an unbalanced data set (7 and 3 samples),
an 8-dimensional feature space and only 7 and 3 samples, which are very much insufficient to fill it (see curse of dimensionality), and
you're only using half those samples to train, meaning you're even further away from filling the feature space.
Thus, I am not surprised that the generalization that the SVM came up with is to classify everything as "class 0".
Try using only one of the features (first column of XX), and use leave-one-out cross validation.

How can I calculate Precision and Recall for sentiment analysis multi-class classifier using Confusion Matrix?

I wonder how to compute precision and recall using a confusion matrix sentiment analysis multi-class classifier using Confusion Matrix. I have a dataset of 5000 texts and I did human labeling for a sample of 100. Now, I would like to compute the Precision and Recall for the classifier based on this sample of data. I have three classes; Positive, Neutral and Negative.
So how can I compute these metrics for each class?
As I am new here in stackoverflow, I couldn't illustrate the confusion matrix I have, so let us assume that we have the following confusion matrix:
red color > Negative
green color > Positive
purple color> Neutral
you can measure
precision=TPos/(TPos+TNeg+TNeu) i.e 30/(30+20+10)=50% ,
recall=TPos/(TPos+FNeg+FNeu) i.e 30/(30+50+20)=30% ,
F-measure=2*precision*recall/(precision+recall)=37.5% ,and
Accuracy(all true)/(all data) =30+60+80/300=56.7% .
for more http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/
You can use sklearn's classification report.

How to fit a poisson distribution with seaborn?

I try to fit my data to a poisson distribution:
import seaborn as sns
import scipy.stats as stats
sns.distplot(x, kde = False, fit = stats.poisson)
But I get this error:
AttributeError: 'poisson_gen' object has no attribute 'fit'
Other distribution (gamma, etc) de work well.
The Poisson distribution (implemented in scipy as scipy.stats.poisson) is a discrete distribution. The discrete distributions in scipy do not have a fit method.
I'm not very familiar with the seaborn.distplot function, but it appears to assume that the data comes from a continuous distribution. If that is the case, then even if scipy.stats.poisson had a fit method, it would not be an appropriate distribution to pass to distplot.
The question title is "How to fit a poisson distribution with seaborn?", so for the sake of completeness, here's one way to get a plot of the data and its fit. seaborn is only used for the bar plot, using #mwaskom's suggestion to use seaborn.countplot. The fitting is actually trivial, because the maximum likelihood estimation for the Poisson distribution is simply the mean of the data.
First, the imports:
In [136]: import numpy as np
In [137]: from scipy.stats import poisson
In [138]: import matplotlib.pyplot as plt
In [139]: import seaborn
Generate some data to work with:
In [140]: x = poisson.rvs(0.4, size=100)
These are the values in the x:
In [141]: k = np.arange(x.max()+1)
In [142]: k
Out[142]: array([0, 1, 2, 3])
Use seaborn.countplot to plot the data:
In [143]: seaborn.countplot(x, order=k, color='g', alpha=0.5)
Out[143]: <matplotlib.axes._subplots.AxesSubplot at 0x114700490>
The maximum likelihood estimation of the Poisson parameter is simply the mean of the data:
In [144]: mlest = x.mean()
Use poisson.pmf() to get the expected probability, and multiply by the size of the data set to get the expected counts, and then plot using matplotlib. The bars are the counts of the actual data, and the dots are the expected counts of the fitted distribution:
In [145]: plt.plot(k, poisson.pmf(k, mlest)*len(x), 'go', markersize=9)
Out[145]: [<matplotlib.lines.Line2D at 0x114da74d0>]

2D weighted Kernel Density Estimation(KDE) in MATLAB

I'm looking for available code that can estimate the kernel density of a set of 2D weighted points. So far I found this option in for non-weighted 2D KDE in MATLAB: http://www.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation
However it does not incorporate the weighted feature. Is there any other implemented function or library that should come in handy for this? I thought about "hacking" the problem, where suppose I have simple weight vector: [2 1 3 1], I can literally just repeat each sampled point, twice, once, three times and once respectively. I'm not sure if this computation would be valid mathematically though. Again the issue here is that the weight vector I have is decimal, so normalizing to the minimum number of the vector and then multiplying each other entry implies errors in rounding, specially if the weights are in the same order of magnitude.
Note: The ksdensity function in MATLAB has the weighted option but it is only for 1D data.
Found this, so problem solved. (I guess): http://www.ics.uci.edu/~ihler/code/kde.html
I used this function and found it to be excellent. I discuss varying the n parameter (area over which density is calculated) in this Stack Overflow post, and it contains some examples of 2D KDE plots using contour3.