roc curve from SVM classifier is visualise with limite thresholds in Python - classification

i am trying to plot ROC to evaluate my classifier, however my ruc plot is not "smooth". It supposed to be some problem with the thresholds? i am quite new in python classification so propably there is sth wrong with my code. see image below. Where i sould look for solution?
i used that drop_intermediate=False but it does not help;/

This is because you are passing 0 and 1 values (predicted labels) to the plotting function. The ROC curve can only be figured out, when you provide floats in a range of 0.0 to 1.0 (predicted label probabilities) such that the ROC curve can consider multiple cutoff values and appears more "smooth" as a result.
Whatever classifier you are using, make sure y_train_pred contains float values in the range [0.0,1.0]. If you have a scoring classifier with values in the range [-∞,+∞] you can apply a sigmoid function to remap the values to this range.

Related

Why is my Gaussian function giving values out of range?

I am trying to plot Gaussian using Matlab. My code is like this.
a=1/(0.1*sqrt(2*3.14))
y1=a*exp(-1*(((X1-Mu).^2)./(2*(Sigma^2)) ))
plot(X1,y1)
My graph looks like the image on link
It is showing correct shape but values at y axis is going up to 4. As per my knowledge Gaussian is a probability distribution function and thus must always return value between 0 and 1.Thus I am apprehensive if my implementation is correct?
Yes it is a probability distribution function but it is not required to return value between 0 and 1 everytime. As you can see from the picture below, Gaussian graph depends on variance and mean.
Your implementation is correct. The gaussian is a probability DENSITY function, which is different from a probability distribution. The former must only be larger or equal than zero but when integrated over the entire range of posible X1, the result must be equal to 1.
Probability distributions are the ones whos values must be lower or equal to 1.
As a sidenote. Matlab has both the gaussian probability density and distribution functions builtin as normpdf and normcdf respectively.

Filtering a sinusoidal wave with FFT

I am trying to write a code in Matlab which takes a one or sum of sinosudal waves imposed with noise and try to filter it using the following algo :
first i take the input and place it in a vector
then i apply fft() to that vector and abs() to that fft
- example if 'x' is the vector in which wave is stored then
- y= abs(fft(x))
now in 'y' i make all the elements less than a certain threshold value 0
then apply the ifft() function to get the filtered signal lets say 'x1'
but the final wave i get even though a sinusoidal wave it is out phase (see the graph).is it because iam applying abs() to the fft??
But the material which i got this algo from doesn't discuss about this.
Do i need to apply any other filter so that i get the actual wave??
here is the plot of the two waves: one i got from above procedure and the other the actual wave which is a sine wave with no noise:
my graph
see how my filtered wave and the actual wave are out of phase how to correct it ??
if you cannot understand the question or have anything you want to ask me please comment i will try to explain it.
You are assigning the absolute-values of the FFT result to y, hence you get REAL values. Doing ifft() on that simply assumes imaginary-parts are zero, hence the phase-shift.

Finding defined peaks with Clusters in MATLAB

this is my problem:
I have the next data "A", which looks like:
As you can see, I have drawn with red circles the apparently peaks, the most defined are 2 and 7, I say that they are defined because its standard deviation is low in comparison with the other peaks (especially the second one).
What I need is a way (anyway) to get the values and the standard deviation of n peaks in a numeric array.
I have tried with "clusters", but I got no good results:
First of all, I used "kmeans" MATLAB function, and I realize that this algorithm doesn't group peaks as I need. As you can see in the picture above, in the red circle, that cluster has at less 3 or 4 peaks. And kmeans need that you set the number of clusters, and I need to identify it automatically.
I hope that anyone can give me some ideas, or a way to get better results, thanks.
Pd: I leave the data "A" in the next link.
https://drive.google.com/file/d/0B4WGV21GqSL5a2EyQ2l0SHZURzA/edit?usp=sharing
The problem is that your axes have very different meaning.
K-means optimizes variance. But variance in X is something entirely different than variance in Y, isn't it? Furthermore, each of these methods will split your data in both X and Y, whereas I assume you want the data to be partitioned on the X axis only.
I suggest the following: consider the Y axis to be a weight, and X axis to be a position.
Then perform weighted density estimation, and look for low density to separate your clusters.
I can't help you with MATLAB. I don't use it.
Mathematically, what you want to do is place a Gaussian at each point, with area Y and center X. Then find minima and maxima on the sum of these Gaussians. See Wikipedia, Kernel Density Estimation for details; except that you want to use the Y axis as weights. You could maybe also use 1/Y as standard deviation, if you don't want to use weights.

ROC curve and libsvm

Given a ROC curve drawn with plotroc.m (see here):
Theoretical question: How to select the best threshold to be used?
Programming qeuestion: How to induce the libsvm classifier to work with the selected (best) threshold?
ROC curve is plot generated by plotting fraction of true positive on y-axis versus fraction of false positive on x-axis. So, co-ordinates of any point (x,y) on ROC curve indicates FPR and TPR value at particular threshold.
As shown in figure, we find the point (x,y) on ROC curve which corresponds to the minimum distance of that point from top-left corner (i.e given by(0,1)) of plot. The threshold value corresponding to that point is the required threshold. Sorry, I am not permitted to put any image, so couldn't explain with figure. But, for more details about this click ROC related help
Secondly, In libsvm, svmpredict function returns you probability of data sample belonging to a particular class. So, if that probability(for positive class) is greater than threshold (obtained from ROC plot) then we can classify the sample to positive class. These few lines might be usefull to you:
[pred_labels,~,p] = svmpredict(target_labels,feature_test,svmStruct,'-b 1');
% where, svmStruct is structure returned by svmtrain function.
op = p(:,svmStruct.Label==1); % This gives probability for positive
% class (i.e whose label is 1 )
Now if this variable 'op' is greater than threshold then we can classify the corresponding test sample to positive class. This can be done as
op_labels = op>th; % where 'th' is threshold obtained from ROC

Estimating the error when fitting a curve with DCT and polyfit

I have a matlab script that performs curve fitting on a set of curves using polynomials of third, second and first order (using polyfit with the desired order) and also using DCT of 4,3 and 2 coefficients (invoking dct for the whole array and then truncating just the first 4,3 or 2 coeffs).
I'm able to get a graphical view of the accuracy of each curve fitting using polyval and idct for the 2 types of curve fitting, but I was wondering if there is any way of getting a numeric value of the accuracy that makes sense for both approaches (dct and polyfit).
I'm sure this is more a maths question rather than a Matlab question, but maybe there is some way to obtain a simple and elegant way in a array-based algorithm that I haven't thought of yet.
Thanks in advance for your comments!
EDIT: What about correlation? :D
In the cuve fitting tool there should be a residual that uses standard deviation. If you are interested in another way to do it maybe you should use rmse for the entire curve, scripting a function that does something like:
input args : y1 = (curve going to be fitted), y2 = (fitted curve)
For each value in y, sum up the difference y1-y2 squared
Divide with the number of entries
Provided you are now left with a number, return its square root
See http://en.wikipedia.org/wiki/Root-mean-square_deviation#Formula for more.