I want to create a ROC curve in Matlab using the perfcurve function (it's for logistic regression similar as illustrated in this example (bottom of page)). I have 150 datapoints (binary data), but they are neither positive nor negative classes; they are the number of positive observations within the particular datapoint.
Example (random data to illustrate):
datapoint +ve observations total observations
1 23 35
2 27 41
3 23 36
4 18 29
5 19 39
6 21 41
7 24 40
8 29 36
9 38 45
10 12 32
The example illustrated on mathworks (bottom of page) only demonstrates how to create labels for data rows that correspond either solely to positive or negative classes.
For
[X,Y,T,AUC] = perfcurve(labels,scores,posclass)
how do I have to format my labels and posclass in order to make the ROC curve plot work?
Thank you very much in advance.
In order to create an ROC curve in Matlab using the perfcurve function, you need to have the score for each data point (which you pass to perfcurve using the scores argument). The score of a data point is given by your classifier and corresponds to the "probability" [1] that this data point belongs to the positive class (which is defined by the posclass argument). Given your data, you don't have enough information to use the perfcurve function.
[1] Some classifiers don't return strict probabilities but higher score indicates a higher probability so it's all right. More information in Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters 27.8 (2006): 861-874.
Related
In MATLAB (R2015b) I have to find the midpoint between two time series of different lengths (ca 2000 vs. 3000 rows), in both series the first column is time and second is a measurement. Such as A:
09:30:14 23
09:31:03 23.5
And B:
09:30:19 25.5
09:30:37 25
09:31:12 24.5
How can I get MATLAB to calculate the midpoint value between A and B and get the result as shown below?
09:30:19 24.25 (Here it is 23+(25.5-23)/2)
09:30:37 24 (Here it is 23+(25-23)/2)
09:30:12 24 (Here it is 23.5+(24.5-23.5)/2)
You can use the interp1 function to estimate the value of one series at the time points corresponding to the other samples. Then the time points agree and you can just take the mean of the values.
interp1 supports several interpolation methods, such as nearest and linear.
I want to understand the use of Gaussian Mixture Models in Hidden Markov Models.
Suppose, we have speech data and we are recognizing 5 speech sounds (which are states of HMM). For example 'X' be the speech sample and O = (s,u,h,b,a) (considering characters instead of phones just for simplicity) be HMM states. Now, we use gaussian mixture model of 3 mixtures to estimate gaussian density for every state using the following equation (sorry cannot upload image because of reputation points).
P(X|O) = sum (i=1->3) w(i) * P (X|mu(i), var(i)) (considering univariate distribution)
So, we first learn the GMM parameters from the training data using EM algorithm.
Then use these parameters for learning HMM parameters and once this is done, we use both of them on test data.
In all we are learning 3 * 3 * 5 (weight, mean and variance for 3 mixtures and 5 states) parameters for GMM in this example.
Is my understanding correct?
Your understanding is mostly correct, however, the number of parameters is usually larger. The mean and variance are vectors, not numbers. Variance could be matrix for rare case of full covariance GMM. Each vector usually contains 39 components for 13 cepstrum + 13 deltas + 13 delta-deltas.
So for every phone you learn
39 + 39 + 1 = 79 parameters
Total number of parameters is
79 * 5 = 395
And, usually phone is composed of 3 or so states, not from a single state. So you have 395 * 3 or 1185 parameters just for GMM. Then you need a transition matrix for HMM. Number of parameters is large thats why training requires a lot of data.
I have a set of scatter points. They are height of sixty plants (cm) over time(days). I measure each of them for three times (days:~10, ~50, ~100)But some of the plants does not have the second or/and third measurement yet. Here are the small example of my data showing four plants (A,B,C,D).
Plant Days Height
A 10 2
B 11 5
C 12 4
D 12 5
A 57 7
B 56 8
C 53 6
A 100 12
B 100 10
Then I could use plotmatrix(Days, Height) to plot the scatter points. I need to make percentile curves (similar to children growth rate) in MATLAB. I tried to use prctile(height, [25 50 75], 1) could only output the 25th, 50th and 75th value of height but not a growth rate curve. Could anyone suggest a way to generate the percentile curve of a set of scatter points over time please? Is regression needed to generate a growth rate curve (25th, 50th, 75th) of sixty plants?
Sorry I am still new to Matlab and statistics, please help. Thanks!
I have a question that I don't know if there is a solution off the bat.
Here it goes,
I have two data sets, plotted on the same figure. I need to find their difference, simple so far...
the problem arises in the fact that say matrix A has 1000 data points while the second (matrix B) has 580 data points. How will I be able to find the difference between the two graphs since there is a dimensional miss match between the two figures.
One way that I thought of is artificially inflating matrix B to 1000 data points, but the trend of the plot will remain the same. Would this be possible? and if yes how?
for example:
A=[1 45 33 4 1009 ];
B=[1 22 33 44 55 66 77 88 99 1010];
Ya=A.*20+4;
Yb=B./10+3;
C=abs(B - A)
plot(A,Ya,'r',B,Yb)
xlim([-100 1000])
grid on
hold on
plot(length(B),C)
One way to do it is to resample the 580 element vector to 1000 samples. Use matlab resample (requires the Signal Processing Toolbox, I believe) for this:
x = randn(580,1);
y = randn(1000,1);
xr = resample(x, 50,29); # 50/29 = 1000/580 is the resampling ratio
You should then be able to compare the two data vectors.
There are two ways that I can think of:
1- Matching the size:
Generating more data for the matrix with lower number of elements (using interpolation, etc.)
Removing some data from the matrix with higher number of elements (i.e. outlier removal)
2- Comparing the matrices with their properties.
For instance, you can calculate the mean and the covariance of a matrix and compare it to the other matrix. The other options include, cov , mean , median , std, var , xcorr , xcov.
Сan anyone shine a light to my matlab program?
I have data from two sensors and i'm doing a kNN classification for each of them separately.
In both cases training set looks like a set of vectors of 42 rows total, like this:
[44 12 53 29 35 30 49;
54 36 58 30 38 24 37;..]
Then I get a sample, e.g. [40 30 50 25 40 25 30] and I want to classify the sample to its closest neighbor.
As a criteria of proximity I use Euclidean metrics, sqrt(sum(Y2)), where Y is a difference between each element and it gives me an array of distances between Sample and each Class of Training Set.
So, two questions:
Is it possible to convert distance into distribution of probabilities, something like: Class1: 60%, Class 2: 30%, Class 3: 5%, Class 5: 1%, etc.
added: Up to this moment I'm using formula: probability = distance/sum of distances, but I cannot plot a correct cdf or histogram.
This gives me a distribution in some way, but I see a problem there, because if distance is large, for example 700, then the closest class will get a biggest probability, but it'd be wrong because the distance is too big to be compared with any of classes.
If I would be able to get two probability density functions, I guess then I would do some product of them. Is it possible?
Any help or remark is highly appreciated.
I think there are multiple way of doing this:
as Adam suggested using 1/d / sum(1/d)
use the square, or even higher ordered of inverse of distance, e.g 1/d^2 / sum(1/d^2), This will make the class probability distribution more skewed. For example if 1/d generated 40%/60% prediction, the 1/d^2 may gave a 10%/90%.
use softmax (https://en.wikipedia.org/wiki/Softmax_function), the exponential of negative distance.
use exp(-d^2)/sigma^2 / sum[exp(-d^2)/sigma^2], this will imitate the Gaussian Distribution likelihoods. Sigma could be the average within-cluster distance, or simply set to 1 for all clusters.
You could try to inverse your distances to get a likelihood measure. I.e. the bigger the distance x, the smaller the inverse of it. Then, you can normalize as in probability = (1/distance) / (sum (1/distance) )
Hi: Have you ever tried with the formula probability = 1-distance assuming that you are using a standardized distance between 0 and 1?