wrong partitions with matlab's cvpartition - matlab

I am having trouble with the cvpartition function of Matlab. I want to perform a 5-fold cross-validation (for classification) with a dataset that has 134 instances from class 1 (negative) and 19 intances from class 2 (positive).
With 5-fold CV one should have something like 4 - 4 - 4 - 4 - 3 positive instances partitioned along the 5 folds or close to that (5 - 4 - 3 - 4 - 3 would also be OK). I make 30 repetitions of the 5-fold CV and sometimes Matlab builds partitions like 1 - 5 - 5 -4 - 4 or even 5 - 5 - 5 - 4 - 0 , that is, on of the folds has no positive instances! How is this possible and how can I correct this? At least it should guarantee that the two classes were always represented in each fold...
This brings me problems when trying to compute PRecision, Recall, F-measure and so on...
LS

Are you using the stratified form of cross-validation that cvpartition provides?
Use the second syntax described in the documentation page, i.e. c = cvpartition(group,'kfold',k) rather than c = cvpartition(n,'kfold',k). Here group is a vector (or categorical array, cell array of strings etc) of class labels, and will stratify the selection of observations into folds rather than just splitting everything randomly into groups.

Related

Vector with specific number of equally spaced values

I am familiar with the command:
(-5:0.1:5)
which creates a vector of values, equally spaced by 0.1, from -5 to 5.
However, is there a way to produce a vector of values, equally spaced from -5 to 5 such that there are, say, 100 values in the vector.
(-5:0.1:5) gives a vector with 101 values, however, is there a way to get a vector of 100 values without manually calculating the step size?
Yes there is. Use function linspace. See documentation here
linspace(1,10,10)
ans =
1 2 3 4 5 6 7 8 9 10
also the question is a duplicate of this question

why should i transpose in neural network in matlab?

I would like to ask a question about matlab transpose symbol. For example in this case:
input=input';
It makes transpose of input but i want to learn why we should use transpose via usin Artificial Neural Network in matlab?
Second Question is:
I am trying to create a classification using ANN in matlab. I showed results like that:
a=sim(neuralnetworkname,test)
test is represens my test data in Neural network.
and the results is like that:
a =
Columns 1 through 12
2.0374 3.9589 3.2162 2.0771 2.0931 3.9947 3.1718 3.9813 2.1528 3.9995 3.8968 3.9808
Columns 13 through 20
3.9996 3.7478 2.1088 3.9932 2.0966 2.0644 2.0377 2.0653
If the result of a is about 2, it would benign, if the result of a is about 4,it is malignant.
So, I want to calculate that :for example,there are 100 benign in 500 data.(100/500) How can i write screen this 100/500
I tried to be clear, but if i didn't clear enough, I can try to explain more.Thanks.
First Question
You don't need to transpose input values everytime. Matlab nntool normally gets input values column by column by default. So you have two choice: 1. Change dataset order 2. Transpose input
Second Question
Suppose you have matrix like this:
a=[1 2 3 4 5 6 7 8 9 0 0 0];
To count how many elements below 8, write this:
sum(a<8) %[1 2 3 4 5 6 7 0 0 0]
Output will be:
10

Unreasonable [positive] log-likelihood values from matlab "fitgmdist" function

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):
gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);
I get the following outputs:
iter log-likelihood
1 -6.66298e+07
2 -1.87763e+07
3 -5.00384e+06
4 -1.11863e+06
5 299767
6 985834
7 1.39525e+06
8 1.70956e+06
9 1.94637e+06
The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.
Could somebody help me?
First of all, it is not a problem of how large your dataset is.
Here is some code that produces similar results with a quite small dataset:
options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 64.4731
2 73.4987
3 73.4987
Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.
PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].
You can verify this by reducing the density in the above code
x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 -8.99083
2 -3.06465
3 -3.06465
So, I would rather assume that your dataset contains several duplicate (or very close) values.

Nearest Neighbour Classifier for multiple features

I have a dataset set that looks like this:
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Class
Obj 2 2 2 8 5 1
Obj 2 8 3 3 4 2
Obj 1 7 4 4 8 1
Obj 4 3 5 9 7 2
The rows contain objects, which have a number of features. I have put 5 features for demonstration purposes but there is approximately 50 features per object, with the final column being the class label for each object.
I want to create and run the nearest neighbour classifier algorithm on this data set and retrieve the error rate.I have managed to get the NN algorithm working for each feature, a short Pseudo code example is below. For each feature I loop through each object, assigning object j according to its nearest neighbours.
for i = 1:Number of features
for j = 1:Number of objects
distance between data(j,i) and values of feature i
order by shortest distance
sum or the class labels k shortest distances
assign class with largest number of labels
end
error = mean(labels~=assigned)
end
The issue I have is how would I work out the 1-NN algorithm for multiple features. I will have a selection of the features from my dataset say features 1,2 and 3. I want to calculate the error rate if I add feature 5 into my set of selected features. I want to work out the error using 1NN. Would I find the nearest value out of all my features 1-3 in my selected feature?
For example, for my data set above:
Adding feature 5 - For object 1 of feature 5 the closest number to that is object 4 of feature 3. As this has a class label of 2 I will assign object 1 of feature 5 the class 2. This is obviously a misclassification but I would continue to classify all other objects in Feature 5 and compare the assigned and actual values.
Is this the correct way to perform the 1NN against multiple features?

Solving nonlinear optimization equations with large errors

The variable y may take a value which is in a defined range:
3<y<5
The value of y should be determined introducing a constraint like
|x-y|=min
x is given and should scan a larger range like:
x:=-1000:1:1000
How do I find the exact y-value with a given x?
The results that I consider is like:
x y
-1000 3
. 3
. 3
2.9 3
3 3
3.1 3.1
4 4
5 5
6 5
7 5
. 5
. 5
1000 5
Which means I want to allow a larger "error" but between 3 and 5 it should solve with a very smaller error so that I can resolve this area fine as possible.
What would be the best way to implement something like this in Matlab? Without "IF"-condition and if possible, symbolically. But also numerical alternatives would be interesting.
Based on your comment and example I think you are simply looking for this:
x = -10:0.1:10 %Suppose this is your x
y = max(min(x,5),3) %Force it between 3 and 5 by rounding up or down respectively