Nearest Neighbour Classifier for multiple features - matlab

I have a dataset set that looks like this:
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Class
Obj 2 2 2 8 5 1
Obj 2 8 3 3 4 2
Obj 1 7 4 4 8 1
Obj 4 3 5 9 7 2
The rows contain objects, which have a number of features. I have put 5 features for demonstration purposes but there is approximately 50 features per object, with the final column being the class label for each object.
I want to create and run the nearest neighbour classifier algorithm on this data set and retrieve the error rate.I have managed to get the NN algorithm working for each feature, a short Pseudo code example is below. For each feature I loop through each object, assigning object j according to its nearest neighbours.
for i = 1:Number of features
for j = 1:Number of objects
distance between data(j,i) and values of feature i
order by shortest distance
sum or the class labels k shortest distances
assign class with largest number of labels
end
error = mean(labels~=assigned)
end
The issue I have is how would I work out the 1-NN algorithm for multiple features. I will have a selection of the features from my dataset say features 1,2 and 3. I want to calculate the error rate if I add feature 5 into my set of selected features. I want to work out the error using 1NN. Would I find the nearest value out of all my features 1-3 in my selected feature?
For example, for my data set above:
Adding feature 5 - For object 1 of feature 5 the closest number to that is object 4 of feature 3. As this has a class label of 2 I will assign object 1 of feature 5 the class 2. This is obviously a misclassification but I would continue to classify all other objects in Feature 5 and compare the assigned and actual values.
Is this the correct way to perform the 1NN against multiple features?

Related

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

Translating a kd-tree in MATLAB

I'm using a kd-tree to perform quick nearest neighbor search queries. I'm using the following piece of code to generate the kd-tree and perform queries on it:
% 3 dimensional vertex data
x = [1 2 2 1 2 5 6 3 4;
3 2 3 2 2 7 6 5 2;
1 2 9 9 7 5 8 9 3]';
% create the kd-tree
kdtree = createns(x, 'NSMethod', 'kdtree');
% perform a nearest neighbor search
nearestNeighborIndex = knnsearch(kdtree, [1 1 1]);
This works well enough for when the data is static. However, every once in a while, I need to translate every vertex on the kd-tree. I know that changing the whole data means I need to re-generate the whole tree to perform a nearest neighbor search again. Having a couple of thousand vertices for each kd-tree, re-generating the whole tree from scratch seems to me like an overkill as it takes a significant amount of time. Is there a way to translate the kd-tree without re-generating it from scratch? I tried accessing and changing the X property (which holds the actual vertex data) of the kd-tree but it seems to be read-only, and it probably wouldn't have worked even if it wasn't since there is a lot more going on behind the curtains.

Calculating the Local Ternary Pattern of an depth image

I found the detail and implementation of Local Ternary Pattern (LTP) on Calculating the Local Ternary Pattern of an image?. I want to ask more details that what the best way to choose the threshold t and also I have confusion in understand the role of reorder_vector = [8 7 4 1 2 3 6 9];
Unfortunately there isn't a good way to figure out what the threshold is using LTPs. It's mostly trial and error or by experimentation. However, I could suggest to make the threshold adaptive. You can use Otsu's algorithm to dynamically determine the best threshold of your image. This is assuming that the distribution of your intensities in the image is bimodal. In other words, there is a clear separation between objects and background. MATLAB has an implementation of this by the graythresh function. However, this generates a threshold between 0 and 1, so you will need to multiply the result by 255, assuming that the type of your image is uint8.
Therefore, do:
t = 255*graythresh(im);
im is the image that you desire to compute the LTPs. Now, I can certainly provide insight on what the reorder_vector is doing. Look at the following figure on how to calculate LTPs:
(source: hindawi.com)
When we generate the ternary code matrix (matrix in the middle), we need to generate an 8 element sequence that doesn't include the middle of the neighbourhood. We start from the east most element (row 2, column 3), then traverse the elements in counter-clockwise order. The reorder_vector variable allows you to select those specific elements that respect that order. If you recall, MATLAB can access matrices using column-major linear indices. Specifically, given a 3 x 3 matrix, we can access an element using a number from 1 to 9 and the memory is laid out like so:
1 4 7
2 5 8
3 6 9
Therefore, the first element of reorder_vector is index 8, which is the east most element. Next is index 7, which is the top right element, then index 4 which is the north facing element, then 1, 2, 3, 6 and finally 9.
If you follow these numbers, you will determine how I got the reorder_vector:
reorder_vector = [8 7 4 1 2 3 6 9];
By using this variable for accessing each 3 x 3 local neighbourhood, we would thus generate the correct 8 element sequence that respects the ordering of the ternary code so that we can proceed with the next stage of the algorithm.

wrong partitions with matlab's cvpartition

I am having trouble with the cvpartition function of Matlab. I want to perform a 5-fold cross-validation (for classification) with a dataset that has 134 instances from class 1 (negative) and 19 intances from class 2 (positive).
With 5-fold CV one should have something like 4 - 4 - 4 - 4 - 3 positive instances partitioned along the 5 folds or close to that (5 - 4 - 3 - 4 - 3 would also be OK). I make 30 repetitions of the 5-fold CV and sometimes Matlab builds partitions like 1 - 5 - 5 -4 - 4 or even 5 - 5 - 5 - 4 - 0 , that is, on of the folds has no positive instances! How is this possible and how can I correct this? At least it should guarantee that the two classes were always represented in each fold...
This brings me problems when trying to compute PRecision, Recall, F-measure and so on...
LS
Are you using the stratified form of cross-validation that cvpartition provides?
Use the second syntax described in the documentation page, i.e. c = cvpartition(group,'kfold',k) rather than c = cvpartition(n,'kfold',k). Here group is a vector (or categorical array, cell array of strings etc) of class labels, and will stratify the selection of observations into folds rather than just splitting everything randomly into groups.

Permutation vectors from the CLUSTERGRAM object (MATLAB)

I'm using the CLUSTERGRAM object from the Bioinformatics Toolbox (ver 3.7).
MATLAB version R2011a.
I'd like to get permutation vectors for row and columns for clustergram, as I can do with dendrogram function:
x = magic(10);
>> [~,~,permrows] = dendrogram(linkage(x,'average','euc'))
permrows =
9 10 6 7 8 1 2 4 5 3
>> [~,~,permcols] = dendrogram(linkage(x','average','euc'))
permcols =
6 7 8 9 2 1 3 4 5 10
I found that the clustering is not the same from clustergram and dendrogram, most probably due to optimal leaf ordering calculation (I don't want to disable it).
For example, for clustergram from:
clustergram(x)
('average' and 'eucledian' are default methods for clustergram)
the vectors (as on the figure attached) should be:
permrows = [1 2 4 5 3 10 9 6 7 8];
permcols = [1 2 8 9 6 7 10 5 4 3];
So, how to get those vectors programmatically? Anybody well familiar with this object?
Do anyone can suggest a good alternative? I know I can create a similar figure combining imagesc and dendrogram functions, but leaf ordering is much better (optimal) in clustergram, than in dendrogram.
From looking at the documentation, I guess that get(gco,'ColumnLabels') and get(gco,'RowLabels'), where gco is the clustergram object, should give you the reordered labels. Note that the corresponding set-methods take in the labels in original order and internally reorders them.
Consequently, if you have used custom labels (set(gco,'RowLabels',originalLabels))
[~,permrows] = ismember(get(gco,'RowLabels'),originalLabels)
should return the row permutation.