optimizing manually-coded k-means in MATLAB? - matlab

So I'm writing a k-means script in MATLAB, since the native function doesn't seem to be very efficient, and it seems to be fully operational. It appears to work on the small training set that I'm using (which is a 150x2 matrix fed via text file). However, the runtime is taking exponentially longer for my target data set, which is a 3924x19 matrix.
I'm not the greatest at vectorization, so any suggestions would be greatly appreciated. Here's my k-means script so far (I know I'm going to have to adjust my convergence condition, since it's looking for an exact match, and I'll probably need more iterations for a dataset this large, but I want it to be able to finish in a reasonable time first, before I crank that number up):
clear all;
%take input file (manually specified by user
disp('Please type input filename (in working directory): ')
target_file = input('filename: ', 's');
%parse and load into matrix
data = load(target_file);
%prompt name of output file for later) UNCOMMENT BELOW TWO LINES LATER
% disp('Please type output filename (to be saved in working directory): ')
% output_name = input('filename:', 's')
%prompt number of clusters
disp('Please type desired number of clusters: ')
c = input ('number of clusters: ');
%specify type of kmeans algorithm ('regular' for regular, 'fuzzy' for fuzzy)
%UNCOMMENT BELOW TWO LINES LATER
% disp('Please specify type (regular or fuzzy):')
% runtype = input('type: ', 's')
%initialize cluster centroid locations within bounds given by data set
%initialize rangemax and rangemin row vectors
%with length same as number of dimensions
rangemax = zeros(1,size(data,2));
rangemin = zeros(1,size(data,2));
%map max and min values for bounds
for dim = 1:size(data,2)
rangemax(dim) = max(data(:,dim));
rangemin(dim) = min(data(:,dim));
end
% rangemax
% rangemin
%randomly initialize mu_k (center) locations in (k x n) matrix where k is
%cluster number and n is number of dimensions/coordinates
mu_k = zeros(c,size(data,2));
for k = 1:size(data,2)
mu_k(k,:) = rangemin + (rangemax - rangemin).*rand(1,1);
end
mu_k
%iterate k-means
%initialize holding variable for distance comparison
comparisonmatrix = [];
%initialize assignment vector
assignment = zeros(size(data,1),1);
%initialize distance holding vector
dist = zeros(1,size(data,2));
%specify convergence threshold
%threshold = 0.001;
for iteration = 1:25
%save current assignment values to check convergence condition
hold_assignment = assignment;
for point = 1:size(data,1)
%calculate distances from point to centers
for k = 1:c
%holding variables
comparisonmatrix = [data(point,:);mu_k(k,:)];
dist(k) = pdist(comparisonmatrix);
end
%record location of mininum distance (location value will be between 1
%and k)
[minval, location] = min(dist);
%assign cluster number (analogous to location value)
assignment(point) = location;
end
%check convergence criteria
if isequal(assignment,hold_assignment)
break
end
%revise mu_k locations
%count number of each label
assignment_count = zeros(1,c);
for i = 1:size(data,1)
assignment_count(assignment(i)) = assignment_count(assignment(i)) + 1;
end
%compute centroids
point_total = zeros(size(mu_k));
for row = 1:size(data,1)
point_total(assignment(row),:) = point_total(assignment(row)) + data(row,:);
end
%move mu_k values to centroids
for center = 1:c
mu_k(center,:) = point_total(center,:)/assignment_count(center);
end
end
There are a lot of loops in there, so I feel that there's a lot of optimization to be made. However, I think I've just been staring at this code for far too long, so some fresh eyes could help. Please let me know if I need to clarify anything in the code block.
When the above code block is executed (in context) on the large dataset, it takes 3732.152 seconds, according to MATLAB's profiler, to make the full 25 iterations (I'm assuming it hasn't "converged" according to my criteria yet) for 150 clusters, but about 130 of them return NaNs (130 rows in mu_k).

Profiling will help, but the place to rework your code is to avoid the loop over the number of data points (for point = 1:size(data,1)). Vectorize that.
In your for iteration loop here is a quick partial example,
[nPoints,nDims] = size(data);
% Calculate all high-dimensional distances at once
kdiffs = bsxfun(#minus,data,permute(mu_k,[3 2 1])); % NxDx1 - 1xDxK => NxDxK
distances = sum(kdiffs.^2,2); % no need to do sqrt
distances = squeeze(distances); % Nx1xK => NxK
% Find closest cluster center for each point
[~,ik] = min(distances,[],2); % Nx1
% Calculate the new cluster centers (mean the data)
mu_k_new = zeros(c,nDims);
for i=1:c,
indk = ik==i;
clustersizes(i) = nnz(indk);
mu_k_new(i,:) = mean(data(indk,:))';
end
This isn't the only (or the best) way to do it, but it should be a decent example.
Some other comments:
Instead of using input, make this script into a function to efficiently handle input arguments.
If you want an easy way to specify a file, see uigetfile.
With many MATLAB functions, such as max, min, sum, mean, etc., you can specify a dimension over which the function should operate. This way you an run it on a matrix and compute values for multiple conditions/dimensions at the same time.
Once you get decent performance, consider iterating longer, specifically until the centers no longer change or the number of samples that change clusters becomes small.
The cluster with the smallest distance for each point, ik, will be the same with squared Euclidean distance.

Related

K-nearest neighbourhood in a spcific range in MATLAB

I am dealing with k-neighbour problem in MATLAB. There is an image with row r and column c. And divide it into r*c blocks - each blcok represents a patch centered in each pixel.
And I want to find the k-nearest neighbourbood of each blcok within a specific search range. At first I use knnsearch with kdTree:
ns = createns(Block','nsmethod','kdtree');
[Index_nearest,dist] = knnsearch(ns,Block','k',k+1);
However, I find that it would find k-nearest neighbourhood in all blocks, instead of the specific range. Therefore, is there any other method to achieve the goal? Could anyone give me some hints? Thanks in advance!
Edit: the code for knnsearch:
function [Index_nearest, Weight] = Compute_Weight(Input, Options)
% Input the data and pre-processing
windowsize = Options.winsize;
k = Options.directionsize;
deviation = Options.deviation; % Deviation for Gaussian kernel
h = Options.h; % This parameter is for controling the weights
[r,c] = size(Input);
In_pad = padarray(Input, [windowsize windowsize], 'symmetric');
window_size = (2*windowsize+1)*(2*windowsize+1);
Block = zeros(window_size,r*c);
%% Split the input data into blocks
for i = 1:r
for j = 1:c
block = In_pad(i:i+2*windowsize,j:j+2*windowsize);
Block(:,r*(i-1)+j) = block(:); % expand from column to column
end
end
%% Find k-nearest neighbour blocks
% Create a KDtree with all local patches
ns = createns(Block','nsmethod','kdtree');
% Find the patches closest by in intensity in relation to the local patch itself
[Index_nearest,ddd] = knnsearch(ns,Block','k',k+1);
Index_nearest = Index_nearest';
Index_nearest = Index_nearest(2:k+1,:);
end

Why is the matlab filter order limited by one third of the length of data minus one?

I have a question regarding filters in matlab.
I wonder why the filter order in MATLAB is limited by n_order = floor(length(t)/3)-1 as in the example below? Is this a numerical requirement for filter to work?
Also, n_order is equal to the size of the window, thus this limit permits creating 3 windows with the maximum order size. Is there a way to create more windows with the same filter order?
The code below is just to give an example to you.
t = linspace(0,4*pi,1000);
rng default %initialize random number generator
x = sin(t) + 0.25*rand(size(t));
dt = t(2)-t(1);
Fs = 1/dt; % Sampling frequency
f_band = [0.01 2];
n_order = floor(length(t)/3)-1; % max order (will result in 3 windows)
n_wind_filtLen = n_order+1; % step-length of windows
df = Fs/n_wind_filtLen; % frequency bin size
b = fir1(n_order,f_band/(Fs/2),'bandpass',hamming(n_order+1)); % a=1;
x_fil = filtfilt(b,1,x); % a=1;
figure;
plot(t, x,'-k','linewidth',2); hold on;
plot(t, x_fil,'-.r','linewidth',2);
This requirement comes from the function filtfilt. You can find its documentation here.
By typing help filtfilt, you can read:
The length of the input X must be more than three times the filter
order, defined as max(length(B)-1,length(A)-1).
This limitation is confirmed in the function's source code:
nb = numel(b);
nfilt = max(nb,na);
nfact = max(1,3*(nfilt-1)); % length of edge transients
if Npts <= nfact % input data too short
error(message('signal:filtfilt:InvalidDimensionsDataShortForFiltOrder',num2str(nfact)));
end
If you want to increase the order, you have to increase the length of your data.
The other limitation, comes from the function fir1:
The window vector must have n + 1 elements.

"Out of Memory" Matlab

Well I made 4 standalone executable from 4 different Matlab functions to build a face recognition system. I am calling those 4 executables using different batch codes and performing tasks on images. The total number of images I have is above 300k. 3 of these 4 executables is working good, but I am facing "out of memory" problem when I am trying to call the standalone executable of Fisherface function. It simply calculates unique features of each image using Fisher's linear discriminant analysis. The analysis is applied on the huge face matrix which consists of pixel values of over 150,000 images of size 60*60. Hence the size of the matrix is 150,000*3600.
Well what I understand is its happening due to shortage of contiguous memory in RAM. So as a way out, I chose to divide my large image set into number of subsets, each of which contains 3000 images. Now when an input face is provided, it searches for best matches of that input in each of those subset and finally sorts out the final list of 3 best matches with lowest distances (Euclidean). This resolved the out of memory error but the recognition rate became much lower. Because when the discriminant analysis is done in the original face matrix (which I have tested in smaller datasets containing 4000-5000 images), it gives good recognition rate.
I am seeking a way out of this problem. I want to perform all the operations on the large matrix. Is there a way to implement the function more efficiently, for example, allocating memory dynamically in Matlab? I hope I have been fairly specific in order to explain my problem. Below, I have provided the code segment of that particular executable.
function FisherfaceCorenew(matname)
load(matname);
Class_number = size(T,2) ;
Class_population = 1;
P = Class_population * Class_number; % Total number of training images
%%%%%%%%%%%%%%%%%%%%%%%% calculating the mean image
m_database = single(mean(T,2));
%%%%%%%%%%%%%%%%%%%%%%%% Calculating the deviation of each image from mean image
A = T - repmat(m_database,1,P);
L = single(A')*single(A);
[V D] = eig(L); % Diagonal elements of D are the eigenvalues for both L=A'*A and C=A*A'.
%%%%%%%%%%%%%%%%%%%%%%%% Sorting and eliminating small eigenvalues
L_eig_vec = [];
for i = 1 : P
L_eig_vec = [L_eig_vec V(:,i)];
end
%%%%%%%%%%%%%%%%%%%%%%%% Calculating the eigenvectors of covariance matrix 'C'
V_PCA = single(A) * single(L_eig_vec);
%%%%%%%%%%%%%%%%%%%%%%%% Projecting centered image vectors onto eigenspace
ProjectedImages_PCA = [];
for i = 1 : P
temp = single(V_PCA')*single(A(:,i));
ProjectedImages_PCA = [ProjectedImages_PCA temp];
end
%%%%%%%%%%%%%%%%%%%%%%%% Calculating the mean of each class in eigenspace
m_PCA = mean(ProjectedImages_PCA,2); % Total mean in eigenspace
m = zeros(P,Class_number);
Sw = zeros(P,P); %new
Sb = zeros(P,P); %new
for i = 1 : Class_number
m(:,i) = mean( ( ProjectedImages_PCA(:,((i-1)*Class_population+1):i*Class_population) ), 2 )';
S = zeros(P,P); %new
for j = ( (i-1)*Class_population+1 ) : ( i*Class_population )
S = S + (ProjectedImages_PCA(:,j)-m(:,i))*(ProjectedImages_PCA(:,j)-m(:,i))';
end
Sw = Sw + S; % Within Scatter Matrix
Sb = Sb + (m(:,i)-m_PCA) * (m(:,i)-m_PCA)'; % Between Scatter Matrix
end
%%%%%%%%%%%%%%%%%%%%%%%% Calculating Fisher discriminant basis's
% We want to maximise the Between Scatter Matrix, while minimising the
% Within Scatter Matrix. Thus, a cost function J is defined, so that this condition is satisfied.
[J_eig_vec, J_eig_val] = eig(Sb,Sw);
J_eig_vec = fliplr(J_eig_vec);
%%%%%%%%%%%%%%%%%%%%%%%% Eliminating zero eigens and sorting in descend order
for i = 1 : Class_number-1
V_Fisher(:,i) = J_eig_vec(:,i);
end
%%%%%%%%%%%%%%%%%%%%%%%% Projecting images onto Fisher linear space
for i = 1 : Class_number*Class_population
ProjectedImages_Fisher(:,i) = V_Fisher' * ProjectedImages_PCA(:,i);
end
save fisherdata.mat m_database V_PCA V_Fisher ProjectedImages_Fisher;
end
It's not easy to help you, because we can't see the sizes of your matrices.
At least you could use the Matlab clear command after you don't use a variable anymore (e.g. A).
Maybe you could use the single() command when you allocate A variable instead of in every equation.
A = single(T - repmat(m_database,1,P));
And then
L = A'*A;
Also you could use the Matlab profiler with memory usage to see your memory demand.
Another option could be to use sparse matrices or reduce to even smaller datatypes like uint8, if appropriate for some data.

Finding 2D area defined by contour lines in Matlab

I am having difficulty with calculating 2D area of contours produced from a Kernel Density Estimation (KDE) in Matlab. I have three variables:
X and Y = meshgrid which variable 'density' is computed over (256x256)
density = density computed from the KDE (256x256)
I run the code
contour(X,Y,density,10)
This produces the plot that is attached. For each of the 10 contour levels I would like to calculate the area. I have done this in some other platforms such as R but am having trouble figuring out the correct method / syntax in Matlab.
C = contourc(density)
I believe the above line would store all of the values of the contours allowing me to calculate the areas but I do not fully understand how these values are stored nor how to get them properly.
This little script will help you. Its general for contour. Probably working for contour3 and contourf as well, with adjustments of course.
[X,Y,Z] = peaks; %example data
% specify certain levels
clevels = [1 2 3];
C = contour(X,Y,Z,clevels);
xdata = C(1,:); %not really useful, in most cases delimters are not clear
ydata = C(2,:); %therefore further steps to determine the actual curves:
%find curves
n(1) = 1; %n: indices where the certain curves start
d(1) = ydata(1); %d: distance to the next index
ii = 1;
while true
n(ii+1) = n(ii)+d(ii)+1; %calculate index of next startpoint
if n(ii+1) > numel(xdata) %breaking condition
n(end) = []; %delete breaking point
break
end
d(ii+1) = ydata(n(ii+1)); %get next distance
ii = ii+1;
end
%which contourlevel to calculate?
value = 2; %must be member of clevels
sel = find(ismember(xdata(n),value));
idx = n(sel); %indices belonging to choice
L = ydata( n(sel) ); %length of curve array
% calculate area and plot all contours of the same level
for ii = 1:numel(idx)
x{ii} = xdata(idx(ii)+1:idx(ii)+L(ii));
y{ii} = ydata(idx(ii)+1:idx(ii)+L(ii));
figure(ii)
patch(x{ii},y{ii},'red'); %just for displaying purposes
%partial areas of all contours of the same plot
areas(ii) = polyarea(x{ii},y{ii});
end
% calculate total area of all contours of same level
totalarea = sum(areas)
Example: peaks (by Matlab)
Level value=2 are the green contours, the first loop gets all contour lines and the second loop calculates the area of all green polygons. Finally sum it up.
If you want to get all total areas of all levels I'd rather write some little functions, than using another loop. You could also consider, to plot just the level you want for each calculation. This way the contourmatrix would be much easier and you could simplify the process. If you don't have multiple shapes, I'd just specify the level with a scalar and use contour to get C for only this level, delete the first value of xdata and ydata and directly calculate the area with polyarea
Here is a similar question I posted regarding the usage of Matlab contour(...) function.
The main ideas is to properly manipulate the return variable. In your example
c = contour(X,Y,density,10)
the variable c can be returned and used for any calculation over the isolines, including area.

exponential random numbers with a bound in matlab

I want to pick values between, say, 50 and 150 using an exponential random number generator (a flat hazard function). How do I implement bounds on the built-in exponential random number function in matlab?
A quick way is to a sequence longer than you need, and throw out values outside your desired range.
dist = exprnd(100,1,1000);
%# mean of 100 ---^ ^---^--- 1x1000 random numbers
dist(dist<50 | dist>150) = []; %# will be shorter than 1000
If you don't have enough values after pruning, you can repeat and append onto the vector, or however else you want to do it.
exprandn uses rand (see >> open exprnd.m) so you can bound the output of that instead by reversing the process and sampling uniformly within the desired range [r1, r2].
sizeOut = [1, 1000]; % sample size
mu = 100; % parameter of exponential
r1 = 50; % lower bound
r2 = 150; % upper bound
r = exprndBounded(mu, sizeOut, r1, r2); % bounded output
function r = exprndBounded(mu, sizeOut, r1, r2);
minE = exp(-r1/mu);
maxE = exp(-r2/mu);
randBounded = minE + (maxE-minE).*rand(sizeOut);
r = -mu .* log(randBounded);
The drawn densities (using a non-parametric kernel estimator) look like the following for 20K samples