K-nearest neighbourhood in a spcific range in MATLAB - matlab

I am dealing with k-neighbour problem in MATLAB. There is an image with row r and column c. And divide it into r*c blocks - each blcok represents a patch centered in each pixel.
And I want to find the k-nearest neighbourbood of each blcok within a specific search range. At first I use knnsearch with kdTree:
ns = createns(Block','nsmethod','kdtree');
[Index_nearest,dist] = knnsearch(ns,Block','k',k+1);
However, I find that it would find k-nearest neighbourhood in all blocks, instead of the specific range. Therefore, is there any other method to achieve the goal? Could anyone give me some hints? Thanks in advance!
Edit: the code for knnsearch:
function [Index_nearest, Weight] = Compute_Weight(Input, Options)
% Input the data and pre-processing
windowsize = Options.winsize;
k = Options.directionsize;
deviation = Options.deviation; % Deviation for Gaussian kernel
h = Options.h; % This parameter is for controling the weights
[r,c] = size(Input);
In_pad = padarray(Input, [windowsize windowsize], 'symmetric');
window_size = (2*windowsize+1)*(2*windowsize+1);
Block = zeros(window_size,r*c);
%% Split the input data into blocks
for i = 1:r
for j = 1:c
block = In_pad(i:i+2*windowsize,j:j+2*windowsize);
Block(:,r*(i-1)+j) = block(:); % expand from column to column
end
end
%% Find k-nearest neighbour blocks
% Create a KDtree with all local patches
ns = createns(Block','nsmethod','kdtree');
% Find the patches closest by in intensity in relation to the local patch itself
[Index_nearest,ddd] = knnsearch(ns,Block','k',k+1);
Index_nearest = Index_nearest';
Index_nearest = Index_nearest(2:k+1,:);
end

Related

Matlab Saving the Output of a Function (preferably as a csv)

I know this kind of thing has been asked before but nothing I have tried has helped.
I've borrowed a function from github for a project but it's in matlab and my project is in python so I want to save the output data as a csv so I can input it back into python.
I've literally only been using matlab for about 4 hrs so this could be a stupid question...
I've tried an array of different methods for saving as a csv but none of them have worked. They don't throw errors though, the files just don't show up.
This is my latest attempt:
clear;clc;
fileName = 'AlzScSK.csv';
csvData = importdata(fileName);
rawData = csvData.data;
dataMV = rawData;
scaledMV = dataMV
K = 6;
function [dataImputed dataImputedWeighted] = NSkNNDataHM(dataMV)
% Function to impute missing values in a dataset using NSkNN. scaledMV is
% the autoscale dataset with the missing values and K is the # of nearest
% neighbors to use to impute the data. filteredMV is the missing value
% dataset after filtering, but before autoscaling.
% NSkNNData_HM does not skip neighbors that have NaN values in the same
% location as the metabolite being imputed. Instead, it replaces these NaN
% values with the half minimum value of that metabolite.
numCol = size(scaledMV,2);
for col = 1:numCol
rowMV{col} = find(isnan(scaledMV(:,col))); % Finds the row # of every missing value in each column
end
counter = 1;
for targetCol = 1:numCol % i is the target sample
for neighborCol = 1:numCol % Calculate the Euclidean distances between the target sample (i) and the other samples (j)
MVRowsRemoved = scaledMV;
rowsToRemove = union(rowMV{targetCol},rowMV{neighborCol}); % Ignore NaNs when calculating distances
MVRowsRemoved(rowsToRemove,:) = []; % Remove rows in target sample that have missing values
numMetInCalc = size(MVRowsRemoved,1); % # of metabolites used in calculation of metabolites in order to weight distance
% Divide by numMetInCalc to avoid scenarios where distances
% calculated with only a few metabolites are weighted more heavily
% over distances that are close to the same distance, but used more
% metabolites in the calculation.
distance = pdist2(MVRowsRemoved(:,targetCol)',MVRowsRemoved(:,neighborCol)')/sqrt(numMetInCalc);
%distance = pdist2(MVRowsRemoved(:,targetCol)',MVRowsRemoved(:,neighborCol)');
distIdx(counter,:) = [targetCol distance neighborCol];
counter = counter+1;
end
end
% Remove rows that calculated the Euclidean distance between a sample and
% itself.
sameSample = find(distIdx(:,1)==distIdx(:,3));
distIdx(sameSample,:) = [];
distIdxSorted = sortrows(distIdx);
minValperRow = min(scaledMV,[],2); % Finds minimum of each metabolite
% Implement NSkNN with half minimum replacement
dataImputed = scaledMV;
dataImputedWeighted = scaledMV;
for targetCol = 1:numCol
numMV = size(rowMV{targetCol},1); % # of missing values in a column
firstNNIdx = (targetCol-1)*(numCol-1)+1; % Column index of the first nearest neighbor of target sample (i) in distIdxSorted
for MVidx = 1:numMV % For each missing value in the target sample...
tempDataMV = scaledMV;
NN = distIdxSorted(firstNNIdx:firstNNIdx+K-1,3); % Column #s of the k nearest neighbors
DistanceNN = distIdxSorted(firstNNIdx:firstNNIdx+K-1,2); % Distances of k nearest neighbors
idxNaNinCol = find(isnan(tempDataMV(rowMV{targetCol}(MVidx),NN))); % Finds missing values that are the same metabolite as the target metabolite to be imputed
if isempty(idxNaNinCol)~=1 % If NaN values found...
% If there are NaN values in the nearest neighbor
% metabolite that is the same as the target metabolite to
% be imputed, replace with half min value of the target
% metabolite
tempDataMV(rowMV{targetCol}(MVidx),NN(idxNaNinCol)) = (minValperRow(rowMV{targetCol}(MVidx)) - avgMV(rowMV{targetCol}(MVidx))/stddevMV(rowMV{targetCol}(MVidx)))/2;
end
% Imputed data is weighted by the inverse of the distance
WeightMultiplier = (1./DistanceNN')/sum(1./DistanceNN);
dataImputedWeighted(rowMV{targetCol}(MVidx),targetCol) = sum(tempDataMV(rowMV{targetCol}(MVidx),NN).*WeightMultiplier);
% Not weighted
dataImputed(rowMV{targetCol}(MVidx),targetCol) = mean(tempDataMV(rowMV{targetCol}(MVidx),NN));
writematrix(dataImputed,['M.csv'])
end
end
end
I'm not sure if I'm maybe referring to the wrong thing or if I've put the writematrix(dataImputed,['M.csv']) in the wrong place.
I've also tried these sorts of routes:
% Convert cell to a table and use first row as variable names
T = cell2table(c(2:end,:),'VariableNames',c(1,:))
% Write the table to a CSV file
writetable(T,'myDataFile.csv')
dlmwrite(filename,M)
csvwrite('filename.csv',variable2,0,2)
This is currently what my workspace looks like
Try this: writematrix(dataImputed,'M.csv','WriteMode','append')
If it's still not working, sharing the AlzScSK.csv with us would be helpful.

Why is the matlab filter order limited by one third of the length of data minus one?

I have a question regarding filters in matlab.
I wonder why the filter order in MATLAB is limited by n_order = floor(length(t)/3)-1 as in the example below? Is this a numerical requirement for filter to work?
Also, n_order is equal to the size of the window, thus this limit permits creating 3 windows with the maximum order size. Is there a way to create more windows with the same filter order?
The code below is just to give an example to you.
t = linspace(0,4*pi,1000);
rng default %initialize random number generator
x = sin(t) + 0.25*rand(size(t));
dt = t(2)-t(1);
Fs = 1/dt; % Sampling frequency
f_band = [0.01 2];
n_order = floor(length(t)/3)-1; % max order (will result in 3 windows)
n_wind_filtLen = n_order+1; % step-length of windows
df = Fs/n_wind_filtLen; % frequency bin size
b = fir1(n_order,f_band/(Fs/2),'bandpass',hamming(n_order+1)); % a=1;
x_fil = filtfilt(b,1,x); % a=1;
figure;
plot(t, x,'-k','linewidth',2); hold on;
plot(t, x_fil,'-.r','linewidth',2);
This requirement comes from the function filtfilt. You can find its documentation here.
By typing help filtfilt, you can read:
The length of the input X must be more than three times the filter
order, defined as max(length(B)-1,length(A)-1).
This limitation is confirmed in the function's source code:
nb = numel(b);
nfilt = max(nb,na);
nfact = max(1,3*(nfilt-1)); % length of edge transients
if Npts <= nfact % input data too short
error(message('signal:filtfilt:InvalidDimensionsDataShortForFiltOrder',num2str(nfact)));
end
If you want to increase the order, you have to increase the length of your data.
The other limitation, comes from the function fir1:
The window vector must have n + 1 elements.

Best method of removing flattish areas of a signal in MatLab?

Say I have a signal that looks a bit like this:
that I am processing in MatLab, what functions would I have to use to get rid of the flattish area in the middle? is there any functions that can do that, or do I need to program it in myself? Currently I just have a blank function as I don't know where to start:
function removala = removal(a, b)
end
Is there any quick functions that can remove it or do I just have to search for all values below a threshold and remove them myself? For reference a and b are vectors of amplitude points.
use findpeaks:
% generating signal
x = 1:0.1:10;
y = rand(size(x))*0.5;
y([5,25,84]) = [6,-5.5,7.5];
z = y;
thresh = 0.75; % height threshold
% find peaks
[pks,locs] = findpeaks(z,'MinPeakProminence',thresh);
% remove signal noise between peaks
for ii = 1:length(locs)-1
zz = z(locs(ii)+1:locs(ii+1)-1);
zz(abs(zz) < thresh) = 0;
z(locs(ii)+1:locs(ii+1)-1) = zz;
end
% plot
plot(x,y);
hold on
plot(x,z);
plot(x(locs),pks,'og');
legend('original signal','modified signal','peaks')
You probably want to remove the signal whose absolute power is less than a certain threshold.
So the two input of your function would be the raw signal, and the threshold. The function will output a variable "cleanSignal"
function cleanSignal = removal(rawSignal,threshold)
simplest implementation. remove the data below threshold. If rawSignal is a matrix the resulting variable will be a vector concatenating all the epochs above threshold.
ind = abs(rawSignal)<threshold;
rawSignal(ind) = [];
cleanSignal = rawSignal;
This might not be the behavior that you want, since you want to preserve the original shape of your rawSignal matrix. So you can just "nan" the values below threshold.
ind = abs(rawSignal)<threshold;
rawSignal(ind) = nan;
cleanSignal = rawSignal;
end
Notice that this does not really removes flat signal, but signal which is close to 0.
If you really want to remove flat signal you should use
ind = abs(diff(rawSignal))<threshold;

Can't I use graphshortestpath function at here?

I want to calculate Diameter of graph, which means greatest distance between any two vertices of G.
cm is connectivity matrix of graph, and diameter of graph should be in variable a.
But MATLAB gave me some error message 'Input argument should be a sparse array.'
Can't I use graphshortestpath function to calculate diameter? Then what should I do instead?
cm = [0,1,1,1,0;1,0,0,1,0;0,1,0,0,0;1,0,0,0,0;0,0,0,0,0];
bg = biograph(cm);
a = 1;
for i = 1:4
for j = (i+1):5
[dist,path,pred] = graphshortestpath(bg,i,j)
if a<=dist
a = dist
end
end
end
I haven't tested this (I don't have MATLAB here), but how about making cm sparse, and use that as input to graphshortestpath?
According to the documentation, "[The first argument must be an] N-by-N sparse matrix that represents a graph. Nonzero entries in matrix G represent the weights of the edges." Thus, you should not use the biograph as input.
Check our the first example in the documentation, it explains it very well!
cm_full = [0,1,1,1,0;1,0,0,1,0;0,1,0,0,0;1,0,0,0,0;0,0,0,0,0];
cm = sparse(cm_full);
bg = biograph(cm);
a = 1;
for i = 1:4
for j = (i+1):5
[dist,path,pred] = graphshortestpath(cm,i,j)
if a<=dist
a = dist
end
end
end
end

optimizing manually-coded k-means in MATLAB?

So I'm writing a k-means script in MATLAB, since the native function doesn't seem to be very efficient, and it seems to be fully operational. It appears to work on the small training set that I'm using (which is a 150x2 matrix fed via text file). However, the runtime is taking exponentially longer for my target data set, which is a 3924x19 matrix.
I'm not the greatest at vectorization, so any suggestions would be greatly appreciated. Here's my k-means script so far (I know I'm going to have to adjust my convergence condition, since it's looking for an exact match, and I'll probably need more iterations for a dataset this large, but I want it to be able to finish in a reasonable time first, before I crank that number up):
clear all;
%take input file (manually specified by user
disp('Please type input filename (in working directory): ')
target_file = input('filename: ', 's');
%parse and load into matrix
data = load(target_file);
%prompt name of output file for later) UNCOMMENT BELOW TWO LINES LATER
% disp('Please type output filename (to be saved in working directory): ')
% output_name = input('filename:', 's')
%prompt number of clusters
disp('Please type desired number of clusters: ')
c = input ('number of clusters: ');
%specify type of kmeans algorithm ('regular' for regular, 'fuzzy' for fuzzy)
%UNCOMMENT BELOW TWO LINES LATER
% disp('Please specify type (regular or fuzzy):')
% runtype = input('type: ', 's')
%initialize cluster centroid locations within bounds given by data set
%initialize rangemax and rangemin row vectors
%with length same as number of dimensions
rangemax = zeros(1,size(data,2));
rangemin = zeros(1,size(data,2));
%map max and min values for bounds
for dim = 1:size(data,2)
rangemax(dim) = max(data(:,dim));
rangemin(dim) = min(data(:,dim));
end
% rangemax
% rangemin
%randomly initialize mu_k (center) locations in (k x n) matrix where k is
%cluster number and n is number of dimensions/coordinates
mu_k = zeros(c,size(data,2));
for k = 1:size(data,2)
mu_k(k,:) = rangemin + (rangemax - rangemin).*rand(1,1);
end
mu_k
%iterate k-means
%initialize holding variable for distance comparison
comparisonmatrix = [];
%initialize assignment vector
assignment = zeros(size(data,1),1);
%initialize distance holding vector
dist = zeros(1,size(data,2));
%specify convergence threshold
%threshold = 0.001;
for iteration = 1:25
%save current assignment values to check convergence condition
hold_assignment = assignment;
for point = 1:size(data,1)
%calculate distances from point to centers
for k = 1:c
%holding variables
comparisonmatrix = [data(point,:);mu_k(k,:)];
dist(k) = pdist(comparisonmatrix);
end
%record location of mininum distance (location value will be between 1
%and k)
[minval, location] = min(dist);
%assign cluster number (analogous to location value)
assignment(point) = location;
end
%check convergence criteria
if isequal(assignment,hold_assignment)
break
end
%revise mu_k locations
%count number of each label
assignment_count = zeros(1,c);
for i = 1:size(data,1)
assignment_count(assignment(i)) = assignment_count(assignment(i)) + 1;
end
%compute centroids
point_total = zeros(size(mu_k));
for row = 1:size(data,1)
point_total(assignment(row),:) = point_total(assignment(row)) + data(row,:);
end
%move mu_k values to centroids
for center = 1:c
mu_k(center,:) = point_total(center,:)/assignment_count(center);
end
end
There are a lot of loops in there, so I feel that there's a lot of optimization to be made. However, I think I've just been staring at this code for far too long, so some fresh eyes could help. Please let me know if I need to clarify anything in the code block.
When the above code block is executed (in context) on the large dataset, it takes 3732.152 seconds, according to MATLAB's profiler, to make the full 25 iterations (I'm assuming it hasn't "converged" according to my criteria yet) for 150 clusters, but about 130 of them return NaNs (130 rows in mu_k).
Profiling will help, but the place to rework your code is to avoid the loop over the number of data points (for point = 1:size(data,1)). Vectorize that.
In your for iteration loop here is a quick partial example,
[nPoints,nDims] = size(data);
% Calculate all high-dimensional distances at once
kdiffs = bsxfun(#minus,data,permute(mu_k,[3 2 1])); % NxDx1 - 1xDxK => NxDxK
distances = sum(kdiffs.^2,2); % no need to do sqrt
distances = squeeze(distances); % Nx1xK => NxK
% Find closest cluster center for each point
[~,ik] = min(distances,[],2); % Nx1
% Calculate the new cluster centers (mean the data)
mu_k_new = zeros(c,nDims);
for i=1:c,
indk = ik==i;
clustersizes(i) = nnz(indk);
mu_k_new(i,:) = mean(data(indk,:))';
end
This isn't the only (or the best) way to do it, but it should be a decent example.
Some other comments:
Instead of using input, make this script into a function to efficiently handle input arguments.
If you want an easy way to specify a file, see uigetfile.
With many MATLAB functions, such as max, min, sum, mean, etc., you can specify a dimension over which the function should operate. This way you an run it on a matrix and compute values for multiple conditions/dimensions at the same time.
Once you get decent performance, consider iterating longer, specifically until the centers no longer change or the number of samples that change clusters becomes small.
The cluster with the smallest distance for each point, ik, will be the same with squared Euclidean distance.