Removing Similar Elements in Matrix - matlab

I'm trying to figure out how to remove an element of a matrix in MATLAB if it differs from any of the other elements by 0.01. I'm supposed to be using all of the unique elements of the matrix as thresholding values for a ROC curve that I'm creating but I need a way to remove values when they are within 0.01 of each other (since we are assuming they are basically equal if this is true).
And help would be greatly appreciated!
Thanks!

If you are simply trying to remove adjacent values within that tolerance from a vector, I would start with something like this:
roc = ...
tolerance = 0.1;
idx = [logical(1) diff(roc)>tolerance)];
rocReduced = roc(idx);
'rocReduced' is now a vector with all values that didn't have an adjacent values within a tolerance in the original vector.
This approach has two distinct limitations:
The original 'roc' vector must be monotonic.
No more than two items in a row may be within the tolerance, otherwise the entire swath will be removed.
I suspect the above would not be sufficient. That said, I can't think of any simple operations that overcome those (and other) limitations while still using vectorized matrix operations.
If performance is not a huge issue, you maybe the following iterative algorithm would suit your application:
roc = ...
tolerance = 0.1;
mask = true(size(roc)); % Start with all points
last = 1; % Always taking first point
for i=2:length(roc) % for all remaining points,
if(abs(roc(i)-roc(last))<tolerance) % If this point is within the tolerance of the last accepted point, remove it from the mask;
mask(i) = false;
else % Otherwise, keep it and mark the last kept
last = i;
end
end
rocReduced = roc(mask);
This handles multiple consecutive sub-tolerance intervals without necessarily throwing all away. It also handles non-monotonic sequences.
MATLAB users sometimes shy away from iterative solutions (vs. vectorized matrix operations), but sometimes it's not worth the trouble of finding a more elegant solution when brute force performance meets your needs.

Let all the elements in your matrix form a graph G = (V,E) such that an there is an edge between two vertices (u,v) if the difference between them is less than 0.01. Now, construct an adjacency matrix for this graph and find the element with the largest degree. Remove it and add it to a list and remove it's neighbors from your graph and repeat until there aren't any elements left.
CODE:
%% Toy dataset
M = [1 1.005 2 ;2.005 2.009 3; 3.01 3.001 3.005];
M = M(:);
A = false(numel(M),numel(M));
for i=1:numel(M)
ind = abs(M-M(i))<=0.01;
A(i,ind) = 1;
end
C = [];
while any(A(:))
[val ind] = max(sum(A));
C(end+1) = M(ind);
A(A(ind,:),:) = 0;
end
This has a runtime of O(n^2) where your matrix has n elements. Yeah it's slow.

From your description, it's not very clear how you want to handle a chain of values (as pointed out in the comments already), e.g. 0.0 0.05 0.1 0.15 ... and what you actually mean by removing the elements from the matrix: set them to zero, remove the entire column, remove the entire line?
For a vector, it could look like (similar to Adams solution)
roc = ...
tolerance = 0.1;
% sort it first to get the similar values in a row
[rocSorted, sortIdx] = sort(roc);
% find the differing values and get their indices
idx = [logical(1); diff(rocSorted)>tolerance)];
sortIdxReduced = sortIdx(idx);
% select only the relevant parts from the original vector (revert sorting)
rocReduced = roc(sort(sortIdxReduced));
The code is untested, but should work hopefully.

Before you use a threshold or tolerance to keep values that all close enough, you can use matlab inbuilt unique() to reduce the run. Usually, matlab tries to accelerate their inbuilts, so try to use as many inbuilts as possible.

Related

How to efficiently create a non-linear mask using two boundary arrays

I can make a non-linear mask based on the two arrays containing lower and higher boundary of the mask. All values in between need to be set to 1. The way I do this now seems to take quite a lot of time and it is becoming a bottleneck. I was wondering if there is a way to do it more time efficient.
First, I was thinking to solve it using parfors to increase the speed. But since this is one of the inner loops in my code these seem highly inefficient since it's more feasible using parfor on the outer loop considering schedule overhead. So parallel techniques are not an option.
See here the creation of the mask:
mask = zeros(size(im));
n = length(bufLow);
for i=1:1:n
mask(bufLow(i):bufHigh(i),i) = 1;
end
im is an matrix of a certain size and bufLow and bufHigh are arrays in size equal to the horizontal size of im describing the higher and lower boundaries for each column of im. In between these values everything needs to be set to 1.
So the goal is to have something that reduces the execution time of this loop as much as possible. I was wondering if there is somebody with some the knowledge to enlight me.
Bests,
Matthijs
I admit, that your question allows for some interpretation and guesswork, but from the code you provided, I have an idea, what you want to achieve: For the i-th column in your mask you want to set all pixels between a start index (that would be bufLow(i)) and an end index (bufHigh(i)) to 1. Is that correct?
So, my idea to "vectorize" your loop would be to translate the "per column" subscript (or array) indices in your bufxxx to "image" linear indices and then find all linear indices between the start and end indices. The latter is a (common) problem, which has already several significant answers, like this one from Divakar.
I incorporated his answer in my solution. Please, see the following code:
dim = 25;
bufLow = int32(10 * rand(1, dim) + 1);
bufHigh = int32(10 * rand(1, dim) + 15);
% Reference implementation from question
mask = zeros(dim);
n = length(bufLow);
for i=1:1:n
mask(bufLow(i):bufHigh(i), i) = 1;
end
% Show mask
figure(1);
imshow(mask);
% Implementation using Divakar's approach
% Translate subscript indices to linear indices
bufLow = bufLow + (dim .* (0:dim-1));
bufHigh = bufHigh + (dim .* (0:dim-1));
% Divakar's approach for finding all indices between two boundaries
lens = bufHigh - bufLow + 1;
shift_idx = cumsum(lens(1:end-1)) + 1;
id_arr = ones(1, sum(lens));
id_arr([1 shift_idx]) = [bufLow(1) bufLow(2:end) - bufHigh(1:end-1)];
out = cumsum(id_arr);
% Generating mask
mask2 = zeros(dim);
mask2(out) = 1;
% Show mask
figure(2);
imshow(mask2);
The resulting masks are identical and look like this:
To have a look on the performance, I set up a separate timing script using both approaches on increasing dimension dim from 25 to 2500 in steps of 25. The result looks like this:
Hope that helps!

Finding roots in data

I have data that look like this:
These are curves of the same process but with different parameters.
I need to find the index (or x value) for certain y values (say, 10).
For the blue curve, this is easy: I'm using min to find the index:
[~, idx] = min(abs(y - target));
where y denotes the data and target the wanted value.
This approach works fine since I know that there is an intersection, and only one.
Now what to do with the red curve? I don't know beforehand, if there will be two intersections, so my idea of finding the first one and then stripping some of the data is not feasible.
How can I solve this?
Please note the the curves can shift in the x direction, so that checking the found solution for its xrange is not really an option (it could work for the data I have, but since there are more to come, this solution is probably not the best).
Shameless steal from here:
function x0 = data_zeros(x,y)
% Indices of Approximate Zero-Crossings
% (you can also use your own 'find' method here, although it has
% this pesky difference of 1-missing-element because of diff...)
dy = find(y(:).*circshift(y(:), [-1 0]) <= 0);
% Do linear interpolation of near-zero-crossings
x0 = NaN(size(dy,1)-1,1);
for k1 = 1:size(dy,1)-1
b = [[1;1] [x(dy(k1)); x(dy(k1)+1)]] \ ...
[y(dy(k1)); y(dy(k1)+1)];
x0(k1) = -b(1)/b(2);
end
end
Usage:
% Some data
x = linspace(0, 2*pi, 1e2);
y = sin(x);
% Find zeros
xz = data_zeros1(x,y);
% Plot original data and zeros found
figure(1), hold on
plot(x, y);
plot(xz, zeros(size(xz)), '+r');
axis([0,2*pi -1,+1]);
The gist: multiply all data points with their consecutive data points. Any of these products that is negative therefore has opposite sign, and gives you an approximate location of the zero. Then use linear interpolation between the same two points to get a more precise answer, and store that.
NOTE: for zeros exactly at the endpoints, this approach will not work. Therefore, it may be necessary to check those manually.
Subtract the desired number from your curve, i.e. if you want the values at 10 do data-10, then use and equality-within-tolerance, something like
TOL = 1e-4;
IDX = 1:numel(data(:,1)); % Assuming you have column data
IDX = IDX(abs(data-10)<=TOL);
where logical indexing has been used.
I figured out a way: The answer by b3 in this question did the trick.
idx = find(diff(y > target));
Easy as can be :) The exact xvalue can then be found by interpolation. For me, this is fine since i don't need exact values.

Vectorize function that finds an array of nearest values

I am still wrapping my head around vectorization and I'm having a difficult time trying to resolve the following function I made...
for i = 1:size(X, 1)
min_n = inf;
for j=1:K
val = X(i,:)' - centroids(j,:)';
diff = val'*val;
if (diff < min_n)
idx(i) = j;
min_n = diff;
end
end
end
X is an array of (x,y) coordinates...
2 5
5 6
...
...
centroids in this example is limited to 3 rows. It is also in (x,y) format as shown above.
For every pair in X I am computing the closest pair of centroids. I then store the index of the centroid in idx.
So idx(i) = j means that I am storing the index j of the centroid at index i, where i corresponds to the index of X. This means the closest centroid to pair X(i, :) is at idx(i).
Can I possibly simplify this via vectorization? I struggle with just vectorizing the inner loop.
Here are three options. But please note that the disadvantage of vectorization, as compared to your double loops, is that it stores all the difference operation results at once, which means that if your matrices have many rows, you might run out of memory. On the other hand, the vectorized approach is probably much faster.
Option 1
If you have access to Statistics and Machine Learning Toolbox, you can use the function pdist2 to get all the pairwise distances between rows of two matrices. Then, the min function gives you the minimum of each column of the result. Its first returned value are the minimal values, and its second are the indices, which is what you need for idx:
diff = pdist2(centroids,X);
[~,idx] = min(diff);
Option 2
If you don't have access to the toolbox, you can use bsxfun. This will let you compute the difference operation between the two matrices even if their dimensions don't agree. All you need to do is to use shiftdim to reshape X' to have size [1,size(X,2),size(X,1)], and then reshapedX and and centroids are compatible with their dimensions (see documentation of bsxfun). This lets you take the difference between their values. The result is a three dimensional array, which you need to sum along the second dimension to get the norm of the differences between rows. At this point you can proceed as in option 1.
reshapedX = shiftdim(X',-1);
diff = bsxfun(#minus,centroids,reshapedX);
diff = squeeze(sum(diff.^2,2));
[~,idx] = min(diff);
Note: Starting in the Matlab version 2016b, the bsxfun is used implicitly and you do not need to call it anymore. So the line with bsxfun can be replaced with the simpler line diff = centroids-reshapedX.
Option 3
Use the function dsearchn, which performs exactly what you need:
idx = dsearchn(centroids,X);
it could be done using pdist2 - pairwise distances between rows of two matrices:
% random data
X = rand(500,2);
centroids = rand(3,2);
% pairwise distances
D = pdist2(X,centroids);
% closest centroid index for each X coordinates
[~,idx] = min(D,[],2)
% plot
scatter(centroids(:,1),centroids(:,2),300,(1:size(centroids,1))','filled');
hold on;
scatter(X(:,1),X(:,2),30,idx);
legend('Centroids','data');

Computing a moving average

I need to compute a moving average over a data series, within a for loop. I have to get the moving average over N=9 days. The array I'm computing in is 4 series of 365 values (M), which itself are mean values of another set of data. I want to plot the mean values of my data with the moving average in one plot.
I googled a bit about moving averages and the "conv" command and found something which i tried implementing in my code.:
hold on
for ii=1:4;
M=mean(C{ii},2)
wts = [1/24;repmat(1/12,11,1);1/24];
Ms=conv(M,wts,'valid')
plot(M)
plot(Ms,'r')
end
hold off
So basically, I compute my mean and plot it with a (wrong) moving average. I picked the "wts" value right off the mathworks site, so that is incorrect. (source: http://www.mathworks.nl/help/econ/moving-average-trend-estimation.html) My problem though, is that I do not understand what this "wts" is. Could anyone explain? If it has something to do with the weights of the values: that is invalid in this case. All values are weighted the same.
And if I am doing this entirely wrong, could I get some help with it?
My sincerest thanks.
There are two more alternatives:
1) filter
From the doc:
You can use filter to find a running average without using a for loop.
This example finds the running average of a 16-element vector, using a
window size of 5.
data = [1:0.2:4]'; %'
windowSize = 5;
filter(ones(1,windowSize)/windowSize,1,data)
2) smooth as part of the Curve Fitting Toolbox (which is available in most cases)
From the doc:
yy = smooth(y) smooths the data in the column vector y using a moving
average filter. Results are returned in the column vector yy. The
default span for the moving average is 5.
%// Create noisy data with outliers:
x = 15*rand(150,1);
y = sin(x) + 0.5*(rand(size(x))-0.5);
y(ceil(length(x)*rand(2,1))) = 3;
%// Smooth the data using the loess and rloess methods with a span of 10%:
yy1 = smooth(x,y,0.1,'loess');
yy2 = smooth(x,y,0.1,'rloess');
In 2016 MATLAB added the movmean function that calculates a moving average:
N = 9;
M_moving_average = movmean(M,N)
Using conv is an excellent way to implement a moving average. In the code you are using, wts is how much you are weighing each value (as you guessed). the sum of that vector should always be equal to one. If you wish to weight each value evenly and do a size N moving filter then you would want to do
N = 7;
wts = ones(N,1)/N;
sum(wts) % result = 1
Using the 'valid' argument in conv will result in having fewer values in Ms than you have in M. Use 'same' if you don't mind the effects of zero padding. If you have the signal processing toolbox you can use cconv if you want to try a circular moving average. Something like
N = 7;
wts = ones(N,1)/N;
cconv(x,wts,N);
should work.
You should read the conv and cconv documentation for more information if you haven't already.
I would use this:
% does moving average on signal x, window size is w
function y = movingAverage(x, w)
k = ones(1, w) / w
y = conv(x, k, 'same');
end
ripped straight from here.
To comment on your current implementation. wts is the weighting vector, which from the Mathworks, is a 13 point average, with special attention on the first and last point of weightings half of the rest.

How do I create a simliarity matrix in MATLAB?

I am working towards comparing multiple images. I have these image data as column vectors of a matrix called "images." I want to assess the similarity of images by first computing their Eucledian distance. I then want to create a matrix over which I can execute multiple random walks. Right now, my code is as follows:
% clear
% clc
% close all
%
% load tea.mat;
images = Input.X;
M = zeros(size(images, 2), size (images, 2));
for i = 1:size(images, 2)
for j = 1:size(images, 2)
normImageTemp = sqrt((sum((images(:, i) - images(:, j))./256).^2));
%Need to accurately select the value of gamma_i
gamma_i = 1/10;
M(i, j) = exp(-gamma_i.*normImageTemp);
end
end
My matrix M however, ends up having a value of 1 along its main diagonal and zeros elsewhere. I'm expecting "large" values for the first few elements of each row and "small" values for elements with column index > 4. Could someone please explain what is wrong? Any advice is appreciated.
Since you're trying to compute a Euclidean distance, it looks like you have an error in where your parentheses are placed when you compute normImageTemp. You have this:
normImageTemp = sqrt((sum((...)./256).^2));
%# ^--- Note that this parenthesis...
But you actually want to do this:
normImageTemp = sqrt(sum(((...)./256).^2));
%# ^--- ...should be here
In other words, you need to perform the element-wise squaring, then the summation, then the square root. What you are doing now is summing elements first, then squaring and taking the square root of the summation, which essentially cancel each other out (or are actually the equivalent of just taking the absolute value).
Incidentally, you can actually use the function NORM to perform this operation for you, like so:
normImageTemp = norm((images(:, i) - images(:, j))./256);
The results you're getting seem reasonable. Recall the behavior of the exp(-x). When x is zero, exp(-x) is 1. When x is large exp(-x) is zero.
Perhaps if you make M(i,j) = normImageTemp; you'd see what you expect to see.
Consider this solution:
I = Input.X;
D = squareform( pdist(I') ); %'# euclidean distance between columns of I
M = exp(-(1/10) * D); %# similarity matrix between columns of I
PDIST and SQUAREFORM are functions from the Statistics Toolbox.
Otherwise consider this equivalent vectorized code (using only built-in functions):
%# we know that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
X = sum(I.^2,1);
D = real( sqrt(bsxfun(#plus,X,X')-2*(I'*I)) );
M = exp(-(1/10) * D);
As was explained in the other answers, D is the distance matrix, while exp(-D) is the similarity matrix (which is why you get ones on the diagonal)
there is an already implemented function pdist, if you have a matrix A, you can directly do
Sim= squareform(pdist(A))