How to calculate similarity matrix of a very large dataset by dividing the data to different pathces in matlab? - matlab

I have a matrix of 367800x84 that rows are instances and columns are dimensions. I try to calculate a similarity matrix however it does not fit into memory if I try to calculate from whole matrix. I tried different codes but did not work. I think to divide the data into bunches and calculate gradually. How can I apply such a method? Here is what I tried for whole data. First is euclidean, second is cosine similarity codes.
function [C] = create_sim_matrix(x)
Qx=repmat(dot(x,x,2),1,size(x,1));
D=sqrt(Qx+Qx'-2*x*x');
-----------------------------------------------------:
function [C] = create_sim_matrix(A)
normA = sqrt(sum(A .^ 2, 2));
temp = bsxfun(#rdivide, A * A', normA);
C = bsxfun(#rdivide, temp, normA');

Related

How to sum 3d Matrix row by interval in Matlab?

I have a 36x256x2232 3d matrix in Matlab created by M = ones(36,256,2232) and I want to reduce the size of the matrix by sum rows by interval 3. The result matrix should be 12x256x2232 and each cell should have the value 3.
I tried using reshape and sum function but I get 1x256x2232 matrix.
How can I do this without using the for-loop ?
This should do it:
M = ones(36,256,2232)
reduced = reshape(sum(reshape(M, 3,[], 256,2232), 1),[], 256, 2232);
reshape makes a 4d matrix with the given intervals
sum reduce it
second reshape transform it to 3d again
you can use also squeeze, which removes singleton dimensions:
reduced = squeeze(sum(reshape(M, 3,[], 256,2232), 1));
You can use the new-ish splitapply function (which is similar to accumarray but can handle data with multiple dimensions). This approach works even if the number of rows is not a multiple of the group size:
M = ones(4,5,2); % example data
n = 3; % group size
result = splitapply(#(x)sum(x,1), M, floor((0:size(M,1)-1).'/n)+1);

Interpolation of time series data in MATLAB

I have a multidimensional time series in MATLAB. Let's say it's of M dimensions, N samples, and as such I have it stored in terms of NxM matrix.
I want interpolate the time series, to fit a new length (N1), where always N is always less than N1.
In other words, if I have multiple time series (all sampled at the same rate, just of different lengths), I want to interpolate them all to be of length N0.
How can one achieve this with MATLAB?
EDIT: Could one achieve this with imresize?
i.e.:
A = randn(5,10) % 10 dimensions, 5 samples
desiredLength = 15; % we want 15 samples in lenght
newA = imresize(A, [desiredLength 10], 'bilinear');
A procedure like the following might do what you want. The new data will be a linear interpolation of the old data.
[initSize1, initSize2] = ndgrid(1:size(Data, 1), 1:size(Data, 2));
[newSize1, newSize2] = ndgrid(linspace(1, size(Data, 1), newlength), 1:size(Data, 2));
newData = interpn(initSize1, initSize2, Data, newSize1, newSize2);
As coded up, only dimension 1 should change, as the second gridded dimension is the same in the first and second calls to ndgrid.
If you have a timeseries object, you might also want to look at the resample method for the timeseries object:
http://www.mathworks.co.uk/help/matlab/ref/timeseries.resample.html

Matlab Vectorization of Multivariate Gaussian Basis Functions

I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.
Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.

Comparing two sets of vectors

I've got matrices A and B
size(A) = [n x]; size(B) = [n y];
Now I need to compare euclidian distance of each column vector of A from each column vector of B. I'm using dist method right now
Q = dist([A B]); Q = Q(1:x, x:end);
But it does also lot of needless work (like calculating distances between vectors of A and B separately).
What is the best way to calculate this?
You are looking for pdist2.
% Compute the ordinary Euclidean distance
D = pdist2(A.',B.','euclidean'); % euclidean distance
You should take the transpose of the matrices since pdist2 assumes the observations are in rows, not in columns.
An alternative solution to pdist2, if you don't have the Statistics Toolbox, is to compute this manually. For example, one way to do it is:
[X, Y] = meshgrid(1:size(A, 2), 1:size(B, 2)); %// or meshgrid(1:x, 1:y)
Q = sqrt(sum((A(:, X(:)) - B(:, Y(:))) .^ 2, 1));
The indices of the columns from A and B for each value in vector Q can be obtained by computing:
[X(:), Y(:)]
where each row contains a pair of indices: the first is the column index in matrix A, and the second is the column index in matrix B.
Another solution if you don't have pdist2 and which may also be faster for very large matrices is to vectorize the following mathematical fact:
||x-y||^2 = ||x||^2 + ||y||^2 - 2*dot(x,y)
where ||a|| is the L2-norm (euclidean norm) of a.
Comments:
C=-2*A'*B (this is a x by y matrix) is the vectorization of the dot products.
||x-y||^2 is the square of the euclidean distance which you are looking for.
Is that enough or do you need the explicit code?
The reason this may be faster asymptotically is that you avoid doing the metric calculation for all x*y comparisons, since you are instead making the bottleneck a matrix multiplication (matrix multiplication is highly optimized in matlab). You are taking advantage of the fact that this is the euclidean distance and not just some unknown metric.

How do I create a simliarity matrix in MATLAB?

I am working towards comparing multiple images. I have these image data as column vectors of a matrix called "images." I want to assess the similarity of images by first computing their Eucledian distance. I then want to create a matrix over which I can execute multiple random walks. Right now, my code is as follows:
% clear
% clc
% close all
%
% load tea.mat;
images = Input.X;
M = zeros(size(images, 2), size (images, 2));
for i = 1:size(images, 2)
for j = 1:size(images, 2)
normImageTemp = sqrt((sum((images(:, i) - images(:, j))./256).^2));
%Need to accurately select the value of gamma_i
gamma_i = 1/10;
M(i, j) = exp(-gamma_i.*normImageTemp);
end
end
My matrix M however, ends up having a value of 1 along its main diagonal and zeros elsewhere. I'm expecting "large" values for the first few elements of each row and "small" values for elements with column index > 4. Could someone please explain what is wrong? Any advice is appreciated.
Since you're trying to compute a Euclidean distance, it looks like you have an error in where your parentheses are placed when you compute normImageTemp. You have this:
normImageTemp = sqrt((sum((...)./256).^2));
%# ^--- Note that this parenthesis...
But you actually want to do this:
normImageTemp = sqrt(sum(((...)./256).^2));
%# ^--- ...should be here
In other words, you need to perform the element-wise squaring, then the summation, then the square root. What you are doing now is summing elements first, then squaring and taking the square root of the summation, which essentially cancel each other out (or are actually the equivalent of just taking the absolute value).
Incidentally, you can actually use the function NORM to perform this operation for you, like so:
normImageTemp = norm((images(:, i) - images(:, j))./256);
The results you're getting seem reasonable. Recall the behavior of the exp(-x). When x is zero, exp(-x) is 1. When x is large exp(-x) is zero.
Perhaps if you make M(i,j) = normImageTemp; you'd see what you expect to see.
Consider this solution:
I = Input.X;
D = squareform( pdist(I') ); %'# euclidean distance between columns of I
M = exp(-(1/10) * D); %# similarity matrix between columns of I
PDIST and SQUAREFORM are functions from the Statistics Toolbox.
Otherwise consider this equivalent vectorized code (using only built-in functions):
%# we know that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
X = sum(I.^2,1);
D = real( sqrt(bsxfun(#plus,X,X')-2*(I'*I)) );
M = exp(-(1/10) * D);
As was explained in the other answers, D is the distance matrix, while exp(-D) is the similarity matrix (which is why you get ones on the diagonal)
there is an already implemented function pdist, if you have a matrix A, you can directly do
Sim= squareform(pdist(A))