Matching 2 matrices of different sizes (matlab) - matlab

I have 2 matrices (D:76572x2 and E:1850092x7) and want the values of rows in the larger matrix (E) if the the first two columns are equal to any row of the smaller matrix (D).
Example:
D = [1000 19751231;
1000 19761231]
E = [1234 19701130 4 5 2 9 3;
1000 19751231 2 3 2 5 2]
So in that case I only want the row: [1000 19751231 2 3 2 5 2] from matrix E. How can I compute this relatively quickly for a large matrix without using any/many (for-)loops?
Thanks

We can use the ismember function here
rows_E = ismember(E(:,1:2),D,'rows');
From your example:
>> E(rows_E,:)
Yields
ans =
1000 19751231 2 3 2 5 2

Related

How to split a matrix based on how close the values are?

Suppose I have a matrix A:
A = [1 2 3 6 7 8];
I would like to split this matrix into sub-matrices based on how relatively close the numbers are. For example, the above matrix must be split into:
B = [1 2 3];
C = [6 7 8];
I understand that I need to define some sort of criteria for this grouping so I thought I'd take the absolute difference of the number and its next one, and define a limit upto which a number is allowed to be in a group. But the problem is that I cannot fix a static limit on the difference since the matrices and sub-matrices will be changing.
Another example:
A = [5 11 6 4 4 3 12 30 33 32 12];
So, this must be split into:
B = [5 6 4 4 3];
C = [11 12 12];
D = [30 33 32];
Here, the matrix is split into three parts based on how close the values are. So the criteria for this matrix is different from the previous one though what I want out of each matrix is the same, to separate it based on the closeness of its numbers. Is there any way I can specify a general set of conditions to make the criteria dynamic rather than static?
I'm afraid, my answer comes too late for you, but maybe future readers with a similar problem can profit from it.
In general, your problem calls for cluster analysis. Nevertheless, maybe there's a simpler solution to your actual problem. Here's my approach:
First, sort the input A.
To find a criterion to distinguish between "intraclass" and "interclass" elements, I calculate the differences between adjacent elements of A, using diff.
Then, I calculate the median over all these differences.
Finally, I find the indices for all differences, which are greater or equal than three times the median, with a minimum difference of 1. (Depending on the actual data, this might be modified, e.g. using mean instead.) These are the indices, where you will have to "split" the (sorted) input.
At last, I set up two vectors with the starting and end indices for each "sub-matrix", to use this approach using arrayfun to get a cell array with all desired "sub-matrices".
Now, here comes the code:
% Sort input, and calculate differences between adjacent elements
AA = sort(A);
d = diff(AA);
% Calculate median over all differences
m = median(d);
% Find indices with "significantly higher difference",
% e.g. greater or equal than three times the median
% (minimum difference should be 1)
idx = find(d >= max(1, 3 * m));
% Set up proper start and end indices
start_idx = [1 idx+1];
end_idx = [idx numel(A)];
% Generate cell array with desired vectors
out = arrayfun(#(x, y) AA(x:y), start_idx, end_idx, 'UniformOutput', false)
Due to the unknown number of possible vectors, I can't think of way to "unpack" these to individual variables.
Some tests:
A =
1 2 3 6 7 8
out =
{
[1,1] =
1 2 3
[1,2] =
6 7 8
}
A =
5 11 6 4 4 3 12 30 33 32 12
out =
{
[1,1] =
3 4 4 5 6
[1,2] =
11 12 12
[1,3] =
30 32 33
}
A =
1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3
out =
{
[1,1] =
1 1 1 1 1 1 1
[1,2] =
2 2 2 2 2 2
[1,3] =
3 3 3 3 3 3 3
}
Hope that helps!

Matlab: combining multiple matrices row-wise

I have some data in 10 matrices. Each matrix has a different number of rows, but the same number of columns.
I want to combine all 10 matrices to one matrix row-wise, interleaved, meaning the rows in that matrix will look like:
row 1 from matrix 0
...
row 1 from matrix 9
row 2 from matrix 0
...
row 2 from matrix 9
...
Example (with 3 matrices):
Matrix 1: [1 2 3 ; 4 5 6; 7 8 9]
Matrix 2: [3 2 1 ; 6 5 4]
Matrix 3: [1 1 1 ; 2 2 2 ; 3 3 3]
Combined matrix will be: [1 2 3 ; 3 2 1 ; 1 1 1 ; 4 5 6 ; 6 5 4 ; 2 2 2 ; 7 8 9 ; 3 3 3]
You can download the function interleave2 here https://au.mathworks.com/matlabcentral/fileexchange/45757-interleave-vectors-or-matrices
z = interleave2(a,b,c,'row')
you can see the way the function works in the source code of course
Here's a general solution that allows you to place however many matrices you want (with matching number of columns) into the starting cell array Result:
Result = {Matrix1, Matrix2, Matrix3};
index = cellfun(#(m) {1:size(m, 1)}, Result);
[~, index] = sort([index{:}]);
Result = vertcat(Result{:});
Result = Result(index, :);
This will generate an index vector 1:m for each matrix, where m is its number of rows. By concatenating these indices and sorting them, we can get a new index that can be used to sort the rows of the vertically-concatenated set of matrices so that they are interleaved.

MATLAB: sample from population randomly many times?

I am aware of MATLAB's datasample which allows to select k times from a certain population. Suppose population=[1,2,3,4] and I want to uniformly sample, with replacement, k=5 times from it. Then:
datasample(population,k)
ans =
1 3 2 4 1
Now, I want to repeat the above experiment N=10000 times without using a for loop. I tried doing:
datasample(repmat(population,N,1),5,2)
But the output I get is (just a short excerpt below):
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
Every row (result of an experiment) is the same! But obviously they should be different... It's as though some random seed is not updating between rows. How can I fix this? Or some other method I could use that avoids a for loop? Thanks!
You seem to be confusing the way datasample works. If you read the documentation on the function, if you specify a matrix, it will generate a data sampling from a selection of rows in the matrix. Therefore, if you simply repeat the population vector 10000 times, and when you specify the second parameter of the function - which in this case is how many rows of the matrix to extract, even though the actual row locations themselves are different, the actual rows over all of the matrix is going to be the same which is why you are getting that "error".
As such, I wouldn't use datasample here if it is your intention to avoid looping. You can use datasample, but you'd have to loop over each call and you explicitly said that this is not what you want.
What I would recommend you do is first create your population vector to have whatever you desire in it, then generate a random index matrix where each value is between 1 up to as many elements as there are in population. This matrix is in such a way where the number of columns is the number of samples and the number of rows is the number of trials. Once you create this matrix, simply use this to index into your vector to achieve the desired sampling matrix. To generate this random index matrix, randi is a fine choice.
Something like this comes to mind:
N = 10000; %// Number of trials
M = 5; %// Number of samples per trial
population = 1:4; %// Population vector
%// Generate random indices
ind = randi(numel(population), N, M);
%// Get the stuff
out = population(ind);
Here's the first 10 rows of the output:
>> out(1:10,:)
ans =
4 3 1 4 2
4 4 1 3 4
3 2 2 2 3
1 4 2 2 2
1 2 3 4 2
2 2 3 2 1
4 1 3 2 4
1 4 1 3 1
1 1 2 4 4
1 2 4 2 1
I think the above does what you want. Also keep in mind that the above code generalizes to any population vector you want. You simply have to change the vector and it will work as advertised.
datasample interprets each column of your data as one element of your population, sampling among all columns.
To fix this you could call datasample N times in a loop, instead I would use randi
population(randi(numel(population),N,5))
assuming your population is always 1:p, you could simplify to:
randi(p,N,5)
Ok so both of the current answers both say don't use datasample and use randi instead. However, I have a solution for you with datasample and arrayfun.
>> population = [1 2 3 4];
>> k = 5; % Number of samples
>> n = 1000; % Number of times to execute datasample(population, k)
>> s = arrayfun(#(k) datasample(population, k), n*ones(k, 1), 'UniformOutput', false);
>> s = cell2mat(s);
s =
1 4 1 4 4
4 1 2 2 4
2 4 1 2 1
1 4 3 3 1
4 3 2 3 2
We need to make sure to use 'UniformOutput', false with arrayfun as there is more than one output. The cell2mat call is needed as the result of arrayfun is a cell array.

dot product of matrix columns

I have a 4x8 matrix which I want to select two different columns of it then derive dot product of them and then divide to norm values of that selected columns, and then repeat this for all possible two different columns and save the vectors in a new matrix. can anyone provide me a matlab code for this purpose?
The code which I supposed to give me the output is:
A=[1 2 3 4 5 6 7 8;1 2 3 4 5 6 7 8;1 2 3 4 5 6 7 8;1 2 3 4 5 6 7 8;];
for i=1:8
for j=1:7
B(:,i)=(A(:,i).*A(:,j+1))/(norm(A(:,i))*norm(A(:,j+1)));
end
end
I would approach this a different way. First, create two matrices where the corresponding columns of each one correspond to a unique pair of columns from your matrix.
Easiest way I can think of is to create all possible combinations of pairs, and eliminate the duplicates. You can do this by creating a meshgrid of values where the outputs X and Y give you a pairing of each pair of vectors and only selecting out the lower triangular part of each matrix offsetting by 1 to get the main diagonal just one below the diagonal.... so do this:
num_columns = size(A,2);
[X,Y] = meshgrid(1:num_columns);
X = X(tril(ones(num_columns),-1)==1); Y = Y(tril(ones(num_columns),-1)==1);
In your case, here's what the grid of coordinates looks like:
>> [X,Y] = meshgrid(1:num_columns)
X =
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Y =
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8
As you can see, if we select out the lower triangular part of each matrix excluding the diagonal, you will get all combinations of pairs that are unique, which is what I did in the last parts of the code. Selecting the lower-part is important because by doing this, MATLAB selects out values column-wise, and traversing the columns of the lower-triangular part of each matrix gives you the exact orderings of each pair of columns in the right order (i.e. 1-2, 1-3, ..., 1-7, 2-3, 2-4, ..., etc.)
The point of all of this is that can then use X and Y to create two new matrices that contain the columns located at each pair of X and Y, then use dot to apply the dot product to each matrix column-wise. We also need to divide the dot product by the multiplication of the magnitudes of the two vectors respectively. You can't use MATLAB's built-in function norm for this because it will compute the matrix norm for matrices. As such, you have to sum over all of the rows for each column respectively for each of the two matrices then multiply both of the results element-wise then take the square root - this is the last step of the process:
matrix1 = A(:,X);
matrix2 = A(:,Y);
B = dot(matrix1, matrix2, 1) ./ sqrt(sum(matrix1.^2,1).*sum(matrix2.^2,1));
I get this for B:
>> B
B =
Columns 1 through 11
1 1 1 1 1 1 1 1 1 1 1
Columns 12 through 22
1 1 1 1 1 1 1 1 1 1 1
Columns 23 through 28
1 1 1 1 1 1
Well.. this isn't useful at all. Why is that? What you are actually doing is finding the cosine angle between two vectors, and since each vector is a scalar multiple of another, the angle that separates each vector is in fact 0, and the cosine of 0 is 1.
You should try this with different values of A so you can see for yourself that it works.
To make this code compatible for copying and pasting, here it is:
%// Define A here:
A = repmat(1:8, 4, 1);
%// Code to produce dot products here
num_columns = size(A,2);
[X,Y] = meshgrid(1:num_columns);
X = X(tril(ones(num_columns),-1)==1); Y = Y(tril(ones(num_columns),-1)==1);
matrix1 = A(:,X);
matrix2 = A(:,Y);
B = dot(matrix1, matrix2, 1) ./ sqrt(sum(matrix1.^2,1).*sum(matrix2.^2,1));
Minor Note
If you have a lot of columns in A, this may be very memory intensive. You can get your original code to work with loops, but you need to change what you're doing at each column.
You can do something like this:
num_columns = nchoosek(size(A,2),2);
B = zeros(1, num_columns);
counter = 1;
for ii = 1 : size(A,2)
for jj = ii+1 : size(A,2)
B(counter) = dot(A(:,ii), A(:,jj), 1) / (norm(A(:,ii))*norm(A(:,jj)));
counter = counter + 1;
end
end
Note that we can use norm because we're specifying vectors for each of the inputs into the function. We first preallocate a matrix B that will contain the dot products of all possible combinations. Then, we go through each pair of combinations - take note that the inner for loop starts from the outer most for loop index added with 1 so you don't look at any duplicates. We take the dot product of the corresponding columns referenced by positions ii and jj and store the results in B. I need an external counter so we can properly access the right slot to place our result in for each pair of columns.

Pairwise distance matrix containing strings

I need to calculate the pairwise distance between two matrix elements in a way that distance is equal to the number of binary differences between features/dimensions.
I want to do this with MATLAB codes without using a loop.
For example:
Assume I want to calculate the distance between instances in A and B:
A = [ 1 2 3 ; 2 3 4] % (two instances with three features)
B = [ 2 3 4 ; 2 5 6 ; 4 5 6] % (three instances with three features)
I need to calculate C, which would be a 2x3 matrix contain the distance of instances in A and B in a way that the distance between [1 3 3] and [2 3 4] would be 2: comparing the features, when a feature is equivalent, add 0 to distance and when they are dissimilar add 1 to distance.
So in this case,
C = [3 3 3; 0 2 3].
A and B may contain strings instead of numbers.
You can use bsxfun with #ne (not equal), followed by a sum to count the number of dissimilar features for an instance:
A = [1 2 3; 2 3 4];
B = [2 3 4; 2 5 6; 4 5 6];
C = squeeze(sum(bsxfun(#ne,A,permute(B,[3 2 1])),2))
C =
3 3 3
0 2 3
The above works by generating a logical array testing for equality of each feature for each instance pair via bsxfun(#ne,...). Then a sum is performed over dimension 2 to count the number of dissimilar features for each instance.
The function pdist2 with Hamming distance already does this for you:
pdist2(A,B,'hamming')
This gives the result as percentage of coordinates that differ. Since you want number instead of percentage, multiply by the number of columns:
pdist2(A,B,'hamming')*size(A,2)
ans =
3 3 3
0 2 3