locating common elements in 2D arrays - matlab

I have two very long 2D lists called "first_data*" and "second_data", and I would like to locate the elements that are equal and place them in the list "final_data". I have a MWE here:
first_data = [1 2; 3 4]';
second_data = [1 2; 9 4]';
final = [];
for i=1:length(first_data(:, 1))
for j=1:length(second_data(:, 1))
if(first_data(i, 2) == second_data(j, 2))
final = [final first_data(i, 1)];
end
end
end
This gives me 2, as desired. This works, but it is very computationally intensive for very large data sets. Is there a more efficient way to write the above code?

ismember will allow you to do what you want.
%# identify the rows in first_data's second column
%# that occur in second_data's second column
goodIdx = ismember(first_data(:,2),second_data(:,2));
%# return the corresponding values of first_data's first column
final = first_data(goodIdx,1);

Try this code:
first_data = [1 2; 3 4]';
second_data = [1 2; 9 4]';
diff_data=first_data-second_data;
Ind=find(diff_data==0);
final=first_data(Ind);

use:
[c, ia, ib] = intersect(first_data(:, 2), second_data(:, 2));
final = second_data(ib,1);
note: I haven't tested this, but it should work (at least up to row/column mixups)

So your two datasets can be completely described by the following tuples?
"first_data" described by (i,j,data_1) - which indicates that the value data_1 is at row i, column j
"second_data" described by (i,j,data_2)
And you want to find such tuples that are equal?
Convert "first_data", and "second_data" to sets using a Bloom Filter
Perform set intersection using the Bloom Filter representation of "first_data" and "second_data" and obtain essentially constant time set intersection using the bitwise AND of the two representations.

Related

MATLAB. How to remove rows, if any of the values in the row is found in another row?

I have a matrix as shown in the image. In this matrix if any of the values in one row is found in another row we remove the shorter row. For example row 2 to row 5 all contain 3, therefore I want to keep only row 5(the row with most non-zero values) and remove all other rows...please suggest a solution.
Thanks
I believe the below code should work. The idea is to sort the matrix first according to the number of elements in the rows, then loop and remove the rows that have matches. Probably not the most efficient code but should work in principle.. see the comments for more explanation
% generating the data
M = zeros(6, 10);
M(2,1:3) = [3 8 10];
M(3,1:4) = [3 8 10 9];
M(4,1:5) = [3 8 10 9 7];
M(5,1:6) = [3 8 10 9 7 4];
M(6,1) = [5];
% sorting according to the number of non-zero elements
nr_of_nonzero = sum(M~=0, 2);
[~, sort_indices] = sort(nr_of_nonzero);
M_sorted = M(sort_indices,:);
M_sorted(M_sorted==0)=NaN; % should not compare 0s (?)
% get rid of the matches
for i=1:size(M_sorted, 1)-1
for j=(i+1):size(M_sorted, 1)
[C,ia,ib] = intersect(M_sorted(i,:),M_sorted(j,:));
if numel(C)>0
M_sorted(i,:) = NaN;
end
break;
end
end
% reorder
M(sort_indices,:) = M_sorted;
% remove all NaN rows
M(all(isnan(M),2),:) = [];
% back to 0s
M(isnan(M)) = 0;
I'm not doing all the code here, but here's the steps that I would take to solve it. You will likely have to try different ways of doing it to obtain the intended result (i.e. vector operations, while loop, for loop, etc.).
Problem
Rows are repetitive and need to be reduced in a more compact form.
Solution
Look up mat2str.
Convert your vectors (rows) to strings. This can be done with temporary values like tmpstr1 = mat2str(yourMatrix(rowToBeCompared, :));
Parse the first string from beginning to end, while parsing the second string in the same way to make comparisons.
use strcmp to see if the string characters (or strings themselves) are the same: http://www.mathworks.com/help/matlab/ref/strcmp.html
Delete a row if you find it appropriate with yourMatrix[rowToDelete, : ] = [];
Try that and see if it works.
Note - Expansion of step 3:
if we have variable a = '[ab+11]';, we can select individual characters from the string like:
a(4)
ans = '+'
a(5)
ans = '1'
a(1)
and = '['
Therefore, you can parse the string with a loop:
for n = 1 : length(a)
if a(n) == '1' || a(n) == '0'
str(n) = a(n);
end
end
Like Sardar Usama said, it's helpful to provide the code so that we can copy and paste into our own MATLAB workspaces.

Efficient aggregation of high dimensional arrays

I have a 3 dimensional (or higher) array that I want to aggregate by another vector. The specific application is to take daily observations of spatial data and average them to get monthly values. So, I have an array with dimensions <Lat, Lon, Day> and I want to create an array with dimensions <Lat, Lon, Month>.
Here is a mock example of what I want. Currently, I can get the correct output using a loop, but in practice, my data is very large, so I was hoping for a more efficient solution than the second loop:
% Make the mock data
A = [1 2 3; 4 5 6];
X = zeros(2, 3, 9);
for j = 1:9
X(:, :, j) = A;
A = A + 1;
end
% Aggregate the X values in groups of 3 -- This is the part I would like help on
T = [1 1 1 2 2 2 3 3 3];
X_agg = zeros(2, 3, 3);
for i = 1:3
X_agg(:,:,i) = mean(X(:,:,T==i),3);
end
In 2 dimensions, I would use accumarray, but that does not accept higher dimension inputs.
Before getting to your answer let's first rewrite your code in a more general way:
ag = 3; % or agg_size
X_agg = zeros(size(X)./[1 1 ag]);
for i = 1:ag
X_agg(:,:,i) = mean(X(:,:,(i-1)*ag+1:i*ag), 3);
end
To avoid using the for loop one idea is to reshape your X matrix to something that you can use the mean function directly on.
splited_X = reshape(X(:), [size(X_agg), ag]);
So now splited_X(:,:,:,i) is the i-th part
that contains all the matrices that should be aggregated which is X(:,:,(i-1)*ag+1:i*ag)) (like above)
Now you just need to find the mean in the 3rd dimension of splited_X:
temp = mean(splited_X, 3);
However this results in a 4D matrix (where its 3rd dimension size is 1). You can again turn it into 3D matrix using reshape function:
X_agg = reshape(temp, size(X_agg))
I have not tried it to see how much more efficient it is, but it should do better for large matrices since it doesn't use for loops.

Creating a vector with random sampling of two vectors in matlab

How does one create a vector that is composed of a random sampling of two other vectors?
For example
Vector 1 [1, 3, 4, 7], Vector 2 [2, 5, 6, 8]
Random Vector [random draw from vector 1 or 2 (value 1 or 2), random draw from vector 1 or 2 (value 3 or 5)... etc]
Finally, how can one ask matlab to repeat this process n times to draw a distribution of results?
Thank you,
There are many ways you could do this. One possibility is:
tmp=round(rand(size(vector1)))
res = tmp.*vector1 + (1-tmp).*vector2
To get one mixed sample, you may use the idea of the following code snippet (not the optimal one, but maybe clear enough):
a = [1, 3, 4, 7];
b = [2, 5, 6, 8];
selector = randn(size(a));
sample = a.*(selector>0) + b.*(selector<=0);
For n samples put the above code in a for loop:
for k=1:n
% Sample code (without initial "samplee" assignments)
% Here do stuff with the sample
end;
More generally, if X is a matrix and for each row you want to take a sample from a column chosen at random, you can do this with a loop:
y = zeros(size(X,1),1);
for ii = 1:size(X,1)
y(ii) = X(ii,ceil(rand*size(X,2)));
end
You can avoid the loop using clever indexing via sub2ind:
idx_n = ceil(rand(size(X,1),1)*size(X,2));
idx = sub2ind(size(X),(1:size(X,1))',idx_n);
y = X(idx);
If I understand your question, you are choosing two random numbers. First you decide whether to select vector 1 or vector 2; next you pick an element from the chosen vector.
The following code takes advantage of the fact that vector1 and vector2 are the same length:
N = 1000;
sampleMatrix = [vector1 vector2];
M = numel(sampleMatrix);
randIndex = ceil(rand(1,N)*M); % N random numbers from 1 to M
randomNumbers = sampleMatrix(randIndex); % sample N times from the matrix
You can then display the result with, for instance
figure; hist(randomNumbers); % draw a histogram of numbers drawn
When vector1 and vector2 have different elements, you run into a problem. If you concatenate them, you will end up picking elements from the longer vector more often. One way around this is to create random samplings from both arrays, then choose between them:
M1 = numel(vector1);
M2 = numel(vector2);
r1 = ceil(rand(1,N)*M1);
r2 = ceil(rand(1,N)*M2);
randMat = [vector1(r1(:)) vector2(r2(:))]; % two columns, now pick one or the other
randPick = ceil(rand(1,N)*2);
randomNumbers = [randMat(randPick==1, 1); randMat(randPick==2, 2)];
On re-reading, maybe you just want to pick "element 1 from either 1 or 2", then "element 2 from either 1 or 2", etc for all the elements of the vector. In that case, do
N=numel(vector1);
randPick = ceil(rand(1,N)*2);
randMat=[vector1(:) vector2(:)];
randomNumbers = [randMat(randPick==1, 1); randMat(randPick==2, 2)];
This problem can be solved using the function datasample.
Combine both vectors into one and apply the function. I like this approach more than the handcrafted versions in the other answers. It gives you much more flexibility in choosing what you actually want, while being a one-liner.

Vectorizing the Notion of Colon (:) - values between two vectors in MATLAB

I have two vectors, idx1 and idx2, and I want to obtain the values between them. If idx1 and idx2 were numbers and not vectors, I could do that the following way:
idx1=1;
idx2=5;
values=idx1:idx2
% Result
% values =
%
% 1 2 3 4 5
But in my case, idx1 and idx2 are vectors of variable length. For example, for length=2:
idx1=[5,9];
idx2=[9 11];
Can I use the colon operator to directly obtain the values in between? This is, something similar to the following:
values = [5 6 7 8 9 9 10 11]
I know I can do idx1(1):idx2(1) and idx1(2):idx2(2), this is, extract the values for each column separately, so if there is no other solution, I can do this with a for-loop, but maybe Matlab can do this more easily.
Your sample output is not legal. A matrix cannot have rows of different length. What you can do is create a cell array using arrayfun:
values = arrayfun(#colon, idx1, idx2, 'Uniform', false)
To convert the resulting cell array into a vector, you can use cell2mat:
values = cell2mat(values);
Alternatively, if all vectors in the resulting cell array have the same length, you can construct an output matrix as follows:
values = vertcat(values{:});
Try taking the union of the sets. Given the values of idx1 and idx2 you supplied, run
values = union(idx1(1):idx1(2), idx2(1):idx2(2));
Which will yield a vector with the values [5 6 7 8 9 10 11], as desired.
I couldn't get #Eitan's solution to work, apparently you need to specify parameters to colon. The small modification that follows got it working on my R2010b version:
step = 1;
idx1 = [5, 9];
idx2 = [9, 11];
values = arrayfun(#(x,y)colon(x, step, y), idx1, idx2, 'UniformOutput', false);
values=vertcat(cell2mat(values));
Note that step = 1 is actually the default value in colon, and Uniform can be used in place of UniformOutput, but I've included these for the sake of completeness.
There is a great blog post by Loren called Vectorizing the Notion of Colon (:). It includes an answer that is about 5 times faster (for large arrays) than using arrayfun or a for-loop and is similar to run-length-decoding:
The idea is to expand the colon sequences out. I know the lengths of
each sequence so I know the starting points in the output array. Fill
the values after the start values with 1s. Then I figure out how much
to jump from the end of one sequence to the beginning of the next one.
If there are repeated start values, the jumps might be negative. Once
this array is filled, the output is simply the cumulative sum or
cumsum of the sequence.
function x = coloncatrld(start, stop)
% COLONCAT Concatenate colon expressions
% X = COLONCAT(START,STOP) returns a vector containing the values
% [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
% Based on Peter Acklam's code for run length decoding.
len = stop - start + 1;
% keep only sequences whose length is positive
pos = len > 0;
start = start(pos);
stop = stop(pos);
len = len(pos);
if isempty(len)
x = [];
return;
end
% expand out the colon expressions
endlocs = cumsum(len);
incr = ones(1, endlocs(end));
jumps = start(2:end) - stop(1:end-1);
incr(endlocs(1:end-1)+1) = jumps;
incr(1) = start(1);
x = cumsum(incr);

How to get the number of columns in a matrix?

Suppose I specify a matrix A like
A = [1 2 3; 4 5 6; 7 8 9]
how can I query A (without using length(A)) to find out it has 3 columns?
Use the size() function.
>> size(A,2)
Ans =
3
The second argument specifies the dimension of which number of elements are required which will be '2' if you want the number of columns.
Official documentation.
While size(A,2) is correct, I find it's much more readable to first define
rows = #(x) size(x,1);
cols = #(x) size(x,2);
and then use, for example, like this:
howManyColumns_in_A = cols(A)
howManyRows_in_A = rows(A)
It might appear as a small saving, but size(.., 1) and size(.., 2) must be some of the most commonly used functions, and they are not optimally readable as-is.
When want to get row size with size() function, below code can be used:
size(A,1)
Another usage for it:
[height, width] = size(A)
So, you can get 2 dimension of your matrix.