How do I compare elements of one row with every other row in the same matrix - matlab

I have the matrix:
a = [ 1 2 3 4;
2 4 5 6;
4 6 8 9]
and I want to compare every row with every other two rows one by one. If they share the same key then the result will tell they have a common key.

Using #gnovice's idea of getting all combinations with nchoosek, I propose yet another two solutions:
one using ismember (as noted by #loren)
the other using bsxfun with the eq function handle
The only difference is that intersect sorts and keeps only the unique common keys.
a = randi(30, [100 20]);
%# a = sort(a,2);
comparisons = nchoosek(1:size(a,1),2);
N = size(comparisons,1);
keys1 = cell(N,1);
keys2 = cell(N,1);
keys3 = cell(N,1);
tic
for i=1:N
keys1{i} = intersect(a(comparisons(i,1),:),a(comparisons(i,2),:));
end
toc
tic
for i=1:N
query = a(comparisons(i,1),:);
set = a(comparisons(i,2),:);
keys2{i} = query( ismember(query, set) ); %# unique(...)
end
toc
tic
for i=1:N
query = a(comparisons(i,1),:);
set = a(comparisons(i,2),:)';
keys3{i} = query( any(bsxfun(#eq, query, set),1) ); %'# unique(...)
end
toc
... with the following time comparisons:
Elapsed time is 0.713333 seconds.
Elapsed time is 0.289812 seconds.
Elapsed time is 0.135602 seconds.
Note that even by sorting a beforehand and adding a call to unique inside the loops (commented parts), these two methods are still faster than intersect.

Here's one solution (which is generalizable to larger matrices than the sample in the question):
comparisons = nchoosek(1:size(a,1),2);
N = size(comparisons,1);
keys = cell(N,1);
for i = 1:N
keys{i} = intersect(a(comparisons(i,1),:),a(comparisons(i,2),:));
end
The function NCHOOSEK is used to generate all of the unique combinations of row comparisons. For the matrix a in your question, you will get comparisons = [1 2; 1 3; 2 3], meaning that we will need to compare rows 1 and 2, then 1 and 3, and finally 2 and 3. keys is a cell array that stores the results of each comparison. For each comparison, the function INTERSECT is used to find the common values (i.e. keys). For the matrix a given in the question, you will get keys = {[2 4], 4, [4 6]}.

Related

How to vectorize searching function in Matlab?

Here is a Matlab coding problem (A little different version with intersect not setdiff here:
a rating matrix A with 3 cols, the 1st col is user'ID which maybe duplicated, 2nd col is the item'ID which maybe duplicated, 3rd col is rating from user to item, ranging from 1 to 5.
Now, I have a subset of user IDs smallUserIDList and a subset of item IDs smallItemIDList, then I want to find the rows in A that rated by users in smallUserIDList, and collect the items that user rated, and do some calculations, such as setdiff with smallItemIDList and count the result, as the following code does:
userStat = zeros(length(smallUserIDList), 1);
for i = 1:length(smallUserIDList)
A2= A(A(:,1) == smallUserIDList(i), :);
itemIDList_each = unique(A2(:,2));
setDiff = setdiff(itemIDList_each , smallItemIDList);
userStat(i) = length(setDiff);
end
userStat
Finally, I find the profile viewer showing that the loop above is inefficient, the question is how to improve this piece of code with vectorization but the help of for loop?
For example:
Input:
A = [
1 11 1
2 22 2
2 66 4
4 44 5
6 66 5
7 11 5
7 77 5
8 11 2
8 22 3
8 44 3
8 66 4
8 77 5
]
smallUserIDList = [1 2 7 8]
smallItemIDList = [11 22 33 55 77]
Output:
userStat =
0
1
0
2
Vanilla MATLAB:
As far as I can tell your code is equivalent to:
%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));
%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];
%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);
%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);
This will work if there is at most one rating per (user,item)-combination. Also it should be quite efficient.
Clean approach without reinventing the wheel:
Check out grpstats from the Statistics Toolbox!
An implementation could look similar to this:
%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});
%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);
%// Keep only users we care about (smallUserIDList)
ratings = ratings(ismember(ratings.user, smallUserIDList),:);
%// Compute the statistics grouped by 'user'.
userStat = grpstats(ratings, 'user');
This could be one vectorized approach -
%// Take care of equality between first column of A and smallUserIDList to
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(#eq,A(:,1),smallUserIDList.')); %//'
%// Take care of non-equality between second column of A and smallItemIDList.
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);
AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step
%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);
%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);
if ~isempty(C1)%//There is at least one element in C, so do further processing
%// Find the count of duplicate elements for each ID in C1 indexed into AR2.
%// NOTE: This corresponds to "unique(A2(:,2))" from OP.
dup_counts = accumarray(C1,AR2,[],#(x) numel(x)-numel(unique(x)));
%// Get the count of matches for each ID in C in the mask1.
%// NOTE: This corresponds to:
%// "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
accums = accumarray(C,mask1);
%// Store the counts in output array and also subtract the dup counts
userStat(1:numel(accums)) = accums;
userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end
Benchmarking
The code listed next compares runtimes for proposed approach against the original loopy code -
%// Size parameters and random inputs with them
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);
disp('---------------------------- With Original Approach')
tic
%// Original posted code
toc
disp('---------------------------- With Proposed Approach'))
tic
%// Proposed approach code
toc
The runtimes thus obtained with three sets of datasizes were -
Case #1:
A_nrows = 500;
IDlist_len = 500;
max_userID = 100;
max_itemID = 100;
---------------------------- With Original Approach
Elapsed time is 0.136630 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.004163 seconds.
Case #2:
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 100;
max_itemID = 100;
---------------------------- With Original Approach
Elapsed time is 1.579468 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.050498 seconds.
Case #3:
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
---------------------------- With Original Approach
Elapsed time is 1.252294 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.044198 seconds.
Conclusion: The speedups with the proposed approach over the original loopy code thus seem to be huge!!
I think you are trying to remove a fixed set of ratings for a subset of users and count the number of remaining ratings:
Does the following work:
Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(#sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;
you need the allcomb function from the file exchange from matlab central, it gives a cartesian product of two vectors, and is easy to implement anyway.

Multiple sampling with different sizes on Matlab

I am trying to implement this code so it works as quickly as possible.
Say I have a population of 100 different values, you can think of it as pop = 1:100 or pop = randn(1,100) to keep things simple. I have a vector n which gives me the size of samples I want to get. Say, for example, that n=[1 3 10 6 2]. What I want to do is to take 5 (which in reality is length(n)) different samples of pop, each consisting of n(i) elements without replacement. This means that for my first sample I want 1 element out of pop, for the second sample I want 3, for the third I want 10, and so on.
To be honest, I am not really interested in which elements are sampled. What I want to get is the sum of those elements that are present in the ith-sample. This would be trivial if I implemented it with a loop, but I am trying to avoid using them to keep my code as quick as possible. I have to do this for many different populations and with length(n)being very large.
If I had to do it with a loop, this would be how:
pop = randn(1,100);
n = [1 3 10 6 2];
sum_sample = zeros(length(n),1);
for i = 1:length(n)
sum_sample(i,1) = sum(randsample(pop,n(i)));
end
Is there a way to do this?
The only way to figure out what is fastest for you is to do a comparison of the different methods.
In fact the loop appears to be very fast in this case!
pop = randn(1,100);
n = [1 3 10 6 2];
tic
sr = #(n) sum(randsample(pop,n));
sum_sample = arrayfun(sr,n);
toc %% Returns about 0.004
clear su
tic
for t=numel(n):-1:1
su(t)=sum(randsample(pop,n(t)));
end
toc %% Returns about 0.003
You can create a function handle which choses the random samples and sums these up. Then you can use arrayfun to execute this function for all values of n:
pop = randn(1,100);
n = [1 3 10 6 2];
sr = #(n) sum(randsample(pop,n));
sum_sample = arrayfun(sr,n);
You can do something like this:
pop = randn(1,100);
n = [1 3 10 6 2];
sampled_data_index = randi(length(pop),1,sum(n));
sampled_data = pop(sampled_data_index);
The randi function randomly selects integer values in a specified range that is suitable for indexing. After you have the indices you can use those at once to sample the data from the pop database.
If you want to have unique indices you can replace the randi function with randperm:
sampled_data_index = randperm(length(pop),sum(n));
Finally:
You can have all the sampled values as a cell variable using the following code:
pop = randn(1,100);
n = [1 3 10 6 2];
fun = #(m) pop(randperm(length(pop),m));
C = arrayfun(fun,n,'UniformOutput',0)
Also having the sum of the sampled data:
funs = #(m) sum(pop(randperm(length(pop),m)));
sumC = arrayfun(funs,n)

Using find with a struct

I have a struct that holds thousands of samples of data. Each data point contains multiple objects. For example:
Structure(1).a = 7
Structure(1).b = 3
Structure(2).a = 2
Structure(2).b = 6
Structure(3).a = 1
Structure(3).b = 6
...
... (thousands more)
...
Structure(2345).a = 4
Structure(2345).b = 9
... and so on.
If I wanted to find the index number of all the '.b' objects containing the number 6, I would have expected the following function would do the trick:
find(Structure.b == 6)
... and I would expect the answer to contain '2' and '3' (for the input shown above).
However, this doesn't work. What is the correct syntax and/or could I be arranging my data in a more logical way in the first place?
The syntax Structure.b for an array of structs gives you a comma-separated list, so you'll have to concatenate them all (for instance, using brackets []) in order to obtain a vector:
find([Structure.b] == 6)
For the input shown above, the result is as expected:
ans =
2 3
As Jonas noted, this would work only if there are no fields containing empty matrices, because empty matrices will not be reflected in the concatenation result.
Handling structs with empty fields
If you suspect that these fields may contain empty matrices, either convert them to NaNs (if possible...) or consider using one of the safer solutions suggested by Rody.
In addition, I've thought of another interesting workaround for this using strings. We can concatenate everything into a delimited string to keep the information about empty fields, and then tokenize it back (this, in my humble opinion, is easier to be done in MATLAB than handle numerical values stored in cells).
Inspired by Jonas' comment, we can convert empty fields to NaNs like so:
str = sprintf('%f,', Structure.b)
B = textscan(str, '%f', 'delimiter', ',', 'EmptyValue', NaN)
and this allows you to apply find on the contents of B:
find(B{:} == 6)
ans =
2
3
Building on EitanT's answer with Jonas' comment, a safer way could be
>> S(1).a = 7;
S(1).b = 3;
S(2).a = 2;
S(2).b = 6;
S(3).a = 1;
S(3).b = [];
S(4).a = 1;
S(4).b = 6;
>> find( cellfun(#(x)isequal(x,6),{S.b}) )
ans =
2 4
It's probably not very fast though (compared to EitanT's version), so only use this when needed.
Another answer to this question! This time, we'll compare the performance of the following 4 methods:
My original method
EitanT's original method (which does not handle emtpies)
EitanT's improved method using strings
A new method: a simple for-loop
Another new method: a vectorized, emtpy-safe version
Test code:
% Set up test
N = 1e5;
S(N).b = [];
for ii = 1:N
S(ii).b = randi(6); end
% Rody Oldenhuis 1
tic
sol1 = find( cellfun(#(x)isequal(x,6),{S.b}) );
toc
% EitanT 1
tic
sol2 = find([S.b] == 6);
toc
% EitanT 2
tic
str = sprintf('%f,', S.b);
values = textscan(str, '%f', 'delimiter', ',', 'EmptyValue', NaN);
sol3 = find(values{:} == 6);
toc
% Rody Oldenhuis 2
tic
ids = false(N,1);
for ii = 1:N
ids(ii) = isequal(S(ii).b, 6);
end
sol4 = find(ids);
toc
% Rody Oldenhuis 3
tic
idx = false(size(S));
SS = {S.b};
inds = ~cellfun('isempty', SS);
idx(inds) = [SS{inds}]==6;
sol5 = find(idx);
toc
% make sure they are all equal
all(sol1(:)==sol2(:))
all(sol1(:)==sol3(:))
all(sol1(:)==sol4(:))
all(sol1(:)==sol5(:))
Results on my machine at work (AMD A6-3650 APU (4 cores), 4GB RAM, Windows 7 64 bit):
Elapsed time is 28.990076 seconds. % Rody Oldenhuis 1 (cellfun)
Elapsed time is 0.119165 seconds. % EitanT 1 (no empties)
Elapsed time is 22.430720 seconds. % EitanT 2 (string manipulation)
Elapsed time is 0.706631 seconds. % Rody Oldenhuis 2 (loop)
Elapsed time is 0.207165 seconds. % Rody Oldenhuis 3 (vectorized)
ans =
1
ans =
1
ans =
1
ans =
1
On my Homebox (AMD Phenom(tm) II X6 1100T (6 cores), 16GB RAM, Ubuntu64 12.10):
Elapsed time is 0.572098 seconds. % cellfun
Elapsed time is 0.119557 seconds. % no emtpties
Elapsed time is 0.220903 seconds. % string manipulation
Elapsed time is 0.107345 seconds. % loop
Elapsed time is 0.180842 seconds. % cellfun-with-string
Gotta love that JIT :)
and wow...anyone know why the two systems behave so differently?
Also, little known fact -- cellfun with one of the possible string arguments is incredibly fast (which goes to show how much overhead anonymous functions require...).
Still, if you can be absolutely sure there are no empties, go for EitanT's original answer; that's what Matlab is for. If you can't be sure, just go for the loop.

Finding the index of a specific value in a cell in MATLAB

I have a two dimensional cell where every element is either a) empty or b) a vector of varying length with values ranging from 0 to 2. I would like to get the indices of the cell elements where a certain value occurs or even better, the "complete" index of every occurrence of a certain value.
I'm currently working on an agent based model of disease spreading and this is done in order to find the positions of infected agents.
Thanks in advance.
Here's how I would do it:
% some example data
A = { [], [], [3 4 5]
[4 8 ], [], [0 2 3 0 1] };
p = 4; % value of interest
% Finding the indices:
% -------------------------
% use cellfun to find indices
I = cellfun(#(x) find(x==p), A, 'UniformOutput', false);
% check again for empties
% (just for consistency; you may skip this step)
I(cellfun('isempty', I)) = {[]};
Call this method1.
A loop is also possible:
I = cell(size(A));
for ii = 1:numel(I)
I{ii} = find(A{ii} == p);
end
I(cellfun('isempty',I)) = {[]};
Call this method2.
Comparing the two methods for speed like so:
tic; for ii = 1:1e3, [method1], end; toc
tic; for ii = 1:1e3, [method2], end; toc
gives
Elapsed time is 0.483969 seconds. % method1
Elapsed time is 0.047126 seconds. % method2
on Matlab R2010b/32bit w/ Intel Core i3-2310M#2.10GHz w/ Ubuntu 11.10/2.6.38-13. This is mostly due to JIT on loops (and how terribly cellfun and anonymous functions seem to be implemented, mumblemumble..)
Anyway, in short, use the loop: it's better readable, and an order of magnitude faster than the vectorized solution.

in matlab, calculate mean in a part of one column where another column satisfies a condition

I'm quite new to matlab, and I'm curious how to do this:
I have a rather large (27000x11) matrix, and the 8th column contains a number which changes sometimes but is constant for like 2000 rows (not necessarily consecutive).
I would like to calculate the mean of the entries in the 3rd column for those rows where the 8th column has the same value. This for each value of the 8th column.
I would also like to plot the 3rd column's means as a function of the 8th column's value but that I can do if I can get a new matrix (2x2) containing [mean_of_3rd,8th].
Ex: (smaller matrix for convenience)
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4
Since the 4th column has the same value in row 1 and 5 I'd like to calculate the mean of 2 and 4 (the corresponding elements of column 2, italic bold) and put it in another matrix together with the 4th column's value. The same for 3 and 5 (bold) since the 4th column has the same value for these two.
3 4
4 5
and so on... is this possible in an easy way?
Use the all-mighty, underused accumarray :
This line gives you mean values of 4th column accumulated by 2nd column:
means = accumarray( A(:,4) ,A(:,2),[],#mean)
This line gives you number of element in each set:
count = accumarray( A(:,4) ,ones(size(A(:,4))))
Now if you want to filter only those that have at least one occurence:
>> filtered = means(count>1)
filtered =
3
4
This will work only for positive integers in the 4th column.
Another possibility for counting amount of elements in each set:
count = accumarray( A(:,4) ,A(:,4),[],#numel)
A slightly refined approach based on the ideas of Andrey and Rody. We can not use accumarray directly, since the data is real, not integer. But, we can use unique to find the indices of the repeating entries. Then we operate on integers.
% get unique entries in 4th column
[R, I, J] = unique(A(:,4));
% count the repeating entries: now we have integer indices!
counts = accumarray(J, 1, size(R));
% sum the 2nd column for all entries
sums = accumarray(J, A(:,2), size(R));
% compute means
means = sums./counts;
% choose only the entries that show more than once in 4th column
inds = counts>1;
result = [means(inds) R(inds)];
Time comparison for the following synthetic data:
A=randi(100, 1000000, 5);
% Rody's solution
Elapsed time is 0.448222 seconds.
% The above code
Elapsed time is 0.148304 seconds.
My official answer:
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
result = [means(inds) R(inds)];
This is because of the following. Here's all of the alternatives we've come up with, in profiling form:
%# sample data
A = [
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4];
%# accumarray
%# works only on positive integers in A(:,4)
tic
for ii = 1:1e4
means = accumarray( A(:,4) ,A(:,2),[],#mean);
count = accumarray( A(:,4) ,ones(size(A(:,4))));
filtered = means(count>1);
end
toc
%# arrayfun
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
B = arrayfun(#(x) A(A(:,4)==x, 2), min(A(:,4)):max(A(:,4)), 'uniformoutput', false);
filtered = cellfun(#mean, B(cellfun(#(x) numel(x)>1, B)) );
end
toc
%# ordinary loop
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
A4 = A(:,4);
R = min(A4):max(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
end
toc
Results:
Elapsed time is 1.238352 seconds. %# (accumarray)
Elapsed time is 7.208585 seconds. %# (arrayfun + cellfun)
Elapsed time is 0.225792 seconds. %# (for loop)
The ordinary loop is clearly the way to go here.
Note the absence of mean in the inner loop. This is because mean is not a Matlab builtin function (at least, on R2010), so that using it inside the loop makes the loop unqualified for JIT compilation, which slows it down by a factor of over 10. Using the form above accelerates the loop to almost 5.5 times the speed of the accumarray solution.
Judging on your comment, it is almost trivial to change the loop to work on all entries in A(:,4) (not just the integers):
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(A4)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
Which I will copy-paste to the top as my official answer :)