How to vectorize searching function in Matlab? - matlab

Here is a Matlab coding problem (A little different version with intersect not setdiff here:
a rating matrix A with 3 cols, the 1st col is user'ID which maybe duplicated, 2nd col is the item'ID which maybe duplicated, 3rd col is rating from user to item, ranging from 1 to 5.
Now, I have a subset of user IDs smallUserIDList and a subset of item IDs smallItemIDList, then I want to find the rows in A that rated by users in smallUserIDList, and collect the items that user rated, and do some calculations, such as setdiff with smallItemIDList and count the result, as the following code does:
userStat = zeros(length(smallUserIDList), 1);
for i = 1:length(smallUserIDList)
A2= A(A(:,1) == smallUserIDList(i), :);
itemIDList_each = unique(A2(:,2));
setDiff = setdiff(itemIDList_each , smallItemIDList);
userStat(i) = length(setDiff);
end
userStat
Finally, I find the profile viewer showing that the loop above is inefficient, the question is how to improve this piece of code with vectorization but the help of for loop?
For example:
Input:
A = [
1 11 1
2 22 2
2 66 4
4 44 5
6 66 5
7 11 5
7 77 5
8 11 2
8 22 3
8 44 3
8 66 4
8 77 5
]
smallUserIDList = [1 2 7 8]
smallItemIDList = [11 22 33 55 77]
Output:
userStat =
0
1
0
2

Vanilla MATLAB:
As far as I can tell your code is equivalent to:
%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));
%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];
%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);
%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);
This will work if there is at most one rating per (user,item)-combination. Also it should be quite efficient.
Clean approach without reinventing the wheel:
Check out grpstats from the Statistics Toolbox!
An implementation could look similar to this:
%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});
%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);
%// Keep only users we care about (smallUserIDList)
ratings = ratings(ismember(ratings.user, smallUserIDList),:);
%// Compute the statistics grouped by 'user'.
userStat = grpstats(ratings, 'user');

This could be one vectorized approach -
%// Take care of equality between first column of A and smallUserIDList to
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(#eq,A(:,1),smallUserIDList.')); %//'
%// Take care of non-equality between second column of A and smallItemIDList.
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);
AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step
%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);
%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);
if ~isempty(C1)%//There is at least one element in C, so do further processing
%// Find the count of duplicate elements for each ID in C1 indexed into AR2.
%// NOTE: This corresponds to "unique(A2(:,2))" from OP.
dup_counts = accumarray(C1,AR2,[],#(x) numel(x)-numel(unique(x)));
%// Get the count of matches for each ID in C in the mask1.
%// NOTE: This corresponds to:
%// "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
accums = accumarray(C,mask1);
%// Store the counts in output array and also subtract the dup counts
userStat(1:numel(accums)) = accums;
userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end
Benchmarking
The code listed next compares runtimes for proposed approach against the original loopy code -
%// Size parameters and random inputs with them
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);
disp('---------------------------- With Original Approach')
tic
%// Original posted code
toc
disp('---------------------------- With Proposed Approach'))
tic
%// Proposed approach code
toc
The runtimes thus obtained with three sets of datasizes were -
Case #1:
A_nrows = 500;
IDlist_len = 500;
max_userID = 100;
max_itemID = 100;
---------------------------- With Original Approach
Elapsed time is 0.136630 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.004163 seconds.
Case #2:
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 100;
max_itemID = 100;
---------------------------- With Original Approach
Elapsed time is 1.579468 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.050498 seconds.
Case #3:
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
---------------------------- With Original Approach
Elapsed time is 1.252294 seconds.
---------------------------- With Proposed Approach
Elapsed time is 0.044198 seconds.
Conclusion: The speedups with the proposed approach over the original loopy code thus seem to be huge!!

I think you are trying to remove a fixed set of ratings for a subset of users and count the number of remaining ratings:
Does the following work:
Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(#sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;
you need the allcomb function from the file exchange from matlab central, it gives a cartesian product of two vectors, and is easy to implement anyway.

Related

MATLAB - sparse to dense matrix

I have a sparse matrix in a data file produced by a code(which is not MATLAB). The data file consists of four columns. The first two column are the real and imaginary part of a matrix entry and the third and fourth columns are the corresponding row and column index respectively.
I convert this into a dense matrix in Matlab using the following script.
tic
dataA = load('sparse_LHS.dat');
toc
% Initialise matrix
tic
Nr = 15; Nz = 15; Neq = 5;
A (Nr*Nz*Neq,Nr*Nz*Neq) = 0;
toc
tic
lA = length(dataA)
rowA = dataA(:,3); colA = dataA(:,4);
toc
tic
for i = 1:lA
A(rowA(i), colA(i)) = complex(dataA(i,1), dataA(i,2));
end
toc
This scipt is, however, very slow(the for loop is the culprit).
Elapsed time is 0.599023 seconds.
Elapsed time is 0.001978 seconds.
Elapsed time is 0.000406 seconds.
Elapsed time is 275.462138 seconds.
Is there any fast way of doing this in matlab?
Here is what I tried so far:
parfor - This gives me
valid indices are restricted in parfor loops
I tired to recast the for loop as something like this:
A(rowA(:),colA(:)) = complex(dataA(:,1), dataA(:,2));
and I get an error
Subscripted assignment dimension mismatch.
The reason your last try doesn't work is that Matlab can't take a list of subscripts for both columns and rows, and match them to assign elements in order. Instead, it's making all the combinations of rows and columns from the list - this is how it looks:
dataA = magic(4)
dataA =
16 2 3 13
5 11 10 8
9 7 6 12
4 14 15 1
dataA([1,2],[1,4]) =
16 13
5 8
So we got 4 elements ([1,1],[1,4],[2,1],[2,4]) instead of 2 ([1,1] and [2,4]).
In order to use subscripts in a list, you need to converts them to linear indexing, and one simple way to do this is using the function sub2ind.
Using this function you can write the following code to do it all at once:
% Initialise matrix
Nr = 15; Nz = 15; Neq = 5;
A(Nr*Nz*Neq,Nr*Nz*Neq) = 0;
% Place all complex values from dataA(:,1:2) in to A by the subscripts in dataA(:,3:4):
A(sub2ind(size(A),dataA(:,3),dataA(:,4))) = complex(dataA(:,1), dataA(:,2));
sub2ind is not such a quick function (but it will be much quicker than your loop), so if you have a lot of data, you might want to do the computation of the linear index by yourself:
rowA = dataA(:,3);
colA = dataA(:,4);
% compute the linear index:
ind = (colA-1)*size(A,1)+rowA;
% Place all complex values from dataA(:,1:2) in to A by the the index 'ind':
A(ind) = complex(dataA(:,1), dataA(:,2));
P.S.:
If you are using Matlab R2015b or later:
A = zeros(Nr*Nz*Neq,Nr*Nz*Neq);
is quicker than:
A(Nr*Nz*Neq,Nr*Nz*Neq) = 0;

Randomly select subset from vector in order (Matlab)

I want to select a random subset of a vector, much like datasample(data,k), but I want them in order.
I have an ODE which has [t,y] as output and it's the y that I want a subset of. I cannot just do a sort because y is not linear and so I somehow have to sort it with respect to t.
Any ideas how I can to this?
If I understand correctly, you want to sample the elements maintaining their original order. You can do it this way:
randomly sample the indices rather than the values;
sort the sampled indices;
use them to access the selected values;
that is:
result = data(sort(randsample(numel(data), k)));
The above uses the randsample function from the Statistics Toolbox. Alternatively, in recent Matlab versions you can use the two-input form of randperm:
result = data(sort(randperm(numel(data), k)));
For example, given
data = [61 52 43 34 25 16];
k = 4;
a possible result is
result =
61 43 34 25
This can be solved using a combination of randperm and intersect:
function q40673112
% Create a vector:
v = round(sin(0:0.6:6),3); disp(['v = ' mat2str(v)]);
% Set the size of sample we want:
N = 5;
% Create the random indices:
inds = intersect(1:numel(v), randperm(numel(v),N)); disp(['inds = ' mat2str(inds)]);
% Sample from the vector:
v_samp = v(inds); disp(['v_samp = ' mat2str(v_samp)]);
Example output:
% 1 2 3 4 5 6 7 8 9 10 11
v = [0 0.565 0.932 0.974 0.675 0.141 -0.443 -0.872 -0.996 -0.773 -0.279]
inds = [4 6 9 10 11]
v_samp = [0.974 0.141 -0.996 -0.773 -0.279]

Implementing this equation in MATLAB with a for loop?

I am looking to implement the following equation in MATLAB since I have a very large matrix,
How would I be able to do this? It is not really about the 261 and for the sake of simplicity, we can assume d = 0.94, and there is no need to worry about the squared term nor mean term as I will be able to figure that out if I can get the loop concept down. So for instance, I will just try to calculate an average of all the past values in the rows with specific weights attached to them.
To clarify, we can essentially think of i as indexing the rows of a matrix and so this consists of an entire column which I provided as an example below. Ignoring the infinity, we can just sum it to period t, but the idea is that there is a certain weight placed on all the previous values of the rows where the most recent row has the greatest weight.
I was thinking of using something like this:
R = [1; 2; 3; 4; 5; 6; 7; 8; 9; 10];
d = 0.94;
r = zeros(10,1);
for t = 2:10
r(t,1) = R(t,1);
for i = 1:10
W(i,1) = (1-d)*(d.^i)*r(t,1);
end
end
Or even indexing t = 1:10.
None of these works. In essence, I want to be able to calculate a mean for which there is greater weight placed to the most recent value. So for example, at row t=4, the value I would get would be:
(1-0.94)(0.94^3)*(1) + (1-0.94)(0.94^2)(2) +(1-0.94)(0.94)(3).
Right, if I understand you correctly, I think the following should work:
R = [1 2 3 4 5 6 7 8 9 10];
d = 0.94;
W = zeros(size(R));
% at t = 1, sigma will be 0
for t = 2:length(R)
meanR = mean(R(1:t-1));
for i = 1:t-1
W(t) = W(t) + 261*(1-d)*(d.^(t-i))*(R(i) - meanR)^2;
end
end

in matlab, calculate mean in a part of one column where another column satisfies a condition

I'm quite new to matlab, and I'm curious how to do this:
I have a rather large (27000x11) matrix, and the 8th column contains a number which changes sometimes but is constant for like 2000 rows (not necessarily consecutive).
I would like to calculate the mean of the entries in the 3rd column for those rows where the 8th column has the same value. This for each value of the 8th column.
I would also like to plot the 3rd column's means as a function of the 8th column's value but that I can do if I can get a new matrix (2x2) containing [mean_of_3rd,8th].
Ex: (smaller matrix for convenience)
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4
Since the 4th column has the same value in row 1 and 5 I'd like to calculate the mean of 2 and 4 (the corresponding elements of column 2, italic bold) and put it in another matrix together with the 4th column's value. The same for 3 and 5 (bold) since the 4th column has the same value for these two.
3 4
4 5
and so on... is this possible in an easy way?
Use the all-mighty, underused accumarray :
This line gives you mean values of 4th column accumulated by 2nd column:
means = accumarray( A(:,4) ,A(:,2),[],#mean)
This line gives you number of element in each set:
count = accumarray( A(:,4) ,ones(size(A(:,4))))
Now if you want to filter only those that have at least one occurence:
>> filtered = means(count>1)
filtered =
3
4
This will work only for positive integers in the 4th column.
Another possibility for counting amount of elements in each set:
count = accumarray( A(:,4) ,A(:,4),[],#numel)
A slightly refined approach based on the ideas of Andrey and Rody. We can not use accumarray directly, since the data is real, not integer. But, we can use unique to find the indices of the repeating entries. Then we operate on integers.
% get unique entries in 4th column
[R, I, J] = unique(A(:,4));
% count the repeating entries: now we have integer indices!
counts = accumarray(J, 1, size(R));
% sum the 2nd column for all entries
sums = accumarray(J, A(:,2), size(R));
% compute means
means = sums./counts;
% choose only the entries that show more than once in 4th column
inds = counts>1;
result = [means(inds) R(inds)];
Time comparison for the following synthetic data:
A=randi(100, 1000000, 5);
% Rody's solution
Elapsed time is 0.448222 seconds.
% The above code
Elapsed time is 0.148304 seconds.
My official answer:
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
result = [means(inds) R(inds)];
This is because of the following. Here's all of the alternatives we've come up with, in profiling form:
%# sample data
A = [
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4];
%# accumarray
%# works only on positive integers in A(:,4)
tic
for ii = 1:1e4
means = accumarray( A(:,4) ,A(:,2),[],#mean);
count = accumarray( A(:,4) ,ones(size(A(:,4))));
filtered = means(count>1);
end
toc
%# arrayfun
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
B = arrayfun(#(x) A(A(:,4)==x, 2), min(A(:,4)):max(A(:,4)), 'uniformoutput', false);
filtered = cellfun(#mean, B(cellfun(#(x) numel(x)>1, B)) );
end
toc
%# ordinary loop
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
A4 = A(:,4);
R = min(A4):max(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
end
toc
Results:
Elapsed time is 1.238352 seconds. %# (accumarray)
Elapsed time is 7.208585 seconds. %# (arrayfun + cellfun)
Elapsed time is 0.225792 seconds. %# (for loop)
The ordinary loop is clearly the way to go here.
Note the absence of mean in the inner loop. This is because mean is not a Matlab builtin function (at least, on R2010), so that using it inside the loop makes the loop unqualified for JIT compilation, which slows it down by a factor of over 10. Using the form above accelerates the loop to almost 5.5 times the speed of the accumarray solution.
Judging on your comment, it is almost trivial to change the loop to work on all entries in A(:,4) (not just the integers):
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(A4)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
Which I will copy-paste to the top as my official answer :)

How do I compare elements of one row with every other row in the same matrix

I have the matrix:
a = [ 1 2 3 4;
2 4 5 6;
4 6 8 9]
and I want to compare every row with every other two rows one by one. If they share the same key then the result will tell they have a common key.
Using #gnovice's idea of getting all combinations with nchoosek, I propose yet another two solutions:
one using ismember (as noted by #loren)
the other using bsxfun with the eq function handle
The only difference is that intersect sorts and keeps only the unique common keys.
a = randi(30, [100 20]);
%# a = sort(a,2);
comparisons = nchoosek(1:size(a,1),2);
N = size(comparisons,1);
keys1 = cell(N,1);
keys2 = cell(N,1);
keys3 = cell(N,1);
tic
for i=1:N
keys1{i} = intersect(a(comparisons(i,1),:),a(comparisons(i,2),:));
end
toc
tic
for i=1:N
query = a(comparisons(i,1),:);
set = a(comparisons(i,2),:);
keys2{i} = query( ismember(query, set) ); %# unique(...)
end
toc
tic
for i=1:N
query = a(comparisons(i,1),:);
set = a(comparisons(i,2),:)';
keys3{i} = query( any(bsxfun(#eq, query, set),1) ); %'# unique(...)
end
toc
... with the following time comparisons:
Elapsed time is 0.713333 seconds.
Elapsed time is 0.289812 seconds.
Elapsed time is 0.135602 seconds.
Note that even by sorting a beforehand and adding a call to unique inside the loops (commented parts), these two methods are still faster than intersect.
Here's one solution (which is generalizable to larger matrices than the sample in the question):
comparisons = nchoosek(1:size(a,1),2);
N = size(comparisons,1);
keys = cell(N,1);
for i = 1:N
keys{i} = intersect(a(comparisons(i,1),:),a(comparisons(i,2),:));
end
The function NCHOOSEK is used to generate all of the unique combinations of row comparisons. For the matrix a in your question, you will get comparisons = [1 2; 1 3; 2 3], meaning that we will need to compare rows 1 and 2, then 1 and 3, and finally 2 and 3. keys is a cell array that stores the results of each comparison. For each comparison, the function INTERSECT is used to find the common values (i.e. keys). For the matrix a given in the question, you will get keys = {[2 4], 4, [4 6]}.