Vectorizing cell find and summing in Matlab - matlab

Would someone please show me how I can go about changing this code from an iterated to a vectorized implementation to speed up performance in Matlab? It takes approximately 8 seconds per i for i=1:20 on my machine currently.
classEachWordCount = zeros(nwords_train, nClasses);
for i=1:nClasses % (20 classes)
for j=1:nwords_train % (53975 words)
classEachWordCount(j,i) = sum(groupedXtrain{i}(groupedXtrain{i}(:,2)==j,3));
end
end
If context is helpful basically groupedXtrain is a cell of 20 matrices which represent different classes, where each class matrix has 3 columns: document#,word#,wordcount, and unequal numbers of rows (tens of thousands). I'm trying to figure out the count total of each word, for each class. So classEachWordCount should be a matrix of size 53975x20 where each row represents a different word and each column a different label. There's got to be a built-in function to assist in something like this, right?
for example groupedXtrain{1} might start off like:
doc#,word#,wordcount
1 1 3
1 2 1
1 4 3
1 5 1
1 8 2
2 2 1
2 5 4
2 6 2

As is mentioned in the comments, you can use accumarray to sum up the values in the third column for each unique value in the second column for each class
results = zeros(nwords_train, numel(groupedXtrain));
for k = 1:numel(groupedXtrain)
results(:,k) = accumarray(groupedXtrain{k}(:,2), groupedXtrain{k}(:,3), ...
[nwords_train 1], #sum);
end

Related

How to rowwise-sort a matrix containing subgrouped data

In matrix A, every column represents an output variable and every row represents a reading (6 rows in total). Every output has a certain subgroup size (groups of 3 rows). I need A's elements to be sorted in the vertical direction within every subgroup.
A = [ 1 7 4; 4 9 3; 8 5 7; 2 9 1; 7 4 4; 8 1 3];
% consecutive 3 rows is one subgroup, within which sorting is required.
B = [1 5 3; 4 7 4; 8 9 7; 2 1 1; 7 4 3; 8 9 4]; % the expected result.
I was considering something along the lines of B = splitapply(#sort,A,2), but splitapply cannot be called like that. How can I get the desired result?
Please note, the actual matrix contains 8 columns and 300 rows. An example is demonstrated above.
The easiest solution would be to reshape your data, sort, then permute it:
rps = 3; % rows per subgroup
B = permute(sort(reshape(A.',rps,size(A,2),[]),2),[2 1 3]);
The above results in a 3x3x2 arrays, which in my opinion are easier to work with, but if you want the output as in the example, you can do the following:
B = reshape(permute(sort(reshape(A.',rps,size(A,2),[]),2),[2 3 1]),size(A));
Alternatively, you are correct to think that splitapply can be useful here, but it requires a bit more work.
This command works on the sample data and should also work on your full dataset:
b = cell2mat( splitapply( #(x){sort(x,2).'}, A.', repelem( 1:size(A,1)/rps, rps ) ).' );
I'll explain what this does:
repelem( 1:size(A,1)/rps, rps ) returns a row vector of groups. The amount of groups is the total amount of rows divided by the group size. (For good measure, there should be an assertion that this is divisible with no remainder).
splitapply( #(x){sort(x,2).'}, ... since splitapply has to return a scalar object per group, it needs to be told that the output is a cell so that it can return a matrix. (This might not be the best explanation, but if you attempt to run it w/o a cell output, you will get the following error:
The function 'sort' returned a non-scalar value when applied to the 1st group of data.
To compute nonscalar values for each group, create an anonymous function to return each value in a scalar cell:
#(x1){sort(x1)}
I perform several transpose operations since this is what splitapply expects.
I used cell2mat to convert the output cells back to a numeric array.

Merge two matrix and find the maximum of its attributes

I've two matrix a and b and I'd like to combine the rows in a way that in the first row I got no duplicate value and in the second value, columns in a and b which have the same row value get the maximum value in new matrix. i.e.
a = 1 2 3
8 2 5
b = 1 2 5 7
2 4 6 1
Desired output
c = 1 2 3 5 7
8 4 5 6 1
Any help is welcomed,please.( the case for accumulation is asked here)
Accumarray accepts functions both anonymous as well as built-in functions. It uses sum function as default. But you could change this to any in-built or anonymous functions like this:
In this case you could use max function.
in = horzcat(a,b).';
[uVal,~,idx] = unique(in(:,1));
out = [uVal,accumarray(idx,in(:,2),[],#max)].'
Based upon your previous question and looking at the help file for accumarray, which has this exact example.
[ii, ~, kk] = unique([a(1,:) b(1,:)]);
result = [ ii; accumarray(kk(:), [a(2,:) b(2,:)], [], #max).'];
The only difference is the anonymous function.

Generate pairs of points using a nested for loop

As an example, I have a matrix [1,2,3,4,5]'. This matrix contains one column and 5 rows, and I have to generate a pair of points like (1,2),(1,3)(1,4)(1,5),(2,3)(2,4)(2,5),(3,4)(3,5)(4,5).
I have to store these values in 2 columns in a matrix. I have the following code, but it isn't quite giving me the right answer.
for s = 1:5;
for tb = (s+1):5;
if tb>s
in = sub2ind(size(pairpoints),(tb-1),1);
pairpoints(in) = s;
in = sub2ind(size(pairpoints),(tb-1),2);
pairpoints(in) = tb;
end
end
end
With this code, I got (1,2),(2,3),(3,4),(4,5). What should I do, and what is the general formula for the number of pairs?
One way, though is limited depending upon how many different elements there are to choose from, is to use nchoosek as follows
pairpoints = nchoosek([1:5],2)
pairpoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
See the limitations of this function in the provided link.
An alternative is to just iterate over each element and combine it with the remaining elements in the list (assumes that all are distinct)
pairpoints = [];
data = [1:5]';
len = length(data);
for k=1:len
pairpoints = [pairpoints ; [repmat(data(k),len-k,1) data(k+1:end)]];
end
This method just concatenates each element in data with the remaining elements in the list to get the desired pairs.
Try either of the above and see what happens!
Another suggestion I can add to the mix if you don't want to rely on nchoosek is to generate an upper triangular matrix full of ones, disregarding the diagonal, and use find to generate the rows and columns of where the matrix is equal to 1. You can then concatenate both of these into a single matrix. By generating an upper triangular matrix this way, the locations of the matrix where they're equal to 1 exactly correspond to the row and column pairs that you are seeking. As such:
%// Highest value in your data
N = 5;
[rows,cols] = find(triu(ones(N),1));
pairpoints = [rows,cols]
pairPoints =
1 2
1 3
2 3
1 4
2 4
3 4
1 5
2 5
3 5
4 5
Bear in mind that this will be unsorted (i.e. not in the order that you specified in your question). If order matters to you, then use the sortrows command in MATLAB so that we can get this into the proper order that you're expecting:
pairPoints = sortrows(pairPoints)
pairPoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
Take note that I specified an additional parameter to triu which denotes how much of an offset you want away from the diagonal. The default offset is 0, which includes the diagonal when you extract the upper triangular matrix. I specified 1 as the second parameter because I want to move away from the diagonal towards the right by 1 unit so I don't want to include the diagonal as part of the upper triangular decomposition.
for loop approach
If you truly desire the for loop approach, going with your model, you'll need two for loops and you need to keep track of the previous row we are at so that we can just skip over to the next column until the end using this. You can also use #GeoffHayes approach in using just a single for loop to generate your indices, but when you're new to a language, one key advice I will always give is to code for readability and not for efficiency. Once you get it working, if you have some way of measuring performance, you can then try and make the code faster and more efficient. This kind of programming is also endorsed by Jon Skeet, the resident StackOverflow ninja, and I got that from this post here.
As such, you can try this:
pairPoints = []; %// Initialize
N = 5; %// Highest value in your data
for row = 1 : N
for col = row + 1 : N
pairPoints = [pairPoints; [row col]]; %// Add row-column pair to matrix
end
end
We get the equivalent output:
pairPoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
Small caveat
This method will only work if your data is enumerated from 1 to N.
Edit - August 20th, 2014
You wish to generalize this to any array of values. You also want to stick with the for loop approach. You can still keep the original for loop code there. You would simply have to add a couple more lines to index your new array. As such, supposing your data array was:
dat = [12, 45, 56, 44, 62];
You would use the pairPoints matrix and use each column to subset the data array to access your values. Also, you need to make sure your data is a column vector, or this won't work. If we didn't, we would be creating a 1D array and concatenating rows and that's not obviously what we're looking for. In other words:
dat = [12, 45, 56, 44, 62];
dat = dat(:); %// Make column vector - Important!
N = numel(dat); %// Total number of elements in your data array
pairPoints = []; %// Initialize
%// Skip if the array is empty
if (N ~= 0)
for row = 1 : N
for col = row + 1 : N
pairPoints = [pairPoints; [row col]]; %// Add row-column pair to matrix
end
end
vals = [dat(pairPoints(:,1)) dat(pairPoints(:,2))];
else
vals = [];
Take note that I have made a provision where if the array is empty, don't even bother doing any calculations. Just output an empty matrix.
We thus get:
vals =
12 45
12 56
12 44
12 62
45 56
45 44
45 62
56 44
56 62
44 62

Remove duplicates appearing next to each other, but keep it if it appears again later

I have a vector that could look like this:
v = [1 1 2 2 2 3 3 3 3 2 2 1 1 1];
that is, the number of equal elements can vary, but they always increase and decrease stepwise by 1.
What I want is an easy way to be left with a new vector looking like this:
v2 = [ 1 2 3 2 1];
holding all the different elements (in the right order as they appear in v), but only one of each. Preferably without looping, since generally my vectors are about 10 000 elements long, and already inside a loop that's taking for ever to run.
Thank you so much for any answers!
You can use diff for this. All you're really asking for is: Delete any element that's equal to the one in front of it.
diff return the difference between all adjacent elements in a vector. If there is no difference, it will return 0. v(ind~=0) will give you all elements that have a value different than zero. The 1 in the beginning is to make sure the first element is counted. As diff returns the difference between elements, numel(diff(v)) = numel(v)-1.
v = [1 1 2 2 2 3 3 3 3 2 2 1 1 1];
ind = [1 diff(v)];
v(ind~=0)
ans =
1 2 3 2 1
This can of course be done in a single line if you want:
v([1, diff(v)]~=0)
You could try using diff which, for a vector X, returns [X(2)-X(1) X(3)-X(2) ... X(n)-X(n-1)] (type help diff for details on this function). Since the elements in your vector always increase or decrease by 1, then
diff(v)
will be a vector (of size one less than v) with zeros and ones where a one indicates a step up or down. We can ignore all the zeros as they imply repeated numbers. We can convert this to a logical array as
logical(diff(v))
so that we can index into v and access its elements as
v(logical(diff(v)))
which returns
1 2 3 2
This is almost what you want, just without the final number which can be added as
[v(logical(diff(v))) v(end)]
Try the above and see what happens!

comparing multiple columns of matrix together

I have a 50x50 matrix named nghlist(i,j) containing 0 and 1 values. 1 means there is a relation between (i,j).
There is another 5x50 matrix named chlist.
I need to check the nghlist matrix and if there is any connection between i and j (nghlist(i,j)==1) then I need to go to the chlist matrix and compare the values on column i and column j. For example compare columns (1,3,8,21,52) and get how many similar values they share together. i.e. I find all those columns have 3 similar values.
I tried using following code. But the problem is I need to compare the unknown number of columns (depend on the node connection (nghlist) for example 4 or 5 columns) together.
for i=1:1:n
for j=1:1:n
if (i~=j & nghlist(i,j)==1)
sum(ismember(chlist(:,i),chlist(:,j)));
end
end
end
Any help is highly appreciated.
++++ simplified example ++++++
take a look at the example http://i.imgur.com/mQjDqzz.jpg
nghlist matrix:
1 1 1 0 0
1 1 1 0 0
1 1 1 1 1
0 0 1 1 1
0 0 1 1 1
chlist matrix:
3 1 4 5 4
4 3 5 6 5
5 4 6 7 6
In this example since node 1 is connected to nodes 2 and 3, I need to compare column 1,2 and 3 from chlist. The output would be 1 (because they only share value '4').
And this value for node 5 would be 2 (because columns 3,4 and 5 only share value '5' and '6'). I hope now it is clear.
If the result of your simplified example is [1,2,0,3,2] then the following code worked for me.
(Matrix a stands for nghlist and matrix b for chlist, result is stored in s )
for i = 1:size(a,1)
s(i)=0;
row = a(i,:);
idx = find(row==1);
idx = idx(idx~=i);
tempb = b(:,idx);
for j=1:size(tempb,1)
if sum(sum(tempb==tempb(j,1)))==size(tempb,2)
s(i)=s(i)+1;
end
end
end
For every node you find all the ones in its row, then discard the one referring to the node itself. Pick the appropriate columns of chlist (line 6) and create a new matrix. For every element of the 1st column of this matrix check if it exists in all other columns.If it does, update the s value
Let's say, the indices of the columns that are to be compared is called idxList:
idxList = [1,3,8,21,50];
You may compare all with the first one and use "AND" to find the minimum number of shared values:
shared = ones(size(chlist(:,i)))
for ii = 2:length(idxList)
shared = (shared == (chlist(:,idxList(1)) == chlist(:,idxList(ii))))
end
Finally, sum as before
sum(shared)
I haven't checked the exact code, but the concept should become clear.
I manage to solve it this way. I compare the first and second column of tempb and put the result in tem. then compare tem with third column of tempb and so on. Anyway thank you finmor and also pyStarter, your codes has inspired me. However I knew its not the best way, but at least it works.
for i=1:size(nghlist,1)
s(i)=0;
j=2;
row=nghlist(i,:);
idx=find(row==1);
tempb=chlist(:,idx);
if (size(tempb,2)>1)
tem=intersect(tempb(:,1),tempb(:,2));
if (size(tempb,2)<=2)
s(i)=size(tem,1);
end
while (size(tempb,2)>j & size(tem)~=0)
j=j+1;
tem= intersect(tem(:,1),tempb(:,j));
s(i)=size(tem,1);
end
end
end