finding most frequent words in matlab - matlab

I have a matrix like below:
temp=[1 1 6;
1 2 6;
1 3 7;
1 4 1;
2 1 1;
2 2 2;
2 3 5;
2 4 6;
3 1 4;
3 2 3;
3 3 5;
3 4 7;];
First column represent the document_id, second column represents word_id and third column represents its occurrence in the document_id.
I want to find the top 3 words in terms of their frequencies in the entire documents. Rather than just using lots of loops, what is a better way to do this in Matlab?
I have an initial idea of using:
sorted=sortrows(temp, 2)
I guess histcount or accumarray could help me but not sure how?

WoW! This was the answer I was looking for:
sortrows(splitapply(#sum, sorted(:, 3), findgroups(sorted(:, 2))), -1)
ans =
17
14
11
11
https://www.mathworks.com/help/matlab/ref/splitapply.html
**Update1: actually NOT. Because it doesn't tell me which word_id from the second column are creating this
**Update2: while I can get the highest frequency word_id, I cannot get the top 3 frequency word_ids using the following method:
>> [index, max_val] =max(splitapply(#sum, sorted(:, 3), findgroups(sorted(:, 2))))
index =
17
max_val =
3
Correct final answer:
>> [frequencies, original_positions] = sort(splitapply(#sum, sorted(:, 3), findgroups(sorted(:, 2))), 'descend')
frequencies =
17
14
11
11
original_positions =
3
4
1
2

If you're interested in a solution with accumarray (as you expected), there you go:
[Pos, ~, ind] = unique(temp(:,2)); %Finding unique word IDs (unsorted)
freq = accumarray(ind, temp(:,3)); %Frequencies (unsorted)
To get the top 3. Sort the frequencies in descending order and extract the values at first three indices (or sort in ascending order and extract the values at last three indices).
PosFreq = sortrows([Pos freq], 2, 'descend'); %Sorting according to frequencies
Top3PosFreq = PosFreq(1:3,:); %Extracting top three frequencies
Result:
Top3PosFreq =
3 17
4 14
1 11

Related

How to split a matrix based on how close the values are?

Suppose I have a matrix A:
A = [1 2 3 6 7 8];
I would like to split this matrix into sub-matrices based on how relatively close the numbers are. For example, the above matrix must be split into:
B = [1 2 3];
C = [6 7 8];
I understand that I need to define some sort of criteria for this grouping so I thought I'd take the absolute difference of the number and its next one, and define a limit upto which a number is allowed to be in a group. But the problem is that I cannot fix a static limit on the difference since the matrices and sub-matrices will be changing.
Another example:
A = [5 11 6 4 4 3 12 30 33 32 12];
So, this must be split into:
B = [5 6 4 4 3];
C = [11 12 12];
D = [30 33 32];
Here, the matrix is split into three parts based on how close the values are. So the criteria for this matrix is different from the previous one though what I want out of each matrix is the same, to separate it based on the closeness of its numbers. Is there any way I can specify a general set of conditions to make the criteria dynamic rather than static?
I'm afraid, my answer comes too late for you, but maybe future readers with a similar problem can profit from it.
In general, your problem calls for cluster analysis. Nevertheless, maybe there's a simpler solution to your actual problem. Here's my approach:
First, sort the input A.
To find a criterion to distinguish between "intraclass" and "interclass" elements, I calculate the differences between adjacent elements of A, using diff.
Then, I calculate the median over all these differences.
Finally, I find the indices for all differences, which are greater or equal than three times the median, with a minimum difference of 1. (Depending on the actual data, this might be modified, e.g. using mean instead.) These are the indices, where you will have to "split" the (sorted) input.
At last, I set up two vectors with the starting and end indices for each "sub-matrix", to use this approach using arrayfun to get a cell array with all desired "sub-matrices".
Now, here comes the code:
% Sort input, and calculate differences between adjacent elements
AA = sort(A);
d = diff(AA);
% Calculate median over all differences
m = median(d);
% Find indices with "significantly higher difference",
% e.g. greater or equal than three times the median
% (minimum difference should be 1)
idx = find(d >= max(1, 3 * m));
% Set up proper start and end indices
start_idx = [1 idx+1];
end_idx = [idx numel(A)];
% Generate cell array with desired vectors
out = arrayfun(#(x, y) AA(x:y), start_idx, end_idx, 'UniformOutput', false)
Due to the unknown number of possible vectors, I can't think of way to "unpack" these to individual variables.
Some tests:
A =
1 2 3 6 7 8
out =
{
[1,1] =
1 2 3
[1,2] =
6 7 8
}
A =
5 11 6 4 4 3 12 30 33 32 12
out =
{
[1,1] =
3 4 4 5 6
[1,2] =
11 12 12
[1,3] =
30 32 33
}
A =
1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3
out =
{
[1,1] =
1 1 1 1 1 1 1
[1,2] =
2 2 2 2 2 2
[1,3] =
3 3 3 3 3 3 3
}
Hope that helps!

Order of elements in cell array constructed by accumarray

What I'm trying to do: given a 2D matrix, get the column indices of the elements in each row that satisfy some particular condition.
For example, say my matrix is
M = [16 2 3 13; 5 11 10 8; 9 7 6 12; 4 14 15 1]
and my condition is M>6. Then my desired output would be something like
Indices = {[1 4]'; [2 3 4]'; [1 2 4]'; [2 3]';}
After reading the answers to this similar question I came up with this partial solution using find and accumarray:
[ix, iy] = find(M>6);
Indices = accumarray(ix,iy,[],#(iy){iy});
This gives very nearly the results I want -- in fact, the indices are all right, but they're not ordered the way I expected. For example, Indices{2} = [2 4 3]' instead of [2 3 4]', and I can't understand why. There are 3 occurrences of 2 in ix, at indices 3, 6, and 9. The corresponding values of iy at those indices are 2, 3, and 4, in that order. What exactly is creating the observed order? Is it just arbitrary? Is there a way to force it to be what I want, other than sorting each element of Indices afterwards?
Here's one way to solve it with arrayfun -
idx = arrayfun(#(x) find(M(x,:)>6),1:size(M,1),'Uni',0)
Display output wtih celldisp(idx) -
idx{1} =
1 4
idx{2} =
2 3 4
idx{3} =
1 2 4
idx{4} =
2 3
To continue working with accumarray, you can wrap iy with sort to get your desired output which doesn't look too pretty maybe -
Indices = accumarray(ix,iy,[],#(iy){sort(iy)})
Output -
>> celldisp(Indices)
Indices{1} =
1
4
Indices{2} =
2
3
4
Indices{3} =
1
2
4
Indices{4} =
2
3
accumarray is not guaranteed to preserve order of each chunk of its second input (see here, and also here). However, it does seem to preserve it when the first input is already sorted:
[iy, ix] = find(M.'>6); %'// transpose and reverse outputs, to make ix sorted
Indices = accumarray(ix,iy,[],#(iy){iy}); %// this line is the same as yours
produces
Indices{1} =
1
4
Indices{2} =
2
3
4
Indices{3} =
1
2
4
Indices{4} =
2
3

average 3rd column when 1st and 2nd column have same numbers

just lets make it simple, assume that I have a 10x3 matrix in matlab. The numbers in the first two columns in each row represent the x and y (position) and the number in 3rd columns show the corresponding value. For instance, [1 4 12] shows that the value of function in x=1 and y=4 is equal to 12. I also have same x, and y in different rows, and I want to average the values with same x,y. and replace all of them with averaged one.
For example :
A = [1 4 12
1 4 14
1 4 10
1 5 5
1 5 7];
I want to have
B = [1 4 12
1 5 6]
I really appreciate your help
Thanks
Ali
Like this?
A = [1 4 12;1 4 14;1 4 10; 1 5 5;1 5 7];
[x,y] = consolidator(A(:,1:2),A(:,3),#mean);
B = [x,y]
B =
1 4 12
1 5 6
Consolidator is on the File Exchange.
Using built-in functions:
sparsemean = accumarray(A(:,1:2), A(:,3).', [], #mean, 0, true);
[i,j,v] = find(sparsemean);
B = [i.' j.' v.'];
A = [1 4 12;1 4 14;1 4 10; 1 5 5;1 5 7]; %your example data
B = unique(A(:, 1:2), 'rows'); %find the unique xy pairs
C = nan(length(B), 1);
% calculate means
for ii = 1:length(B)
C(ii) = mean(A(A(:, 1) == B(ii, 1) & A(:, 2) == B(ii, 2), 3));
end
C =
12
6
The step inside the for loop uses logical indexing to find the mean of rows that match the current xy pair in the loop.
Use unique to get the unique rows and use the returned indexing array to find the ones that should be averaged and ask accumarray to do the averaging part:
[C,~,J]=unique(A(:,1:2), 'rows');
B=[C, accumarray(J,A(:,3),[],#mean)];
For your example
>> [C,~,J]=unique(A(:,1:2), 'rows')
C =
1 4
1 5
J =
1
1
1
2
2
C contains the unique rows and J shows which rows in the original matrix correspond to the rows in C then
>> accumarray(J,A(:,3),[],#mean)
ans =
12
6
returns the desired averages and
>> B=[C, accumarray(J,A(:,3),[],#mean)]
B =
1 4 12
1 5 6
is the answer.

MATLAB find mean of column in matrix using two different indices

I have a 22007x3 matrix with data in column 3 and two separate indices in columns 1 and 2.
eg.
x =
1 3 4
1 3 5
1 3 5
1 16 4
1 16 3
1 16 4
2 4 1
2 4 3
2 11 2
2 11 3
2 11 2
I need to find the mean of the values in column 3 when the values in column 1 are the same AND the values in column 2 are the same, to end up with something like:
ans =
1 3 4.6667
1 16 3.6667
2 4 2
2 11 2.3333
Please bear in mind that in my data, the number of times the values in column 1 and 2 occur can be different.
Two options I've tried already are the meshgrid/accumarray option, using two distinct unique functions and a 3D array:
[U, ix, iu] = unique(x(:, 1));
[U2,ix2,iu2] = unique(x(:,2));
[c, r, j] = meshgrid((1:size(x(:, 1), 2)), iu, iu2);
totals = accumarray([r(:), c(:), j(:)], x(:), [], #nanmean);
which gives me this:
??? Maximum variable size allowed by the program is exceeded.
Error in ==> meshgrid at 60
xx = xx(ones(ny,1),:,ones(nz,1));
and the loop option,
for i=1:size(x,1)
if x(i,2)== x(i+1,2);
totals(i,:)=accumarray(x(:,1),x(:,3),[],#nanmean);
end
end
which is obviously so very, very wrong, not least because of the x(i+1,2) bit.
I'm also considering creating separate matrices depending on how many times a value in column 1 occurs, but that would be long and inefficient, so I'm loathe to go down that road.
Group on the first two columns with a unique(...,'rows'), then accumulate only the third column (always the best approach to accumulate only where accumulation really happens, thus avoiding indices, i.e. the first two columns, which you can reattach with unX):
[unX,~,subs] = unique(x(:,1:2),'rows');
out = [unX accumarray(subs,x(:,3),[],#nanmean)];
out =
1 3 4.6667
1 16 3.6667
2 4 2
2 11 2.33
This is an ideal opportunity to use sparse matrix math.
x = [ 1 2 5;
1 2 7;
2 4 6;
3 4 6;
1 4 8;
2 4 8;
1 1 10]; % for example
SM = sparse(x(:,1),x(:,2), x(:,3);
disp(SM)
Result:
(1,1) 10
(1,2) 12
(1,4) 8
(2,4) 14
(3,6) 7
As you can see, we did the "accumulate same indices into same container" in one fell swoop. Now you need to know how many elements you have:
NE = sparse(x(:,1), x(:,2), ones(size(x(:,1))));
disp(NE);
Result:
(1,1) 1
(1,2) 2
(1,4) 1
(2,4) 2
(3,6) 1
Finally, you divide one by the other to get the mean (only use elements that have a value):
matrixMean = SM;
nz = find(NE>0);
matrixMean(nz) = SM(nz) ./ NE(nz);
If you then disp(matrixMean), you get
(1,1) 10
(1,2) 6
(1,4) 8
(2,4) 7
(3,6) 7
If you want to access the individual elements differently, then after you have computed SM and NE you can do
[i j n] = find(NE);
matrixMean = SM(i,j)./NE(i,j);
disp([i(:) j(:) nonzeros(matrixMean)]);

MATLAB: Keeping a number of random rows in a matrix which satisfy certain conditions

I have a matrix in Matlab which looks similar to this, except with thousands of rows:
A =
5 6 7 8
6 1 2 3
5 1 4 8
5 2 3 7
5 8 7 2
6 1 3 8
5 2 1 6
6 3 2 1
I would like to get out a matrix that has three random rows with a '5' in the first column and three random rows with a '6' in the first column. So in this case, the output matrix would look something like this:
A =
5 6 7 8
6 1 2 3
5 2 3 7
6 1 3 8
5 2 1 6
6 3 2 1
The rows must be random, not just the first three or the last three in the original matrix.
I'm not really sure how to begin this, so any help would be greatly appreciated.
EDIT: This is the most successful attempt I've had so far. I found all the rows with a '5' in the first column:
BLocation = find(A(:,1) == 5);
B = A(BLocation,:);
Then I was trying to use 'randsample' like this to find three random rows from B:
C = randsample(B,3);
But 'randsample' does not work with a matrix.
I also think this could be done a little more efficiently.
You need to run randsample on the row indices that satisfy the conditions, i.e. equality to 5 or 6.
n = size(A,1);
% construct the linear indices for rows with 5 and 6
indexA = 1:n;
index5 = indexA(A(:,1)==5);
index6 = indexA(A(:,1)==6);
% sample three (randomly) from each
nSamples = 3;
r5 = randsample(index5, nSamples);
r6 = randsample(index6, nSamples);
% new matrix from concatenation
B = [A(r5,:); A(r6,:)];
Update: You can also use find to replace the original index construction, as yuk suggested, which proves to be faster (and optimized!).
Bechmark (MATLAB R2012a)
A = randi(10, 1e8, 2); % 10^8 rows random matrix of 1-10
tic;
n = size(A,1);
indexA = 1:n;
index5_1 = indexA(A(:,1)==5);
toc
tic;
index5_2 = find(A(:,1)==5);
toc
Elapsed time is 1.234857 seconds.
Elapsed time is 0.679076 seconds.
You can do this as follows:
desiredMat=[];
mat1=A(A(:,1)==5,:);
mat1=mat1(randperm(size(mat1,1)),:);
desiredMat=[desiredMat;mat1(1:3,:)];
mat1=A(A(:,1)==6,:);
mat1=mat1(randperm(size(mat1,1)),:);
desiredMat=[desiredMat;mat1(1:3,:)];
The above code uses logical indexing. You can also do this with find function (logical indexing is always faster than find).