How to count patterns columnwise in Matlab? - matlab

I have a matrix S in Matlab that looks like the following:
2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1
I would like to count patterns of values column-wise. I am interested into the frequency of the numbers that follow right after number 3 in any of the columns. For instance, number 3 occurs three times in the first column. The first time we observe it, it is followed by 3, the second time it is followed by 3 again and the third time it is followed by 4. Thus, the frequency for the patters observed in the first column would look like:
3-3: 66.66%
3-4: 33.33%
3-1: 0%
3-2: 0%

To generate the output, you could use the convenient tabulate
S = [
2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1];
idx = find(S(1:end-1,:)==3);
S2 = S(2:end,:);
tabulate(S2(idx))
Value Count Percent
1 0 0.00%
2 0 0.00%
3 4 66.67%
4 2 33.33%

Here's one approach, finding the 3's then looking at the following digits
[i,j]=find(S==3);
k=i+1<=size(S,1);
T=S(sub2ind(size(S),i(k)+1,j(k))) %// the elements of S that are just below a 3
R=arrayfun(#(x) sum(T==x)./sum(k),1:max(S(:))).' %// get the number of probability of each digit

I'm going to restate your problem statement in a way that I can understand and my solution will reflect this new problem statement.
For a particular column, locate the locations that contain the number 3.
Look at the row immediately below these locations and look at the values at these locations
Take these values and tally up the total number of occurrences found.
Repeat these for all of the columns and update the tally, then determine the percentage of occurrences for the values.
We can do this by the following:
A = [2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1]; %// Define your matrix
[row,col] = find(A(1:end-1,:) == 3);
vals = A(sub2ind(size(A), row+1, col));
h = 100*accumarray(vals, 1) / numel(vals)
h =
0
0
66.6667
33.3333
Let's go through the above code slowly. The first few lines define your example matrix A. Next, we take a look at all of the rows except for the last row of your matrix and see where the number 3 is located with find. We skip the last row because we want to be sure we are within the bounds of your matrix. If there is a number 3 located at the last row, we would have undefined behaviour if we tried to check the values below the last because there's nothing there!
Once we do this, we take a look at those values in the matrix that are 1 row beneath those that have the number 3. We use sub2ind to help us facilitate this. Next, we use these values and tally them up using accumarray then normalize them by the total sum of the tallying into percentages.
The result would be a 4 element array that displays the percentages encountered per number.
To double check, if we look at the matrix, we see that the value of 3 follows other values of 3 for a total of 4 times - first column, row 3, row 4, second column, row 2 and third column, row 6. The value of 4 follows the value of 3 two times: first column, row 6, second column, row 3.
In total, we have 6 numbers we counted, and so dividing by 6 gives us 4/6 or 66.67% for number 3 and 2/6 or 33.33% for number 4.

If I got the problem statement correctly, you could efficiently implement this with MATLAB's logical indexing and an approach that is essentially of two lines -
%// Input 2D matrix
S = [
2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1]
Labels = [1:4]'; %//'# Label array
counts = histc(S([false(1,size(S,2)) ; S(1:end-1,:) == 3]),Labels)
Percentages = 100*counts./sum(counts)
Verify/Present results
The styles for presenting the output results listed next use MATLAB's table for a well human-readable format of data.
Style #1
>> table(Labels,Percentages)
ans =
Labels Percentages
______ ___________
1 0
2 0
3 66.667
4 33.333
Style #2
You can do some fancy string operations to present the results in a more "representative" manner -
>> Labels_3 = strcat('3-',cellstr(num2str(Labels','%1d')'));
>> table(Labels_3,Percentages)
ans =
Labels_3 Percentages
________ ___________
'3-1' 0
'3-2' 0
'3-3' 66.667
'3-4' 33.333
Style #3
If you want to present them in descending sorted manner based on the percentages as listed in the expected output section of the question, you can do so with an additional step using sort -
>> [Percentages,idx] = sort(Percentages,'descend');
>> Labels_3 = strcat('3-',cellstr(num2str(Labels(idx)','%1d')'));
>> table(Labels_3,Percentages)
ans =
Labels_3 Percentages
________ ___________
'3-3' 66.667
'3-4' 33.333
'3-1' 0
'3-2' 0
Bonus Stuff: Finding frequency (counts) for all cases
Now, let's suppose you would like repeat this process for say 1, 2 and 4 as well, i.e. find occurrences after 1, 2 and 4 respectively. In that case, you can iterate the above steps for all cases and for the same you can use arrayfun -
%// Get counts
C = cell2mat(arrayfun(#(n) histc(S([false(1,size(S,2)) ; S(1:end-1,:) == n]),...
1:4),1:4,'Uni',0))
%// Get percentages
Percentages = 100*bsxfun(#rdivide, C, sum(C,1))
Giving us -
Percentages =
90.9091 20.0000 0 100.0000
9.0909 20.0000 0 0
0 60.0000 66.6667 0
0 0 33.3333 0
Thus, in Percentages, the first column are the counts of [1,2,3,4] that occur right after there is a 1 somewhere in the input matrix. As as an example, one can see column -3 of Percentages is what you had in the sample output when looking for elements right after 3 in the input matrix.

If you want to compute frequencies independently for each column:
S = [2 2 1 2
2 3 1 1
3 3 1 1
3 4 1 1
3 1 2 1
4 1 3 1
1 1 3 1]; %// data: matrix
N = 3; %// data: number
r = max(S(:));
[R, C] = size(S);
[ii, jj] = find(S(1:end-1,:)==N); %// step 1
count = full(sparse(S(ii+1+(jj-1)*R), jj, 1, r, C)); %// step 2
result = bsxfun(#rdivide, count, sum(S(1:end-1,:)==N)); %// step 3
This works as follows:
find is first applied to determine row and col indices of occurrences of N in S except its last row.
The values in the entries right below the indices of step 1 are accumulated for each column, in variable count. The very convenient sparse function is used for this purpose. Note that this uses linear indexing into S.
To obtain the frequencies for each column, count is divided (with bsxfun) by the number of occurrences of N in each column.
The result in this example is
result =
0 0 0 NaN
0 0 0 NaN
0.6667 0.5000 1.0000 NaN
0.3333 0.5000 0 NaN
Note that the last column correctly contains NaNs because the frequency of the sought patterns is undefined for that column.

Related

MATLAB: sample from population randomly many times?

I am aware of MATLAB's datasample which allows to select k times from a certain population. Suppose population=[1,2,3,4] and I want to uniformly sample, with replacement, k=5 times from it. Then:
datasample(population,k)
ans =
1 3 2 4 1
Now, I want to repeat the above experiment N=10000 times without using a for loop. I tried doing:
datasample(repmat(population,N,1),5,2)
But the output I get is (just a short excerpt below):
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
1 3 2 1 3
Every row (result of an experiment) is the same! But obviously they should be different... It's as though some random seed is not updating between rows. How can I fix this? Or some other method I could use that avoids a for loop? Thanks!
You seem to be confusing the way datasample works. If you read the documentation on the function, if you specify a matrix, it will generate a data sampling from a selection of rows in the matrix. Therefore, if you simply repeat the population vector 10000 times, and when you specify the second parameter of the function - which in this case is how many rows of the matrix to extract, even though the actual row locations themselves are different, the actual rows over all of the matrix is going to be the same which is why you are getting that "error".
As such, I wouldn't use datasample here if it is your intention to avoid looping. You can use datasample, but you'd have to loop over each call and you explicitly said that this is not what you want.
What I would recommend you do is first create your population vector to have whatever you desire in it, then generate a random index matrix where each value is between 1 up to as many elements as there are in population. This matrix is in such a way where the number of columns is the number of samples and the number of rows is the number of trials. Once you create this matrix, simply use this to index into your vector to achieve the desired sampling matrix. To generate this random index matrix, randi is a fine choice.
Something like this comes to mind:
N = 10000; %// Number of trials
M = 5; %// Number of samples per trial
population = 1:4; %// Population vector
%// Generate random indices
ind = randi(numel(population), N, M);
%// Get the stuff
out = population(ind);
Here's the first 10 rows of the output:
>> out(1:10,:)
ans =
4 3 1 4 2
4 4 1 3 4
3 2 2 2 3
1 4 2 2 2
1 2 3 4 2
2 2 3 2 1
4 1 3 2 4
1 4 1 3 1
1 1 2 4 4
1 2 4 2 1
I think the above does what you want. Also keep in mind that the above code generalizes to any population vector you want. You simply have to change the vector and it will work as advertised.
datasample interprets each column of your data as one element of your population, sampling among all columns.
To fix this you could call datasample N times in a loop, instead I would use randi
population(randi(numel(population),N,5))
assuming your population is always 1:p, you could simplify to:
randi(p,N,5)
Ok so both of the current answers both say don't use datasample and use randi instead. However, I have a solution for you with datasample and arrayfun.
>> population = [1 2 3 4];
>> k = 5; % Number of samples
>> n = 1000; % Number of times to execute datasample(population, k)
>> s = arrayfun(#(k) datasample(population, k), n*ones(k, 1), 'UniformOutput', false);
>> s = cell2mat(s);
s =
1 4 1 4 4
4 1 2 2 4
2 4 1 2 1
1 4 3 3 1
4 3 2 3 2
We need to make sure to use 'UniformOutput', false with arrayfun as there is more than one output. The cell2mat call is needed as the result of arrayfun is a cell array.

Vector-defined cross product application matrix and vectorization in Matlab

I ran into an operation I cannot seem to achieve via vectorization.
Let's say I want to find the matrix of the application defined by
h: X -> cross(V,X)
where V is a predetermined vector (both X and V are 3-by-1 vectors).
In Matlab, I would do something like
M= cross(repmat(V,1,3),eye(3,3))
to get this matrix. For instance, V=[1;2;3] yields
M =
0 -3 2
3 0 -1
-2 1 0
Let's now suppose that I have a 3-by-N matrix
V=[V_1,V_2...V_N]
with each column defining its own cross-product operation. For N=2, here's a naive try to find the two cross-product matrices that V's columns define
V=[1,2,3;4,5,6]'
M=cross(repmat(V,1,3),repmat(eye(3,3),1,2))
results in
V =
1 4
2 5
3 6
M =
0 -6 2 0 -3 5
3 0 -1 6 0 -4
-2 4 0 -5 1 0
while I was expecting
M =
0 -3 2 0 -6 5
3 0 -1 6 0 -4
-2 1 0 -5 4 0
2 columns are inverted.
Is there a way to achieve this without for loops?
Thanks!
First, make sure you read the documentation of cross very carefully when dealing with matrices:
It says:
C = cross(A,B,DIM), where A and B are N-D arrays, returns the cross
product of vectors in the dimension DIM of A and B. A and B must
have the same size, and both SIZE(A,DIM) and SIZE(B,DIM) must be 3.
Bear in mind that if you don't specify DIM, it's automatically assumed to be 1, so you're operating along the columns. In your first case, you specified both the inputs A and B to be 3 x 3 matrices. Therefore, the output will be the cross product of each column independently due to the assumption that DIM=1. As such, you expect that the i'th column of the output contains the cross product of the i'th column of A and the i'th column of B and the number of rows is expected to be 3 and the number of columns needs to match between A and B.
You're getting what you expect because the first input A has [1;2;3] duplicated correctly over the columns three times. From your second piece of code, what you're expecting for V as the first input (A) looks like this:
V =
1 1 1 4 4 4
2 2 2 5 5 5
3 3 3 6 6 6
However, when you do repmat, you are in fact alternating between each column. In fact, you are getting this:
V =
1 4 1 4 1 4
2 5 2 5 2 5
3 6 3 6 3 6
repmat tile matrices together and you specified that you wanted to tile V horizontally three times. That's obviously not correct. This explains why the columns are swapped because the second, fourth and sixth columns of V actually should appear at the last three columns instead. As such, the ordering of your input columns is the reason why the output appears swapped.
As such, you need to re-order V so that the first three vectors are [1;2;3], followed by the next three vectors as [4;5;6] after. Therefore, you can generate your original V matrix first, then create a new matrix such that the odd column comes first in a group of three, followed by the even column in a group of three after:
>> V = [1,2,3;4,5,6].';
>> V = V(:, [1 1 1 2 2 2])
V =
1 1 1 4 4 4
2 2 2 5 5 5
3 3 3 6 6 6
Now use V with cross and maintain the same second input:
>> M = cross(V, repmat(eye(3), 1, 2))
M =
0 -3 2 0 -6 5
3 0 -1 6 0 -4
-2 1 0 -5 4 0
Looks good to me!

Find two MAXIMUM values' position in 3D matrix (MATLAB)

I have been having problem with identifying two maximum values' position in 3D matrix (MATLAB). Say I have matrix A output as follows:
A(:,:,1) =
5 3 5
0 1 0
A(:,:,2) =
0 2 0
8 0 8
A(:,:,3) =
3 0 0
0 7 7
A(:,:,4) =
6 6 0
4 0 0
For the first A(:,:,1), I want to identify that the first row have the highest value (A=5). But I need the two index position, which in this case, 1 and 3. And this is the same as the other A(:,:,:).
I have searched through SO but since I am bad in MATLAB, I couldn't find way to work this through.
Please do help me on this. It would be better if I don't need to use for loop to get the desired output.
Shot #1 Finding the indices for maximum values across each 3D slice -
%// Reshape A into a 2D matrix
A_2d = reshape(A,[],size(A,3))
%// Find linear indices of maximum numbers for each 3D slice
idx = find(reshape(bsxfun(#eq,A_2d,max(A_2d,[],1)),size(A)))
%// Convert those linear indices to dim1, dim2,dim3 indices and
%// present the final output as a Nx3 array
[dim1_idx,dim2_idx,dim3_idx] = ind2sub(size(A),idx)
out_idx_triplet = [dim1_idx dim2_idx dim3_idx]
Sample run -
>> A
A(:,:,1) =
5 3 5
0 1 0
A(:,:,2) =
0 2 0
8 0 8
A(:,:,3) =
3 0 0
0 7 7
A(:,:,4) =
6 6 0
4 0 0
out_idx_triplet =
1 1 1
1 3 1
2 1 2
2 3 2
2 2 3
2 3 3
1 1 4
1 2 4
out_idx_triplet(:,2) is what you are looking for!
Shot #2 Finding the indices for highest two numbers across each 3D slice -
%// Get size of A
[m,n,r] = size(A)
%// Reshape A into a 2D matrix
A_2d = reshape(A,[],r)
%// Find linear indices of highest two numbers for each 3D slice
[~,sorted_idx] = sort(A_2d,1,'descend')
idx = bsxfun(#plus,sorted_idx(1:2,:),[0:r-1]*m*n)
%// Convert those linear indices to dim1, dim2,dim3 indices
[dim1_idx,dim2_idx,dim3_idx] = ind2sub(size(A),idx(:))
%// Present the final output as a Nx3 array
out_idx_triplet = [dim1_idx dim2_idx dim3_idx]
out_idx_triplet(:,2) is what you are looking for!
The following code gives you the column and row of the respective maximum.
The first step will obtain the maximum of each sub-matrix containing the first and second dimension. Since max works per default with the first dimension, the matrix is reshaped to combine the original first and second dimension.
max_vals = max(reshape(A,size(A,1)*size(A,2),size(A,3)));
max_vals =
5 8 7 6
In the second step, the index of elements equal to the respective max_vals of each sub-matrix is obtained using arrayfun over the third dimension. Since the output of arrayfun are cells, cell2mat is used to transform the output into a matrix. As a last step, the linear index from find is transformed into sub-indices by ind2sub.
[i,j] = ind2sub(size(A(:,:,1)),cell2mat(arrayfun(#(i)find(A(:,:,i)==max_vals(i)),1:size(A,3),'UniformOutput',false)))
i =
1 2 2 1
1 2 2 1
j =
1 1 2 1
3 3 3 2
Hence, the values in j are the ones you want to have.

matlab: how to compare two matrices to get the indeces of the elements that differs from one to another

I'm using Matlab with very big multidimensional similar matrices and I'd like to find the differences of between them.
The two matrices have the same size.
Here is an example:
A(:,:,1) =
1 1 1
1 1 1
1 1 1
A(:,:,2) =
1 1 1
1 1 1
1 1 1
A(:,:,3) =
1 1 1
1 1 1
1 1 1
B(:,:,1) =
1 1 99
1 1 99
1 1 1
B(:,:,2) =
1 1 1
1 1 1
1 1 1
B(:,:,3) =
1 1 99
1 1 1
1 1 1
I need a function that give me the indeces of the values that differs, in this example this would be :
output =
1 3 1
1 3 3
2 3 1
I know that I can use functions like find(B~=A) or find(~ismember(B, A)) I don't know how to change their output to the indeces I want.
Thank you all!
You almost have it correct! Remember that find finds column major indices of where in your matrix (or vector) the Boolean condition you want to check for is being satisfied. If you want the actual row/column/slice locations, you need to use ind2sub. You would call it this way:
%// To reproduce your problem
A = ones(3,3,3);
B = ones(3,3,3);
B(7:8) = 99;
B(25) = 99;
%// This is what you call
[row,col,dim] = ind2sub(size(A), find(A ~= B));
The first parameter to ind2sub is the matrix size of where you're searching. Since the dimensions of A are equal to B, we can choose either A or B for the first input, and we use size to help us determine the size of the matrix. The second input are the column major indices that we want to access the matrix. These are simply the result of find.
row, col, and dim will give you the rows, columns and slices of which elements in your 3D matrix were not equal. Also note that these will be column vectors, as the output of find will produce a column vector of column-major indices. As such, we can concatenate each of the column vectors into a single matrix and display your information. Therefore:
locations = [row col dim];
disp(locations);
1 3 1
2 3 1
1 3 3
As such, the first column of this matrix tells you the row locations of where the matrix values are unequal, the second column of this matrix tells you the column locations of where the matrix values are unequal, and finally the third column tells you the slices of where the matrix values are unequal. Therefore, we have three points in this matrix that are unequal, which are located at (1,3,1), (2,3,1) and (1,3,3) respectively. Note that this is unsorted due to the nature of find as it searches amongst the columns of your matrix first. If you want to have this sorted like you have in your example output, use sortrows. If we do this, we get:
sortrows(locations)
ans =
1 3 1
1 3 3
2 3 1

Count non-zero entries in each column of a matrix

If I have a matrix:
A = [1 2 3 4 5; 1 1 6 1 2; 0 0 9 0 1]
A =
1 2 3 4 5
1 1 6 1 2
0 0 9 0 1
How can I count the number of non-zero entries for each column? For example the desired output for this matrix would be:
2, 2, 3, 2, 3
I am not sure how to do this as size, length or numel do not appear to meet the requirements. Perhaps it would be best to remove zero entries first?
It's simply
> A ~= 0
ans =
1 1 1 1 1
1 1 1 1 1
0 0 1 0 1
> sum(A ~= 0, 1)
ans =
2 2 3 2 3
Here's another solution I can suggest that isn't very speed worthy for dense matrices but quite fast for sparse matrices (thanks #user1877862!). This also would mimic how one might do this in a compiled language, like C or Java, and perhaps for research purposes too. First find the row and column locations that are non zero, then do a histogram on just the column locations to count the frequency of how often you see a non-zero in each column. In other words:
[~,col] = find(A ~= 0);
counts = histc(col, 1:size(A,2));
find outputs the row and column locations of where a matrix satisfies some Boolean condition inside the argument of the function. We ignore the first output as we aren't concerned with the row locations.
The output we get is:
counts =
2
2
3
2
3