Grouping by nested unique values

Grouping by nested unique values - matlab

I have a matrix A in Matlab:
A = [176 5406 1 4 7903;
155 5406 1 5 7903;
122 5407 0 4 7903;
140 5407 0 5 7904;
130 5407 0 3 7904];
Just for information - the second column is a user ID, while the fourth column is a time. So 5406 is one user and 5407 is another user. Both of these users have some information stored in the first column and the 4th column which I am interested in accessing.
So basically what I want to do is:
For each user take the median of their values in the first column. I have written code (below) that works for this.
If there are two equal "time" values in column 5 for each user then I want to average the values in column 4. So like for user 5406 the time values are both 7903, I want to the average of values in column 4 - i.e. the average of 4 and 5 to end up with one value (4.5).
But for example for the next user 5407 I will have two final values - one will be the average of 5 and 3 (because 7904 is repeated) and one will be 4 (because 7903 is not repeated).
I am a bit confused about how to do this, I know there needs to be an if statement of some sort, but I've been stuck on it for ages. Can anyone help?
Thanks
Code for the first part:
u=unique(A(:,2));
for i=1:size(u,1)
M=find(A(i,2)==u(i));
med(i)=median(A(M,1));
end

You could run unique for the time values of each user (within the loop) and do a similar sub loop to collect the mean of unique timestamp for that user.
But here I think it's neater to use accumarray. In first example below, I've modified your code just a bit.
% Get unique
[user, ~, userIdx] = unique(A(:,2));
nUser = numel(user);
% Allocate container for result
med = zeros(nUser,1);
men = cell(nUser,1); % <-- Need a cell since length of result could vary
for i = 1:nUser
% Median of col #1
med(i) = median(A(userIdx == i, 1));
% Mean of col #4 for unique times
[~, ~, timeIdx] = unique(A(userIdx == i, 5));
men{i} = accumarray(timeIdx, A(userIdx == i, 4), [], #mean);
end
Result:
>> med =
165.5
130
>> celldisp(men)
men{1} =
4.5
men{2} =
4
4
To squeeze it a bit more, you could take unique time for entire A and use accumarray for both
[~, ~, userIdx] = unique(A(:,2));
[~, ~, timeIdx] = unique(A(:,5));
med = accumarray(userIdx, A(:,1), [], #median);
men = accumarray([userIdx timeIdx], A(:,4), [], #mean, NaN);
This gives men not as a cell but a matrix. Therefore the blank spaces has to be filled (here I choose NaN since 0 could be a result of #mean).
men =
4.5 NaN
4 4
If you want it as a cell without NaN you could just loop over the rows and pick non-NaN values, or place only the men calculation in the loop, or any other way...
If you are sure that column 4 of A doesn't contain any negative or zero numbers (mean value should never risk being 0), you could collect the result of men as a sparse matrix instead
men = accumarray([userIdx timeIdx], A(:,4), [], #mean, 0, true);
men =
(1,1) 4.5
(2,1) 4
(2,2) 4

I got another solution for your task without using any loops:
Median values.
u=unique(A(:,2));
umedians = arrayfun( #(x) median (A( A(:,2)==x, 1)), u);
Explanation:
find all unique users first. Then using arrayfun to find all data for current user and calculate median for every one of them.
Average values of column 4.
This task is a bit harder. We can go this way:
temp = arrayfun( #(x) unique(A ( A(:,2)==x,5 )), u, 'UniformOutput',false);
result = cellfun( #(y,z) arrayfun( #(x) mean( A( A(:,2) == u(z) & A(:,5) == x ,4) ), ...
y, 'UniformOutput',false), temp , num2cell( [1:size(u,1)]'), 'UniformOutput',false)
Explanation: first of all lets find all unique times for each users. Save it to cell array temp. Now we need for each cell find the same times and calculate mean. So lets use cellfun to made it for each cell of temp and use arrayfun into it to calculate mean.
Hope it helps!

Related

Merge 2 vectors on equal time values

I have collected two types of data. One is a struct Outputs with 3 fields: Outputs.time, Outputs.signals and an unimportant one. Outputs.time is a columnvector containing all the time values (where the data is sampled), Outputs.signals has 15 rows, on each row the values and properties of a signal (so there are 15 signals in total). Consequently Outputs.signals(i).values has the same number of rows as Outputs.time.
Now i have another table with 4 columns: LabData.time, LabData.NdBoiler, LabData.NdOutput and an unimportant one. Outputs.time contains all the computer sampled data, LabData.time only some measurements taken by hand. Ergo, Outputs.time is way larger than LabData.time, but at certain times (where Outputs.time = LabData.time) there are values for both Outputs.signals and the other columns of LabData.
The goal is to put the values of LabData.NdBoiler and LabData.NdOutput in Outputs.signals(16) and Outputs.signals(17) for the time-samples where the value is known. For the other values, Outputs.signals(16) = NaN and Outputs.signals(17) = NaN. But i dont know how to do that, could you help me?
Example:
Outputs.time = [1; 2; 3; 4; 5];
Outputs.signals(1).values = [1111; 2222; 3333; 4444; 5555]; %and so on for the other signals
LabData.time = [2; 4];
LabData.NdBoiler = [1.23; 1.32];
%% Now the final result should be
Outputs.signals(16).values = [NaN; 1.23; NaN; 1.32; NaN]

The idea is to first create the vector of NaNs whereafter you match the timepoints using ismember to substitute the values you know in.
Outputs.signals(16).values = nan(1,length(Outputs.time)); %Vector of nans
Lia = ismember(Outputs.time,LabData.time); %Where does the times match?
Outputs.signals(16).values(Lia) = LabData.NdBoiler; %substitute

Calculation the elements of different sized matrix in Matlab

Can anybody help me to find out the method to calculate the elements of different sized matrix in Matlab ?
Let say that I have 2 matrices with numbers.
Example:
A=[1 2 3;
4 5 6;
7 8 9]
B=[10 20 30;
40 50 60]
At first,we need to find maximum number in each column.
In this case, Ans=[40 50 60].
And then,we need to find ****coefficient** (k).
Coefficient(k) is equal to 1 divided by quantity of column of matrix A.
In this case, **coefficient (k)=1/3=0.33.
I wanna create matrix C filling with calculation.
Example in MS Excel.
H4 = ABS((C2-C6)/C9)*0.33+ABS((D2-D6)/D9)*0.33+ABS((E2-E6)/E9)*0.33
I4 = ABS((C3-C6)/C9)*0.33+ABS((D3-D6)/D9)*0.33+ABS((E3-E6)/E9)*0.33
J4 = ABS((C4-C6)/C9)*0.33+ABS((D4-D6)/D9)*0.33+ABS((E4-E6)/E9)*0.33
And then (Like above)
H5 = ABS((C2-C7)/C9)*0.33+ABS((D2-D7)/D9)*0.33+ABS((E2-E7)/E9)*0.33
I5 = ABS((C3-C7)/C9)*0.33+ABS((D3-D7)/D9)*0.33+ABS((E3-E7)/E9)*0.33
J5 = ABS((C4-C7)/C9)*0.33+ABS((D4-D7)/D9)*0.33+ABS((E4-E7)/E9)*0.33
C =
0.34 =|(1-10)|/40*0.33+|(2-20)|/50*0.33+|(3-30)|/60*0.33
0.28 =|(4-10)|/40*0.33+|(5-20)|/50*0.33+|(6-30)|/60*0.33
0.22 =|(7-10)|/40*0.33+|(8-20)|/50*0.33+|(9-30)|/60*0.33
0.95 =|(1-40)|/40*0.33+|(2-50)|/50*0.33+|(3-60)|/60*0.33
0.89 =|(4-40)|/40*0.33+|(5-50)|/50*0.33+|(6-60)|/60*0.33
0.83 =|(7-40)|/40*0.33+|(8-50)|/50*0.33+|(9-60)|/60*0.33
Actually A is a 15x4 matrix and B is a 5x4 matrix.
Perhaps,the matrices dimensions are more than this matrices (variables).
How can i write this in Matlab?
Thanks you!

You can do it like so. Let's assume that A and B are defined as you did before:
A = vec2mat(1:9, 3)
B = vec2mat(10:10:60, 3)
A =
1 2 3
4 5 6
7 8 9
B =
10 20 30
40 50 60
vec2mat will transform a vector into a matrix. You simply specify how many columns you want, and it will automatically determine the right amount of rows to transform the vector into a correctly shaped matrix (thanks #LuisMendo!). Let's also define more things based on your post:
maxCol = max(B); %// Finds maximum of each column in B
coefK = 1 / size(A,2); %// 1 divided by number of columns in A
I am going to assuming that coefK is multiplied by every element in A. You would thus compute your desired matrix as so:
cellMat = arrayfun(#(x) sum(coefK*(bsxfun(#rdivide, ...
abs(bsxfun(#minus, A, B(x,:))), maxCol)), 2), 1:size(B,1), ...
'UniformOutput', false);
outputMatrix = cell2mat(cellMat).'
You thus get:
outputMatrix =
0.3450 0.2833 0.2217
0.9617 0.9000 0.8383
Seems like a bit much to chew right? Let's go through this slowly.
Let's start with the bsxfun(#minus, A, B(x,:)) call. What we are doing is taking the A matrix and subtracting with a particular row in B called x. In our case, x is either 1 or 2. This is equal to the number of rows we have in B. What is cool about bsxfun is that this will subtract every row in A by this row called by B(x,:).
Next, what we need to do is divide every single number in this result by the corresponding columns found in our maximum column, defined as maxCol. As such, we will call another bsxfun that will divide every element in the matrix outputted in the first step by their corresponding column elements in maxCol.
Once we do this, we weight all of the values of each row by coefK (or actually every value in the matrix). In our case, this is 1/3.
After, we then sum over all of the columns to give us our corresponding elements for each column of the output matrix for row x.
As we wish to do this for all of the rows, going from 1, 2, 3, ... up to as many rows as we have in B, we apply arrayfun that will substitute values of x going from 1, 2, 3... up to as many rows in B. For each value of x, we will get a numCol x 1 vector where numCol is the total number of columns shared by A and B. This code will only work if A and B share the same number of columns. I have not placed any error checking here. In this case, we have 3 columns shared between both matrices. We need to use UniformOutput and we set this to false because the output of arrayfun is not a single number, but a vector.
After we do this, this returns each row of the output matrix in a cell array. We need to use cell2mat to transform these cell array elements into a single matrix.
You'll notice that this is the result we want, but it is transposed due to summing along the columns in the second step. As such, simply transpose the result and we get our final answer.
Good luck!
Dedication
This post is dedicated to Luis Mendo and Divakar - The bsxfun masters.

Assuming by maximum number in each column, you mean columnwise maximum after vertically concatenating A and B, you can try this one-liner -
sum(abs(bsxfun(#rdivide,bsxfun(#minus,permute(A,[3 1 2]),permute(B,[1 3 2])),permute(max(vertcat(A,B)),[1 3 2]))),3)./size(A,2)
Output -
ans =
0.3450 0.2833 0.2217
0.9617 0.9000 0.8383
If by maximum number in each column, you mean columnwise maximum of B, you can try -
sum(abs(bsxfun(#rdivide,bsxfun(#minus,permute(A,[3 1 2]),permute(B,[1 3 2])),permute(max(B),[1 3 2]))),3)./size(A,2)
The output for this case stays the same as the previous case, owing to the values of A and B.

Sum every n rows of matrix

Is there any way that I can sum up columns values for each group of three rows in a matrix?
I can sum three rows up in a manual way.
For example
% matrix is the one I wanna store the new data.
% data is the original dataset.
matrix(1,1:end) = sum(data(1:3, 1:end))
matrix(2,1:end) = sum(data(4:6, 1:end))
...
But if the dataset is huge, this wouldn't work.
Is there any way to do this automatically without loops?

Here are four other ways:
The obligatory for-loop:
% for-loop over each three rows
matrix = zeros(size(data,1)/3, size(data,2));
counter = 1;
for i=1:3:size(data,1)
matrix(counter,:) = sum(data(i:i+3-1,:));
counter = counter + 1;
end
Using mat2cell for tiling:
% divide each three rows into a cell
matrix = mat2cell(data, ones(1,size(data,1)/3)*3);
% compute the sum of rows in each cell
matrix = cell2mat(cellfun(#sum, matrix, 'UniformOutput',false));
Using third dimension (based on this):
% put each three row into a separate 3rd dimension slice
matrix = permute(reshape(data', [], 3, size(data,1)/3), [2 1 3]);
% sum rows, and put back together
matrix = permute(sum(matrix), [3 2 1]);
Using accumarray:
% build array of group indices [1,1,1,2,2,2,3,3,3,...]
idx = floor(((1:size(data,1))' - 1)/3) + 1;
% use it to accumulate rows (appliead to each column separately)
matrix = cell2mat(arrayfun(#(i)accumarray(idx,data(:,i)), 1:size(data,2), ...
'UniformOutput',false));
Of course all the solution so far assume that the number of rows is evenly divisble by 3.

This one-liner reshapes so that all the values needed for a particular cell are in a column, does the sum, and then reshapes the back to the expected shape.
reshape(sum(reshape(data, 3, [])), [], size(data, 2))
The naked 3 could be changed if you want to sum a different number of rows together. It's on you to make sure the number of rows in each group divides evenly.

Slice the matrix into three pieces and add them together:
matrix = data(1:3:end, :) + data(2:3:end, :) + data(3:3:end, :);
This will give an error if size(data,1) is not a multiple of three, since the three pieces wouldn't be the same size. If appropriate to your data, you might work around that by truncating data, or appending some zeros to the end.
You could also do something fancy with reshape and 3D arrays. But I would prefer the above (unless you need to replace 3 with a variable...)

Prashant answered nicely before but I would have a simple amendment:
fl = filterLength;
A = yourVector (where mod(A,fl)==0)
sum(reshape(A,fl,[]),1).'/fl;
There is the ",1" that makes the line run even when fl==1 (original values).
I discovered this while running it in a for loop like so:
... read A ...
% Plot data
hold on;
averageFactors = [1 3 10 30 100 300 1000];
colors = hsv(length(averageFactors));
clear legendTxt;
for i=1:length(averageFactors)
% ------ FILTERING ----------
clear Atrunc;
clear ttrunc;
clear B;
fl = averageFactors(i); % filter length
Atrunc = A(1:L-mod(L,fl),:);
ttrunc = t(1:L-mod(L,fl),:);
B = sum(reshape(Atrunc,fl,[]),1).'/fl;
tB = sum(reshape(ttrunc,fl,[]),1).'/fl;
length(B)
plot(tB,B,'color',colors(i,:) )
%kbhit ()
endfor

reformatting a matrix in matlab with nan values

This post follows a previous question regarding the restructuring of a matrix:
re-formatting a matrix in matlab
An additional problem I face is demonstrated by the following example:
depth = [0:1:20]';
data = rand(1,length(depth))';
d = [depth,data];
d = [d;d(1:20,:);d];
Here I would like to alter this matrix so that each column represents a specific depth and each row represents time, so eventually I will have 3 rows (i.e. days) and 21 columns (i.e. measurement at each depth). However, we cannot reshape this because the number of measurements for a given day are not the same i.e. some are missing. This is known by:
dd = sortrows(d,1);
for i = 1:length(depth);
e(i) = length(dd(dd(:,1)==depth(i),:));
end
From 'e' we find that the number of depth is different for different days. How could I insert a nan into the matrix so that each day has the same depth values? I could find the unique depths first by:
unique(d(:,1))
From this, if a depth (from unique) is missing for a given day I would like to insert the depth to the correct position and insert a nan into the respective location in the column of data. How can this be achieved?

You were thinking correctly that unique may come in handy here. You also need the third output argument, which maps the unique depths onto the positions in the original d vector. have a look at this code - comments explain what I do
% find unique depths and their mapping onto the d array
[depths, ~, j] = unique(d(:,1));
% find the start of every day of measurements
% the assumption here is that the depths for each day are in increasing order
days_data = [1; diff(d(:,1))<0];
% count the number of days
ndays = sum(days_data);
% map every entry in d to the correct day
days_data = cumsum(days_data);
% construct the output array full of nans
dd = nan(numel(depths), ndays);
% assing the existing measurements using linear indices
% Where data does not exist, NaN will remain
dd(sub2ind(size(dd), j, days_data)) = d(:,2)
dd =
0.5115 0.5115 0.5115
0.8194 0.8194 0.8194
0.5803 0.5803 0.5803
0.9404 0.9404 0.9404
0.3269 0.3269 0.3269
0.8546 0.8546 0.8546
0.7854 0.7854 0.7854
0.8086 0.8086 0.8086
0.5485 0.5485 0.5485
0.0663 0.0663 0.0663
0.8422 0.8422 0.8422
0.7958 0.7958 0.7958
0.1347 0.1347 0.1347
0.8326 0.8326 0.8326
0.3549 0.3549 0.3549
0.9585 0.9585 0.9585
0.1125 0.1125 0.1125
0.8541 0.8541 0.8541
0.9872 0.9872 0.9872
0.2892 0.2892 0.2892
0.4692 NaN 0.4692
You may want to transpose the matrix.

It's not entirely clear from your question what your data looks like exactly, but the following might help you towards an answer.
Suppose you have a column vector
day1 = 1:21';
and, initially, all the values are NaN
day1(:) = NaN
Suppose next that you have a 2d array of measurements, in which the first column represents depths, and the second the measurements at those depths. For example
msrmnts = [1,2;2,3;4,5;6,7] % etc
then the assignment
day1(msrmnts(:,1)) = msrmnts(:,2)
will set values in only those rows of day1 whose indices are found in the first column of msrmnts. This second statement uses Matlab's capabilities for using one array as a set of indices into another array, for example
d([9 7 8 12 4]) = 1:5
would set elements [9 7 8 12 4] of d to the values 1:5. Note that the indices of the elements do not need to be in order. You could even insert the same value several times into the index array, eg [4 4 5 6 3 4] though it's not terribly useful.

in matlab, calculate mean in a part of one column where another column satisfies a condition

I'm quite new to matlab, and I'm curious how to do this:
I have a rather large (27000x11) matrix, and the 8th column contains a number which changes sometimes but is constant for like 2000 rows (not necessarily consecutive).
I would like to calculate the mean of the entries in the 3rd column for those rows where the 8th column has the same value. This for each value of the 8th column.
I would also like to plot the 3rd column's means as a function of the 8th column's value but that I can do if I can get a new matrix (2x2) containing [mean_of_3rd,8th].
Ex: (smaller matrix for convenience)
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4
Since the 4th column has the same value in row 1 and 5 I'd like to calculate the mean of 2 and 4 (the corresponding elements of column 2, italic bold) and put it in another matrix together with the 4th column's value. The same for 3 and 5 (bold) since the 4th column has the same value for these two.
3 4
4 5
and so on... is this possible in an easy way?

Use the all-mighty, underused accumarray :
This line gives you mean values of 4th column accumulated by 2nd column:
means = accumarray( A(:,4) ,A(:,2),[],#mean)
This line gives you number of element in each set:
count = accumarray( A(:,4) ,ones(size(A(:,4))))
Now if you want to filter only those that have at least one occurence:
>> filtered = means(count>1)
filtered =
3
4
This will work only for positive integers in the 4th column.
Another possibility for counting amount of elements in each set:
count = accumarray( A(:,4) ,A(:,4),[],#numel)

A slightly refined approach based on the ideas of Andrey and Rody. We can not use accumarray directly, since the data is real, not integer. But, we can use unique to find the indices of the repeating entries. Then we operate on integers.
% get unique entries in 4th column
[R, I, J] = unique(A(:,4));
% count the repeating entries: now we have integer indices!
counts = accumarray(J, 1, size(R));
% sum the 2nd column for all entries
sums = accumarray(J, A(:,2), size(R));
% compute means
means = sums./counts;
% choose only the entries that show more than once in 4th column
inds = counts>1;
result = [means(inds) R(inds)];
Time comparison for the following synthetic data:
A=randi(100, 1000000, 5);
% Rody's solution
Elapsed time is 0.448222 seconds.
% The above code
Elapsed time is 0.148304 seconds.

My official answer:
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
result = [means(inds) R(inds)];
This is because of the following. Here's all of the alternatives we've come up with, in profiling form:
%# sample data
A = [
1 2 3 4 5
3 7 5 3 2
1 3 2 5 3
4 5 7 5 8
2 4 7 4 4];
%# accumarray
%# works only on positive integers in A(:,4)
tic
for ii = 1:1e4
means = accumarray( A(:,4) ,A(:,2),[],#mean);
count = accumarray( A(:,4) ,ones(size(A(:,4))));
filtered = means(count>1);
end
toc
%# arrayfun
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
B = arrayfun(#(x) A(A(:,4)==x, 2), min(A(:,4)):max(A(:,4)), 'uniformoutput', false);
filtered = cellfun(#mean, B(cellfun(#(x) numel(x)>1, B)) );
end
toc
%# ordinary loop
%# works only on integers in A(:,4)
tic
for ii = 1:1e4
A4 = A(:,4);
R = min(A4):max(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(R)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
end
toc
Results:
Elapsed time is 1.238352 seconds. %# (accumarray)
Elapsed time is 7.208585 seconds. %# (arrayfun + cellfun)
Elapsed time is 0.225792 seconds. %# (for loop)
The ordinary loop is clearly the way to go here.
Note the absence of mean in the inner loop. This is because mean is not a Matlab builtin function (at least, on R2010), so that using it inside the loop makes the loop unqualified for JIT compilation, which slows it down by a factor of over 10. Using the form above accelerates the loop to almost 5.5 times the speed of the accumarray solution.
Judging on your comment, it is almost trivial to change the loop to work on all entries in A(:,4) (not just the integers):
A4 = A(:,4);
R = unique(A4);
means = zeros(size(R));
inds = false(size(R));
for jj = 1:numel(A4)
I = A4==R(jj);
sumI = sum(I);
inds(jj) = sumI>1;
means(jj) = sum(A(I,2))/sumI;
end
filtered = means(inds);
Which I will copy-paste to the top as my official answer :)