Let's have a 2D double array such as:
% Data: ID, Index, Weight, Category
A0=[1 1121 204 1;...
2 2212 112 1;...
3 2212 483 3;...
4 4334 233 1;...
5 4334 359 2;...
6 4334 122 3 ];
I am needing to pivot / group by the rows with the highest Weights, for each given Index, which can be achieved with any Pivot Table | Group By functionality (such as pivottable, SQL GROUP BY or MS Excel PivotTable)
% Current Result
A1=pivottable(A0,[2],[],[3],{#max}); % Pivot Table
A1=cell2mat(A1); % Convert to array
>>A1=[1121 204;...
2212 483;...
4334 359 ]
How should I proceed if I need to recover also the ID and the Category columns?
% Required Result
>>A1=[1 1121 204 1;...
3 2212 483 3;...
5 4334 359 2 ];
The syntax is Matlab, but a solution involving other languages (Java, SQL) can be acceptable, due they can be transcribed into Matlab.
You can use splitapply with an anonymous function as follows.
grouping_col = 2; % Grouping column
maximize_col = 3; % Column to maximize
[~, ~, group_label] = unique(A0(:,grouping_col));
result = splitapply(#(x) {x(x(:,maximize_col)==max(x(:,maximize_col)),:)}, A0, group_label);
result = cell2mat(result); % convert to matrix
How it works: the anonymous function #(x) {x(x(:,maximize_col)==max(···),:)} is called by splitapply once for each group. The function is provided as input a submatrix containing all rows with the same value of the column with index grouping_col. What this function then does is keep all rows that maximize the column with index maximize_col, and pack that into a cell. The result is then converted to matrix form by cell2mat.
With the above solution, if there are several maximizing rows for each group all of them are produced. To keep only the first one, replace the last line by
result = cell2mat(cellfun(#(c) c(1,:), result, 'uniformoutput', false));
How it works: this uses cellfun to apply the anonymous function #(c) c(1,:) to the content of each cell. The function simply keeps the first row. Alternatively, to keep the last row use #(c) c(end,:). The result is then converted to matrix form using cell2mat again.
Related
Consider that I have a table of such type in MATLAB:
Location String Number
1 a 26
1 b 361
2 c 28
2 a 45
3 a 78
4 b 82
I would like to create a script which returns only 3 rows, which would include the largest Number for each string. So in this case the table returned would be the following:
Location String Number
3 a 78
1 b 361
2 c 28
The actual table that I want to tackle is much greater, though I wrote this like that for simplicity. Any ideas on how this task can be tackled? Thank you in advance for your time!
You could use splitapply, with an id for each row.
Please see the comments for details...
% Assign unique ID to each row
tbl.id = (1:size(tbl,1))';
% Get groups of the different strings
g = findgroups(tbl.String);
% create function which gets id of max within each group
% f must take arguments corresponding to each splitapply table column
f = #(num,id) id(find(num == max(num), 1));
% Use splitapply to apply the function f to all different groups
idx = splitapply( f, tbl(:,{'Number','id'}), g );
% Collect rows
outTbl = tbl(idx, {'Location', 'String', 'Number'});
>> outTbl =
Location String Number
3 'a' 78
1 'b' 361
2 'c' 28
Or just a simple loop. This loop is only over the unique values of String so should be pretty quick.
u = unique(tbl.String);
c = cell(numel(u), size(tbl,2));
for ii = 1:numel(u)
temp = tbl(strcmp(tbl.String, u{ii}),:);
[~, idx] = max(temp.Number);
c(ii,:) = table2cell(temp(idx,:));
end
outTbl = cell2table(c, 'VariableNames', tbl.Properties.VariableNames);
Finding max values of each string my idea is.
Create a vector of all your strings and include them only one time. Something like:
strs=['a','b','c'];
Then create a vector that will store maximum value of each string:
n=length(strs);
max_values=zeros(1,n);
Now create a loop with the size of the whole data to compare current max_value with the current value and substitute if current_value>max_value:
for i=1:your_table_size
m=find(strs==current_table_string); % This finds the index of max_values
if max_values(m)<current_table_Number % This the the i_th row table_number
max_values(m)=current_table_Number;
end
end
Overview
An n×m matrix A and an n×1 vector Date are the inputs of the function S = sumdate(A,Date).
The function returns an n×m vector S such that all rows in S correspond to the sum of the rows of A from the same date.
For example, if
A = [1 2 7 3 7 3 4 1 9
6 4 3 0 -1 2 8 7 5]';
Date = [161012 161223 161223 170222 160801 170222 161012 161012 161012]';
Then I would expect the returned matrix S is
S = [15 9 9 6 7 6 15 15 15;
26 7 7 2 -1 2 26 26 26]';
Because the elements Date(2) and Date(3) are the same, we have
S(2,1) and S(3,1) are both equal to the sum of A(2,1) and A(3,1)
S(2,2) and S(3,2) are both equal to the sum of A(2,2) and A(3,2).
Because the elements Date(1), Date(7), Date(8) and Date(9) are the same, we have
S(1,1), S(7,1), S(8,1), S(9,1) equal the sum of A(1,1), A(7,1), A(8,1), A(9,1)
S(1,2), S(7,2), S(8,2), S(9,2) equal the sum of A(1,2), A(7,2), A(8,2), A(9,2)
The same for S([4,6],1) and S([4,6],2)
As the element Date(5) does not repeat, so S(5,1) = A(5,1) = 7 and S(5,2) = A(5,2) = -1.
The code I have written so far
Here is my try on the code for this task.
function S = sumdate(A,Date)
S = A; %Pre-assign S as a matrix in the same size of A.
Dlist = unique(Date); %Sort out a non-repeating list from Date
for J = 1 : length(Dlist)
loc = (Date == Dlist(J)); %Compute a logical indexing vector for locating the J-th element in Dlist
S(loc,:) = repmat(sum(S(loc,:)),sum(loc),1); %Replace the located rows of S by the sum of them
end
end
I tested it on my computer using A and Date with these attributes:
size(A) = [33055 400];
size(Date) = [33055 1];
length(unique(Date)) = 2645;
It took my PC about 1.25 seconds to perform the task.
This task is performed hundreds of thousands of times in my project, therefore my code is too time-consuming. I think the performance will be boosted up if I can eliminate the for-loop above.
I have found some built-in functions which do special types of sums like accumarray or cumsum, but I still do not have any ideas on how to eliminate the for-loop.
I would appreciate your help.
You can do this with accumarray, but you'll need to generate a set of row and column subscripts into A to do it. Here's how:
[~, ~, index] = unique(Date); % Get indices of unique dates
subs = [repmat(index, size(A, 2), 1) ... % repmat to create row subscript
repelem((1:size(A, 2)).', size(A, 1))]; % repelem to create column subscript
S = accumarray(subs, A(:)); % Reshape A into column vector for accumarray
S = S(index, :); % Use index to expand S to original size of A
S =
15 26
9 7
9 7
6 2
7 -1
6 2
15 26
15 26
15 26
Note #1: This will use more memory than your for loop solution (subs will have twice the number of element as A), but may give you a significant speed-up.
Note #2: If you are using a version of MATLAB older than R2015a, you won't have repelem. Instead you can replace that line using kron (or one of the other solutions here):
kron((1:size(A, 2)).', ones(size(A, 1), 1))
I am creating a sparse matrix
sp = sparse(I,J,Val,X,Y)
My Val matrix is a ones matrix. Much to my surprise the sp matrix does not contain only zeros and ones. I suppose that this happens because in some cases there are duplicates in I,J. I mean the sp(1,1) is set to 1 2 times, and this makes it 2.
Question 1: Is my assumption true? Instead of overwriting the value, does MATLAB really add it?
Question 2: How can we get around this, given that it would be very troublesome to manipulate I and J. Something I can think of, is to use find (thus guaranteeing uniqueness) and then recreate the matrix using ones once more. Any better suggestion?
Question 1: Is my assumption true? Instead of overwriting the value, does Matlab really add it?
Correct. If you have duplicate row and column values each with their own values, MATLAB will aggregate them all into the same row and column location by adding them.
This is clearly seen in the documentation but as a reproducible example, suppose I have the following row and column locations and their associated values at these locations:
i = [6 6 6 5 10 10 9 9].';
j = [1 1 1 2 3 3 10 10].';
v = [100 202 173 305 410 550 323 121].';
Note that these are column vectors as this shape is the expected input. In a neater presentation:
>> [i j v]
ans =
6 1 100
6 1 202
6 1 173
5 2 305
10 3 410
10 3 550
9 10 323
9 10 121
We can see that there are three values that get mapped to location (6, 1), two values that get mapped to location (10, 3) and finally two that get mapped to location (9, 10).
By creating the sparse matrix and displaying it, we thus get:
>> S = sparse(i,j,v)
S =
(6,1) 475
(5,2) 305
(10,3) 960
(9,10) 444
As you can see, the three values mapped to (6, 1) are summed: 100 + 202 + 173 = 475. You can verify this with the other duplicate row and column locations.
Question 2: How can we get around this, given that it would be very troublesome to manipulate I and J. Something I can think of, is to use find (thus guaranteeing uniqueness) and then recreate the matrix using ones once more. Any better suggestion?
There are two possible ways to mitigate this if it is truly your desire to only have a binary matrix.
The first way which may be more preferable to you as you mentioned that manipulating the row and column locations is troublesome is to create the matrix that you have now, but then convert it to logical so that any values that are non-zero are set to 1:
>> S = S ~= 0
S =
10×10 sparse logical array
(6,1) 1
(5,2) 1
(10,3) 1
(9,10) 1
If you require that the precision of the matrix be back in its original double form, cast the result after you convert to logical:
>> S = double(S ~= 0)
S =
(6,1) 1
(5,2) 1
(10,3) 1
(9,10) 1
The second way if you wish is to work on your row and column locations so that you filter out any indices that are non-unique, then create a vector of ones for val that is as long as the unique row and column locations. You can use the unique function to help you do that. Concatenate the row and column locations in a two column matrix and specify that you want to operate on 'rows'. This means that each row is considered an input rather than individual elements in a matrix. Once you find the unique row and column locations, use these as input for creating the sparse matrix:
>> unique_vals = unique([i j], 'rows')
unique_vals =
5 2
6 1
9 10
10 3
>> vals = ones(size(unique_vals, 1));
>> S = sparse(unique_vals(:, 1), unique_vals(:, 2), vals)
S =
(6,1) 1
(5,2) 1
(10,3) 1
(9,10) 1
As an example, I have a matrix [1,2,3,4,5]'. This matrix contains one column and 5 rows, and I have to generate a pair of points like (1,2),(1,3)(1,4)(1,5),(2,3)(2,4)(2,5),(3,4)(3,5)(4,5).
I have to store these values in 2 columns in a matrix. I have the following code, but it isn't quite giving me the right answer.
for s = 1:5;
for tb = (s+1):5;
if tb>s
in = sub2ind(size(pairpoints),(tb-1),1);
pairpoints(in) = s;
in = sub2ind(size(pairpoints),(tb-1),2);
pairpoints(in) = tb;
end
end
end
With this code, I got (1,2),(2,3),(3,4),(4,5). What should I do, and what is the general formula for the number of pairs?
One way, though is limited depending upon how many different elements there are to choose from, is to use nchoosek as follows
pairpoints = nchoosek([1:5],2)
pairpoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
See the limitations of this function in the provided link.
An alternative is to just iterate over each element and combine it with the remaining elements in the list (assumes that all are distinct)
pairpoints = [];
data = [1:5]';
len = length(data);
for k=1:len
pairpoints = [pairpoints ; [repmat(data(k),len-k,1) data(k+1:end)]];
end
This method just concatenates each element in data with the remaining elements in the list to get the desired pairs.
Try either of the above and see what happens!
Another suggestion I can add to the mix if you don't want to rely on nchoosek is to generate an upper triangular matrix full of ones, disregarding the diagonal, and use find to generate the rows and columns of where the matrix is equal to 1. You can then concatenate both of these into a single matrix. By generating an upper triangular matrix this way, the locations of the matrix where they're equal to 1 exactly correspond to the row and column pairs that you are seeking. As such:
%// Highest value in your data
N = 5;
[rows,cols] = find(triu(ones(N),1));
pairpoints = [rows,cols]
pairPoints =
1 2
1 3
2 3
1 4
2 4
3 4
1 5
2 5
3 5
4 5
Bear in mind that this will be unsorted (i.e. not in the order that you specified in your question). If order matters to you, then use the sortrows command in MATLAB so that we can get this into the proper order that you're expecting:
pairPoints = sortrows(pairPoints)
pairPoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
Take note that I specified an additional parameter to triu which denotes how much of an offset you want away from the diagonal. The default offset is 0, which includes the diagonal when you extract the upper triangular matrix. I specified 1 as the second parameter because I want to move away from the diagonal towards the right by 1 unit so I don't want to include the diagonal as part of the upper triangular decomposition.
for loop approach
If you truly desire the for loop approach, going with your model, you'll need two for loops and you need to keep track of the previous row we are at so that we can just skip over to the next column until the end using this. You can also use #GeoffHayes approach in using just a single for loop to generate your indices, but when you're new to a language, one key advice I will always give is to code for readability and not for efficiency. Once you get it working, if you have some way of measuring performance, you can then try and make the code faster and more efficient. This kind of programming is also endorsed by Jon Skeet, the resident StackOverflow ninja, and I got that from this post here.
As such, you can try this:
pairPoints = []; %// Initialize
N = 5; %// Highest value in your data
for row = 1 : N
for col = row + 1 : N
pairPoints = [pairPoints; [row col]]; %// Add row-column pair to matrix
end
end
We get the equivalent output:
pairPoints =
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
Small caveat
This method will only work if your data is enumerated from 1 to N.
Edit - August 20th, 2014
You wish to generalize this to any array of values. You also want to stick with the for loop approach. You can still keep the original for loop code there. You would simply have to add a couple more lines to index your new array. As such, supposing your data array was:
dat = [12, 45, 56, 44, 62];
You would use the pairPoints matrix and use each column to subset the data array to access your values. Also, you need to make sure your data is a column vector, or this won't work. If we didn't, we would be creating a 1D array and concatenating rows and that's not obviously what we're looking for. In other words:
dat = [12, 45, 56, 44, 62];
dat = dat(:); %// Make column vector - Important!
N = numel(dat); %// Total number of elements in your data array
pairPoints = []; %// Initialize
%// Skip if the array is empty
if (N ~= 0)
for row = 1 : N
for col = row + 1 : N
pairPoints = [pairPoints; [row col]]; %// Add row-column pair to matrix
end
end
vals = [dat(pairPoints(:,1)) dat(pairPoints(:,2))];
else
vals = [];
Take note that I have made a provision where if the array is empty, don't even bother doing any calculations. Just output an empty matrix.
We thus get:
vals =
12 45
12 56
12 44
12 62
45 56
45 44
45 62
56 44
56 62
44 62
Is there a function in MATLAB which allows to aggregate (or we can say sum) columns in a matrix per defined number of columns?
For example I have:
A =
1 2 3 4 5 6
9 10 11 12 13 14
17 18 19 20 21 22
I wish to aggregate every 2 columns, like this: col1+col2, and then col3+col4, and then col5+col6, so my output is:
A_agg =
3 7 11
19 23 27
35 39 43
I couldn't find a built-in function and was trying to write a for loop but I couldn't manage to do it since I am quite new to programming. Do you have any suggestions/solutions how this could be solved with a for loop, or maybe there is a built-in function?
Since sum operates down columns in a matrix, I first reshape A so that it has 2 rows and 9 columns, then sum down each column. Then reshape back to the desired output matrix A_agg.
A=[1 2 3 4 5 6
9 10 11 12 13 14
17 18 19 20 21 22]
[m,n]=size(A);
A_agg=reshape(sum(reshape(A',2,[])),m,[])'
You can use a combination of mat2cell and cellfun. You can use mat2cell to split up your matrices into individual 2 column chunks. Each chunk would be stored as a cell in a cell array. You can then use cellfun to take each cell and sum row-wise. After you're done, you can use cell2mat to convert back.
Using your example:
A = [1:6;9:14;17:22];
B = mat2cell(A, 3, [2 2 2]);
C = cellfun(#(x) sum(x,2), B, 'UniformOutput', false);
A_agg = cell2mat(C);
A_agg should thus give you:
A_agg =
3 7 11
19 23 27
35 39 43
Let's walk through the code slowly:
A is defined as we had before. B will be a cell array, and will segment your matrix into matrices of 2 columns per cell. The first parameter is the matrix you want to decompose (in our case A). The second parameter tells you how many rows each segment should have. Because we want all of the matrices to have the same number of rows, we thus supply one number which is 3. After, you specify the number of columns you want per matrix. Because there are 6 columns, we need 3 matrices, and so you'd specify a vector of [2 2 2].
C is the output of cellfun, where cellfun applies a function to every single element in a cell matrix. What you want to do here is for each cell (essentially each matrix), you want to sum row-wise. The first parameter is an anonymous function that takes in a matrix from each cell, and sums row-wise. The second parameter is the cell array we just created. You'll notice that we have an additional flag to set: UniformOutput. The reason why you have to set UniformOutput = false is because if you apply cellfun without that flag, the expected result at the end of the function you apply to each cell is scalar. Because we are outputting a column vector instead, we have to set this flag to false.
A_agg will thus aggregate all of your cells back to matrix form.
If you want to do this for any size matrix, bear in mind that there has to be an even amount of columns for this work. What I mean by even is that the number of columns has to be evenly divisible by 2. You would thus re-run the code like so:
B = mat2cell(A, size(A,1), 2*ones(1, size(A,2)/2));
C = cellfun(#(x) sum(x,2), B, 'UniformOutput', false);
A_agg = cell2mat(C);
Another possibility, if you have the Image Processing Toolbox, is to use blockproc. Let n denote the number of columns to be aggregated (2 in your example). Then:
A_agg = blockproc(A, [size(A,1) n], #(x) sum(x.data, 2));