In Matlab, how to apply an 'AggregationFunction' with two variables in unstack? - matlab

Objective:
I would like to get, for each period and group of a timetable, the result of a given function of var1 and var2 [i.e. the ratio of (the sum of var1 over the group) by (the sum of var2 over the group)]
using unstack and a function handle:
Data
A = [1 2 3 4 2 4 6 8]';
B = [4 2 14 7 8 4 28 14]';
C=["group1";"group1";"group2";"group2";"group1";"group1";"group2";"group2"];
Year = [2010 2010 2010 2010 2020 2020 2020 2020]';
Year = datetime(string(Year), 'Format', 'yyyy');
t=table(Year,C,A,B,'VariableNames',{'Year' 'group' 'var1' 'var2'});
t=table2timetable(t,'RowTimes','Year');
Desired Output [EDIT]
A table with three columns: year, Ratio_group1, Ratio_group2. Where for instance:
Ratio_group1 for 2010 = (1+2) / (4+2) =0.5.
Function
f = #(x,y) sum(x)./sum(y); %or f = #(x) sum(x(1,:))./sum(x(2,:));
[Ratio,is] = unstack(t,{'var1','var2'},"group",'AggregationFunction',f);
Errors that I get:
%Not enough input arguments.
%Or: Index in position 1 exceeds array bounds (must not exceed 1)
Another failed test inspired from https://www.mathworks.com/help/matlab/ref/double.groupsummary.html
(See Method Function Handle with Multiple Inputs)
[Ratio,is] = unstack(t,{["var1"],["var2"]},"group",'AggregationFunction',f);
%Error: A table variable subscript must be a numeric array continaing real positive integers, a logical array (...)

This can be done using findgroups and splitapply (or equivalently accumarray):
result = splitapply(#(x,y) sum(x)/sum(y), t.var1, t.var2, findgroups(t.Year, t.group));

Related

find indices of subsets in MATLAB

This question is motivated by very specific combinatorial optimization problem, where search space is defined as a space of permuted subsets of vector unsorted set of discrete values with multiplicities.
I am looking for effective (fast enough, vectorized or any other more clever solution) function which is able to find indices of subsets in the following manner:
t = [1 1 3 2 2 2 3 ]
is unsorted vector of all possible values, including its multiplicities.
item = [2 3 1; 2 1 2; 3 1 1; 1 3 3]
is a list of permuted subsets of vector t.
I need to find list of corresponding indices of subsets item which corresponds to the vector t. So, for above mentioned example we have:
item =
2 3 1
2 1 2
3 1 1
1 3 3
t =
1 1 3 2 2 2 3
ind = item2ind(item,t)
ind =
4 3 1
4 1 5
3 1 2
1 3 7
So, for item = [2 3 1] we get ind = [4 3 1], which means, that:
first value "2" at item corresponds to the first value "2" at t on position "4",
second value "3" at item corresponds to the first value "3" at t on position "3" and
third value "1" at item corresponds to the first value "1" at t on position "1".
In a case item =[ 2 1 2] we get ind = [4 1 5], which means, that:
first value "2" at item corresponds to the first value "2" at t on position "4",
second value "1" at item corresponds to the first value "1" at t on position "1", and
third value "2" at item corresponds to the second(!!!) value "1" at t on position "5".
For
item = [1 1 1]
does not exist any solution, because vector t contains only two "1".
My current version of function "item2ind" is very trivial serial code, which is possible simple parallelized by changing of "for" to "parfor" loop:
function ind = item2ind(item,t)
[nlp,N] = size(item);
ind = zeros(nlp,N);
for i = 1:nlp
auxitem = item(i,:);
auxt = t;
for j = 1:N
I = find(auxitem(j) == auxt,1,'first');
if ~isempty(I)
auxt(I) = 0;
ind(i,j) = I;
else
error('Incompatible content of item and t.');
end
end
end
end
But I need something definitely more clever ... and faster:)
Test case for larger input data:
t = 1:10; % 10 unique values at vector t
t = repmat(t,1,5); % unsorted vector t with multiplicity of all unique values 5
nlp = 100000; % number of item rows
[~,p] = sort(rand(nlp,length(t)),2); % 100000 random permutations
item = t(p); % transform permutations to items
item = item(:,1:30); % transform item to shorter subset
tic;ind = item2ind(item,t);toc % runing and timing of the original function
tic;ind_ = item2ind_new(item,t);toc % runing and timing of the new function
isequal(ind,ind_) % comparison of solutions
To achieve vectorizing the code, I have assumed that the error case won't be present. It should be discarded first, with a simple procedure I will present below.
Method First, let's compute the indexes of all elements in t:
t = t(:);
mct = max(accumarray(t,1));
G = accumarray(t,1:length(t),[],#(x) {sort(x)});
G = cellfun(#(x) padarray(x.',[0 mct-length(x)],0,'post'), G, 'UniformOutput', false);
G = vertcat(G{:});
Explanation: after putting input in column vector shape, we compute the max number of occurences of each possible value in t using accumarray. Now, we form array of all indexes of all numbers. It forms a cell array as there may be not the same number of occurences for each value. In order to form a matrix, we pad each array independently to the max length (naming mct). Then we can transform the cell array into a matrix. At this step, we have:
G =
1 11 21 31 41
2 12 22 32 42
3 13 23 33 43
4 14 24 34 44
5 15 25 35 45
6 16 26 36 46
7 17 27 37 47
8 18 28 38 48
9 19 29 39 49
10 20 30 40 50
Now, we process item. For that, let's figure out how to create the cumulative sum of occurences of values inside a vector. For example, if I have:
A = [1 1 3 2 2 2 3];
then I want to get:
B = [1 2 1 1 2 3 2];
Thanks to implicit expansion, we can have it in one line:
B = diag(cumsum(A==A'));
As easy as this. The syntax A==A' expands into a matrix where each element is A(i)==A(j). Making the cumulative sum in only one dimension and taking the diagonal gives us the good result: each column in the cumulative sum of occurences over one value.
To use this trick with item which 2-D, we should use a 3D array. Let's call m=size(item,1) and n=size(item,2). So:
C = cumsum(reshape(item,m,1,n)==item,3);
is a (big) 3D matrix of all cumulatives occurences. Last thing is to select the columns that are on the diagonal along dimension 2 and 3:
ia = C(sub2ind(size(C),repelem((1:m).',1,n),repelem(1:n,m,1),repelem(1:n,m,1)));
Now, with all these matrices, indexing is easy:
ind = G(sub2ind(size(G),item,ia));
Finally, let's recap the code of the function:
function ind = item2ind_new(item,t)
t = t(:);
[m,n] = size(item);
mct = max(accumarray(t,1));
G = accumarray(t,1:length(t),[],#(x) {sort(x)});
G = cellfun(#(x) padarray(x.',[0 mct-length(x)],0,'post'), G, 'UniformOutput', false);
G = vertcat(G{:});
C = cumsum(reshape(item,m,1,n)==item,3);
ia = C(sub2ind(size(C),repelem((1:m).',1,n),repelem(1:n,m,1),repelem(1:n,m,1)));
ind = G(sub2ind(size(G),item,ia));
Results Running the provided script on an old 4-core, I get:
Elapsed time is 4.317914 seconds.
Elapsed time is 0.556803 seconds.
ans =
logical
1
Speed up is substential (more than 8x), along with memory consumption (with matrix C). I guess some improvements can be done with this part to save more memory.
EDIT For generating ia, this procedure can cost a lost of memory. A way to save memory is to use a for-loop to generate directly this array:
ia = zeros(size(item));
for i=unique(t(:)).'
ia = ia+cumsum(item==i, 2).*(item==i);
end
In all cases, when you have ia, it's easy to test if there is an error in item compared to t:
any(ind(:)==0)
A simple solution to get items in error (as a mask) is then
min(ind,[],2)==0

Count occurrences of an event by date

I am facing an issue with counting number of occurrences by date, suppose I have an excel file where the data is as follows:
1/1/2001 23
1/1/2001 29
1/1/2001 24
3/1/2001 22
3/1/2001 23
My desired output is:
1/1/2001 3
2/1/2001 0
3/1/2001 2
Though 2/1/2001 does't appear in the input, I want that included in the output with 0 counts. This is my current code:
[Value, Time] = xlsread('F:\1km\fire\2001- 02\2001_02.xlsx','Sheet1','A2:D159','',#convertSpreadsheetExcelDates);
tm=datenum(Time);
val=Value(:,4);
data=[tm val];
% a=(datestr(tm));
T1=datetime('9/23/2001');
T2=datetime('6/23/2002');
T = T1:T2;
tm_all=datenum(T);
[~, idx] = ismember(tm_all,data(:,1));
% idx=idx';
out = tm_all(idx);
The ismember function does not seem to work, because the length of tm_all is 274 and the size of data is 158x2
I suggest you to use datetime instead of datenum for converting your date strings into a serial representation, this can make (not only) the whole computation much easier:
tm = datetime({
'1/1/2001';
'1/1/2001';
'1/1/2001';
'3/1/2001';
'3/1/2001'
},'InputFormat','dd/MM/yyyy');
Once you have obtained your datetime vector, the calculation can be achieved as follows:
% Create a sequence of datetimes from the first date to the last date...
T = (min(tm):max(tm)).';
% Build an indexing of every occurrence to the regards of the sequence...
[~,idx] = ismember(tm,T);
% Count the occurrences for every occurrence...
C = accumarray(idx,1);
% Put unique dates and occurrences together into a single variable...
res = table(T,C)
Here is the output:
res =
T C
___________ _
01-Jan-2001 3
02-Jan-2001 0
03-Jan-2001 2
For more information about the functions used within the computation:
accumarray function
ismember function
On a side note, I didn't understand whether your dates are in dd/MM/yyyy or in MM/dd/yyyy format... because with the latter, you cannot have that output using my approach, and you should also implement an algorithm for detecting the current month and then splitting your data over a monthly (and eventually yearly, if your dates span over 2001) criterion instead:
tm = datetime({
'1/1/2001';
'1/1/2001';
'1/1/2001';
'3/1/2001';
'3/1/2001'
},'InputFormat','MM/dd/yyyy');
M = month(tm);
M_seq = (min(M):max(M)).';
[~,idx] = ismember(M,M_seq);
C = accumarray(idx,1);
res = table(datetime(2001,M_seq,1),C)
res =
Var1 C
___________ _
01-Jan-2001 3
01-Feb-2001 0
01-Mar-2001 2
I'll first give the code and then explain step by step.
code:
[Value, Time] = xlsread('stack','A1:D159','',#convertSpreadsheetExcelDates);
tm=datenum(Time);
val=Value(:,4);
data=[tm val];
a=(datestr(tm));
T1=datetime('1/1/2001');
T2=datetime('6/23/2002');
T = T1:T2;
tm_all=datenum(T);
[~, idx] = ismember(tm_all,data(:,1)); % get indices
[occurence,dates]= hist(data(:,1),unique(data(:,1))); % count occurences of dates from file
t = [0;data(:,1)]; % add 0 to dates (for later because MATLAB starts at 1
[~,idx] = ismember(t(idx+1),dates); % get incides
q = [0 occurence]; % add 0 to occurence (for later because MATLAB starts at 1
occ = q(idx+1); % make vector with occurences
out = [tm_all' occ']; % output
idx of ismember is an 1xlength(tm_all) vector that at position i contains the lowest index of where tm_all(i) is found in data(:,1). So take for example A = [1 2 3 4] and B = [1 1 2 4] then for [~,idx] = ismember(A,B) the result will be
idx = [1 3 0 4]
because A(1) = 1 and the first 1 in B is found at posistion 1. If a number in A doesn't occur in B, then the result will be 0.
[occurence,dates]= hist(data(:,1),unique(data(:,1))); gives the number of occurences for the dates.
t = [0;data(:,1)]; adds a zero in the beginning so tlooks like:
0
'date 1'
'date 2'
'date 3'
'date 4'
...
Why this is done, will be explained next.
t(idx+1) is a vector that is 1xlength(tm_all), and is kind of a copy of tm_all except that when a date doesn't occur in the file, the date is zero. How does this work? t(i) gives you the value of t at position i. So t( 1 5 4 2 9) is a vector with the values of t at positions 1, 5, 4, 2 and 9. Remember idx is the vector that contains the incides of the of the dates in data(:,1). Because Matlab indexing starts at 1, idx+1 is needed. The dates in data':,1) then must also be increased. That's done by adding the zero in the beginning.
[~,idx] = ismember(t(idx+1),dates); is the same as before, but idx now contains the indices of dates.
q = [0 occurence]; again adds a zero occ = q(idx+1); is the row of occurences of the dates.

MATLAB - Returning a matrix of sums of elements corresponding to the same kind

Overview
An n×m matrix A and an n×1 vector Date are the inputs of the function S = sumdate(A,Date).
The function returns an n×m vector S such that all rows in S correspond to the sum of the rows of A from the same date.
For example, if
A = [1 2 7 3 7 3 4 1 9
6 4 3 0 -1 2 8 7 5]';
Date = [161012 161223 161223 170222 160801 170222 161012 161012 161012]';
Then I would expect the returned matrix S is
S = [15 9 9 6 7 6 15 15 15;
26 7 7 2 -1 2 26 26 26]';
Because the elements Date(2) and Date(3) are the same, we have
S(2,1) and S(3,1) are both equal to the sum of A(2,1) and A(3,1)
S(2,2) and S(3,2) are both equal to the sum of A(2,2) and A(3,2).
Because the elements Date(1), Date(7), Date(8) and Date(9) are the same, we have
S(1,1), S(7,1), S(8,1), S(9,1) equal the sum of A(1,1), A(7,1), A(8,1), A(9,1)
S(1,2), S(7,2), S(8,2), S(9,2) equal the sum of A(1,2), A(7,2), A(8,2), A(9,2)
The same for S([4,6],1) and S([4,6],2)
As the element Date(5) does not repeat, so S(5,1) = A(5,1) = 7 and S(5,2) = A(5,2) = -1.
The code I have written so far
Here is my try on the code for this task.
function S = sumdate(A,Date)
S = A; %Pre-assign S as a matrix in the same size of A.
Dlist = unique(Date); %Sort out a non-repeating list from Date
for J = 1 : length(Dlist)
loc = (Date == Dlist(J)); %Compute a logical indexing vector for locating the J-th element in Dlist
S(loc,:) = repmat(sum(S(loc,:)),sum(loc),1); %Replace the located rows of S by the sum of them
end
end
I tested it on my computer using A and Date with these attributes:
size(A) = [33055 400];
size(Date) = [33055 1];
length(unique(Date)) = 2645;
It took my PC about 1.25 seconds to perform the task.
This task is performed hundreds of thousands of times in my project, therefore my code is too time-consuming. I think the performance will be boosted up if I can eliminate the for-loop above.
I have found some built-in functions which do special types of sums like accumarray or cumsum, but I still do not have any ideas on how to eliminate the for-loop.
I would appreciate your help.
You can do this with accumarray, but you'll need to generate a set of row and column subscripts into A to do it. Here's how:
[~, ~, index] = unique(Date); % Get indices of unique dates
subs = [repmat(index, size(A, 2), 1) ... % repmat to create row subscript
repelem((1:size(A, 2)).', size(A, 1))]; % repelem to create column subscript
S = accumarray(subs, A(:)); % Reshape A into column vector for accumarray
S = S(index, :); % Use index to expand S to original size of A
S =
15 26
9 7
9 7
6 2
7 -1
6 2
15 26
15 26
15 26
Note #1: This will use more memory than your for loop solution (subs will have twice the number of element as A), but may give you a significant speed-up.
Note #2: If you are using a version of MATLAB older than R2015a, you won't have repelem. Instead you can replace that line using kron (or one of the other solutions here):
kron((1:size(A, 2)).', ones(size(A, 1), 1))

How to extract vectors of consecutive numbers?

Suppose that I have a Q vector which is defined as Q = [1 2 3 4 5 8 9 10 15]; and I would like to find a way to extract different vectors of consecutive numbers and also a vector for the rest of the elements. So my result would be like:
q1 = [1 2 3 4 5];
q2 = [8 9 10 ];
q3 = [15];
You can do this using diff, cumsum and accumarray:
q = accumarray(cumsum([1, diff(Q)~=1])', Q', [], #(x){x})
which returns:
{[1,2,3,4,5];
[8,9,10];
[15]}
i.e. q{1} gives you [1,2,3,4,5] etc which is a far cleaner solution to having separately named vectors. But if you really really wanted to have them, and you know exactly how many groups you will get out, you can do it as follows:
[q1,q2,q3] = q{:};
Explanation:
accumarray will apply an aggregation function (4th input) to elements of a vector (2nd input) based on groupings specified in another vector (1st input).
To use the notation in the docs:
sub = cumsum([1, diff(Q)~=1])';
val = Q';
fun = #(x){x};
Note that sub needs to start from 1. The idea is to use diff to find elements that are consecutive (i.e. where Q(i+1) - Q(i) == 1) which is vectorized using the diff function. By specifying diff(Q)~=1 we can find the breaks between groups of consecutive numbers (concatenating the 1 at the beginning to force a break at the start). cumsum then just converts these breaks into vector of in the right form for sub i.e.
sub = [1 1 1 1 1 2 2 2 3]
The aggregation function we specify is just cell concatenation.

Use MATLAB function handles to reference original matrix

I often have to manipulate a lot of matrices row by row using MATLAB.
Instead of having to type:
m(x, :);
every time I want to reference a particular row, I created an anonymous MATLAB function:
row = #(x) m(x, :);
allowing me to call row(x) and get the correct row back.
But it seems that this anonymous function is actually not the same as calling m(x, :) directly, as the reference to the matrix is lost. So when I call something like:
row(2) = 2 * row(2);
MATLAB returns the error:
error: can't perform indexed assignment for function handle type
error: assignment failed, or no method for 'function handle = matrix'
Is there a way to define a function handle to get around this problem, or am I better off just typing out m(x, :) when I want to reference a row?
Thanks so much!
By defining an anonymous function, you make every row immutable (at least through row). Reassigning the value of a function handle is simply not possible.
Imagine that you define the function handle mySquare(x) = #(x) x.^2 ;. If reassigning the output of a function handle would be possible, you could change the value of mySquare(2) (e.g., mySquare(2)=2) and basically states that 4=2!
On the positive side, your anonymous function "protects" your initial input m from unexpected modifications. If you want to use your function row, you should simply define another matrix m_prime, whose rows are initialized with the function handle row (avoid using m again since it would probably mix everything up).
reference through a handle only work for Matlab object/class inherited from the handle class.
If as you said in comment "elementary row operations is the end application for this", then David's answer is a good simple way to go (or simply keep using m(x,:) which is still the shortest syntax after all).
If you really want to venture into handles and true reference values, then you can create a class rowClass which you specialise in row operations. An example with a few row operations would be:
classdef rowClass < handle
properties
m = [] ;
end
methods
%// constructor
function obj = rowClass(matrix)
obj.m = matrix ;
end
%// row operations on a single row ----------------------------
function obj = rowinc(obj,irow,val)
%// increment values of row "irow" by scalar (or vector) "val"
obj.m(irow,:) = obj.m(irow,:) + val ;
end
function obj = rowmult(obj,irow,val)
%// multiply by a scalar or by a vector element wise
obj.m(irow,:) = obj.m(irow,:) .* val ;
end
function obj = rowsquare(obj,irow)
%// multiply the row by itself
obj.m(irow,:) = obj.m(irow,:).^2 ;
end
%// row operations between two rows ---------------------------
function obj = addrows(obj,irow,jrow)
%// add values of row "jrow" to row "irow"
obj.m(irow,:) = obj.m(irow,:) + obj.m(jrow,:) ;
end
function obj = swaprows(obj,irow,jrow)
%// swap rows
obj.m([irow,jrow],:) = obj.m([jrow,irow],:) ;
end
end
end
Of course you could add all the operations you frequently do to your rows, or even to the full matrix (or a subset).
Example usage:
%% // sample data
A = (1:5).' * ones(1,5) ; %'// initial matrix
B = rowClass( A ) ; %// make an object out of it
B =
rowClass with properties:
m: [5x5 double]
B.m %// display the matrix
ans =
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
Add a value (12) to all elements of the row(1):
%% // add a value (scalar or vector) to a row
rowinc(B,1,12) ;
B.m
ans =
13 13 13 13 13
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
Square the row(3):
%% // Square row 3
rowsquare(B,3) ;
B.m
ans =
13 13 13 13 13
2 2 2 2 2
9 9 9 9 9
4 4 4 4 4
5 5 5 5 5
Last one for the road, swap row(3) and row(5):
%% // swap rows
swaprows(B,5,3) ;
B.m
ans =
13 13 13 13 13
2 2 2 2 2
5 5 5 5 5
4 4 4 4 4
9 9 9 9 9
I think you'll be best off typing m(x,:)! It's not much quicker than doing row(x). Another issue with the anonymous function row is that it will keep the original matrix m, which wont change.
Here is an anonymous function that does what you want, I think it's a reasonable way of doing things. row(a,b,c) multiplies the b'th row of the matrix (not necessarily square) a by c.
x=rand(5)
row=#(x,i,k) (diag([ones(1,i-1) k ones(1,size(x,1)-(i))]))*x
x=row(x,2,20)
Ultimately, I think the simplest method is to make a standalone function to do each type of row operation. For example,
function x=scalarmult(x,i,k)
x(i,:)=k*x(i,:);
and
function x=addrows(x,i,j)
x(i,:)=x(i,:)+x(j,:);
and
function x=swaprows(x,i,j)
x([i,j],:)=x([j,i],:);