Index of Max Value from by grouping using varfun - matlab

I have a table with Ids and Dates. I would like to retrieve the index of the max date for each Id.
My initial approach is so:
varfun(#max, table, 'Grouping Variables', 'Id', 'InputVariables','Date');
This obviously gives me the date rather than the index.
I noted that the max function will return both the maxvalue and maxindex when specified:
[max_val, max_idx] = max(values);
How can I define an anonymous function using max to retrieve max_idx? I would then use it in the var_fun to get my result.
I'd prefer not to declare a cover function (as opposed to a anon func)over max() as:
1. I'm working in a script and would rather not create another function file
2. I'm unwilling to change my current script to a function
Thanks a million guys,

I'm assuming your Ids are positive integers and and your Dates are numbers.
If you wanted the maximum Date for each Id, it would be a perfect case for accumarray with the max function. In the following I'll use f to denote a generic function passed to accumarray.
The fact that you want the index of the maximum makes it a little trickier (and more interesting!). The problem is that the Dates corresponding to a given Id are passed to f without any reference to their original index. Therefore, an f based on max can't help. But you can make the indices "pass through" accumarray as imaginary parts of the Dates.
So: if you want just one maximizing index (even if there are several) for each Id:
result = accumarray(t.Id,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) imag(x(find(real(x)==max(real(x))),1))); %// function f
Note that the function f here maximizes the real part and then extracts the imaginary part, which contains the original index.
Or, if you want all maximizing indices for each Id:
result = accumarray(t.Id,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) {imag(x(find(real(x)==max(real(x)))))}); %// function f
If your Ids are strings: transform them into numeric labels using the third output of unique, and then proceed as above:
[~, ~, NumId] = unique(t.Id);
and then either
result = accumarray(NumId,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) imag(x(find(real(x)==max(real(x))),1))); % function f
or
result = accumarray(NumId,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) {imag(x(find(real(x)==max(real(x)))))}); %// function f

I don't think varfun is the right approach here, as
varfun(func,A) applies the function func separately to each variable of
the table A.
This would only make sense if you wanted to apply it to multiple columns.
Simple approach:
Simply go with the loop approach: First find the different IDs using unique, then for each ID find the indices of the maximum dates. (This assumes your dates are in a numerical format which can be compared directly using max.)
I did rename your variable table to t, as otherwise we would be overwriting the built-in function table.
uniqueIds = unique(t.Id);
for i = 1:numel(uniqueIds)
equalsCurrentId = t.Id==uniqueIds(i);
globalIdxs = find(equalsCurrentId);
[~, localIdxsOfMax] = max(t.Date(equalsCurrentId));
maxIdxs{i} = globalIdxs(localIdxsOfMax);
end
As you mentioned your Ids are actually strings instead of numbers, you will have to change the line: equalsCurrentId = t.Id==uniqueIds(i); to
equalsCurrentId = strcmp(t.Id, uniqueIds{i});
Approach using accumarray:
If you prefer a more compact style, you could use this solution inspired by Luis Mendo's answer, which should work for both numerical and string Ids:
[uniqueIds, ~, global2Unique] = unique(t.Id);
maxDateIdxsOfIdxSubset = #(I) {I(nth_output(2, #max, t.Date(I)))};
maxIdxs = accumarray(global2Unique, 1:length(t.Id), [], maxDateIdxsOfIdxSubset);
This uses nth_output of gnovice's great answer.
Usage:
Both above solutions will yield: A vector uniqueIds with a corresponding cell-array maxIdxs, in a way that maxIdxs{i} are the indices of the maximum dates of uniqueIds(i).
If you only want a single index, even though there are multiple entries where the maximum is attained, use the following to strip away the unwanted data:
maxIdxs = cellfun(#(X) X(1), maxIdxs);

Related

Extract values from a vector and sort them based on their original squence

I have a vector of numbers (temperatures), and I am using the MATLAB function mink to extract the 5 smallest numbers from the vector to form a new variable. However, the numbers extracted using mink are automatically ordered from lowest to largest (of those 5 numbers). Ideally, I would like to retain the sequence of the numbers as they are arranged in the original vector. I hope my problem is easy to understand. I appreciate any advice.
The function mink that you use was introduced in MATLAB 2017b. It has (as Andras Deak mentioned) two output arguments:
[B,I] = mink(A,k);
The second output argument are the indices, such that B == A(I).
To obtain the set B but sorted as they appear in A, simply sort the vector of indices I:
B = A(sort(I));
For example:
>> A = [5,7,3,1,9,4,6];
>> [~,I] = mink(A,3);
>> A(sort(I))
ans =
3 1 4
For older versions of MATLAB, it is possible to reproduce mink using sort:
function [B,I] = mink(A,k)
[B,I] = sort(A);
B = B(1:k);
I = I(1:k);
Note that, in the above, you don't need the B output, your ordered_mink can be written as follows
function B = ordered_mink(A,k)
[~,I] = sort(A);
B = A(sort(I(1:k)));
Note: This solution assumes A is a vector. For matrix A, see Andras' answer, which he wrote up at the same time as this one.
First you'll need the corresponding indices for the extracted values from mink using its two-output form:
[vals, inds] = mink(array);
Then you only need to order the items in val according to increasing indices in inds. There are multiple ways to do this, but they all revolve around sorting inds and using the corresponding order on vals. The simplest way is to put these vectors into a matrix and sort the rows:
sorted_rows = sortrows([inds, vals]); % sort on indices
and then just extract the corresponding column
reordered_vals = sorted_rows(:,2); % items now ordered as they appear in "array"
A less straightforward possibility for doing the sorting after the above call to mink is to take the sorting order of inds and use its inverse to reverse-sort vals:
reverse_inds = inds; % just allocation, really
reverse_inds(inds) = 1:numel(inds); % contruct reverse permutation
reordered_vals = vals(reverse_inds); % should be the same as previously

How to marginalise a vector with index?

I have this code:
A = [3,1,5,8]
B = [0, 0]
indexB = [1,2,2,1]
for i = 1:4
B(indexB(i)) = B(indexB(i)) + A(i)
end
So, in the end, I got
B = [11, 6]
I wonder if I can use a more efficient way to sum up instead of using the for-loop?
Classic use of accumarray. Only this time, you accumulate the entries in A then add this on top of B as B is the starting point of the summation:
B = B(:); % Force into columns
B = B + accumarray(indexB(:), A(:));
How accumarray works is quite simple. You can think of it as a miniature MapReduce paradigm. Simply put, for each data point we have, there is a key and an associated value. The goal of accumarray is to place (or bin) all of the values that belong to the same key and do some operation on all of these values. In our case, the "key" would be the values in indexB where each element is a location to index into B. The values themselves are those from A. We would then want to add up all of the values that belong to each location in indexB together. Thankfully, the default behaviour for accumarray is to add all of these values. Specifically, the output of accumarray would be an array where each position computes the sum of all values that mapped to a key. For example, the first position would be the summation of all values that mapped to the key of 1, the second position would be the summation of all values that mapped to the key of 2 and so on.
Because you are using B as a starting point, the end result would be to take the summation result from accumarray and add this on top of B thus completing the code.
Minor Note
I do have to point out that accumarray works by columns. Because you are using rows, I had to force the input so that they are columns, which is the purpose of the (:) syntax. The output will also be as a column so you can transpose that if you wish to have it in a row format.

Find index of minimum value for each unique value in MATLAB [duplicate]

I have two vectors of length 16. The first one, r, for example is:
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
r contains a list of IDs. I want to collect the indices of the duplicate IDs in r so that each group is a list of indices for one ID. I would then use these indices to access a second vector a and find the maximum value incident on the indices for each group.
Therefore, I would like to produce an output vector using r and a such that:
max(a(1),a(5)), max(a(2),a(6)), a(3), a(7), max(a(4),a(8)), max(a(9),a(13)), max(a(10),a(14)), max(a(11),a(15)), max(a(12),a(16))
I also want to keep the indices of the maximum values. How can I efficiently implement this in MATLAB?
You can use the third output of unique to assign each unique number in r a unique ID. You can then bin all of the numbers that share the same ID with an accumarray call where the key is the unique ID and the value is the actual value of a for the corresponding position of the key in this unique ID array. Once you collect all of these values, use accumarray so that you can use these values for each unique value in r to reference into a and select out the maximum element:
%// Define r and a
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
a = [...];
%// Relevant code
[~,~,id] = unique(r, 'stable');
out = accumarray(id(:), a(:), [], #max);
The 'stable' flag in unique is important because we want to assign unique IDs in order of occurrence. Not doing this will sort the values in r first before assigning IDs and that's not what we want.
Here's a quick example. Let me set up your problem with generating a random 16 element array stored in a which you are trying to ultimately index. We'll also set up r:
rng(123);
a = rand(16,1);
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
This is what a looks like:
>> a
a =
0.6965
0.2861
0.2269
0.5513
0.7195
0.4231
0.9808
0.6848
0.4809
0.3921
0.3432
0.7290
0.4386
0.0597
0.3980
0.7380
After running through the code, we get this:
out =
0.7195
0.4231
0.2269
0.6848
0.9808
0.4809
0.3921
0.3980
0.7380
You can verify for yourself that this gives the right result. Specifically, the first element is the maximum of a(1) and a(5) which is 0.6965 and 0.7195 respectively, and the maximum is 0.7195. Similarly, the second element is the maximum a(2) and a(6), which is 0.2861 and 0.4231, and the maximum is 0.4231 and so on.
If it is your desire to also remember what the indices were used to select out the maximum element, this will be slightly more complicated. What you need to do is call accumarray once again, but the values won't be those of a but the actual index values instead. You'd use the second output of max to get the actual location of the value chosen. However, with the nature of max, we can't just grab the second element of max without explicitly calling the two-output version of max (I really wish there was another way around this... Python has a function in NumPy called numpy.argmax) and this can't be properly encapsulated in an anonymous function (i.e. #(x) ...), so you're going to need to create a custom function to do that.
Create a new function called maxmod and save it to a file called maxmod.m. You'd put this inside the function:
function p = maxmod(vals, ind)
[~,ii] = max(vals(ind));
p = ind(ii);
This takes in an array and a range of indices to access the array, called vals. We'd then find the maximum of these selected results, then return which index gave us the maximum.
After, you'd call accumarray like so:
%// Define r and a
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
a = [...];
%// Relevant code
[~,~,id] = unique(r, 'stable');
out = accumarray(id(:), (1:numel(r)).', [], #(x) maxmod(a,x));
This is now what I get:
>> out
out =
5
6
3
8
7
9
10
15
16
If you look at each value, this reflects which location of a we chose that corresponds to the maximum of each group.

how to transfer the pairwise distance value in a distance matrix

I am pretty new to Matlab, now i want to use the matlab to do some clustering job.
if I have 3 columns values
id1 id2 distvalue1
id1 id3 distvalue2
....
id2 id4 distvalue i
.....
5000 ids in total, but some ids pairs are missing the distance value
in python I can make loops to import these distance value into a matrix form. How I can do it in matlab?
and also let the matlab knows id1,...idx are identifies and the third column is the value
Thanks!
Based on the comments, you know how to get the data into the form of an N x 3 matrix, called X, where X(:,1) is the first index, X(:,2) is the second index, and X(:,3) is the corresponding distance.
Let's assume that the indices (id1... idx) are arbitrary numeric labels.
So then we can do the following:
% First, build a list of all the unique indices
indx = unique([X(:,1); X(:,2)]);
Nindx = length(indx);
% Second, initialize an empty connection matrix, C
C = zeros(Nindx, Nindx); %or you could use NaN(Nindx, Nindx)
% Third, loop over the rows of X, and map them to points in the matrix C
for n = 1:size(X,1)
row = find(X(n,1) == indx);
col = find(X(n,2) == indx);
C(row,col) = X(n,3);
end
This is not the most efficient method (that would be to remap the indices of X to the range [1... Nindx] in a vectorized manner), but it should be fine for 5000 ids.
If you end up dealing with very large numbers of unique indices, for which only very few of the index-pairs have assigned distance values, then you may want to look at using sparse matrices -- try help sparse -- instead of pre-allocating a large zero matrix.

N-Dimensional Histogram Counts

I am currently trying to code up a function to assign probabilities to a collection of vectors using a histogram count. This is essentially a counting exercise, but requires some finesse to be able to achieve efficiently. I will illustrate with an example:
Say that I have a matrix X = [x1, x2....xM] with N rows and M columns. Here, X represents a collection of M, N-dimensional vectors. IN other words, each of the columns of X is an N-dimensional vector.
As an example, we can generate such an X for M = 10000 vectors and N = 5 dimensions using:
X = randint(5,10000)
This will produce a 5 x 10000 matrix of 0s and 1s, where each column is represents a 5 dimensional vector of 1s and 0s.
I would like to assign a probability to each of these vectors through a basic histogram count. The steps are simple: first find the unique columns of X; second, count the number of times each unique column occurs. The probability of a particular occurrence is then the #of times this column was in X / total number of columns in X.
Returning to the example above, I can do the first step using the unique function in MATLAB as follows:
UniqueXs = unique(X','rows')'
The code above will return UniqueXs, a matrix with N rows that only contains the unique columns of X. Note that the transposes are due to weird MATLAB input requirements.
However, I am unable to find a good way to count the number of times each of the columns in UniqueX is in X. So I'm wondering if anyone has any suggestions?
Broadly speaking, I can think of two ways of achieving the counting step. The first way would be to use the find function, though I think this may be slow since find is an elementwise operation. The second way would be to call unique recursively as it can also provide the index of one of the unique columns in X. This should allow us to remove that column from X and redo unique on the resulting X and keep counting.
Ideally, I think that unique might already be doing some counting so the most efficient way would probably be to work without the built-in functions.
Here are two solutions, one assumes all values are either 0's or 1's (just like the example in your description), the other does not. Both codes should be very fast (more so the one with binary values), even on large data.
1) only zeros and ones
%# random vectors of 0's and 1's
x = randi([0 1], [5 10000]); %# RANDINT is deprecated, use RANDI instead
%# convert each column to a binary string
str = num2str(x', repmat('%d',[1 size(x,1)])); %'
%# convert binary representation to decimal number
num = (str-'0') * (2.^(size(s,2)-1:-1:0))'; %'# num = bin2dec(str);
%# count frequency of how many each number occurs
count = accumarray(num+1,1); %# num+1 since it starts at zero
%# assign probability based on count
prob = count(num+1)./sum(count);
2) any positive integer
%# random vectors with values 0:MAX_NUM
x = randi([0 999], [5 10000]);
%# format vectors as strings (zero-filled to a constant length)
nDigits = ceil(log10( max(x(:)) ));
frmt = repmat(['%0' num2str(nDigits) 'd'], [1 size(x,1)]);
str = cellstr(num2str(x',frmt)); %'
%# find unique strings, and convert them to group indices
[G,GN] = grp2idx(str);
%# count frequency of occurrence
count = accumarray(G,1);
%# assign probability based on count
prob = count(G)./sum(count);
Now we can see for example how many times each "unique vector" occurred:
>> table = sortrows([GN num2cell(count)])
table =
'000064850843749' [1] # original vector is: [0 64 850 843 749]
'000130170550598' [1] # and so on..
'000181606710020' [1]
'000220492735249' [1]
'000275871573376' [1]
'000525617682120' [1]
'000572482660558' [1]
'000601910301952' [1]
...
Note that in my example with random data, the vector space becomes very sparse (as you increase the maximum possible value), thus I wouldn't be surprised if all counts were equal to 1...