Creating cumulative matrix which accounts for column start points - matlab

I have a simple example matrix as follows: (The actual matrix I'm working on is 674x11 and is not simply all '1' elements).
a =
1 1 1 NaN NaN
1 1 1 NaN NaN
1 1 1 1 NaN
1 1 1 1 1
1 1 1 1 1
I want to create a cumulative matrix which accounts for the fact that numeric elements start in each column at different rows. I want to achieve this by replacing the NaN value above the first numeric element in each column with the mean of that row.
So instead of:
cumsum(a)=
1 1 1 NaN NaN
2 2 2 NaN NaN
3 3 3 1 NaN
4 4 4 2 1
5 5 5 3 2
what I want to achieve is:
cumsum(a) =
1 1 1 NaN NaN
2 2 2 2 NaN
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
where element (2,4) is the mean of a(2,1:3) and element (3,5) is the mean of a(3,1:4).

You can compute the mean of each row (ignoring the NaN values) by using nanmean. We can then use find to identify the row in which each NaN is and replace the values with the mean of that row. Then we can follow that up with the cumsum operation
% Get the rows of each NaN value
bool = isnan(a);
[row,col] = find(bool);
% Compute the mean value of each row
rowmeans = nanmean(a, 2);
% Replace the NaN values with their row means
a(bool) = rowmeans(row);
% Perform the cumulative sum
result = cumsum(a);
If you want to leave the initial NaN values as NaN values afterwards, then you can follow it up with
result(bool) = NaN;

Related

Cumulative matrix which accounts for column start points

I have a simple example dataset below:
a =
1 1 1 NaN NaN
1 1 1 NaN NaN
1 1 1 1 NaN
1 1 1 1 1
1 1 1 1 1
I want to work out the average cumulative value per row. However, cumsum gives the following output:
cumsum(a)
1 1 1 NaN NaN
2 2 2 NaN NaN
3 3 3 1 NaN
4 4 4 2 1
5 5 5 3 2
Then calculating a row mean gives:
nanmean(a,2)
1
2
2.5
3
4
I want to be able to account for the fact that different columns start later i.e. the row mean values for rows (3:5) are reduced with respect to their true values due to low numbers in columns (4:5).
I want to achieve this by replacing the last NaN above the first numeric element in each column in the matrix (a) with the mean of the other columns in that row in the cumulative matrix.This would need to be done iteratively to reflect the changing values in the cumulative matrix. So the new matrix would first look as follows:
(a)
1 1 1 NaN NaN
1 1 1 *2* NaN
1 1 1 1 NaN
1 1 1 1 1
1 1 1 1 1
which would lead to:
cumsum(a)
1 1 1 NaN NaN
2 2 2 2 NaN
3 3 3 3 NaN
4 4 4 4 1
5 5 5 5 2
and then iteratively, (a) would equal:
(a)
1 1 1 NaN NaN
1 1 1 2 NaN
1 1 1 1 *3*
1 1 1 1 1
1 1 1 1 1
which would lead to:
cumsum(a)
1 1 1 NaN NaN
2 2 2 2 NaN
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
which would give the desired row means values as:
nanmean(a,2)
1
2
3
4
5
There may be a way to further vectorise this. However, I think that because each row depends on the previous values, you have to update the matrix row-by-row as follows:
% Cycle through each row in matrix
for i = 1:length(a)
if i > 1
% This makes elements equal to the sum of themselves and above element
% Equivalent outcome to cumsum
a(i,:) = a(i,:) + a(i-1,:);
end
% Replace all NaN values in the row with the average of the non-NaN values
a(i,isnan(a(i,:))) = mean(a(i,~isnan(a(i,:))));
end
This replicates your input and output examples. It doesn't replicate all your iterative steps, it in fact uses many less steps, only 5 (number of rows) for entire operation.
Edit: equally,
for i = 1:length(a)
% Replace all NaN values in the row with the average of the non-NaN values
a(i,isnan(a(i,:))) = mean(a(i,~isnan(a(i,:))));
end
a = cumsum(a);

Getting row and column numbers of valid elements in a matrix

I have a 3x3 matrix, populated with NaN and values of a variable:
NaN 7 NaN
5 NaN 0
NaN NaN 4
matrix = [NaN 7 NaN; 5 NaN 0; NaN NaN 4]
I would like to get the row and column numbers of non-NaN cells and put them in a matrix together with the value of the variable. That is, I would like to obtain the following matrix:
row col value
1 2 7
2 1 5
2 3 0
3 3 4
want = [1 2 7; 2 1 5; 2 3 0; 3 3 4]
Any help would be highly appreciated.
This can be done without loops:
[jj, ii, kk] = find((~isnan(matrix).*(reshape(1:numel(matrix), size(matrix)))).');
result = [ii jj matrix(kk)];
The trick is to multiply ~isnan(matrix) by a matrix of indices so that the third output of find gives the linear index of non-NaN entries. The transpose is needed to have the same order as in the question.
The following should work!
[p,q]=find(~isnan(matrix)) % Loops through matrix to find indices
want = zeros(numel(p),3) % three columns you need with same number of rows as p
for i=1:numel(p)
want[i,:] = [p(i) q(i) matrix(p(i), matrix(i))]
end
Should give you the correct result which is:
2 1 5
1 2 7
2 3 0
3 3 4
If you don't mind the ordering of the rows, you can use a simplified version of Luis Mendo's answer:
[row, col] = find(~isnan(matrix));
result = [row(:), col(:), matrix(~isnan(matrix))];
Which will result in:
2 1 5
1 2 7
2 3 0
3 3 4

Replace NaN sequence according to values before and after the sequence

I would appreciate if someone can help me with this problem...
I have a vector
A = [NaN 1 1 1 1 NaN NaN NaN NaN NaN 2 2 2 NaN NaN NaN 2 NaN NaN 3 NaN NaN];
I would like to fill the NaN values according to this logic.
1) if the value that precedes the sequence of NaN is different from the one that follows the sequence => assign half of the NaNs to the first value and half to the second value
2) if the NaN seqence is between 2 equal values => fill the NaN with that value.
A should be then:
A = [1 1 1 1 1 1 1 (1) 2 2 2 2 2 2 2 2 2 2 3 3 3]
I have put one 1 within brakets because I assigned that value to the first half...the sequence of NaNs is odd.
I am typing this in my phone, without MATLAB - so there can be some issues. But this should be close:
t = 1:numel(A);
Anew = interp1(t(~isnan(A)),A(~isnan(A)),t,'nearest','extrap');
If you have the image processing toolbox, you can use bwdist to calculate the index of the nearest non-NaN-neighbor:
nanMask = isnan(A);
[~,idx] = bwdist(~nanMask);
A(nanMask) = A(idx(nanMask));

finding clustered NaNs but leaving lonely NaNs alone

I have an incomplete dataset,
N = [NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN]'
I wish to identify a cluster of Nans, that is, if the subsequent number of them exceeds 2. how do i do that?
You could do something like this:
aux = diff([0; isnan(N); 0]);
clusters = [find(aux == 1) find(aux == -1) - 1];
Then clusters will be a Nx2 matrix, where N is the number of NaN clusters (all of them), and each row gives you the start and end index of the cluster.
In this example, that would be:
clusters =
1 1
5 5
8 9
15 19
It means you have 4 NaN clusters, and cluster one ranges from index 1 to index 1, cluster two ranges from 5 to 5, cluster three ranges from 8 to 9 and cluster four ranges from 15 to 19.
If you want only the clusters with at least K NaNs, you could do it like this (for example, with K = 2):
K = 2;
clusters(clusters(:,2) - clusters(:,1) + 1 >= K, :)
That would give you this:
ans =
8 9
15 19
That is, clusters 8-9 and 15-19 have 2 or more NaNs.
Explanation:
Finding the clusters
isnan(N) gives you a logical vector containing the NaNs as ones:
N --------> NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN
isnan(N) -> 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 1 1
We want to know where each sequence of ones start, so we use diff, which calculates each value minus the previous one, and gives us this:
aux = diff(isnan(N));
N ----> NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN
aux --> -1 0 0 1 -1 0 1 0 -1 0 0 0 0 1 0 0 0 0
Where a 1 indicates the group start and a -1 indicates a group end. But it misses the first group start and the last group end, because the first 1 element is absent (it doesn't have a previous on N because it is the first) and the last -1 is absent too (because there is nothing after the last 1 on N). A common fix is to add a zero before and after the array, which gives us this:
aux = diff([0; isnan(N); 0]);
N ----> NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN
aux --> 1 -1 0 0 1 -1 0 1 0 -1 0 0 0 0 1 0 0 0 0 -1
Notice two things:
If the diff at index i is 1, N(i) is the start of the NaN block.
If the diff at index i is -1, N(i - 1) is the end of the NaN block.
To get the start and end, we use find to get the indexes where aux == 1 and aux == -1. Hence, we call find twice, and concatenate both calls using [ and ]:
aux = diff([0; isnan(N); 0]);
clusters = [find(aux == 1) find(aux == -1) - 1];
Filtering the clusters whick have K or more elements
The last step is to find clusters which have K or more elements. To do that, we first take the cluster matrix and subtract the first column from the first, and add 1, like this:
clusters(:,2) - clusters(:,1) + 1
ans =
1
1
2
5
It means clusters 1 and 2 have 1 NaN, cluster 3 have 3 NaNs and cluster 4 have 5 NaNs. If we ask which values are greather than or equal K, we get this:
clusters(:,2) - clusters(:,1) + 1 >= K
ans =
0
0
1
1
It's a logical array. We can use that to index only the 1 (true) rows of the cluster matrix, like this:
clusters(clusters(:,2) - clusters(:,1) + 1 >= K, :)
ans =
8 9
15 19
It's like asking: give us only the clusters where the rows match the ones on this logical vector, and give us all columns (denoted by the :).
Here is a modular solution:
% the number of NaN you consider as a cluster
num = 3;
% moving average filter
Z = filter(ones(num,1),1,isnan(N));
x = arrayfun(#(x) find(Z == num) - num + x, 1:num,'uni',0)
y = unique(cell2mat(x))
(UPDATE: faster version below)
gives for num = 1:
y = 1 5 8 9 15 16 17 18 19
for num = 2:
y = 8 9 15 16 17 18 19
for num = 3, num = 4 and num = 5:
y = 15 16 17 18 19
and finally for num = 6 ... and more
y = Empty matrix: 1-by-0
Explanation
isnan(N) returns a logical array with ones at the positions of NaN.
Z = filter(ones(num,1),1,isnan(N)); is a implementation for a moving average filter with a filter window of ones(num,1) = [1 1 1] (for num = 3). So the filter of size 3 glides of the array and just reaches the value num = 3 when there are 3 NaN in a row.
So it basicall looks like:
%// N isnan(N) Z
NaN 1 1
1 0 1
2 0 1
3 0 0
NaN 1 1
5 0 1
6 0 1
NaN 1 1
NaN 1 2
7 0 2
8 0 1
10 0 0
12 0 0
20 0 0
NaN 1 1
NaN 1 2
NaN 1 3
NaN 1 3
NaN 1 3
Now it is easy to find all elements which are 3: find(Z == num) - but you also need all 2 right before: find(Z == num) - num + 2 and all 1 right before: find(Z == num) - num + 1. Instead of a loop arrayfun is used, which is basically the same. As result you get a matrix with a lot of indices, lot of them mulitple, but you just need the unique ones. I hope everything is clear now.
Actually it would be much faster to get find out of arrayfun, which can then even be substituted by bsxfun and you can get rid of cell2mat, which leads to the following form:
Faster:
Z = find( filter(ones(num,1),1,isnan(N)) == num ) - num;
y = unique( bsxfun(#plus, Z,1:num) );
or faster the obligatory fancy one-liner:
y = unique(bsxfun(#plus,find(filter(ones(num,1),1,isnan(N))==num)-num,1:num));
STRFIND Approach
I. Fancy One Liner:
%%// Given input N
N = [NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN]
out = [strfind(num2str(isnan([ 0 N 0]),'%1d'),'011');strfind(num2str(isnan([ 0 N 0]),'%1d'),'110')]'
Output
out =
8 9
15 19
II. Detailed one with explanation:
Basically you are trying to do sliding window checks, for which there is no direct method when working with double arrays, but after converting to strings, one can use strfind. This trick is used here.
I would suggest following the comments used in the code and the output numbers to understand it. Please note that for this particular case a cluster means a group of two or more consecutive NaNs
Code
%%// Given input N
N = [NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN]
%%// Set the locations where NaNs are present and then
%%// append at the start and end with zeros
N2 = isnan([ 0 N 0])
%%// Find the start indices of all NaN clusters
start_ind = strfind(num2str(N2,'%1d'),'011')
%%// Find the stop indices of all NaN clusters
stop_ind = [strfind(num2str(N2,'%1d'),'110')]
%%// Put start and stop indices into a Mx2 matrix
out = [start_ind' stop_ind']
Output
N =
NaN 1 2 3 NaN 5 6 NaN NaN 7 8 10 12 20 NaN NaN NaN NaN NaN
N2 =
0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0
start_ind =
8 15
stop_ind =
9 19
out =
8 9
15 19
This uses diff, as Rafael Monteiro's answer, but seems to be simpler:
ind = diff([0; isnan(N(:))]);
result = find(ind(1:end-1)==1 & ind(2:end)==0);
In your example, this gives [8 15].
How it works: ind takes the values:
1 where a run of (one or more) NaN values starts;
0 where there is no change between NaN and numeric with respect to the previous value;
-1 where a run of (one o more) numeric values starts.
The second line selects the positions at which a run of NaN starts and such that the next position is also NaN. Thus it gives the start of each run of more than one NaN, as desired.

Find row-wise minima in sparse matrix

I would like to get the minimum nonzero values per row in a sparse matrix. Solutions I found for dense matrices suggested masking out the zero values by setting them to NaN or Inf. However, this obviously doesn't work for sparse matrices.
Ideally, I should get a column vector of all the row-wise minima, as I would get with
minValues = min( A, [], 2);
Except, obviously, using min leaves me with an all-zeros column vector due to the sparsity. Is there a solution using find?
This is perfect for accumarray. Consider the following sparse matrix,
vals = [3 1 1 9 7 4 10 1]; % got this from randi(10,1,8)
S = sparse([1 3 4 4 5 5 7 9],[2 2 3 6 7 8 8 11],vals);
Get the minimum value for each row, assuming 0 for empty elements:
[ii,jj] = find(S);
rowMinVals = accumarray(ii,nonzeros(S),[],#min)
Note that rows 4 and 5 of rowMinVals, which are the only two rows of S with multiple nonzero values are equal to the min of the row:
rowMinVals =
3
0
1
1 % min([1 9]
4 % min([7 4]
0
10
0
1
If the last row(s) of your sparse matrix do not contain any non-zeros, but you want your min row value output to reflect that you have numRows, for example, change theaccumarray command as follows,
rowMinVals = accumarray(ii,nonzeros(S),[numRows 1],#min).
Also, perhaps you also want to avoid including the default 0 in the output. One way to handle that is to set the fillval input argument to NaN:
rowMinVals = accumarray(ii,nonzeros(S),[numRows 1],#min,NaN)
rowMinVals =
3
NaN
1
1
4
NaN
10
NaN
1
NaN
NaN
NaN
Or you can keep using a sparse matrix with the fifth input argument, issparse:
>> rowMinVals = accumarray(ii,nonzeros(S),[],#min,[],true)
rowMinVals =
(1,1) 3
(3,1) 1
(4,1) 1
(5,1) 4
(7,1) 10
(9,1) 1