How to exclude NaNs from ranking a vector - matlab

We're working on a MATLAB code to rank stocks. We do not have a full dataset and therefore have to cope with some NaNs. However, in the code we use for sorting, the NaNs are ranked the highest. Our intention is to exclude the NaNs from the ranking. How to do this?
Please consider an example with Y and stockkid below
Y = [1.2 1.3 NaN 0.9 0.95 NaN 0.8 0.7];
stockid = [801 802 803 804 805 806 807 808];
[totalmonths,totalstocks] = size(Y);
nbrstocks = totalstocks - sum(isnan(Y));
[B,I] = sort(Y,'descend');
ncandidates = 4;
idwinner(1:ncandidates) = stockid(I(1:ncandidates));
Running the program results in:
Y =
1.2000 1.3000 NaN 0.9000 0.9500 NaN 0.8000 0.7000
idwinner =
803 806 802 801
So, 803 corresponds to NaN, 806 to NaN, 802 to 1.3 etc.
The result we're aiming for should be like this:
Y =
1.2000 1.3000 NaN 0.9000 0.9500 NaN 0.8000 0.7000
idwinner =
802 801 805 804
So, how can we exclude the NaNs from the ranking?

Use
Y(isnan(Y)) = -inf;
before calling sort. That will change NaN values into -inf, and thus those values will be the lowest.
Alternatively, if you don't want to change any value in Y, you can use an intermediate index as follows:
Y = [1.2 1.3 NaN 0.9 0.95 NaN 0.8 0.7];
stockid = [801 802 803 804 805 806 807 808];
ind = find(~isnan(Y)); %/ intermediate index that tells which elements are numbers
[B,I] = sort(Y(ind),'descend');
ncandidates = 4;
idwinner(1:ncandidates) = stockid(ind(I(1:ncandidates))); %// apply intermediate index

After your sort statement, add the line: I = I(~isnan(B));, which will remove the indices associated with NaNs before you select them from stockids

I = I(~isnan(B));
Works best since we then do not overwrite the NaNs as is the case with using
Y(isnan(Y)) = -inf;
Since we later on also have to determine the loser portfolios from the stocks with the lowest returs. This does not work well with the last code because all the NaNs have the lowest returns instead of the stocks with actual data.

Related

Aligning multiple arrays in a cell array by prepending/postpending NaNs

I am trying to align arrays within a cell-array while prepending/postpending NaNs to match the size of arrays like this for example:
%Setting up data
A = [0.01 0.02 0.03 0.01 0.60 0.90 -1.02];
B = [0.03 0.01 0.60 0.90];
C = [0.03 0.01 0.60 0.90 -1.02 0.03 -1.02];
CellABC = {A, B, C};
The expected output is this:
CellABC = {[0.01 0.02 0.03 0.01 0.60 0.90 -1.02 NaN NaN ],...
NaN NaN 0.03 0.01 0.60 0.90 NaN NaN NaN ],...
NaN NaN 0.03 0.01 0.60 0.90 -1.02 0.03 -1.02]};
This is just an example. In my actual data, I have a 1x100 cell-array containing arrays of sizes ranging from 1x400 to 1x1400.
I have tried this:
[~, idx] = max(cellfun(#numel, CellABC)); %index of maximum no. of entries in CellABC
for i=1:length(CellABC)
[d1, d2] = findsignal(CellABC{idx},CellABC{i},'Metric','absolute');
tmp = NaN(size(CellABC{idx})); %initializing with NaNs
tmp(d1:d2) = CellABC{i}; %saving the array as per indices of found values
CellABC{i} = tmp; %Updating the cell array
end
This will align the CellABC{2} correctly but the number of postpended NaNs is not correct. Also that does not give postpended NaNs at the end of CellABC{1} and prepended NaNs at the start of CellABC{3}. I understand the reason that findsignal function is not useful in this case since we don't have an array with the complete data to be use as the first input argument of findsignal. How could I make this work?
I have also looked into alignsignals function but it is only for two signals. I am unable to figure out how this could be implemented for 100 signals as in my case.
How could this problem be solved?
Its relatively simple for the example data, but you may need more than one template in multiple loops if the real data is too fragmented.
A = [0.01 0.02 0.03 0.01 0.60 0.90 -1.02];
B = [0.03 0.01 0.60 0.90];
C = [0.03 0.01 0.60 0.90 -1.02 0.03 -1.02];
CellABC = {A, B, C};
% find longest anyway
[~,I]=max(cellfun(#(x) numel(x),CellABC));
% find lags and align in two pass
% 1st pass
lags=zeros(numel(CellABC),1);
for idx=1:numel(CellABC)
if idx==I, continue; end
[r,lag]=xcorr(CellABC{I},CellABC{idx});
[~,lagId]=max(r);
lags(idx)=lag(lagId);
end
% 2nd pass
out=nan(numel(CellABC),max(arrayfun(#(x) numel(CellABC{x})+lags(x),1:numel(CellABC))));
for idx=1:numel(CellABC)
out(idx,lags(idx)+1:lags(idx)+numel(CellABC{idx}))=CellABC{idx};
end
out =
0.0100 0.0200 0.0300 0.0100 0.6000 0.9000 -1.0200 NaN NaN
NaN NaN 0.0300 0.0100 0.6000 0.9000 NaN NaN NaN
NaN NaN 0.0300 0.0100 0.6000 0.9000 -1.0200 0.0300 -1.0200

Ranking (ordering value) of an observation in a matrix

I am trying to get the rank of an observation in a matrix, taking into account NaN's and values that can repeat themselfs.
E.g. if we have
A = [0.1 0.15 0.3; 0.5 0.15 0.1; NaN 0.2 0.4];
A =
0.1000 0.1500 0.3000
0.5000 0.1500 0.1000
NaN 0.2000 0.4000
Then I want to get the following output:
B =
1 2 4
6 2 1
NaN 3 5
Thus 0.1 is the lowest value (rank=1), whereas 0.5 is the highest value (rank = 6).
Ideally an efficient solution without loops.
You can use unique. This sorts data by default, and you can get the index of the sorted unique values. This would replicate your tie behaviour, since identical values will have the same index. You can omit NaN values with logical indexing.
r = A; % or NaN(size(A))
nanIdx = isnan(A); % Get indices of NaNs in A to ignore
[~, ~, r(~nanIdx)] = unique(A(~nanIdx)) % Assign non-NaN values to their 'unique' index
>> r =
[ 1 2 4
6 2 1
NaN 3 5 ]
If you have the stats toolbox you can use tiedrank function for a similar result.
r = reshape(tiedrank(A(:)), size(A)) % Have to use reshape or rank will be per-column
>> r =
[ 1.5, 3.5, 6.0
8.0, 3.5, 1.5
NaN, 5.0, 7.0 ]
This is not your desired result (as per your example). You can see that tiedrank actually uses a more conventional ranking system than yours, where a tie gives each result the average rank. For example a tied 1st and 2nd gives each rank 1.5, and the next rank is 3.

Cartesian product of row values of a marix in Matlab

Similarly to this question, I have a matrix with real values (including NaNs) A of dimension mxn in Matlab. I want to construct a matrix B listing row-wise each element of the non-unique Cartesian product of the values contained in As columns which are not NaN. To be more clear consider the following example.
Example:
%m=3;
%n=3;
A=[2.1 0 NaN;
69 NaN 1;
NaN 32.1 NaN];
%Hence, the Cartesian product {2.1,0}x{69,1}x{32.1} is
%{(2.1,69,32.1),(2.1,1,32.1),(0,69,32.1),(0,1,32.1)}
%I construct B by disposing row-wise each 3-tuple in the Cartesian product
B=[2.1 69 32.1;
2.1 1 32.1;
0 69 32.1;
0 1 32.1];
I came up with a solution using cells:
function B = q48444528(A)
if nargin < 1
A = [2.1 0 NaN;
69 NaN 1 ;
NaN 32.1 NaN];
end
% Converting to a cell array of rows:
C = num2cell(A,2);
% Getting rid of NaN values:
C = cellfun(#(x)x(~isnan(x)),C,'UniformOutput',false);
% Finding combinations:
B = combvec(C{:}).';
Output:
B =
2.1000 69.0000 32.1000
0 69.0000 32.1000
2.1000 1.0000 32.1000
0 1.0000 32.1000

mean based on maximum value of a matrix

This post follows up another post: find common value of one matrix in another matrix
As I explained there, I have one matrix MyMatrix 2549x13double
Few example lines from MyMatrix:
-7.80 -4.41 -0.08 2.51 6.31 6.95 4.97 2.91 0.66 -0.92 0.31 1.24 -0.07
4.58 5.87 6.18 6.23 5.20 4.86 5.02 5.33 3.69 1.36 -0.54 0.28 -1.20
-6.22 -3.77 1.18 2.85 -3.55 0.52 3.24 -7.77 -8.43 -9.81 -6.05 -5.88 -7.77
-2.21 -3.21 -4.44 -3.58 -0.89 3.40 6.56 7.20 4.30 -0.77 -5.09 -3.18 0.43
I have identified the maximum value for each row of matrix MyMatrix as following:
[M Ind] = max(MyMatrix, [], 2);
Example lines I obtain in M:
6.95
6.23
3.24
7.20
Now, I would like to select in MyMatrix the 2 values before and after the maximum value as found in M, as I will need to calculate the average of these 5 values. So, in the example, I would like to select:
2.51 6.31 6.95 4.97 2.91
5.87 6.18 6.23 5.20 4.86
-3.55 0.52 3.24 -7.77 -8.43
3.40 6.56 7.20 4.30 -0.77
and to create a new column in MyMatrix with the mean of these 5 values.
Following the code by #Dan, taken from the previous post:
colInd = bsxfun(#plus,PeakInd, -2:2);
MyMatrixT = MyMatrix.';
rowIndT = colInd.';
linIndT = bsxfun(#plus,rowIndT,0:size(MyMatrixT,1):size(MyMatrixT,1)*(size(MyMatrixT,2)-1));
resultT = MyMatrixT(linIndT);
result = resultT.';
mean(result,2)
MyMatrix = [MyMatrix, mean(result,2)];
Here is the new part of the post, regarding the issue when the maximum value is near the edges.
When the maximum is the first or last column of MyMatrix, I would like to have NaN.
Instead, when the maximum is in the second column, I would like to calculate the mean considering one column preceding the maximum, the maximum value, and two columns following the maximum.
While, when the maximum is in the second last column, I would like to consider the two columns preceding the maximum, the maximum value, and only one column following the maximum.
I would be extremely grateful if you could help me. Many thanks!
Instead of creating a 2D array with NaNs plus nanmean, you could use min/max to get the right indexes:
pad = 2;
[~, Ind] = max(MyMatrix, [], 2);
minCol = max(1, Ind-pad);
maxCol = min(size(MyMatrix, 2), Ind+pad);
result = arrayfun(#(row, min_, max_) mean(MyMatrix(row, min_:max_)),...
(1:size(MyMatrix, 1)).', minCol, maxCol);
If you have the Image Processing Toolbox, you can also use padarray, e.g.
B = padarray(magic(5),[0 2],NaN);
B =
NaN NaN 17 24 1 8 15 NaN NaN
NaN NaN 23 5 7 14 16 NaN NaN
NaN NaN 4 6 13 20 22 NaN NaN
NaN NaN 10 12 19 21 3 NaN NaN
NaN NaN 11 18 25 2 9 NaN NaN
(...if you don't have padarray, just manually add 2 NaN columns on either side) then using some bsxfun + sub2ind we get the desired result:
pad_sz = 2;
B = padarray(magic(5),[0 pad_sz],NaN);
[~,I] = nanmax(B,[],2); % by using nanmax we "explicitly" say we ignore NaNs.
colInd = bsxfun(#plus,-pad_sz:pad_sz,I);
linearInd = sub2ind(size(B), repmat((1:5).',[1,size(colInd,2)]), colInd);
picks = B(linearInd);
res = nanmean(picks,2);
% or combine the last 3 lines into:
% res = nanmean(B(sub2ind(size(B), repmat((1:5).',[1,size(colInd,2)]), colInd)),2);
res = res + 0./~(I == pad_sz+1 | I == size(B,2)-pad_sz); %add NaN where needed.

MATLAB remove NaN values from matrix and shift values left

I am trying to compute column-wise differences in the following matrix:
A =
0 NaN NaN 0.3750 NaN
NaN 0.1250 0.2500 0.3750 NaN
I would like to obtain:
0.3750 NaN NaN
0.1250 0.1250 0.1250
Where I am essentially taking a columnwise difference, skipping NaN values and shifting values to the left.
A one-dimensional case would be:
A = [0 NaN 0.250 0.375 NaN 0.625];
NaN_diff(A) = [0.250 0.125 0.250];
Any way to do this efficiently in MATLAB without using inefficient find() queries per row?
Here's a solution that vectorizes most of the operations:
notNan = ~isnan(A);
numNN = sum(notNan,2);
shifted = NaN(size(A));
for r = 1:size(A,1)
myRow = A(r,:);
shifted(r,1:numNN(r)) = myRow(notNan(r,:));
end
nanDiff = diff(shifted,1,2);
Here is an alternative vectorized solution:
%// Convert to cell array without NaNs
[rows, cols] = size(A);
C = cellfun(#(x)x(~isnan(x)), mat2cell(A, ones(1, rows), cols), 'Uniform', 0);
%// Compute diff for each row and pad
N = max(sum(~isnan(A), 2));
C = cellfun(#(x)[diff(x) nan(1, N - length(x))], C, 'Uniform', 0);
%// Convert back to a matrix
nandiff = vertcat(C{:});
If you want to pad the result matrix with zeroes instead of NaN values, change the nan function call in nan(1, N - length(x)) to zeros.
Here is an alternative method that does require you to loop over each row, but should still have decent performance and feels very intuitive to me.
B = NaN(size(A,1),size(A,2)-1)
for i = 1:size(A,1)
idx = ~isnan(A(:,i))
B(i,1:sum(idx)) = diff(A(i,idx))
end
I'm aware that this is a rather old question, but for people like me who stumble into this page, here is a simpler (imho) solution to the question:
A = [0 NaN 0.250 0.375 NaN 0.625];
A(isnan(A))=[]; % identify index of NaN values and remove them from the array
B = diff(A);
Here is another simple solution without using a loop [but assuming all values are in ascending order]:
A=[0 NaN NaN 0.3750 NaN;NaN 0.1250 0.2500 0.3750 NaN]
A(isnan(A(:,1)))=0;
B=sort(A,2);
C=diff(B,1,2)