I am trying to get the rank of an observation in a matrix, taking into account NaN's and values that can repeat themselfs.
E.g. if we have
A = [0.1 0.15 0.3; 0.5 0.15 0.1; NaN 0.2 0.4];
A =
0.1000 0.1500 0.3000
0.5000 0.1500 0.1000
NaN 0.2000 0.4000
Then I want to get the following output:
B =
1 2 4
6 2 1
NaN 3 5
Thus 0.1 is the lowest value (rank=1), whereas 0.5 is the highest value (rank = 6).
Ideally an efficient solution without loops.
You can use unique. This sorts data by default, and you can get the index of the sorted unique values. This would replicate your tie behaviour, since identical values will have the same index. You can omit NaN values with logical indexing.
r = A; % or NaN(size(A))
nanIdx = isnan(A); % Get indices of NaNs in A to ignore
[~, ~, r(~nanIdx)] = unique(A(~nanIdx)) % Assign non-NaN values to their 'unique' index
>> r =
[ 1 2 4
6 2 1
NaN 3 5 ]
If you have the stats toolbox you can use tiedrank function for a similar result.
r = reshape(tiedrank(A(:)), size(A)) % Have to use reshape or rank will be per-column
>> r =
[ 1.5, 3.5, 6.0
8.0, 3.5, 1.5
NaN, 5.0, 7.0 ]
This is not your desired result (as per your example). You can see that tiedrank actually uses a more conventional ranking system than yours, where a tie gives each result the average rank. For example a tied 1st and 2nd gives each rank 1.5, and the next rank is 3.
Related
I have two matrices in Matlab.
A =
and
B =
I want to assign the elements having the same cell-value according to it's corresponding column number in A matrix and move the elements there. I want to map the elements of B with A so that B elements also moves in that position.
I want this
A =
And therefore,
B =
Is there a way to do this?!
Thanks.
Easiest way I can think of is to create row/column pairs where the rows correspond row locations of the matrix and column locations are the actual elements of the matrix themselves. The values seen at these row/column pairs are again just the matrix values themselves.
You can very easily do this with sparse. Recreating the matrix above and storing this in A:
A = [1 2 5 8; 1 2 4 7];
... I would do it this way:
r = repmat((1:size(A,1)).', 1, size(A,2)); %'
S = full(sparse(r(:),A(:),A(:)));
The first line of code generates row locations for each value in the matrix A, then using sparse to specify row/column pairs and the associated values and we use full to convert to a proper numeric matrix.
We get:
S =
1 2 0 0 5 0 0 8
1 2 0 4 0 0 7 0
You can also do the same for the matrix B. You'd use sparse and specify the third parameter to be B instead:
B = [0.5 0.2 0.6 0.8; 0.4 0.6 0.8 0.9];
S2 = full(sparse(r(:),A(:),B(:)));
We get:
>> S2
S2 =
0.5000 0.2000 0 0 0.6000 0 0 0.8000
0.4000 0.6000 0 0.8000 0 0 0.9000 0
We're working on a MATLAB code to rank stocks. We do not have a full dataset and therefore have to cope with some NaNs. However, in the code we use for sorting, the NaNs are ranked the highest. Our intention is to exclude the NaNs from the ranking. How to do this?
Please consider an example with Y and stockkid below
Y = [1.2 1.3 NaN 0.9 0.95 NaN 0.8 0.7];
stockid = [801 802 803 804 805 806 807 808];
[totalmonths,totalstocks] = size(Y);
nbrstocks = totalstocks - sum(isnan(Y));
[B,I] = sort(Y,'descend');
ncandidates = 4;
idwinner(1:ncandidates) = stockid(I(1:ncandidates));
Running the program results in:
Y =
1.2000 1.3000 NaN 0.9000 0.9500 NaN 0.8000 0.7000
idwinner =
803 806 802 801
So, 803 corresponds to NaN, 806 to NaN, 802 to 1.3 etc.
The result we're aiming for should be like this:
Y =
1.2000 1.3000 NaN 0.9000 0.9500 NaN 0.8000 0.7000
idwinner =
802 801 805 804
So, how can we exclude the NaNs from the ranking?
Use
Y(isnan(Y)) = -inf;
before calling sort. That will change NaN values into -inf, and thus those values will be the lowest.
Alternatively, if you don't want to change any value in Y, you can use an intermediate index as follows:
Y = [1.2 1.3 NaN 0.9 0.95 NaN 0.8 0.7];
stockid = [801 802 803 804 805 806 807 808];
ind = find(~isnan(Y)); %/ intermediate index that tells which elements are numbers
[B,I] = sort(Y(ind),'descend');
ncandidates = 4;
idwinner(1:ncandidates) = stockid(ind(I(1:ncandidates))); %// apply intermediate index
After your sort statement, add the line: I = I(~isnan(B));, which will remove the indices associated with NaNs before you select them from stockids
I = I(~isnan(B));
Works best since we then do not overwrite the NaNs as is the case with using
Y(isnan(Y)) = -inf;
Since we later on also have to determine the loser portfolios from the stocks with the lowest returs. This does not work well with the last code because all the NaNs have the lowest returns instead of the stocks with actual data.
Let's say I have a mxn matrix of different features of a time series signal (column 1 represents linear regression of the last n samples, column 2 represents the average of the last n samples, column 3 represents the local max values of a different time series but correlated signal, etc). How should I normalize these inputs? All the inputs fall into different categories, so they have a different range. One ranges from 0,1, the other ranges from -5 to 50, etc etc.
Should I normalize the WHOLE matrix? Or should I normalize each set of inputs one by one individually?
Note: I usually use mapminmax function from MATLAB for the normalization.
You should normalise each vector/column of your matrix individually, they represent different data types and shouldn't be mixed up together.
You could for example transpose your matrix to have your 3 different data types in the rows instead of in the columns of your matrix and still use mapminmax:
A = [0 0.1 -5; 0.2 0.3 50; 0.8 0.8 10; 0.7 0.9 20];
A =
0 0.1000 -5.0000
0.2000 0.3000 50.0000
0.8000 0.8000 10.0000
0.7000 0.9000 20.0000
B = mapminmax(A')
B =
-1.0000 -0.5000 1.0000 0.7500
-1.0000 -0.5000 0.7500 1.0000
-1.0000 1.0000 -0.4545 -0.0909
You should normalize each feature independently.
column 1 represents linear regression of the last n samples, column 2 represents the average of the last n samples, column 3 represents the local max values of a different time series but correlated signal, etc
I can't say for sure about your particular problem, but generally, you should normalize each feature independently. So normalize column 1, then column 2 etc.
Should I normalize the WHOLE matrix? Or should I normalize each set of inputs one by one individually?
I'm not sure what you mean here. What is an input? If by that you mean an instance (a row of your matrix), then no, you should not normalize rows individually, but columns.
I don't know how you would do this in Matlab, but I took your question more as a theoretical one than an implementation one.
If you want to have a range of [0,1] for all the columns that normalized within each column, you can use mapminmax like so (assuming A as the 2D input array) -
out = mapminmax(A.',0,1).'
You can also use bsxfun for the same output, like so -
Aoffsetted = bsxfun(#minus,A,min(A,[],1))
out = bsxfun(#rdivide,Aoffsetted,max(Aoffsetted,[],1))
Sample run -
>> A
A =
3 7 4 2 7
1 3 4 5 7
1 9 7 5 3
8 1 8 6 7
>> mapminmax(A.',0,1).'
ans =
0.28571 0.75 0 0 1
0 0.25 0 0.75 1
0 1 0.75 0.75 0
1 0 1 1 1
>> Aoffsetted = bsxfun(#minus,A,min(A,[],1));
>> bsxfun(#rdivide,Aoffsetted,max(Aoffsetted,[],1))
ans =
0.28571 0.75 0 0 1
0 0.25 0 0.75 1
0 1 0.75 0.75 0
1 0 1 1 1
I have a dataset which has 4 columns/attributes and 150 rows. I want to normalize this data using min-max normalization. So far, my code is:
minData=min(min(data1))
maxData=max(max(data1))
minmaxeddata=((data1-minData)./(maxData))
Here, minData and maxData returns the global minimum and maximum values. Therefore, this code actually applies a min-max normalization over all values in the 2D matrix so that the global minimum is 0 and the global maximum is 1.
However, I would like to perform the same operation on each column individually. Specifically, each column of the 2D matrix should be min-max normalized independently from the other columns.
I tried using just using min(data1) and max(data1), but got the error saying that the Matrix dimensions must agree.
However, by using the global minimum and maximum, I got the values in the range of [0-1] and have done experimentations using this normalized dataset. I would like to know whether there is any problem in my results? Is there a problem in my understanding as well? Any guidance would be appreciated.
If I understand you correctly, you wish to normalize each column of data1. Also, as each column is an independent data set and most likely having different dynamic ranges, doing a global min-max operation is probably not recommended. I would recommend that you go with your initial thoughts in normalizing each column individually.
Going with your error, you can't subtract data1 with min(data1) because min(data1) would produce a row vector while data1 is a matrix. You are subtracting a matrix with a vector which is why you are getting that error.
If you want to achieve what you're asking, use bsxfun to broadcast the vector and repeat it for as many rows as you have data1. Therefore:
mindata = min(data1);
maxdata = max(data1);
minmaxdata = bsxfun(#rdivide, bsxfun(#minus, data1, mindata), maxdata - mindata);
With later versions of MATLAB, broadcasting is built-in to the language, so you can simply do:
mindata = min(data1);
maxdata = max(data1);
minmaxdata = (data1 - mindata) ./ (maxdata - mindata);
It's a lot easier to read and still does the same job.
Example
>> data1 = [5 9 9 9 3 3; 3 10 2 1 10 1; 2 4 4 6 5 5]
data1 =
5 9 9 9 3 3
3 10 2 1 10 1
2 4 4 6 5 5
When I run the above normalization code, I get:
minmaxdata =
1.0000 0.8333 1.0000 1.0000 0 0.5000
0.3333 1.0000 0 0 1.0000 0
0 0 0.2857 0.6250 0.2857 1.0000
I have a large matrix with two columns. First is an index, second is data. Some indices are repeated. How can I retain only the first instance of rows with repeated indices?
For Example:
x =
1 5.5
1 4.5
2 4
3 2.5
3 3
4 1.5
to end up with:
ans =
1 5.5
2 4
3 2.5
4 1.5
I've tried various variations and iterations of
[Uy, iy, yu] = unique(x(:,1));
[q, t] = meshgrid(1:size(x, 2), yu);
totals = accumarray([t(:), q(:)], x(:));
but nothing so far has given me the output I need.
Use the 'first' tag in the unique function and then the second output supplies you with the row indices you want which you can use to 'filter' your matrix.
[~, ind] = unique(x(:,1), 'first');
ans = x(ind, :)
ans =
1.0000 5.5000
2.0000 4.0000
3.0000 2.5000
4.0000 1.5000
EDIT
or as Jonas points out (esp for old Matlab releases)
[~, ind] = unique(flipud(x(:,1)));
ans = x(flipud(ind), :)