Replace certain elements of matrix with NaN (MATLAB) - matlab

I have a vector, A.
A=[3,4,5,2,2,4;2,3,4,5,3,4;2,4,3,2,3,1;1,2,3,2,3,4]
Some of the records in A must be replaced by NaN values, as they are inaccurate.
I have created vector rowid, which records the last value that must be kept after which the existing values must be swapped to NaN.
rowid=[4,5,4,3]
So the vector I wish to create, B, would look as follows:
B=[3,4,5,2,NaN,NaN;2,3,4,5,3,NaN;2,4,3,2,NaN,NaN;1,2,3,NaN,NaN,NaN]
I am at a loss as to how to do this. I have tried to use
A(:,rowid:end)
to start selecting out the data from vector A. I am expecting to be able to use sub2ind or some sort of idx to do this, possibly an if loop, but I don't know where to start and cannot find an appropriate similar question to base my thoughts on!
I would very much appreciate any tips/pointers, many thanks

If you are not yet an expert of matlab, I would stick to simple for-loops for now:
B = A;
for i=1:length(rowid)
B(i, rowid(i)+1:end) = NaN;
end
It is always a sport to write this as a one-liner (see Mohsen's answer), but in many cases an explicit for-loop is much clearer.

A compact one is:
B = A;
B(bsxfun(#lt, rowid.', 1:size(A,2)))=NaN;

Related

Matlab: need some help for a seemingly simple vectorization of an operation

I would like to optimize this piece of Matlab code but so far I have failed. I have tried different combinations of repmat and sums and cumsums, but all my attempts seem to not give the correct result. I would appreciate some expert guidance on this tough problem.
S=1000; T=10;
X=rand(T,S),
X=sort(X,1,'ascend');
Result=zeros(S,1);
for c=1:T-1
for cc=c+1:T
d=(X(cc,:)-X(c,:))-(cc-c)/T;
Result=Result+abs(d');
end
end
Basically I create 1000 vectors of 10 random numbers, and for each vector I calculate for each pair of values (say the mth and the nth) the difference between them, minus the difference (n-m). I sum over of possible pairs and I return the result for every vector.
I hope this explanation is clear,
Thanks a lot in advance.
It is at least easy to vectorize your inner loop:
Result=zeros(S,1);
for c=1:T-1
d=(X(c+1:T,:)-X(c,:))-((c+1:T)'-c)./T;
Result=Result+sum(abs(d),1)';
end
Here, I'm using the new automatic singleton expansion. If you have an older version of MATLAB you'll need to use bsxfun for two of the subtraction operations. For example, X(c+1:T,:)-X(c,:) is the same as bsxfun(#minus,X(c+1:T,:),X(c,:)).
What is happening in the bit of code is that instead of looping cc=c+1:T, we take all of those indices at once. So I simply replaced cc for c+1:T. d is then a matrix with multiple rows (9 in the first iteration, and one fewer in each subsequent iteration).
Surprisingly, this is slower than the double loop, and similar in speed to Jodag's answer.
Next, we can try to improve indexing. Note that the code above extracts data row-wise from the matrix. MATLAB stores data column-wise. So it's more efficient to extract a column than a row from a matrix. Let's transpose X:
X=X';
Result=zeros(S,1);
for c=1:T-1
d=(X(:,c+1:T)-X(:,c))-((c+1:T)-c)./T;
Result=Result+sum(abs(d),2);
end
This is more than twice as fast as the code that indexes row-wise.
But of course the same trick can be applied to the code in the question, speeding it up by about 50%:
X=X';
Result=zeros(S,1);
for c=1:T-1
for cc=c+1:T
d=(X(:,cc)-X(:,c))-(cc-c)/T;
Result=Result+abs(d);
end
end
My takeaway message from this exercise is that MATLAB's JIT compiler has improved things a lot. Back in the day any sort of loop would halt code to a grind. Today it's not necessarily the worst approach, especially if all you do is use built-in functions.
The nchoosek(v,k) function generates all combinations of the elements in v taken k at a time. We can use this to generate all possible pairs of indicies then use this to vectorize the loops. It appears that in this case the vectorization doesn't actually improve performance (at least on my machine with 2017a). Maybe someone will come up with a more efficient approach.
idx = nchoosek(1:T,2);
d = bsxfun(#minus,(X(idx(:,2),:) - X(idx(:,1),:)), (idx(:,2)-idx(:,1))/T);
Result = sum(abs(d),1)';
Update: here are the results for the running times for the different proposals (10^5 trials):
So it looks like the transformation of the matrix is the most efficient intervention, and my original double-loop implementation is, amazingly, the best compared to the vectorized versions. However, in my hands (2017a) the improvement is only 16.6% compared to the original using the mean (18.2% using the median).
Maybe there is still room for improvement?

Select all elements except one in a vector

My question is very similar to this one but I can't manage exactly how to apply that answer to my problem.
I am looping through a vector with a variable k and want to select the whole vector except the single value at index k.
Any idea?
for k = 1:length(vector)
newVector = vector( exluding index k); <---- what mask should I use?
% other operations to do with the newVector
end
Another alternative without setdiff() is
vector(1:end ~= k)
vector([1:k-1 k+1:end]) will do. Depending on the other operations, there may be a better way to handle this, though.
For completeness, if you want to remove one element, you do not need to go the vector = vector([1:k-1 k+1:end]) route, you can use vector(k)=[];
Just for fun, here's an interesting way with setdiff:
vector(setdiff(1:end,k))
What's interesting about this, besides the use of setdiff, you ask? Look at the placement of end. MATLAB's end keyword translates to the last index of vector in this context, even as an argument to a function call rather than directly used with paren (vector's () operator). No need to use numel(vector). Put another way,
>> vector=1:10;
>> k=6;
>> vector(setdiff(1:end,k))
ans =
1 2 3 4 5 7 8 9 10
>> setdiff(1:end,k)
Error using setdiff (line 81)
Not enough input arguments.
That is not completely obvious IMO, but it can come in handy in many situations, so I thought I would point this out.
Very easy:
newVector = vector([1:k-1 k+1:end]);
This works even if k is the first or last element.
%create a logic vector of same size:
l=ones(size(vector))==1;
l(k)=false;
vector(l);
Another way you can do this which allows you to exclude multiple indices at once (or a single index... basically it's robust to allow either) is:
newVector = oldVector(~ismember(1:end,k))
Works just like setdiff really, but builds a logical mask instead of a list of explicit indices.

MatLab Missing data handling in categorical data

I am trying to put my dataset into the MATLAB [ranked,weights] = relieff(X,Ylogical,10, 'categoricalx', 'on') function to rank the importance of my predictor features. The dataset<double n*m> has n observations and m discrete (i.e. categorical) features. It happens that each observation (row) in my dataset has at least one NaN value. These NaNs represent unobserved, i.e. missing or null, predictor values in the dataset. (There is no corruption in the dataset, it is just incomplete.)
relieff() uses this function below to remove any rows that contain a NaN:
function [X,Y] = removeNaNs(X,Y)
% Remove observations with missing data
NaNidx = bsxfun(#or,isnan(Y),any(isnan(X),2));
X(NaNidx,:) = [];
Y(NaNidx,:) = [];
This is not ideal, especially for my case, since it leaves me with X=[] and Y=[] (i.e. no observations!)
In this case:
1) Would replacing all NaN's with a random value, e.g. 99999, help? By doing this, I am introducing a new feature state for all the predictor features so I guess it is not ideal.
2) or is replacing NaNs with the mode of the corresponding feature column vector (as below) statistically more sound? (I am not vectorising for clarity's sake)
function [matrixdata] = replaceNaNswithModes(matrixdata)
for i=1: size(matrixdata,2)
cv= matrixdata(:,i);
modevalue= mode(cv);
cv(find(isnan(cv))) = modevalue;
matrixdata(:,i) = cv;
end
3) Or any other sensible way that would make sense for "categorical" data?
P.S: This link gives possible ways to handle missing data.
I suggest to use a table instead of a matrix.
Then you have functions such as ismissing (for the entire table), and isundefined to deal with missing values for categorical variables.
T = array2table(matrix);
T = standardizeMissing(T); % NaN is standard for double but this
% can be useful for other data type
var1 = categorical(T.var1);
missing = isundefined(var1);
T = T(missing,:); % removes lines with NaN
matrix = table2array(T);
For a start both solutiona (1) and (2) do not help you handle your data more properly, since NaN is in fact a labelling that is handled appropriately by Matlab; warnings will be issued. What you should do is:
Handle the NaNs per case
Use try catch blocks
NaN is like a number, and there is nothing bad about it. Even is you divide by NaN matlab will treat it properly and give you a NaN.
If you still want to replace them, then you will need an assumption that holds. For example, if your data is engine speeds in a timeseries that have been input by the engine operator, but some time instances have not been specified then there are more than one ways to handle the NaN that will appear in the matrix.
Replace with 0s
Replace with the previous value
Replace with the next value
Replace with the average of the previous and the next value
and many more.
As you can see your problem is ill-posed, and depends on the predictor and the data source.
In case of categorical data, e.g. three categories {0,1,2} and supposing NaN occurs in Y.
for k=1:size(Y,2)
[ id ]=isnan(Y(:,k);
m(k)=median(Y(~id),k);
Y(id,k)=round(m(k));
end
I feel really bad that I had to write a for-loop but I cannot see any other way. As you can see I made a number of assumptions, by using median and round. You may want to use a threshold depending on you knowledge about the data.
I think the answer to this has been given by gd047 in dimension-reduction-in-categorical-data-with-missing-values:
I am going to look into this, if anyone has any other suggestions or particular MatLab implementations, it would be great to hear.
You can take a look at this page http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html the firs a1a, it says transforming categorical into binary. Could possibly work. (:

How do I calculate result for every value in a matrix in MATLAB

Keeping simple, take a matrix of ones i.e.
U_iso = ones(72,37)
and some parameters
ThDeg = 0:5:180;
dtheta = 5*pi/180;
dphi = 5*pi/180;
Th = ThDeg*pi/180;
Now the code is
omega_iso = 0;
for i = 1:72
for j=1:37
omega_iso = omega_iso + U_iso(i,j)*sin(Th(j))*dphi*dtheta;
end
end
and
D_iso = (4 * pi)/omega_iso
This code is fine. It take a matrix with dimension 72*37. The loop is an approximation of the integral which is further divided by 4pi to get ONE value of directivity of antenna.
Now this code gives one value which will be around 1.002.
My problem is I dont need 1 value. I need a 72*37 matrix as my answer where the above integral approximation is implemented on each cell of the 72 * 37 matrix. and thus the Directviity 'D' also results in a matrix of same size with each cell giving the same value.
So all we have to do is instead of getting 1 value, we need value at each cell.
Can anyone please help.
You talk about creating a result that is a function essentially of the elements of U. However, in no place is that code dependent on the elements of U. Look carefully at what you have written. While you do use the variable U_iso, never is any element of U employed anywhere in that code as you have written it.
So while you talk about defining this for a matrix U, that definition is meaningless. So far, it appears that a call to repmat at the very end would create a matrix of the desired size, and clearly that is not what you are looking for.
Perhaps you tried to make the problem simple for ease of explanation. But what you did was to over-simplify, not leaving us with something that even made any sense. Please explain your problem more clearly and show code that is consistent with your explanation, for a better answer than I can provide so far.
(Note: One option MIGHT be to use arrayfun. Or the answer to this question might be more trivial, using simple vectorized operations. I cannot know at this point.)
EDIT:
Your question is still unanswerable. This loop creates a single scalar result, essentially summing over the entire array. You don't say what you mean for the integral to be computed for each element of U_iso, since you are already summing over the entire array. Please learn to be accurate in your questions, otherwise we are just guessing as to what you mean.
My best guess at the moment is that you might wish to compute a cumulative integral, in two dimensions. cumtrapz can help you there, IF that is your goal. But I'm not sure it is your goal, since your explanation is so incomplete.
You say that you wish to get the same value in each cell of the result. If that is what you wish, then a call to repmat at the end will do what you wish.

What's the best way to iterate through columns of a matrix?

I want to apply a function to all columns in a matrix with MATLAB. For example, I'd like to be able to call smooth on every column of a matrix, instead of having smooth treat the matrix as a vector (which is the default behaviour if you call smooth(matrix)).
I'm sure there must be a more idiomatic way to do this, but I can't find it, so I've defined a map_column function:
function result = map_column(m, func)
result = m;
for col = 1:size(m,2)
result(:,col) = func(m(:,col));
end
end
which I can call with:
smoothed = map_column(input, #(c) (smooth(c, 9)));
Is there anything wrong with this code? How could I improve it?
The MATLAB "for" statement actually loops over the columns of whatever's supplied - normally, this just results in a sequence of scalars since the vector passed into for (as in your example above) is a row vector. This means that you can rewrite the above code like this:
function result = map_column(m, func)
result = [];
for m_col = m
result = horzcat(result, func(m_col));
end
If func does not return a column vector, then you can add something like
f = func(m_col);
result = horzcat(result, f(:));
to force it into a column.
Your solution is fine.
Note that horizcat exacts a substantial performance penalty for large matrices. It makes the code be O(N^2) instead of O(N). For a 100x10,000 matrix, your implementation takes 2.6s on my machine, the horizcat one takes 64.5s. For a 100x5000 matrix, the horizcat implementation takes 15.7s.
If you wanted, you could generalize your function a little and make it be able to iterate over the final dimension or even over arbitrary dimensions (not just columns).
Maybe you could always transform the matrix with the ' operator and then transform the result back.
smoothed = smooth(input', 9)';
That at least works with the fft function.
A way to cause an implicit loop across the columns of a matrix is to use cellfun. That is, you must first convert the matrix to a cell array, each cell will hold one column. Then call cellfun. For example:
A = randn(10,5);
See that here I've computed the standard deviation for each column.
cellfun(#std,mat2cell(A,size(A,1),ones(1,size(A,2))))
ans =
0.78681 1.1473 0.89789 0.66635 1.3482
Of course, many functions in MATLAB are already set up to work on rows or columns of an array as the user indicates. This is true of std of course, but this is a convenient way to test that cellfun worked successfully.
std(A,[],1)
ans =
0.78681 1.1473 0.89789 0.66635 1.3482
Don't forget to preallocate the result matrix if you are dealing with large matrices. Otherwise your CPU will spend lots of cycles repeatedly re-allocating the matrix every time it adds a new row/column.
If this is a common use-case for your function, it would perhaps be a good idea to make the function iterate through the columns automatically if the input is not a vector.
This doesn't exactly solve your problem but it would simplify the functions' usage. In that case, the output should be a matrix, too.
You can also transform the matrix to one long column by using m(:,:) = m(:). However, it depends on your function if this would make sense.