Replacing outlier values with NaN in MATLAB - matlab

I have an n x m data matrix with n samples and m measurements per sample. I'm dealing with data from mass spectrometry, measuring the concentration of different metabolites. Each column is the concentrations of a single metabolite. The rows are the samples. Some of the samples have a few metabolite measurements that are much higher than the rest of the samples.
I want to find these outlier values, and replace them with NaN. Is there a way to do this automatically, maybe by looking for values higher than X column SDs and making them NaN? I have found relevant questions for R and Python, but not for MATLAB.
Addendum: dfri's solution worked perfectly for me. However, I couldn't use the column SD as a cutoff-measure, because the outliers made the SD so large that the outlier values were still within the threshold (they were 10 000 times larger than the rest). I ended up using 100 x the column median as a threshold for removal.

You can compare elements in your data for some threshold to identify your outliers, and use the resulting indices to replace outlier values by NaN. E.g.
data = randi(4,5); %// values in {1, 2, 3, 4}
threshold = 3; %// decide upon your threshold
data(data > threshold) = NaN
data =
NaN 3 NaN 2 2
3 1 3 2 2
2 2 2 NaN 3
3 1 NaN NaN 3
1 1 1 1 NaN
If you want to replace outliers w.r.t. some threshold column per column, you can make use of e.g. bsxfun (thanks #Dan):
data = randi(4,5) %// values in {1, 2, 3, 4}
threshold = mean(data)+1*std(data) %// per column
data(bsxfun(#(x, y) x > y, data, threshold)) = NaN
%// example:
threshold =
4.7416 3.7416 4.0000 2.8954 1.9477
data =
4 3 2 NaN NaN
4 NaN 3 1 1
1 3 4 1 NaN
4 1 4 1 1
4 1 2 NaN 1
Note that the most important (non-matlab-technical) part in your case, as mentioned by #Dan in his comments above, is to decide upon how you create your threshold values for each of the columns. The simple thresholds in the example above has only been included to show the technical aspects of how to "remove" outliers (set to NaN) given an array of thresholds for the columns.

Related

Reshape a matrix in matlab with nan elements

I have a Nx3-matrix in matlab, where I have a degree value from 0 to 360 in the first column, a radius value from 0 to 0.5 in the second and an integer in the third. Every combination out of (M(n,1),M(n,2)) is unique with M the matrix and n a random number between 1 and N, but it is possible that there is a value in M(:,1) or M(:,2) more than once. M is sorted, first after the first row, then after the second.
My target is now to reshape this matrix into a 360xV-matrix, with V the amount of unique values in M(:,2). If there is a value in M(:,3) at the position M(o,1) and M(p,2) with 1 <= o, p <= N, it should be placed at the position (o,p), if there is no value, then there should a NaN-value placed instead.
Is there a simple way to do this, or do I have to write my own function for that?
Edit:
Desired input:
0 0.25 1
0 0.43 4
1 0.25 2
2 0.03 5
2 0.43 2
Desired output:
NaN 1 4
NaN 2 NaN
5 NaN 2
You can use an approach of finding unique indices for the first and second columns from the input arrays and then using those to set elements in an appropriately (discussed in more detail inside the code as comments) sized output array with the elements from the third column. Here's the implementation -
%// Input array
A = [
0 0.25 1
0 0.43 4
1 0.25 2
2 0.03 5
2 0.43 2 ]
%// Find unique indices for col-1,2
[~,~,idx1] = unique(A(:,1)) %// these would form the row indices of output
[~,~,idx2] = unique(A(:,2)) %// these would form the col indices of output
%// Decide the size of output array based on the "extents" of those indices
nrows = max(idx1)
ncols = max(idx2)
%// Initialize output array with NaNs
out = NaN(nrows,ncols)
%// Set the elements in output indexed by those indices to values from
%// col-3 of input array
out((idx2-1)*nrows + idx1) = A(:,3)
Code run -
out =
NaN 1 4
NaN 2 NaN
5 NaN 2
Is there a simple way to do this, or do I have to write my own function for that?
You'll need to write a method, seeing that what you've described is utterly specific to your problem. There's methods to find unique values, so this will help you when designing your for loop.

Repmat function in matlab

I have been through a bunch of questions about the Repeat function in MatLab, but I can't figure out how this process work.
I am trying to translate it into R, but my problem is that I do not know how the function manipulates the data.
The code is part of a process to make a pairs trading strategy, where the code takes in a vector of FALSE/TRUE expressions.
The code is:
% initialize positions array
positions=NaN(length(tday), 2);
% long entries
positions(shorts, :)=repmat([-1 1], [length(find(shorts)) 1]);
where shorts is the vector of TRUE/FALSE expressions.
Hope you can help.
repmat repeats the matrix you give him [dim1 dim2 dim3,...] times. What your code does is:
1.-length(find(shorts)): gets the amount of "trues" in shorts.
e.g:
shorts=[1 0 0 0 1 0]
length(find(shorts))
ans = 2
2.-repmat([-1 1], [length(find(shorts)) 1]); repeats the [-1 1] [length(find(shorts)) 1] times.
continuation of e.g.:
repmat([-1 1], [length(find(shorts)) 1]);
ans=[-1 1
-1 1];
3.- positions(shorts, :)= saves the given matrix in the given indexes. (NOTE!: only works if shorts is of type logical).
continuation of e.g.:
At this point, if you haven't omit anything, positions should be a 6x2 NaN matrix. the indexing will fill the true positions of shorts with the [-1 1] matrix. so after this, positions will be:
positions=[-1 1
NaN NaN
NaN NaN
NaN NaN
-1 1
NaN NaN]
Hope it helps
The MATLAB repmat function replicates and tiles the array. The syntax is
B = repmat(A,n)
where A is the input array and n specifies how to tile the array. If n is a vector [n1,n2] - as in your case - then A is replicated n1 times in rows and n2 times in columns. E.g.
A = [ 1 2 ; 3 4]
B = repmat(A,[2,3])
B = | |
1 2 1 2 1 2
3 4 3 4 3 4 __
1 2 1 2 1 2
3 4 3 4 3 4
(the lines are only to illustrate how A gets tiled)
In your case, repmat replicates the vector [-1, 1] for each non-zero element of shorts. You thus set each row of positions, whos corresponding entry in shorts is not zero, to [-1,1]. All other rows will stay NaN.
For example if
shorts = [1; 0; 1; 1; 0];
then your code will create
positions =
-1 1
NaN NaN
-1 1
-1 1
NaN NaN
I hope this helps you to clarify the effect of repmat. If not, feel free to ask.

How can I use values within a MATLAB matrix as indices to determine the location of data in a new matrix?

I have a matrix that looks like the following.
I want to take the column 3 values and put them in another matrix, according to the following rule.
The value in the Column 5 is the row index for the new matrix, and Column 6 is the column index. Therefore 20 (taken from 29,3) should be in Row 1 Column 57 of the new matrix, 30 (from 30,3) should in Row 1 column 4 of the new matrix, and so on.
If the value in column 3 is NaN then I want NaN to be copied over to the new matrix.
Example:
% matrix of values and row/column subscripts
A = [
20 1 57
30 1 4
25 1 16
nan 1 26
nan 1 28
25 1 36
nan 1 53
50 1 56
nan 2 1
nan 2 2
nan 2 3
80 2 5
];
% fill new matrix
B = zeros(5,60);
idx = sub2ind(size(B), A(:,2), A(:,3));
B(idx) = A(:,1);
There are a couple other ways to do this, but I think the above code is easy to understand. It is using linear indexing.
Assuming you don't have duplicate subscripts, you could also use:
B = full(sparse(A(:,2), A(:,3), A(:,1), m, n));
(where m and n are the output matrix size)
Another one:
B = accumarray(A(:,[2 3]), A(:,1), [m,n]);
I am not sure if I understood your question clearly but this might help:
(Assuming your main matrix is A)
nRows = max(A(:,5));
nColumns = max(A(:,6));
FinalMatrix = zeros(nRows,nColumns);
for i=1:size(A,1)
FinalMatrix(A(i,5),A(i,6))=A(i,3);
end
Note that above code sets the rest of the elements equal to zero.

Checking values of two vectors against eachother and then using the column location of equal entries to extract colums from a matrix in matlab

I'm doing a curve fitting problem in Matlab and so far I've set up some orthonormal polynomials along a specified range of x-values with x = (0:0.0001:40);
The polynomials themselves are each a manipulation of that x vector and are stored as a row in a matrix. I also have some have data entries in the form of two vectors - one for the data x-coords and one for the actual values. I need a way to use the x-coords of my data points to find the same values in my continuous x-vector and then take the corresponding columns from my polynomial matrix and add them to a new matrix.
EDIT: To be more clear. I have, for example:
x = [0 1 2 3 4 5]
Polynomial =
1 1 1 1 1 1
0 1 2 3 4 5
0 1 4 9 16 25
% Data values:
x-coord = [1 3 4]
values = [5 3 8]
I want to check the x-coord values against 'x' to find the corresponding columns and then pull out those columns from the polynomial matrix to get:
Polynomial =
1 1 1
1 3 4
1 9 16
If your x, Polynomial, and xcoord are the same length you could use logical indexing which is elegant; something along the lines of Polynomial(x==xcoord). But since this doesn't seem to be the case, there's a less fancy solution with a for-loop and find(xcoord(i)==x)

Find row-wise minima in sparse matrix

I would like to get the minimum nonzero values per row in a sparse matrix. Solutions I found for dense matrices suggested masking out the zero values by setting them to NaN or Inf. However, this obviously doesn't work for sparse matrices.
Ideally, I should get a column vector of all the row-wise minima, as I would get with
minValues = min( A, [], 2);
Except, obviously, using min leaves me with an all-zeros column vector due to the sparsity. Is there a solution using find?
This is perfect for accumarray. Consider the following sparse matrix,
vals = [3 1 1 9 7 4 10 1]; % got this from randi(10,1,8)
S = sparse([1 3 4 4 5 5 7 9],[2 2 3 6 7 8 8 11],vals);
Get the minimum value for each row, assuming 0 for empty elements:
[ii,jj] = find(S);
rowMinVals = accumarray(ii,nonzeros(S),[],#min)
Note that rows 4 and 5 of rowMinVals, which are the only two rows of S with multiple nonzero values are equal to the min of the row:
rowMinVals =
3
0
1
1 % min([1 9]
4 % min([7 4]
0
10
0
1
If the last row(s) of your sparse matrix do not contain any non-zeros, but you want your min row value output to reflect that you have numRows, for example, change theaccumarray command as follows,
rowMinVals = accumarray(ii,nonzeros(S),[numRows 1],#min).
Also, perhaps you also want to avoid including the default 0 in the output. One way to handle that is to set the fillval input argument to NaN:
rowMinVals = accumarray(ii,nonzeros(S),[numRows 1],#min,NaN)
rowMinVals =
3
NaN
1
1
4
NaN
10
NaN
1
NaN
NaN
NaN
Or you can keep using a sparse matrix with the fifth input argument, issparse:
>> rowMinVals = accumarray(ii,nonzeros(S),[],#min,[],true)
rowMinVals =
(1,1) 3
(3,1) 1
(4,1) 1
(5,1) 4
(7,1) 10
(9,1) 1