Modified linear interpolation with missing data - matlab

Imagine a set of data with given x-values (as a column vector) and several y-values combined in a matrix (row vector of column vectors). Some of the values in the matrix are not available:
%% Create the test data
N = 1e2; % Number of x-values
x = 2*sort(rand(N, 1))-1;
Y = [x.^2, x.^3, x.^4, x.^5, x.^6]; % Example values
Y(50:80, 4) = NaN(31, 1); % Some values are not avaiable
Now i have a column vector of new x-values for interpolation.
K = 1e2; % Number of interplolation values
x_i = rand(K, 1);
My goal is to find a fast way to interpolate all y-values for the given x_i values. If there are NaN values in the y-values, I want to use the y-value which is before the missing data. In the example case this would be the data in Y(49, :).
If I use interp1, I get NaN-values and the execution is slow for large x and x_i:
starttime = cputime;
Y_i1 = interp1(x, Y, x_i);
executiontime1 = cputime - starttime
An alternative is interp1q, which is about two times faster.
What is a very fast way which allows my modifications?
Possible ideas:
Do postprocessing of Y_i1 to eliminate NaN-values.
Use a combination of a loop and the find-command to always use the neighbour without interpolation.

Using interp1 with spline interpolation (spline) ignores NaN's.

Related

Cross validation and ROC curve using Matlab: how plot mean ROC curve?

I am using k-fold cross validation with k = 10. Thus, I have 10 ROC curves.
I would like to average between the curves. I can't just average the values ​​on the Y axes (using perfcurve) because the vectors returned are not the same size.
[X1,Y1,T1,AUC1] = perfcurve(t_test(1),resp(1),1);
.
.
.
[X10,Y10,T10,AUC10] = perfcurve(t_test(10),resp(10),1);
How to solve this? How can I plot the average curve of the 10 ROC curves?
So, you have k curves with different number of points, all bound in [0..1] interval in both dimensions. First, you need to calculate interpolated values for each curve at specified query points. Now you have new curves with fixed number of points and can compute their mean. The interp1 function will do the interpolation part.
%% generating sample data
k = 10;
X = cell(k, 1);
Y = cell(k, 1);
hold on;
for i=1:k
n = 10+randi(10);
X{i} = sort([0 1 rand(1, n)]);
Y{i} = sort([0 1 rand(1, n)].^.5);
end
%% Calculating interpolations
% location of query points
X2 = linspace(0, 1, 50);
n = numel(X2);
% initializing values for different curves at different query points
Y2 = zeros(k, n);
for i=1:k
% finding interpolated values for i-th curve
Y2(i, :) = interp1(X{i}, Y{i}, X2);
end
% finding the mean
meanY = mean(Y2, 1);
Notice that different interpolation methods can affect your results. For example, the ROC plot data are kind of stairs data. To find the exact values on such curves, you should use the Previous Neighbor Interpolation method, instead of the Linear Interpolation which is the default method of interp1:
Y2(i, :) = interp1(X{i}, Y{i}, X2); % linear
Y3(i, :) = interp1(X{i}, Y{i}, X2, 'previous');
This is how it affects the final results:
I solved it using Matlab's perfcurve. For that, I had to pass as a parameter a list of vectors (size vectors 1xn) for "label" and "scores". Thus, the perfcurve function already understands as a set of resolutions made using k-fold and returns the average ROC curve and its confidence interval, in addition to the AUC and its confidence interval.
[X1,Y1,T1,AUC1] = perfcurve(t_test_list,resp_list,1);
t_test and resp they are lists of size 1xk (k is the number of folds / k-fold) and each element of the lists is a 1xn vector with scores and labels.
resp = nnet(x_test(i));
t_test_act = t_test(i);
resp has 2xn format (n is the number of predicted samples). There are two classes.
t_test_act contains the labels of the current set of tests, it has formed 2xn and is composed of 0 and 1 (each column has a 1 and a 0, indicating the true class of the sample).
resp_list{i} = resp(1,:) %(scores)
t_test_list{i} = t_test_act(1,:) %(labels)
[X1,Y1,T1,AUC1] = perfcurve(t_test_list,resp_list,1);

Get Matrix of minimum coordinate distance to point set

I have a set of points or coordinates like {(3,3), (3,4), (4,5), ...} and want to build a matrix with the minimum distance to this point set. Let me illustrate using a runnable example:
width = 10;
height = 10;
% Get min distance to those points
pts = [3 3; 3 4; 3 5; 2 4];
sumSPts = length(pts);
% Helper to determine element coordinates
[cols, rows] = meshgrid(1:width, 1:height);
PtCoords = cat(3, rows, cols);
AllDistances = zeros(height, width,sumSPts);
% To get Roh_I of evry pt
for k = 1:sumSPts
% Get coordinates of current Scribble Point
currPt = pts(k,:);
% Get Row and Col diffs
RowDiff = PtCoords(:,:,1) - currPt(1);
ColDiff = PtCoords(:,:,2) - currPt(2);
AllDistances(:,:,k) = sqrt(RowDiff.^2 + ColDiff.^2);
end
MinDistances = min(AllDistances, [], 3);
This code runs perfectly fine but I have to deal with matrix sizes of about 700 milion entries (height = 700, width = 500, sumSPts = 2k) and this slows down the calculation. Is there a better algorithm to speed things up?
As stated in the comments, you don't necessary have to put everything into a huge matrix and deal with gigantic matrices. You can :
1. Slice the pts matrix into reasonably small slices (say of length 100)
2. Loop on the slices and calculate the Mindistances slice over these points
3. Take the global min
tic
Mindistances=[];
width = 500;
height = 700;
Np=2000;
pts = [randi(width,Np,1) randi(height,Np,1)];
SliceSize=100;
[Xcoords,Ycoords]=meshgrid(1:width,1:height);
% Compute the minima for the slices from 1 to floor(Np/SliceSize)
for i=1:floor(Np/SliceSize)
% Calculate indexes of the next slice
SliceIndexes=((i-1)*SliceSize+1):i*SliceSize
% Get the corresponding points and reshape them to a vector along the 3rd dim.
Xpts=reshape(pts(SliceIndexes,1),1,1,[]);
Ypts=reshape(pts(SliceIndexes,2),1,1,[]);
% Do all the diffs between your coordinates and your points using bsxfun singleton expansion
Xdiffs=bsxfun(#minus,Xcoords,Xpts);
Ydiffs=bsxfun(#minus,Ycoords,Ypts);
% Calculate all the distances of the slice in one call
Alldistances=bsxfun(#hypot,Xdiffs,Ydiffs);
% Concatenate the mindistances
Mindistances=cat(3,Mindistances,min(Alldistances,[],3));
end
% Check if last slice needed
if mod(Np,SliceSize)~=0
% Get the corresponding points and reshape them to a vector along the 3rd dim.
Xpts=reshape(pts(floor(Np/SliceSize)*SliceSize+1:end,1),1,1,[]);
Ypts=reshape(pts(floor(Np/SliceSize)*SliceSize+1:end,2),1,1,[]);
% Do all the diffs between your coordinates and your points using bsxfun singleton expansion
Xdiffs=bsxfun(#minus,Xcoords,Xpts);
Ydiffs=bsxfun(#minus,Ycoords,Ypts);
% Calculate all the distances of the slice in one call
Alldistances=bsxfun(#hypot,Xdiffs,Ydiffs);
% Concatenate the mindistances
Mindistances=cat(3,Mindistances,min(Alldistances,[],3));
end
% Get global minimum
Mindistances=min(Mindistances,[],3);
toc
Elapsed time is 9.830051 seconds.
Note :
You'll not end up doing less calculations. But It will be a lot less intensive for your memory (700M doubles takes 45Go in memory), thus speeding up the process (With the help of vectorizing aswell)
About bsxfun singleton expansion
One of the great strength of bsxfun is that you don't have to feed it matrices whose values are along the same dimensions.
For example :
Say I have two vectors X and Y defined as :
X=[1 2]; % row vector X
Y=[1;2]; % Column vector Y
And that I want a 2x2 matrix Z built as Z(i,j)=X(i)+Y(j) for 1<=i<=2 and 1<=j<=2.
Suppose you don't know about the existence of meshgrid (The example is a bit too simple), then you'll have to do :
Xs=repmat(X,2,1);
Ys=repmat(Y,1,2);
Z=Xs+Ys;
While with bsxfun you can just do :
Z=bsxfun(#plus,X,Y);
To calculate the value of Z(2,2) for example, bsxfun will automatically fetch the second value of X and Y and compute. This has the advantage of saving a lot of memory space (No need to define Xs and Ys in this example) and being faster with big matrices.
Bsxfun Vs Repmat
If you're interested with comparing the computational time between bsxfun and repmat, here are two excellent (word is not even strong enough) SO posts by Divakar :
Comparing BSXFUN and REPMAT
BSXFUN on memory efficiency with relational operations

Matlab best technique to remove outliers in data

I have 2 columns x, y of 100 points each. I would like to remove the outliers data and refill their gap with the average value of the points near to them. Firstly, can I do that? is any Matlab function? Secondly, if yes, what is the best technique to make that?
E.g:
x = 1:1:100
y = rand(1,99)
y(end+1)=2
In this case, not so similar to my problem, I would like to remove value 2 at the end and to be replaced with one similar to their neighbor points. In my case the distribution of the [x,y] is a non linear function, having few outliers.
It depends on what you mean by outlier. If you assume that outliers are more than three standard deviations from the median, for example, you could do this
all_idx = 1:length(x)
outlier_idx = abs(x - median(x)) > 3*std(x) | abs(y - median(y)) > 3*std(y) % Find outlier idx
x(outlier_idx) = interp1(all_idx(~outlier_idx), x(~outlier_idx), all_idx(outlier_idx)) % Linearly interpolate over outlier idx for x
y(outlier_idx) = interp1(all_idx(~outlier_idx), y(~outlier_idx), all_idx(outlier_idx)) % Do the same thing for y
This code will just remove the outliers and linearly interpolate over their positions using the closest values that are not outliers.

Matlab Vectorization of Multivariate Gaussian Basis Functions

I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.
Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.

Matlab disregarding NaN's in matrix

I have a matrix (X) of doubles containing time series. Some of the observations are set to NaN when there is a missing value. I want to calculate the standard deviation per column to get a std dev value for each column. Since I have NaNs mixed in, a simple std(X) will not work and if I try std(X(~isnan(X)) I end up getting the std dev for the entire matrix, instead of one per column.
Is there a way to simply omit the NaNs from std dev calculations along the 1st dim without resorting to looping?
Please note that I only want to ignore individual values as opposed to entire rows or cols in case of NaNs. Obviously I cannot set NaNs to zero or any other value as that would impact calculations.
Have a look at nanstd (stat toolbox).
The idea is to center the data using nanmean, then to replace NaN with zero, and finally to compute the standard deviation.
See nanmean below.
% maximum admissible fraction of missing values
max_miss = 0.6;
[m,n] = size(x);
% replace NaNs with zeros.
inan = find(isnan(x));
x(inan) = zeros(size(inan));
% determine number of available observations on each variable
[i,j] = ind2sub([m,n], inan); % subscripts of missing entries
nans = sparse(i,j,1,m,n); % indicator matrix for missing values
nobs = m - sum(nans);
% set nobs to NaN when there are too few entries to form robust average
minobs = m * (1 - max_miss);
k = find(nobs < minobs);
nobs(k) = NaN;
mx = sum(x) ./ nobs;
See nanstd below.
flag = 1; % default: normalize by nobs-1
% center data
xc = x - repmat(mx, m, 1);
% replace NaNs with zeros in centered data matrix
xc(inan) = zeros(size(inan));
% standard deviation
sx = sqrt(sum(conj(xc).*xc) ./ (nobs-flag));