Matlab disregarding NaN's in matrix - matlab

I have a matrix (X) of doubles containing time series. Some of the observations are set to NaN when there is a missing value. I want to calculate the standard deviation per column to get a std dev value for each column. Since I have NaNs mixed in, a simple std(X) will not work and if I try std(X(~isnan(X)) I end up getting the std dev for the entire matrix, instead of one per column.
Is there a way to simply omit the NaNs from std dev calculations along the 1st dim without resorting to looping?
Please note that I only want to ignore individual values as opposed to entire rows or cols in case of NaNs. Obviously I cannot set NaNs to zero or any other value as that would impact calculations.

Have a look at nanstd (stat toolbox).
The idea is to center the data using nanmean, then to replace NaN with zero, and finally to compute the standard deviation.
See nanmean below.
% maximum admissible fraction of missing values
max_miss = 0.6;
[m,n] = size(x);
% replace NaNs with zeros.
inan = find(isnan(x));
x(inan) = zeros(size(inan));
% determine number of available observations on each variable
[i,j] = ind2sub([m,n], inan); % subscripts of missing entries
nans = sparse(i,j,1,m,n); % indicator matrix for missing values
nobs = m - sum(nans);
% set nobs to NaN when there are too few entries to form robust average
minobs = m * (1 - max_miss);
k = find(nobs < minobs);
nobs(k) = NaN;
mx = sum(x) ./ nobs;
See nanstd below.
flag = 1; % default: normalize by nobs-1
% center data
xc = x - repmat(mx, m, 1);
% replace NaNs with zeros in centered data matrix
xc(inan) = zeros(size(inan));
% standard deviation
sx = sqrt(sum(conj(xc).*xc) ./ (nobs-flag));

Related

How can I find a specific point in a figure in MATLAB?

I want a specific value in the figure in MATLAB. I put the black circle and arrow manually through the figure insert option. But How can I set the value now?
I want the x-axes values that are exactly 90% of each CDF curve.
here I am attaching my MatLab figure in jpg mode.
I would use interp1 to find the value. I'll assume that your x variable is called x and your cdf value is called c. You can then use code like this to get the x value where c = 0.9. This will work even if you don't have a cdf value at exactly 0.9
x_at_0p9 = interp1(c, x, 0.9);
You plotted those figures by using:
plot(X,Y)
So, your problem is to find x_0 value that makes Y = 0.9.
You can do this:
ii = (Y==0.9) % finding index
x_0 = X(ii) % using index to get x_0 value
Of course this will only work if your Y vector has exactly the 0.9 value.
As this is not always the case you may want to get the x_0 value that first makes Y to be greater or equal than 0.9.
Then you can do this:
ii = find(Y>=0.9, 1) % finding index
x_0 = X(ii) % using index to get x_0 value
Assuming that your values are x and Y (where x is a vector and the same for all curves) and Y is a matrix with the same number of rows and as many columns as there are curves; you just need to find the first point where Y exceeds 0.9:
x = (0:0.01:pi/2)'; % time vector
Y = sin(x*rand(1,3))*10; % value matrix
% where does the values exceed 90%?
lg = Y>= 0.9;
% allocate memory
XY = NaN(2,size(Y,2));
for i = 1:size(Y,2)
% find first entry of a column, which is 1 | this is an index
idx = find(lg(:,i),1);
XY(:,i) = [x(idx);Y(idx,i)];
end
plot(x,Y, XY(1,:),XY(2,:), 'o')

Error using * Inner matrix dimensions must agree in using Least Squares - how to make the regressor array for multiple independent variables

I am trying to learn how to code for linear regression where the data statistics_data represents the yeast growth year in first column, the value of a chemical component in the second column and the value of the population in third column. Once theta is calculated using least squares formulation, I want to predict the value of the population using: pred_year = 2020;
pred_year_val = [1 2020]; which is giving this error:
Error using *
Inner matrix dimensions must agree.
Error in main_normal_equation (line 44)
pred_value = pred_year_val * theta;
Below is the code:
statistics_data = [2007, 9182927, 2;
2008,3,9256347;
2009,3.5,9340682;
2010,4,9415570;
2011,5,9482855;
2012,4.8,9555893;
2013,4.9,9644864;
2014,5,9747355;
2015,5,9851017;
2016,5,9995153;
2017,5,10120242;];
% Convert to independent variable matrix and response
X = (statistics_data(:,1:2));
y = (statistics_data(:,3));
% Convert matrix values to double
X = double(X);
y = double(y);
hold on;
% Set the x-axis label
xlabel('Year');
% Set the y-axis label
ylabel('Population');
% Plot population data
plot(X, y, 'rx', 'MarkerSize', 10);
m = length(y);
% Add ones column
X = [ones(m, 1) X];
% Normal Equation
theta = (pinv(X'*X))*X'*y
% Predict population for 2020
pred_year = 2020;
pred_year_val = [1 2020];
% Calculate predicted value
pred_value = pred_year_val * theta;
% Plot linear regression line
plot(X(:,2), X*theta, '-')
fprintf('Predicted population in 2020 is %d people\n ', int64(pred_value));
In matlab when you use the * operator, you are referencing a matrix multiply. Matrix multiplication has strict rules about the dimensions of the multiplied matrices.
Inspecting your code, it does not seem that your intent is to do a matrix multiply....
You can multiply a scalar by a matrix using * and scale each value in the matrix accordingly.
You can also vector multiply which is sometimes called element by element multiplication using the .* operator.
To resolve your issue you must clarify whether you intended to do a matrix multiply, scalar multiplication, or a vector multiplication. Then you must properly set your operands and operator to reflect what it is you aim to achieve.
It isn't clear to me exactly how the math in your code is supposed to be executed otherwise I could help show you where your operators and operands must be changed.
You could start by reviewing the documentation here: https://www.mathworks.com/help/matlab/matlab_prog/array-vs-matrix-operations.html
So pred_year_val has size [1 2] while theta has size [3 1]. Using the pigeon hole principle we can determine that the number of columns of pred_year_val is not equal to the number of rows of theta and therefore we cannot perform a matrix multiplication, i.e. the execution of
pred_value = pred_year_val * theta;
is bound to fail. So it seems like you need to add a value for the chemical component to pred_year_val.

vectorising multiple calls of Matlab 'find'

I make a large number of calls to the 'find' function of Matlab. For example, the following should give the essence:
x=rand(1,10^8);
indx=zeros(1,10^8);
for i=1:10^8
indx(i) = find([0.2, 0.52, 0.76,1] < x(i), 1, 'last');
end
Is there a way to vectorize this code to speed it up? Just including x as a vector creates an error. If vectorization is not possible, then any other suggestions for speed would be appreciated. The actual problem I wish to solve has a considerably longer vector in the place of [0.2, 0.52, 0.76,1], so any solution shouldn't depend on the specific vector I provided.
thanks.
For MATLAB versions R2015a and newer, the answer from crjones gives the best option using discretize:
edges = [0.2, 0.52, 0.76, 1];
indx = discretize(x, edges, 'IncludedEdge', 'right');
Any values in x outside the range of edges will have NaN for their indices.
For MATLAB versions R2014b and newer you can also use histcounts:
[~, ~, indx] = histcounts(x, edges);
The differences with discretize are that you can also get the count of values in each bin (the first output), and indices for values in x outside the range of edges will be 0.
For MATLAB versions prior to R2014b you can use histc (deprecated in newer versions):
[~, indx] = histc(x, edges);
Again, you can also get the count of values in each bin (the first output), and indices for values in x outside the range of edges will be 0.
Based on your example, you may want to consider using the discretize function for this:
x=rand(1,10^8);
edges = [0.2, 0.52, 0.76, 1];
indx = discretize(x, edges, 'IncludedEdge', 'right');
Note that cases outside of the range will result in NaN.
% small test case
% x = [0.5198, 0.0768, 0.6788, 0.9496]
% indx = discretize(x, edges, 'IncludedEdge', 'right')
% answer: 1 NaN 2 3
Of course, this will only be applicable if you're trying to find where x fits in a well-ordered set.
Compare your vector with x to get a logical matrix indicating the values lesser in vec than x. Multiply that logical matrix with column vector representing the column subscripts. Use max to find the maximum (last) index that satisfies the inequality. For the case where inequality doesn't satisfy, you will get zero.
vec = [0.2, 0.52, 0.76, 1]; %Your vector
indx = bsxfun(#lt, vec(:), x); %Making 'vec' a column matrix and comparing with 'x'
indx = max(bsxfun(#times, indx, (1:numel(vec)).')); %The required result
With R2016b and later versions, you can use implicit expansion instead of bsxfun:
indx = vec(:) < x ;
indx = max(indx .* (1:numel(vec)).');

Matlab Vectorization of Multivariate Gaussian Basis Functions

I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.
Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.

Modified linear interpolation with missing data

Imagine a set of data with given x-values (as a column vector) and several y-values combined in a matrix (row vector of column vectors). Some of the values in the matrix are not available:
%% Create the test data
N = 1e2; % Number of x-values
x = 2*sort(rand(N, 1))-1;
Y = [x.^2, x.^3, x.^4, x.^5, x.^6]; % Example values
Y(50:80, 4) = NaN(31, 1); % Some values are not avaiable
Now i have a column vector of new x-values for interpolation.
K = 1e2; % Number of interplolation values
x_i = rand(K, 1);
My goal is to find a fast way to interpolate all y-values for the given x_i values. If there are NaN values in the y-values, I want to use the y-value which is before the missing data. In the example case this would be the data in Y(49, :).
If I use interp1, I get NaN-values and the execution is slow for large x and x_i:
starttime = cputime;
Y_i1 = interp1(x, Y, x_i);
executiontime1 = cputime - starttime
An alternative is interp1q, which is about two times faster.
What is a very fast way which allows my modifications?
Possible ideas:
Do postprocessing of Y_i1 to eliminate NaN-values.
Use a combination of a loop and the find-command to always use the neighbour without interpolation.
Using interp1 with spline interpolation (spline) ignores NaN's.