Matlab Vectorization of Multivariate Gaussian Basis Functions - matlab

I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.

Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.

Related

Cross validation and ROC curve using Matlab: how plot mean ROC curve?

I am using k-fold cross validation with k = 10. Thus, I have 10 ROC curves.
I would like to average between the curves. I can't just average the values ​​on the Y axes (using perfcurve) because the vectors returned are not the same size.
[X1,Y1,T1,AUC1] = perfcurve(t_test(1),resp(1),1);
.
.
.
[X10,Y10,T10,AUC10] = perfcurve(t_test(10),resp(10),1);
How to solve this? How can I plot the average curve of the 10 ROC curves?
So, you have k curves with different number of points, all bound in [0..1] interval in both dimensions. First, you need to calculate interpolated values for each curve at specified query points. Now you have new curves with fixed number of points and can compute their mean. The interp1 function will do the interpolation part.
%% generating sample data
k = 10;
X = cell(k, 1);
Y = cell(k, 1);
hold on;
for i=1:k
n = 10+randi(10);
X{i} = sort([0 1 rand(1, n)]);
Y{i} = sort([0 1 rand(1, n)].^.5);
end
%% Calculating interpolations
% location of query points
X2 = linspace(0, 1, 50);
n = numel(X2);
% initializing values for different curves at different query points
Y2 = zeros(k, n);
for i=1:k
% finding interpolated values for i-th curve
Y2(i, :) = interp1(X{i}, Y{i}, X2);
end
% finding the mean
meanY = mean(Y2, 1);
Notice that different interpolation methods can affect your results. For example, the ROC plot data are kind of stairs data. To find the exact values on such curves, you should use the Previous Neighbor Interpolation method, instead of the Linear Interpolation which is the default method of interp1:
Y2(i, :) = interp1(X{i}, Y{i}, X2); % linear
Y3(i, :) = interp1(X{i}, Y{i}, X2, 'previous');
This is how it affects the final results:
I solved it using Matlab's perfcurve. For that, I had to pass as a parameter a list of vectors (size vectors 1xn) for "label" and "scores". Thus, the perfcurve function already understands as a set of resolutions made using k-fold and returns the average ROC curve and its confidence interval, in addition to the AUC and its confidence interval.
[X1,Y1,T1,AUC1] = perfcurve(t_test_list,resp_list,1);
t_test and resp they are lists of size 1xk (k is the number of folds / k-fold) and each element of the lists is a 1xn vector with scores and labels.
resp = nnet(x_test(i));
t_test_act = t_test(i);
resp has 2xn format (n is the number of predicted samples). There are two classes.
t_test_act contains the labels of the current set of tests, it has formed 2xn and is composed of 0 and 1 (each column has a 1 and a 0, indicating the true class of the sample).
resp_list{i} = resp(1,:) %(scores)
t_test_list{i} = t_test_act(1,:) %(labels)
[X1,Y1,T1,AUC1] = perfcurve(t_test_list,resp_list,1);

Error using * Inner matrix dimensions must agree in using Least Squares - how to make the regressor array for multiple independent variables

I am trying to learn how to code for linear regression where the data statistics_data represents the yeast growth year in first column, the value of a chemical component in the second column and the value of the population in third column. Once theta is calculated using least squares formulation, I want to predict the value of the population using: pred_year = 2020;
pred_year_val = [1 2020]; which is giving this error:
Error using *
Inner matrix dimensions must agree.
Error in main_normal_equation (line 44)
pred_value = pred_year_val * theta;
Below is the code:
statistics_data = [2007, 9182927, 2;
2008,3,9256347;
2009,3.5,9340682;
2010,4,9415570;
2011,5,9482855;
2012,4.8,9555893;
2013,4.9,9644864;
2014,5,9747355;
2015,5,9851017;
2016,5,9995153;
2017,5,10120242;];
% Convert to independent variable matrix and response
X = (statistics_data(:,1:2));
y = (statistics_data(:,3));
% Convert matrix values to double
X = double(X);
y = double(y);
hold on;
% Set the x-axis label
xlabel('Year');
% Set the y-axis label
ylabel('Population');
% Plot population data
plot(X, y, 'rx', 'MarkerSize', 10);
m = length(y);
% Add ones column
X = [ones(m, 1) X];
% Normal Equation
theta = (pinv(X'*X))*X'*y
% Predict population for 2020
pred_year = 2020;
pred_year_val = [1 2020];
% Calculate predicted value
pred_value = pred_year_val * theta;
% Plot linear regression line
plot(X(:,2), X*theta, '-')
fprintf('Predicted population in 2020 is %d people\n ', int64(pred_value));
In matlab when you use the * operator, you are referencing a matrix multiply. Matrix multiplication has strict rules about the dimensions of the multiplied matrices.
Inspecting your code, it does not seem that your intent is to do a matrix multiply....
You can multiply a scalar by a matrix using * and scale each value in the matrix accordingly.
You can also vector multiply which is sometimes called element by element multiplication using the .* operator.
To resolve your issue you must clarify whether you intended to do a matrix multiply, scalar multiplication, or a vector multiplication. Then you must properly set your operands and operator to reflect what it is you aim to achieve.
It isn't clear to me exactly how the math in your code is supposed to be executed otherwise I could help show you where your operators and operands must be changed.
You could start by reviewing the documentation here: https://www.mathworks.com/help/matlab/matlab_prog/array-vs-matrix-operations.html
So pred_year_val has size [1 2] while theta has size [3 1]. Using the pigeon hole principle we can determine that the number of columns of pred_year_val is not equal to the number of rows of theta and therefore we cannot perform a matrix multiplication, i.e. the execution of
pred_value = pred_year_val * theta;
is bound to fail. So it seems like you need to add a value for the chemical component to pred_year_val.

Numerical derivative of a vector

I have a problem with numerical derivative of a vector that is x: Nx1 with respect to another vector t (time) that is the same size of x.
I do the following (x is chosen to be sine function as an example):
t=t0:ts:tf;
x=sin(t);
xd=diff(x)/ts;
but the answer xd is (N-1)x1 and I figured out that it does not compute derivative corresponding to the first element of x.
is there any other way to compute this derivative?
You are looking for the numerical gradient I assume.
t0 = 0;
ts = pi/10;
tf = 2*pi;
t = t0:ts:tf;
x = sin(t);
dx = gradient(x)/ts
The purpose of this function is a different one (vector fields), but it offers what diff doesn't: input and output vector of equal length.
gradient calculates the central difference between data points. For an
array, matrix, or vector with N values in each row, the ith value is
defined by
The gradient at the end points, where i=1 and i=N, is calculated with
a single-sided difference between the endpoint value and the next
adjacent value within the row. If two or more outputs are specified,
gradient also calculates central differences along other dimensions.
Unlike the diff function, gradient returns an array with the same
number of elements as the input.
I know I'm a little late to the game here, but you can also get an approximation of the numerical derivative by taking the derivatives of the polynomial (cubic) splines that runs through your data:
function dy = splineDerivative(x,y)
% the spline has continuous first and second derivatives
pp = spline(x,y); % could also use pp = pchip(x,y);
[breaks,coefs,K,r,d] = unmkpp(pp);
% pre-allocate the coefficient vector
dCoeff = zeroes(K,r-1);
% Columns are ordered from highest to lowest power. Both spline and pchip
% return 4xn matrices, ordered from 3rd to zeroth power. (Thanks to the
% anonymous person who suggested this edit).
dCoeff(:, 1) = 3 * coefs(:, 1); % d(ax^3)/dx = 3ax^2;
dCoeff(:, 2) = 2 * coefs(:, 2); % d(ax^2)/dx = 2ax;
dCoeff(:, 3) = 1 * coefs(:, 3); % d(ax^1)/dx = a;
dpp = mkpp(breaks,dCoeff,d);
dy = ppval(dpp,x);
The spline polynomial is always guaranteed to have continuous first and second derivatives at each point. I haven not tested and compared this against using pchip instead of spline, but that might be another option as it too has continuous first derivatives (but not second derivatives) at every point.
The advantage of this is that there is no requirement that the step size be even.
There are some options to work-around your issue.
First: you can make your domain larger. Instead of N, use N+1 gridpoints.
Second: depending on the end-point of interest, you can use
Forward difference: F(x + dx) - F(x)
Backward difference: F(x) - F(x - dx)

DCT rows instead of columns

I was aquainted in using the fft in matlab with the code
fft(signal,[],n)
where n tells the dimension on which to apply the fft as from Matlab documentation:
http://www.mathworks.it/it/help/matlab/ref/fft.html
I would like to do the same with dct.
Is this possible? I could not find any useful information around.
Thanks for the help.
Luigi
dct does not have the option to pick a dimension like fft. You will have to either transpose your input to operate on rows or pick one vector from your signal and operate on that.
yep!
try it:
matrix = dctmtx(n);
signal_dct = matrix * signal;
Edit
Discrete cosine transform.
Y = dct(X) returns the discrete cosine transform of X.
The vector Y is the same size as X and contains the discrete cosine transform coefficients.
Y = dct(X,N) pads or truncates the vector X to length N before transforming.
If X is a matrix, the dct operation is applied to each column. This transform can be inverted using IDCT.
% Example:
% Find how many dct coefficients represent 99% of the energy
% in a sequence.
x = (1:100) + 50*cos((1:100)*2*pi/40); % Input Signal
X = dct(x); % Discrete cosine transform
[XX,ind] = sort(abs(X)); ind = fliplr(ind);
num_coeff = 1;
while (norm([X(ind(1:num_coeff)) zeros(1,100-num_coeff)])/norm(X)<.99)
num_coeff = num_coeff + 1;
end;
num_coeff

Modified linear interpolation with missing data

Imagine a set of data with given x-values (as a column vector) and several y-values combined in a matrix (row vector of column vectors). Some of the values in the matrix are not available:
%% Create the test data
N = 1e2; % Number of x-values
x = 2*sort(rand(N, 1))-1;
Y = [x.^2, x.^3, x.^4, x.^5, x.^6]; % Example values
Y(50:80, 4) = NaN(31, 1); % Some values are not avaiable
Now i have a column vector of new x-values for interpolation.
K = 1e2; % Number of interplolation values
x_i = rand(K, 1);
My goal is to find a fast way to interpolate all y-values for the given x_i values. If there are NaN values in the y-values, I want to use the y-value which is before the missing data. In the example case this would be the data in Y(49, :).
If I use interp1, I get NaN-values and the execution is slow for large x and x_i:
starttime = cputime;
Y_i1 = interp1(x, Y, x_i);
executiontime1 = cputime - starttime
An alternative is interp1q, which is about two times faster.
What is a very fast way which allows my modifications?
Possible ideas:
Do postprocessing of Y_i1 to eliminate NaN-values.
Use a combination of a loop and the find-command to always use the neighbour without interpolation.
Using interp1 with spline interpolation (spline) ignores NaN's.