Fast scaling of Gaussian Kernel by the Covariance of the Inputs - matlab

I am currently fiddling with multivariate kernel density estimations for estimating the probability density functions (PDF) of hydrological data sets using Matlab. I am most familiar with kernel density estimation using Gaussian kernels as outlined in Sharma (2000 and 2014) (where the kernel bandwidths are set using the Gaussian Reference Rule (GRR)). The GRR is written as follows (Sharma, 2000):
where lambda_ref = GRR bandwidth of kernel, n is the sample size, and d is the dimension of the data set we are using for density estimation. To estimate the multivariate density of our data set X we use the following formula (Sharma, 2000):
where lamda is the same as lamda_ref above, S is the sample covariance of X and det() stands for determinant.
My question is: I understand that there are many "fast" methods for calculating the Gaussian kernel function represented by the term exp() such as the method proposed here (using Matlab): http://mrmartin.net/?p=218. Since I will be working with data sets that are quite large in sample size (1000-10,000) I am looking for a fast code. Is anyone aware how I can write a fast code for the second equation that takes into account the inverse of the sample covariance matrix (S^-1)?
I greatly appreciate any help that can be provided on this issue. Thank you!
Note(s):
I understand that there is a Matlab code for calculating the second equation, found as a sub-function in: http://www.mathworks.com/matlabcentral/fileexchange/29039-mutual-information-2-variablle/content/MutualInfo.m. However this code has a bottleneck in how it calculates the kernel matrix.
References:
1 A. Sharma, Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 3 — A nonparametric probabilistic forecast model, Journal of Hydrology, Volume 239, Issues 1–4, 20 December 2000, Pages 249-258, ISSN 0022-1694, http://dx.doi.org/10.1016/S0022-1694(00)00348-6.
2 Sharma, A., and R. Mehrotra (2014), An information theoretic alternative to model a natural system using observational information alone, Water Resour. Res., 50, 650–660, doi:10.1002/2013WR013845.

I have found a code that I am able to modify for my purposes. The original code is listed at the following link: http://www.kernel-methods.net/matlab/kernels/rbf.m.
Code
function K = rbf(coord,sig)
%function K = rbf(coord,sig)
%
% Computes an rbf kernel matrix from the input coordinates
%
%INPUTS
% coord = a matrix containing all samples as rows
% sig = sigma, the kernel width; squared distances are divided by
% squared sig in the exponent
%
%OUTPUTS
% K = the rbf kernel matrix ( = exp(-1/(2*sigma^2)*(coord*coord')^2) )
%
%
% For more info, see www.kernel-methods.net
%
%Author: Tijl De Bie, february 2003. Adapted: october 2004 (for speedup).
n=size(coord,1);
K=coord*coord'/sig^2;
d=diag(K);
K=K-ones(n,1)*d'/2;
K=K-d*ones(1,n)/2;
K=exp(K);
Modified Code incorporating sample covariance scaling:
xcov = cov(x.'); % sample covariance of the data
invxc = pinv(xcov); % inversion of data sample covariance
coord = x.';
sig = sigma; % kernel bandwidth
n = size(coord,1);
K = coord*invxc*coord'/sig^2;
d = diag(K);
K = K-ones(n,1)*d'/2;
K = K-d*ones(1,n)/2;
K = exp(K); % kernel matrix
I hope this helps someone else looking into the same problem.

Related

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it

This line of code is supposed to generate exponential service times, but I am not able to get the logic behind it.
% Exponential service time with rate 1
mean = 1;
dt = -mean * log(1 - rand());
This is the source link, but MATLAB is needed to open the example.
I was also thinking if exprnd(1) will give the same result of generating numbers from the exponential distribution that has a mean of 1?
You are right!
First, note that MATLAB parameterizes the Exponential distribution by the mean, not the rate, so exprnd(5) would have a rate lambda = 1/5.
This line of code is another way to do the same thing:
-mean * log(1 - rand());
This is the inverse transform for the Exponential distribution.
If X follows an Exponential distribution, then
and rewriting the cumulative distribution function (CDF) and letting U ~ Uniform(0,1), we can derive the inverse transform.
Note the last equality is because 1-U and U are equal in distribution. In other words, 1-U ~ Uniform(0,1) and U ~ Uniform(0,1).
You can test this yourself with this example code with multiple approaches.
% MATLAB R2018b
rate = 1; % mean = 1 % mean = 1/rate
NumSamples = 1000;
% Approach 1
X1 = (-1/rate)*log(1-rand(NumSamples,1)); % inverse transform
% Approach 2
X2 = exprnd(1/rate,NumSamples,1);
% Approach 3
pd = makedist('Exponential',1/rate) % create probability distribution object
X3 = random(pd,NumSamples,1);
EDIT: The OP asked is there was a reason to generate from the CDF rather than from the probability density function (PDF). This is my attempt to answer that.
The inverse transform method uses the CDF to take advantage of the fact that the CDF is itself a probability and so must be on the interval [0, 1]. Then it is very easy to generate very good (pseudo) random numbers which will be on that interval. The CDF is sufficient to uniquely define the distribution, and inverting the CDF means that its unique "shape" will properly map the uniformly distributed numbers on [0, 1] to a non-uniform shape in the domain which will follow the probability density function (PDF).
You can see the CDF performing this nonlinear mapping in this figure.
One use of the PDF would be Acceptance-Rejection methods, which can be useful for some distributions including custom PDFs (thanks to #pjs for jogging my memory).

MATLAB - Meaning of guassian distribution data. (in Neural Network)

I'm a newbie to MATLAB and now I'm trying to create a 2-d gaussian distribute data to train my neural network. I just found the code on the official document.
mu = [0 0];
Sigma = [.25 .3; .3 1];
x1 = -3:.2:3; x2 = -3:.2:3;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
I know "mu" is average of the data. Sigma is something related to
Standard deviation. But I just don't get what is the idea of mesgrid and the interval(x1,x2). And the Geometric meaning of these code.
Also, can someone explain me why is guassian distribution so important in machine learning and data science? Cause all the course keep saying and saying this term.
Meshgrid is a basic matlab function, that is in no way specifically related to neural networks or a gaussian distribution. Check the documentation of Matlab to find out more about it.
The gaussian distribution (also known as normal distribution) is important for datascience because it comes with several nice statistical properties. Unfortunately it is hard to describe them all in a compact way, and this would also not be a question about programming, but more about statistics.
I think the code you provide seems confusing to you because you expect it to generate samples whereas it merely returns values of the Gaussian PDF (probability density function) for some given pairs of (x1,x2).
For example F = mvnpdf(a,b,mu, Sigma) returns the probability of x1=a and x2=b given that they follow a multivariate Gaussian distribution with mean mu and covariance matrix Sigma.
Being in Stack Overflow, I am focusing on the Matlab aspect of your question: for generating 100 samples of a 2-D Gaussian you can use something like the following (taken from the Matlab help of randn function):
mu = [1 2];
Sigma = [1 .5; .5 2];
R = chol(Sigma);
z = repmat(mu,100,1) + randn(100,2)*R;
The array z = [x1,x2] contains the x1 and x2 vectors that you are looking for.
Some statistics textbook or wikipedia could convince you on why the above code indeed generates such samples. The last line of code is related to one of the nice properties of a Gaussian distribution (or any other elliptical distribution).

Kriging / Gaussian Process Conditional Simulations in Matlab

I would like to perform conditional simulations for Gaussian process (GP) models in Matlab. I have found a tutorial by Martin Kolář (http://mrmartin.net/?p=223).
sigma_f = 1.1251; %parameter of the squared exponential kernel
l = 0.90441; %parameter of the squared exponential kernel
kernel_function = #(x,x2) sigma_f^2*exp((x-x2)^2/(-2*l^2));
%This is one of many popular kernel functions, the squared exponential
%kernel. It favors smooth functions. (Here, it is defined here as an anonymous
%function handle)
% we can also define an error function, which models the observation noise
sigma_n = 0.1; %known noise on observed data
error_function = #(x,x2) sigma_n^2*(x==x2);
%this is just iid gaussian noise with mean 0 and variance sigma_n^2s
%kernel functions can be added together. Here, we add the error kernel to
%the squared exponential kernel)
k = #(x,x2) kernel_function(x,x2)+error_function(x,x2);
X_o = [-1.5 -1 -0.75 -0.4 -0.3 0]';
Y_o = [-1.6 -1.3 -0.5 0 0.3 0.6]';
prediction_x=-2:0.01:1;
K = zeros(length(X_o));
for i=1:length(X_o)
for j=1:length(X_o)
K(i,j)=k(X_o(i),X_o(j));
end
end
%% Demo #5.2 Sample from the Gaussian Process posterior
clearvars -except k prediction_x K X_o Y_o
%We can also sample from this posterior, the same way as we sampled before:
K_ss=zeros(length(prediction_x),length(prediction_x));
for i=1:length(prediction_x)
for j=i:length(prediction_x)%We only calculate the top half of the matrix. This an unnecessary speedup trick
K_ss(i,j)=k(prediction_x(i),prediction_x(j));
end
end
K_ss=K_ss+triu(K_ss,1)'; % We can use the upper half of the matrix and copy it to the
K_s=zeros(length(prediction_x),length(X_o));
for i=1:length(prediction_x)
for j=1:length(X_o)
K_s(i,j)=k(prediction_x(i),X_o(j));
end
end
[V,D]=eig(K_ss-K_s/K*K_s');
A=real(V*(D.^(1/2)));
for i=1:7
standard_random_vector = randn(length(A),1);
gaussian_process_sample(:,i) = A * standard_random_vector+K_s/K*Y_o;
end
hold on
plot(prediction_x,real(gaussian_process_sample))
set(plot(X_o,Y_o,'r.'),'MarkerSize',20)
The tutorial generates the conditional simulations using a direct simulation method based on covariance matrix decomposition. It is my understanding that there are several methods of generating conditional simulations that may be better when the number of simulation points is large such as conditioning by Kriging using a local neighborhood. I have found information regarding several methods in J.-P. Chilès and P. Delfiner, “Chapter 7 - Conditional Simulations,” in Geostatistics: Modeling Spatial Uncertainty, Second Edition, John Wiley & Sons, Inc., 2012, pp. 478–628.
Is there an existing Matlab toolbox that can be used for conditional simulations? I am aware of DACE, GPML, and mGstat (http://mgstat.sourceforge.net/). I believe only mGstat offers the capability to perform conditional simulations. However, mGstat also seems to be limited to only 3D models and I am interested in higher dimensional models.
Can anybody offer any advice on getting started performing conditional simulations with an existing toolbox such as GPML?
===================================================================
EDIT
I have found a few more Matlab toolboxes: STK, ScalaGauss, ooDACE
It appears STK is capable of conditional simulations using covariance matrix decomposition. However, is limited to a moderate number (maybe a few thousand?) of simulation points due to the Cholesky factorization.
I used the STK toolbox and I recommend it for others:
http://kriging.sourceforge.net/htmldoc/
I found that if you need conditional simulations at a large number of points then you might consider generating a conditional simulation at the points in a large design of experiment (DoE) and then simply relying on the mean prediction conditional on that DoE.

Discrete surface integral with cumsum

I have a matrix z(x,y)
This is an NxN abitary pdf constructed from a unique Kernel density estimation (i.e. not a usual pdf and it doesn't have a function). It is multivariate and can't be separated and is discrete data.
I wan't to construct a NxN matrix (F(x,y)) that is the cumulative distribution function in 2 dimensions of this pdf so that I can then randomly sample the F(x,y) = P(x < X ,y < Y);
Analytically I think the CDF of a multivariate function is the surface integral of the pdf.
What I have tried is using the cumsum function in order to calculate the surface integral and tested this with a multivariate normal against the analytical solution and there seems to be some discrepancy between the two:
% multivariate parameters
delta = 100;
mu = [1 1];
Sigma = [0.25 .3; .3 1];
x1 = linspace(-2,4,delta); x2 = linspace(-2,4,delta);
[X1,X2] = meshgrid(x1,x2);
% Calculate Normal multivariate pdf
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
% My attempt at a numerical surface integral
FN = cumsum(cumsum(F,1),2);
% Normalise the CDF
FN = FN./max(max(FN));
X = [X1(:) X2(:)];
% Analytic solution to a multivariate normal pdf
p = mvncdf(X,mu,Sigma);
p = reshape(p,delta,delta);
% Highlight the difference
dif = p - FN;
error = max(max(sqrt(dif.^2)));
% %% Plot
figure(1)
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
xlabel('x1'); ylabel('x2'); zlabel('Probability Density');
figure(2)
surf(X1,X2,FN);
xlabel('x1'); ylabel('x2');
figure(3);
surf(X1,X2,p);
xlabel('x1'); ylabel('x2');
figure(5)
surf(X1,X2,dif)
xlabel('x1'); ylabel('x2');
Particularly the error seems to be in the transition region which is the most important.
Does anyone have any better solution to this problem or see what I'm doing wrong??
Any help would be much appreciated!
EDIT: This is the desired outcome of the cumulative integration, The reason this function is of value to me is that when you randomly generate samples from this function on the closed interval [0,1] the higher weighted (i.e. the more likely) values appear more often in this way the samples converge on the expected value(s) (in the case of multiple peaks) this is desired outcome for algorithms such as particle filters, neural networks etc.
Think of the 1-dimensional case first. You have a function represented by a vector F and want to numerically integrate. cumsum(F) will do that, but it uses a poor form of numerical integration. Namely, it treats F as a step function. You could instead do a more accurate numerical integration using the Trapezoidal rule or Simpson's rule.
The 2-dimensional case is no different. Your use of cumsum(cumsum(F,1),2) is again treating F as a step function, and the numerical errors resulting from that assumption only get worse as the number of dimensions of integration increases. There exist 2-dimensional analogues of the Trapezoidal rule and Simpson's rule. Since there's a bit too much math to repeat here, take a look here:
http://onestopgate.com/gate-study-material/mathematics/numerical-analysis/numerical-integration/2d-trapezoidal.asp.
You DO NOT need to compute the 2-dimensional integral of the probability density function in order to sample from the distribution. If you are computing the 2-d integral, you are going about the problem incorrectly.
Here are two ways to approach the sampling problem.
(1) You write that you have a kernel density estimate. A kernel density estimate is a special case of a mixture density. Any mixture density can be sampled by first selecting one kernel (perhaps differently or equally weighted, same procedure applies), and then sampling from that kernel. (That applies in any number of dimensions.) Typically the kernels are some relatively simple distribution such as a Gaussian distribution so that it is easy to sample from it.
(2) Any joint density P(X, Y) is equal to P(X | Y) P(Y) (and equivalently P(Y | X) P(X)). Therefore you can sample from P(Y) (or P(X)) and then from P(X | Y). In order to sample from P(X | Y), you will need to integrate P(X, Y) along a line Y = y (where y is the sampled value of Y), but (this is crucial) you only need to integrate along that line; you don't need to integrate over all values of X and Y.
If you tell us more about your problem, I can help with the details.

Estimating the variance of eigenvalues of sample covariance matrices in Matlab

I am trying to investigate the statistical variance of the eigenvalues of sample covariance matrices using Matlab. To clarify, each sample covariance matrix is constructed from a finite number of vector snapshots (afflicted with random white Gaussian noise). Then, over a large number of trials, a large number of such matrices are generated and eigendecomposed in order to estimate the theoretical statistics of the eigenvalues.
According to several sources (see, for example, [1, Eq.3] and [2, Eq.11]), the variance of each sample eigenvalue should be equal to that theoretical eigenvalue squared, divided by the number of vector snapshots used for each covariance matrix. However, the results I get from Matlab aren't even close.
Is this an issue with my code? With Matlab? (I've never had such trouble working on similar problems).
Here's a very simple example:
% Data vector length
Lvec = 5;
% Number of snapshots per sample covariance matrix
N = 200;
% Number of simulation trials
Ntrials = 10000;
% Noise variance
sigma2 = 10;
% Theoretical covariance matrix
Rnn_th = sigma2*eye(Lvec);
% Theoretical eigenvalues (should all be sigma2)
lambda_th = sort(eig(Rnn_th),'descend');
lambda = zeros(Lvec,Ntrials);
for trial = 1:Ntrials
% Generate new (complex) white Gaussian noise data
n = sqrt(sigma2/2)*(randn(Lvec,N) + 1j*randn(Lvec,N));
% Sample covariance matrix
Rnn = n*n'/N;
% Save sample eigenvalues
lambda(:,trial) = sort(eig(Rnn),'descend');
end
% Estimated eigenvalue covariance matrix
b = lambda - lambda_th(:,ones(1,Ntrials));
Rbb = b*b'/Ntrials
% Predicted (approximate) theoretical result
Rbb_th_approx = diag(lambda_th.^2/N)
References:
[1] Friedlander, B.; Weiss, A.J.; , "On the second-order statistics of the eigenvectors of sample covariance matrices," Signal Processing, IEEE Transactions on , vol.46, no.11, pp.3136-3139, Nov 1998
[2] Kaveh, M.; Barabell, A.; , "The statistical performance of the MUSIC and the minimum-norm algorithms in resolving plane waves in noise," Acoustics, Speech and Signal Processing, IEEE Transactions on , vol.34, no.2, pp. 331- 341, Apr 1986
According to the abstract of from your first reference:
"Formulas for the second-order statistics of the eigenvectors have been derived in the statistical literature and are widely used. We point out a discrepancy between the statistics observed in numerical simulations and the theoretical formulas, due to the nonuniqueness of the definition of eigenvectors. We present two ways to resolve this discrepancy. The first involves modifying the theoretical formulas to match the computational results. The second involved a simple modification of the computations to make them match existing formulas."
Sounds like there is a discrepancy, and it also sounds like the two 'solutions' are hacks, but without access to the actual paper, it's kind of hard to help.