Get Sobol indices error from kriging with Openturns - kriging

I'm looking for a Python package which computes the Sobol' indices from a kriging model and which provides a confidence intervalle on these indices that takes into account both the meta-model and the Monte-Carlo errors. Is it possible to do this with OpenTurns ?
Thanks

It is indeed possible to compute Sobol' indices from a Kriging model with confidence intervals that take into account Kriging uncertainty in addition to the uncertainty of Sobol' indices estimators.
The trick is to sample from trajectories of the conditional Gaussian process. Assuming you have a KrigingResult object kri_res, and also that you have obtained an inputDesign from a SobolIndicesExperiment with given size, you can build the ConditionalGaussianProcess with:
import openturns as ot
conditional_gp = ot.ConditionalGaussianProcess(kri_res, ot.Mesh(inputDesign))
And then sample output designs corresponding to N different trajectories of the conditional Gaussian process:
outputDesigns = conditional_gp.getSample(N)
Then you can get the distribution of the estimator of the (here first order) Sobol' indices for each trajectory:
distributions = []
for i in range(N):
algo = ot.SaltelliSensitivityAlgorithm(inputDesign, outputDesigns[i], size)
dist = algo.getFirstOrderIndicesDistribution()
distributions.append(dist)
In order to average out Kriging uncertainty, you can build the mixture of the distributions and an associated confidence interval:
mixture = ot.Mixture(distributions)
ci = mixture.computeBilateralConfidenceInterval(0.9)
Be careful, what you get this way is a domain which contains 90% of the probability mass of the joint distribution. If you want to get confidence intervals marginal by marginal, then you need to do:
intervals = []
for j in range(mixture.getDimension()):
marginal = mixture.getMarginal(j)
ci = marginal.computeBilateralConfidenceInterval(0.9)
intervals.append(ci)

Related

Approximate continuous probability distribution in Matlab

Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)
I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)

Drawing from correlated marginal uniform in Matlab: Alternatives to copularnd?

Consider two random variables X and Y both uniformly distributed in [0,1] and correlated with correlation rho.
I want to draw P realisations from their joint distribution in Matlab.
Is
A = copularnd('gaussian',rho,P);
the only way to do it?
Copulas provide a very convenient way modelling the joint distribution. Just saying that you want X~U[0,1], Y~U[0,1] and corr(X,Y)=rho is not enough to define the relationship between these two random variables. Simply by substituting different copulas you can build different models (eventually choosing the one suiting your use-case best) that all satisfy this condition.
Apart from understanding the basics of what copulas are, you need to understand there are different types of correlations, such as linear (Pearson) and rank (Spearman / Kendall), and how they relate to each other. In particular, rank correlations preserve when a monotonic transformation is applied, unlike linear correlation. This is the key property allowing us to easily translate the desired linear correlation of uniform marginal distributions to the linear correlation of bivariate normal (or other type of distribution you use in the copula), which would be the input correlation to copularnd.
There is a very good answer on Cross Validated, describing exactly how to convert the correlation in your case. You should also read this Matlab guide on Using Rank Correlation Coefficients and copulas.
In a nutshell, to model desired linear correlation of marginals, you need to translate it into some rank correlation (using Spearman is convenient, since for uniform distribution it equals Pearason correlation), which would be the same for your normals (because it's a monotonic transformation). All you need is to convert that Spearman correlation for your normal to a linear Person correlation, which would be the input to copularnd.
Hopefully the code would make it easy to understand. Obviously you don't need those temporary variables, but it should make the logic clear:
rho_uni_Pearson = 0.7;
rho_uni_Spearman = rho_uni_Pearson;
rho_normal_Spearman = rho_uni_Spearman;
rho_normal_Pearson = 2*sin(rho_normal_Spearman*pi/6);
X = copularnd('gaussian', rho_normal_Pearson, 1e7);
And the resulting linear correlation is exactly what we wanted (since we've generated a very large sample):
corr(X(:,1), X(:,2));
ans =
0.7000
Note that for a bivariate normal copula, the relationship between linear and rank correlations is easy enough, but it could be more complex if you use a different copula. Matlab has a function copulaparam that allows you to translate from rank to linear correlation. So instead of writing out explicit formula as above, we could just use:
rho_normal_Pearson = copulaparam('Gaussian', rho_normal_Spearman, 'type', 'spearman')
Now that we have learned the basics, let's go ahead and use a t copula with 5 degress of freedom instead of gaussian copula:
nu = 5; % number of degrees of freedom of t distribution
rho_t_Pearson = copulaparam('t', 0.7, nu, 'type', 'spearman');
X = copularnd('t', rho_t_Pearson, nu, 1e7);
Ensuring resulting linear correlation is what we wanted:
corr(X(:,1), X(:,2));
ans =
0.6996
It's easy to observe that resulting distributions may be strikingly different, depending on which copula you choose, even through they all give you the same linear correlation. It is up to you, the researcher, to determine which model is best suited to your particular data and problem. For example:
N = 500;
rho_Pearson = copulaparam('gaussian', 0.1, 'type', 'spearman');
X1 = copularnd('gaussian', rho_Pearson, N);
figure(); scatterhist(X1(:,1),X1(:,2)); title('gaussian');
nu = 1; % number of degrees of freedom of t distribution
rho_Pearson = copulaparam('t', 0.1, nu, 'type', 'spearman');
X2 = copularnd('t', rho_Pearson, nu, N);
figure(); scatterhist(X2(:,1),X2(:,2)); title('t, nu=1');

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Linear Discriminant Analysis in Matlab

I am working on performing a LDA in Matlab and I am able to get it to successfully create a threshold for distinguishing between binary classes. However, I noticed that the threshold always crosses the origin which gives me incorrect thresholds. Is there a way to perform an LDA without a threshold crossing the origin in Matlab?
Thanks in advance
This depends on which formulation you are using for LDA.
By threshold, I assume you're referring to decision threshold?
In the code below the prior probabilities affect the decision threshold, so you may not be setting them correctly.
Here is some sample pseudo code:
N = number of cases
c= number of classes
Priors = vector of prior probabilities for each case per class
Target = Target labels for each case per class
dimension of Data = Features x Cases.
Get target labels for each data point:
T = Targets(:,Cases); % Target labels for each case
Calculate the mean vector per class and the common covariance matrix:
classifier.u = [mean(Data(:,(T(1,:)==1)),2),mean_nan(Data(:,(T(2,:)==1)),2),....,mean_nan(Data(:,(T(2,:)==c)),2]; % Matrix of data means
classifier.invCV = cov(Data');
Get discriminant value using class mean vectors and common covariance matrix:
A1=classifier.u;
B1=classifier.invCV;
D = A1'*B1*Data-0.5*(A1'*B1.*A1')*ones(d,N)+log(Priors(:,Cases));
Function will produce c discriminant values. The case is then assigned to the class with the largest discriminant value.

determining "how good" a correlation is in matlab?

I'm working with a set of data and I've obtained a certain correlations (using pearson's correlation coefficient). I've been asked to determine the "quality of the correlation," and by that my supervisor means he wants to see what the correlations would be if I tried permuting all the y values of my ordered pairs, and compared the obtained correlation coefficients. Does anyone know a nice way of doing this? Is there a matlab function that would determine how good a correlation is when compared to a correlation between random permutations of the data?
First, you have to check whether the correlation coefficient you got is significantly different from zero. The corr function can do this (see pval).
Second, if it's significantly different from zero, then you would like to decide whether this difference is also significant from a practical point of view. In practice, the square of the correlation coefficent (the coefficient of determination) is considered significant, if it's larger than 0.5, which means that the variations of one of the correlated parameters "explains" at least 50% of the variation of the other.
Third, there are cases when the coefficient of determination is close to one, but this is not enough to determine the "goodness of correlation". For example, if you measure the same variable using two different methods, you will usually get very similar values, so the correlation coefficient will be almost 1. In such cases you should apply the Bland-Altman analysis, which is very easy to implement in Matlab, and has its own "goodness" parameters (the bias and the so-called limits of agreement).
You can permute one vector's labels N times and calculate coefficient of correlations (cc) for each iteration. Then you can compare distribution of those values with the real correlation.
Something like this:
%# random data
n = 20;
x = (1:n)';
y = x + randn(n,1)*3;
%# real correlation
cc = corr(x,y);
%# do permutations
n_iter = 100; %# number of permutations
cc_iter = zeros(n_iter,1); %# preallocate the vector
for k = 1:n_iter
ind = randperm(n); %# vector of random permutations
cc_iter(k) = corr(x,y(ind));
end
%# calculate statistics
cc_mean = mean(cc_iter);
cc_std = std(cc_iter);
zval = cc - cc_mean ./ cc_std;
%# probability that the real cc belongs to the same distribution as cc from permuted data
pv = 2 * normcdf(-abs(zval),cc_mean,cc_std);
%# plot
hist(cc_iter,20)
line([cc cc],ylim,'color','r') %# real value
In addition, if you compute correlation with [cc pv] = corr(x,y), you get p-value of how your correlation is different from no correlation. This p-value is calculated from assumption that your vector distributed normally. However, if you calculate not Pearson, but Spearman or Kendall correlation (non-parametric), those p-values will be from randomly permuted data:
[cc pv] = corr(x,y,'type','Spearman')