Central Limit Theorem in matlab - matlab

I am trying to prove CLT in matlab by comparing histogram for sum of three RV and normal distribution.
Here is my code:
clc;clear;
len = 50000;
%y0 : Exponential Distribution
lambda = 3;
y0=-log(rand(1,len))./lambda;
%y1 : Rayleigh Distribution
mu = 0;
sig = 2;
var1 = mu + sig*randn(1,len);
var2 = mu + sig*randn(1,len);
t1 = var1 .^ 2;
t2 = var2 .^ 2;
y1 = sqrt(t1+t2);
% %y2: Normal Distribution
y2 = randn(1,len);
%y3 : What result excpected to be:
mean0 = (sum(y0)+ sum(y1)+ sum(y2)) / (len * 3);%how do I calculate this?
var0 = 1;%how do I calculate this?
y3 = mean0 + var0*randn(1,len);
delta = 0.1;
x3 = min(y3):delta:max(y3);
figure('Name','Normal Distribution');
hist(y3,x3);
%Central Limit Theorem:
%what result is:
res = y0+y1+y2;
xn = min(res):delta:max(res);
figure('Name','Final Result');
hist(res,xn);
I have two main problems.
How can I calculate mean and variance for y3 (what result should be)
Is my code correct?

Since y0, y1 and y2 are row vectors, you have to do:
mean0 = mean([y0 y1 y2]);
variance0 = var([y0 y1 y2]);
When you create [y0 y1 y2] you are creating a big vector with all your previous samples in a single vector (As if they were samples form one single distribution).
Now just plug it into the functions you want (mean and variance) as showed above.
About the statistical part: I think you are getting some things wrong.
The Central Limit Theorem applies for the sum of variables distributed according to a same distribution. It can be indeed be any distribution D, but all variables must have that same distribution D. You are trying to sum different distributions.
The theorem says:
I've coded an example for variables distributed according to an exponential distribution.
Run it and you observe that when you increase N, the resulting distribution tends to the expected normal distribution. For N=1 you have your exponential distribution (very different from a normal distribution), but for N=100 you already have a distribution that is very close to the expected normal distribution (you can see how the mean and variance are basically the same now).
CLT for Exponentials with N=1
CLT for Exponentials with N=3
CLT for Exponentials with N=10
CLT for Exponentials with N=100
The expected normal distribution (convergence distibution of CLT)
clc;clear;
len = 50000;
lambda = 3;
%yA : Exponential Distribution A
yA=-log(rand(1,len))./lambda;
%yB : Exponential Distribution B
yB=-log(rand(1,len))./lambda;
%yC : Exponential Distribution C
yC=-log(rand(1,len))./lambda;
%yD : Exponential Distribution D
yD=-log(rand(1,len))./lambda;
%yE : Exponential Distribution E
yE=-log(rand(1,len))./lambda;
%yF : Exponential Distribution F
yF=-log(rand(1,len))./lambda;
%yG : Exponential Distribution G
yG=-log(rand(1,len))./lambda;
%yH : Exponential Distribution H
yH=-log(rand(1,len))./lambda;
%yI : Exponential Distribution I
yI=-log(rand(1,len))./lambda;
%yJ : Exponential Distribution J
yJ=-log(rand(1,len))./lambda;
%y1 : What result you expect it to be (centred Gaussian with same variation as exponential):
mean0 = 0;
var0 = var(yA);
y1 = mean0 + sqrt(var0)*randn(1,len);
delta = 0.01;
x1 = min(y1):delta:max(y1);
figure('Name','Normal Distribution (Expected)');
hist(y1,x1);
%Central Limit Theorem:
%what result is:
res1 = (((yA)/1) - mean(yA))*sqrt(1);
res2 = (((yA+yB)/2) - mean(yA))*sqrt(2);
res3 = (((yA+yB+yC)/3) - mean(yA))*sqrt(3);
res4 = (((yA+yB+yC+yD)/4) - mean(yA))*sqrt(4);
res5 = (((yA+yB+yC+yD+yE)/5) - mean(yA))*sqrt(5);
res10 = (((yA+yB+yC+yD+yE+yF+yG+yH+yI+yJ)/10) - mean(yA))*sqrt(10);
delta = 0.01;
xn = min(res1):delta:max(res1);
figure('Name','Final Result for N=1');
hi st(res1,xn);
xn = min(res2):delta:max(res2);
figure('Name','Final Result for N=2');
hist(res2,xn);
xn = min(res3):delta:max(res3);
figure('Name','Final Result for N=3');
hist(res3,xn);
xn = min(res4):delta:max(res4);
figure('Name','Final Result for N=4');
hist(res4,xn);
xn = min(res5):delta:max(res5);
figure('Name','Final Result for N=5');
hist(res5,xn);
xn = min(res10):delta:max(res10);
figure('Name','Final Result for N=10');
hist(res10,xn);
%for N = 100
y100=-log(rand(100,len))./lambda;
res100 = ((sum(y100)/100) - mean(yA))*sqrt(100);
xn = min(res100):delta:max(res100);
figure('Name','Final Result for N=100');
hist(res100,xn);

Related

PDF and CDF plot for central limit theorem using Matlab

I am struggling to plot the PDF and CDF graphs of where
Sn=X1+X2+X3+....+Xn
using central limit theorem where n = 1; 2; 3; 4; 5; 10; 20; 40
I am taking Xi to be a uniform continuous random variable for values between (0,3).
Here is what i have done so far -
close all
%different sizes of input X
%N=[1 5 10 50];
N = [1 2 3 4 5 10 20 40];
%interval (1,6) for random variables
a=0;
b=3;
%to store sum of differnet sizes of input
for i=1:length(N)
%generates uniform random numbers in the interval
X = a + (b-a).*rand(N(i),1);
S=zeros(1,length(X));
S=cumsum(X);
cd=cdf('Uniform',S,0,3);
plot(cd);
hold on;
end
legend('n=1','n=2','n=3','n=4','n=5','n=10','n=20','n=40');
title('CDF PLOT')
figure;
for i=1:length(N)
%generates uniform random numbers in the interval
X = a + (b-a).*rand(N(i),1);
S=zeros(1,length(X));
S=cumsum(X);
cd=pdf('Uniform',S,0,3);
plot(cd);
hold on;
end
legend('n=1','n=2','n=3','n=4','n=5','n=10','n=20','n=40');
title('PDF PLOT')
My output is nowhere near what I am expecting any help is much appreciated.
This can be done with vectorization using rand() and cumsum().
For example, the code below generates 40 replications of 10000 samples of a Uniform(0,3) distribution and stores in X. To meet the Central Limit Theorem (CLT) assumptions, they are independent and identically distributed (i.i.d.). Then cumsum() transforms this into 10000 copies of the Sn = X1 + X2 + ... where the first row is n = 10000copies of Sn = X1, the 5th row is n copies of S_5 = X1 + X2 + X3 + X4 + X5. The last row is n copies of S_40.
% MATLAB R2019a
% Setup
N = [1:5 10 20 40]; % values of n we are interested in
LB = 0; % lowerbound for X ~ Uniform(LB,UB)
UB = 3; % upperbound for X ~ Uniform(LB,UB)
n = 10000; % Number of copies (samples) for each random variable
% Generate random variates
X = LB + (UB - LB)*rand(max(N),n); % X ~ Uniform(LB,UB) (i.i.d.)
Sn = cumsum(X);
You can see from the image that the n = 2 case, the sum is indeed a Triangular(0,3,6) distribution. For the n = 40 case, the sum is approximately Normally distributed (Gaussian) with mean 60 (40*mean(X) = 40*1.5 = 60). This shows the convergence in distribution for both the probability density function (PDF) and the cumulative distribution function (CDF).
Note: The CLT is often stated with convergence in distribution to a Normal distribution with zero mean as it has been shifted. Shifting the results by subtracting mean(Sn) = n*mean(X) = n*0.5*(LB+UB) from Sn gets this done.
Code below isn't the gold standard but it produced the image.
figure
s(11) = subplot(6,2,1) % n = 1
histogram(Sn(1,:),'Normalization','pdf')
title(s(11),'n = 1')
s(12) = subplot(6,2,2)
cdfplot(Sn(1,:))
title(s(12),'n = 1')
s(21) = subplot(6,2,3) % n = 2
histogram(Sn(2,:),'Normalization','pdf')
title(s(21),'n = 2')
s(22) = subplot(6,2,4)
cdfplot(Sn(2,:))
title(s(22),'n = 2')
s(31) = subplot(6,2,5) % n = 5
histogram(Sn(5,:),'Normalization','pdf')
title(s(31),'n = 5')
s(32) = subplot(6,2,6)
cdfplot(Sn(5,:))
title(s(32),'n = 5')
s(41) = subplot(6,2,7) % n = 10
histogram(Sn(10,:),'Normalization','pdf')
title(s(41),'n = 10')
s(42) = subplot(6,2,8)
cdfplot(Sn(10,:))
title(s(42),'n = 10')
s(51) = subplot(6,2,9) % n = 20
histogram(Sn(20,:),'Normalization','pdf')
title(s(51),'n = 20')
s(52) = subplot(6,2,10)
cdfplot(Sn(20,:))
title(s(52),'n = 20')
s(61) = subplot(6,2,11) % n = 40
histogram(Sn(40,:),'Normalization','pdf')
title(s(61),'n = 40')
s(62) = subplot(6,2,12)
cdfplot(Sn(40,:))
title(s(62),'n = 40')
sgtitle({'PDF (left) and CDF (right) for Sn with n \in \{1, 2, 5, 10, 20, 40\}';'note different axis scales'})
for tgt = [11:10:61 12:10:62]
xlabel(s(tgt),'Sn')
if rem(tgt,2) == 1
ylabel(s(tgt),'pdf')
else % rem(tgt,2) == 0
ylabel(s(tgt),'cdf')
end
end
Key functions used for plot: histogram() from base MATLAB and cdfplot() from the Statistics toolbox. Note this could be done manually without requiring the Statistics toolbox with a few lines to obtain the cdf and then just calling plot().
There was some concern in comments over the variance of Sn.
Note the variance of Sn is given by (n/12)*(UB-LB)^2 (derivation below). Monte Carlo simulation shows our samples of Sn do have the correct variance; indeed, it converges to this as n gets larger. Simply call var(Sn(40,:)).
% with n = 10000
var(Sn(40,:)) % var(S_40) = 30 (will vary slightly depending on random seed)
(40/12)*((UB-LB)^2) % 29.9505
You can see the convergence is very good by S_40:
step = 0.01;
Domain = 40:step:80;
mu = 40*(LB+UB)/2;
sigma = sqrt((40/12)*((UB-LB)^2));
figure, hold on
histogram(Sn(40,:),'Normalization','pdf')
plot(Domain,normpdf(Domain,mu,sigma),'r-','LineWidth',1.4)
ylabel('pdf')
xlabel('S_n')
Derivation of mean and variance for Sn:
For the expectation (mean), the second equality holds by linearity of expectation. The third equality holds since X_i are identically distributed.
The discrete version of this is posted here.

Comparing the function fminunc with the BFGS method for logistic regression

I´m constructing an algorithm that uses the BFGS method to find the parameters in a logistic regression for a binary dataset in Octave.
Now, I´m struggling with something I believe is an overfitting problem. I run the algorithm for several datasets and it actually converges to the same results as the fminunc function of Octave. However for an especific "type of dataset" the algorithm converges to very high values of the parameters, at contrary to the fminunc which gives razonable values of these parameters. I added a regularization term and I actually achieved my algorithm to converge to the same values of fminunc.
This especific type of dataset has data that can be completely separated by a straight line. My question is: why this is a problem for the BFGS method but it´s not a problem for fminunc? How this function avoid this issue without regularization? Could I implement this in my algorithm?
The code of my algorithm is the following:
function [beta] = Log_BFGS(data, L_0)
clc
close
%************************************************************************
%************************************************************************
%Loading the data:
[n, e] = size(data);
d = e - 1;
n; %Number of observations.
d; %Number of features.
Y = data(:, e); %Labels´ values
X_o = data(:, 1:d);
X = [ones(n, 1) X_o]; %Features values
%Initials conditions:
beta_0 = zeros(e, 1);
beta = [];
beta(:, 1) = beta_0;
N = 600; %Max iterations
Tol = 1e-10; %Tolerance
error = .1;
L = L_0; %Regularization parameter
B = eye(e);
options = optimset('GradObj', 'on', 'MaxIter', 600);
[beta_s] = fminunc(#(t)(costFunction(t, X, Y, L)), beta_0, options);
disp('Beta obtained with the fminunc function');
disp("--------------");
disp(beta_s)
k = 1;
a_0 = 1;
% Define the sigmoid function
h = inline('1.0 ./ (1.0 + exp(-z))');
while (error > Tol && k < N)
beta_k = beta(:, k);
x_0 = X*beta_k;
h_0 = h(x_0);
beta_r = [0 ; beta(:, k)(2:e, :)];
g_k = ((X)'*(h_0 - Y) + L*beta_r)/n;
d_k = -pinv(B)*g_k;
a = 0.1; %I´ll implement an Armijo line search here (soon)
beta(:, k+1) = beta(:, k) + a*d_k;
beta_k_1 = beta(:, k+1);
x_1 = X*beta_k_1;
h_1 = h(x_1);
beta_s = [0 ; beta(:, k+1)(2:e, :)];
g_k_1 = (transpose(X)*(h_1 - Y) + L*beta_s)/n;
s_k = beta(:, k+1) - beta(:, k);
y_k = g_k_1 - g_k;
B = B - B*s_k*s_k'*B/(s_k'*B*s_k) + y_k*y_k'/(s_k'*y_k);
k = k + 1;
error = norm(d_k);
endwhile
%Accuracy of the logistic model:
p = zeros(n, 1);
for j = 1:n
if (1./(1. + exp(-1.*(X(j, :)*beta(:, k)))) >= 0.5)
p(j) = 1;
else
p(j) = 0;
endif
endfor
R = mean(double(p == Y));
beta = beta(:, k);
%Showing the results:
disp("Estimation of logistic regression model Y = 1/(1 + e^(beta*X)),")
disp("using the algorithm BFGS =")
disp("--------------")
disp(beta)
disp("--------------")
disp("with a convergence error in the last iteration of:")
disp(error)
disp("--------------")
disp("and a total number of")
disp(k-1)
disp("iterations")
disp("--------------")
if k == N
disp("The maximum number of iterations was reached before obtaining the desired error")
else
disp("The desired error was reached before reaching the maximum of iterations")
endif
disp("--------------")
disp("The precision of the logistic regression model is given by (max 1.0):")
disp("--------------")
disp(R)
disp("--------------")
endfunction
The results I got for the dataset are showed in the following picture. If you need the data used in this situation, please let me know.
Results of the algorithm
Check the objectives!
The values of the solution-vector are nice, but the whole optimization is driven by the objective. You say fminunc which gives reasonable values of these parameters, but reasonable is not defined within this model.
It would not be impossible, that both, your low-value and your high-value solution allows pretty much the same objective. And that's what those solvers are solely caring about (when using no regulization-term).
So the important question is: is there a unique solution (which should disallow these results)? Only when your dataset has full rank! So maybe your data is rank-deficient and you obtain two equally good solutions. Of course there might be slight differences due to numerical-issues, which are always a source of errors, especially in more complex optimization-algorithms.

Correlation coefficients between two matrices to find intercorrelation

I am trying to calculate Pearson coefficients between all pair combinations of my variables of all my samples.
Say i have an m*n matrix where m are the variables and n are the samples
i want to calculate for each variable of my data what is the correlation to every other variable.
So, i managed to do that with nested loops:
X = rand[1000 100];
for i = 1:1000
base = X(i, :);
for j = 1:1000
target = X(j, :);
correlation = corrcoef(base, target);
correlation = correlation(2, 1);
corData(1, j) = correlation
end
totalCor(i, :) = corData
end
and it works, but takes too much time to run
I am trying to find a way to run the corrcoef function on a row basis, meaning maybe to create an additional matrix with repmat of the base values and correlate to the X data using some FUN function.
Could not figure out how to use the fun with inputs from to arrays, running between individuals lines/columns
help will be appreciated
This post involves a bit of hacking, so bear with it!
Stage #0 To start off, we have -
for i = 1:N
base = X(i, :);
for j = 1:N
target = X(j, :);
correlation = corrcoef(base, target);
correlation = correlation(2, 1)
corData(1, j) = correlation;
end
end
Stage #1 From the documentation of corrcoef in its source code :
If C is the covariance matrix, C = COV(X), then CORRCOEF(X) is the
matrix whose (i,j)'th element is : C(i,j)/SQRT(C(i,i)*C(j,j)).
After hacking into the code of covariance, we see that for the default case of one input, the covariance formula is simply -
[m,n] = size(x);
xc = bsxfun(#minus,x,sum(x,1)/m);
xy = (xc' * xc) / (m-1);
Thus, mixing the two definitions and putting them into the problem at hand, we have -
m = size(X,2);
for i = 1:N
base = X(i, :);
for j = 1:N
target = X(j, :);
BT = [base(:) target(:)];
xc = bsxfun(#minus,BT,sum(BT,1)/m);
C = (xc' * xc) / (m-1); %//'
corData = C(2,1)/sqrt(C(2,2)*C(1,1))
end
end
Stage #2 This is the final stage where we use the real fun aka bsxfun to kill all loops, like so -
%// Broadcasted subtract of each row by the average of it.
%// This corresponds to "xc = bsxfun(#minus,BT,sum(BT,1)/m)"
p1 = bsxfun(#minus,X,mean(X,2));
%// Get pairs of rows from X and get the dot product.
%// Thus, a total of "N x N" such products would be obtained.
p2 = sum(bsxfun(#times,permute(p1,[1 3 2]),permute(p1,[3 1 2])),3);
%// Scale them down by "size(X,2)-1".
%// This was for the part : "C = (xc' * xc) / (m-1)".
p3 = p2/(size(X,2)-1);
%// "C(2,2)" and "C(1,1)" are diagonal elements from "p3", so store them.
dp3 = diag(p3);
%// Get "sqrt(C(2,2)*C(1,1))" by broadcasting elementwise multiplication
%// of "dp3". Finally do elementwise division of "p3" by it.
totalCor_out = p3./sqrt(bsxfun(#times,dp3,dp3.'));
Benchmarking
This section compares the original approach against the proposed one and also verifies the output. Here's the benchmarking code -
disp('---------- With original approach')
tic
X = rand(1000,100);
corData = zeros(1,1000);
totalCor = zeros(1000,1000);
for i = 1:1000
base = X(i, :);
for j = 1:1000
target = X(j, :);
correlation = corrcoef(base, target);
correlation = correlation(2, 1);
corData(1, j) = correlation;
end
totalCor(i, :) = corData;
end
toc
disp('---------- With the real fun aka BSXFUN')
tic
p1 = bsxfun(#minus,X,mean(X,2));
p2 = sum(bsxfun(#times,permute(p1,[1 3 2]),permute(p1,[3 1 2])),3);
p3 = p2/(size(X,2)-1);
dp3 = diag(p3);
totalCor_out = p3./sqrt(bsxfun(#times,dp3,dp3.')); %//'
toc
error_val = max(abs(totalCor(:)-totalCor_out(:)))
Output -
---------- With original approach
Elapsed time is 186.501746 seconds.
---------- With the real fun aka BSXFUN
Elapsed time is 1.423448 seconds.
error_val =
4.996e-16

Why does linprog only give one value of x1 or x2 and not a combination of both?

Hi I have the following code using linprog
for K = 1:3;
for M = 1:3;
PV_output(:,:,K) = real(PV_power_output(:,:,K));
PV =reshape(PV_output(:,:,1),8760,1);
WT_output(:,:,M) = WT_power_output(:,:,M);
WT = reshape(WT_output(:,:,1),8760,1);
PVenergy = sum(sum(PV_output(:,:,1)));
WTenergy = sum(sum(WT_power_output(:,:,1)));
% Linprog
f = [((CRF*CC_PV)/PVenergy)+OM_PV; ((CRF*CC_WT)/WTenergy)+OM_WT];
A(:,:) = [-PV -WT];
b(:,:) = -0.25.*Demand(:);
lb = zeros(2,1);
ub = [max_PV_area/PV_area; max_WT_area/WT_area]';
[x(:,K,M), fval, exitflag] = linprog(f,A,b,[],[],lb,ub)
end
end
Where PV = 8760x2 , WT = 8760 x 2 and x = 2x1. When I run this program the optimisation converges with an exit flag of 1 but I either get a value of x1 =0 and a value of x2 equal to a certain integer.
Why doesn't the output give a mixture of the results (i.e a non-zero value of both x1 and x2?
Because a linear programming solver will return a solution at a vertex of the polytope defined by the constraints. An optimal solution will always lie at such a vertex.

Generating a triangular distribution in Matlab

I have attempted to generate a triangular probability distribution in Matlab, but was not successful. I used the formula at http://en.wikipedia.org/wiki/Triangular_distribution.
n = 10000000;
a = 0.2;
b = 0.7;
c = 0.5;
u = sqrt(rand(n, 1));
x = zeros(n, 1);
for i = 1:n
U = u(i);
if U < (c-a)/(b-a)
X = a + sqrt(U*(b-a)*(c-a));
else
X = b - sqrt((1-U)*(b-a)*(b-c));
end
x(i) = X;
end
hist(x, 100);
The histogram looks like so:
Doesn't look like much of a triangle to me. What's the problem? Am I abusing rand(n)?
you can add up two uniform distributions, the distribution graphs convolve, and you get a triangular distribution.
easy-to-understand example: rolling two dice, each action has uniform distribution to result in a number from 1-6, combined action has triangular distribution to result in a number 2-12
edit: minimal working example:
a=randint(10000,1,10);
b=randint(10000,1,10);
c=a+b;
hist(c,max(c)-min(c)+1)
edit2: looked in your script again. It's working but you've made one mistake:
u = sqrt(rand(n, 1));
should be
u = rand(n, 1);
edit3: optimized code
n = 10000000;
a = 0.2;
b = 0.7;
c = 0.5;
u = rand(n, 1);
x = zeros(n, 1);
idx = find(u < (c-a)/(b-a));
x(idx) = a + sqrt(u(idx)*(b-a)*(c-a));
idx =setdiff(1:n,idx);
x(idx) = b - sqrt((1-u(idx))*(b-a)*(b-c));
hist(x, 100);
This example uses the makedist() and pdf() commands.
a = 2; m = 7; b = 10;
N = 50000; % Number of samples
pd = makedist('Triangular',a,m,b); % Create probability distribution object
T = random(pd,N,1); % Generate samples from distribution
Triangular Distribution with lowerbound a = 7, mode m = 10, and upperbound b = 10.
% Plot PDF & Compare with Generated Sample
X = (a-2:.1:b+2);
figure, hold on, box on
histogram(T,'Normalization','pdf') % Note normalization-pdf option name-value pair
title([num2str(N) ' Samples'])
plot(X,pdf(pd,X),'r--','LineWidth',1.8)
legend('Empirical Density','Theoretical Density','Location','northwest')
MATLAB introduced makedist() in R2013a. Requires Stats toolbox.
Reference:
Triangular Distribution
Change
u = sqrt(rand(n, 1));
to
u = rand(n, 1);
The nice thing about this formula is that you can distribute a sample from a general triangle distribution with a single random sample.