I have a vector A in Matlab of dimension Nx1. I want to get a non-parametric estimate the cdf at each point in A and store all the values in a vector B of dimension Nx1. Which different options do I have?
I have read about ecdf and ksdensity but it is not clear to me what is the difference, pros and cons. Any direction would be appreciated.
This doesn't exactly answer your question, but you can compute the empirical CDF very simply:
A = randn(1,1e3); % example Gaussian data
x_cdf = sort(A);
y_cdf = (1:numel(A))/numel(A);
plot(x_cdf, y_cdf) % plot CDF
This works because, by definition, each sample contributes to the (empirical) CDF with an increment of 1/N. That is, for values smaller than the minimum sample the CDF equals 0; for values between the minimum sample and the next highest sample it equals 1/N, etc.
The advantage of this approach is that you know exactly what is being done.
If you need to evaluate the empirical CDF at prescribed x-axis values:
A = randn(1,1e3); % example Gaussian data
x_cdf = -5:.1:5;
y_cdf = sum(bsxfun(#le, A(:), x_cdf), 1)/numel(A);
plot(x_cdf, y_cdf) % plot CDF
If you have prescribed y-axis values, the corresponding x-axis values are by definition the quantiles of the (empirical) distribution:
A = randn(1,1e3); % example Gaussian data
y_cdf = 0:.01:1;
x_cdf = quantile(A, y_cdf);
plot(x_cdf, y_cdf) % plot CDF
You want ecdf, not ksdensity.
ecdf computes the empirical distribution function of your data set. This converges to the cumulative distribution function of the underlying population as the sample size increases.
ksdensity computes a kernel density estimation from your data. This converges to the probability density function of the underlying population as the sample size increases.
The PDF tells you how likely you are to get values near a given value. It wiggles up and down over your domain, going up near more likely values and falling near less likely values. The CDF tells you how likely you are to get values below a given value. So it always starts at zero at the left end of your domain and increases monotonically to one at the right end of your domain.
Related
I am using MATLAB R2020a with MacOS. I am trying to find the exponentially weighted moving mean of the cycle period of an ECG signal, and have used the dsp.MovingAverage function from the DSP signal processing toolbox, and called the commands shown. However, I am not sure how to specify how many of the elements of the vector to include in the weighted mean. At the moment, is it just adding a weight to all of the elements and then finding the moving mean?
movavgExp = dsp.MovingAverage('Method', 'Exponential weighting', 'ForgettingFactor', 0.1);
Whenever I call the 'WindowLength' command as specified in the DSP documentation, it produces an error:
movavgExp = dsp.MovingAverage(10, 'Method', 'Exponential weighting', 'ForgettingFactor', 0.1);
Warning: The WindowLength property is not relevant in this configuration of the System
object.
I would really appreciate any suggestions for this, thanks in advance!
From the Mathworks page for dsp.MovingAverage:
"Exponential weighting — The block multiplies the samples by a set of weighting factors. The magnitude of the weighting factors decreases exponentially as the age of the data increases, but the magnitude never reaches zero. To compute the average, the algorithm sums the weighted data."
So there is no real averaging window as you use all your signal up to time t (exponentially weighted) for the mean value at that instant.
Of course older samples are weighted less than newer ones, and the parameter for that is that ForgettingFactor. I guess you could then define an "effective" averaging window as the number of samples whose weight is larger than a threshold.
Unfortunately it doesn't seem like dsp.MovingAverage can return the weights itself, but you can calculate them yourself. From the Mathworks page,
where is the weight for the Nth sample and is your forgetting factor. Remember to initialize the weight for the first sample to 1, so that you could have something like:
w = zeros(length(x),1); % where x is your signal
w(1) = 1; % initialize the weight for the first sample
for i = 2:length(x)
w(i) = lambda*w(i-1) + 1; % calculate the successive weights
end
To have then the averaging window for the N-th sample I would probably then normalize the weights from 1 to N with respect to the their sum:
thr = 1.e-3; % your threshold, you'll probably have to play with this a bit
lengthAveragWdw = zeros(length(x),1);
for i = 1:length(x)
wi = w(1:i); % weights used to calculate the moving average up to the i-th sample
wi = wi./sum(wi); % normalize the weights
lengthAveragWdw(i) = sum(wi >= thr); % count the number of samples whose weight is greater than the threshold
end
where thr is a threshold value that you have to decide beforehand.
Consider the following draws for a 2x1 vector in Matlab with a probability distribution that is a mixture of two Gaussian components.
P=10^3; %number draws
v=1;
%First component
mu_a = [0,0.5];
sigma_a = [v,0;0,v];
%Second component
mu_b = [0,8.2];
sigma_b = [v,0;0,v];
%Combine
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);
%Draws
RV_temp = random(obj,P);%Px2
% Transform each component of RV_temp into a uniform in [0,1] by estimating the cdf.
RV1=ksdensity(RV_temp(:,1), RV_temp(:,1),'function', 'cdf');
RV2=ksdensity(RV_temp(:,2), RV_temp(:,2),'function', 'cdf');
Now, if we check whether RV1 and RV2 are uniformly distributed on [0,1] by doing
ecdf(RV1)
ecdf(RV2)
we can see that RV1 is uniformly distributed on [0,1] (the empirical cdf is close to the 45 degree line) while RV2 is not.
I don't understand why. It seems that the more distant are mu_a(2)and mu_b(2), the worse the job done by ksdensity with a reasonable number of draws. Why?
When you have a mixture of N(0.5,v) and N(8.2,v) then the range of the generated data is larger than if you had expectation which were closer, like N(0,v) and N(0,v), as you have in the other dimension. Then you ask ksdensity to approximate a function using P points inside this range.
Like in standard linear interpolation, the denser the points the better approximation of the function (inside the range), this is the same case here. Thus in the N(0.5,v) and N(8.2,v) where the points are "sparse" (or sparser, is that a word?) the approximation is worse than in the N(0,v) and N(0,v) where the points are denser.
As a small side note, are there any reason that you do not apply ksdensity directly on the bivariate data? Also I cannot reproduce your comment where you say that 5e2points are also good. Final comment, 1e3 is typically prefered over 10^3.
I think this is simply about the number of samples you're using. For the first example, the means of the two Gaussians are relatively close, hence a thousand samples are enough to obtain a cdf really close the the U[0,1] cdf. On the second vector though, you have a higher difference, and need more samples. With 100000 samples, I obtained the following result:
With 1000 I obtained this:
Which is clearly farther from the Uniform cdf function. Try to increase the number of samples to a million and check if the result is again getting closer.
Suppose I have the following data and commands:
clc;clear;
t = [0:0.1:1];
t_new = [0:0.01:1];
y = [1,2,1,3,2,2,4,5,6,1,0];
p = interp1(t,y,t_new,'spline');
plot(t,y,'o',t_new,p)
You can see they work quite fine, in the sense interpolating function matches the data points at the nodes fine. But my problem is, I need to compute the exact derivative of y (i.e., p function) w.r.t. time and plot it against the t vector. How can it be done? I shall not use diff commands, because I need to make sure the derivative function has the same length as t vector. Thanks a lot.
Method A: Using the derivative
This method calculates the actual derivative of the polynomial. If you have the curve fitting toolbox you can use:
% calculate the polynominal
pp = interp1(t,y,'spline','pp')
% take the first order derivative of it
pp_der=fnder(pp,1);
% evaluate the derivative at points t (or any other points you wish)
slopes=ppval(pp_der,t);
If you don't have the curve fitting toolbox you can replace the fnderline with:
% piece-wise polynomial
[breaks,coefs,l,k,d] = unmkpp(pp);
% get its derivative
pp_der = mkpp(breaks,repmat(k-1:-1:1,d*l,1).*coefs(:,1:k-1),d);
Source: This mathworks question. Thanks to m7913d for linking it.
Appendix:
Note that
p = interp1(t,y,t_new,'spline');
is a shortcut for
% get the polynomial
pp = interp1(t,y,'spline','pp');
% get the height of the polynomial at query points t_new
p=ppval(pp,t_new);
To get the derivative we obviously need the polynomial and can't just work with the new interpolated points. To avoid interpolating the points twice which can take quite long for a lot of data, you should replace the shortcut with the longer version. So a fully working example that includes your code example would be:
t = [0:0.1:1];
t_new = [0:0.01:1];
y = [1,2,1,3,2,2,4,5,6,1,0];
% fit a polynomial
pp = interp1(t,y,'spline','pp');
% get the height of the polynomial at query points t_new
p=ppval(pp,t_new);
% plot the new interpolated curve
plot(t,y,'o',t_new,p)
% piece-wise polynomial
[breaks,coefs,l,k,d] = unmkpp(pp);
% get its derivative
pp_der = mkpp(breaks,repmat(k-1:-1:1,d*l,1).*coefs(:,1:k-1),d);
% evaluate the derivative at points t (or any other points you wish)
slopes=ppval(pp_der,t);
Method B: Using finite differences
A derivative of a continuous function is at its base just the difference of f(x) to f(x+infinitesimal difference) divided by said infinitesimal difference.
In matlab, eps is the smallest difference possible with a double precision. Therefore after each t_new we add a second point which is eps larger and interpolate y for the new points. Then the difference between each point and it's +eps pair divided by eps gives the derivative.
The problem is that if we work with such small differences the precision of the output derivatives is severely limited, meaning it can only have integer values. Therefore we add values slightly larger than eps to allow for higher precisions.
% how many floating points the derivatives can have
precision = 10;
% add after each t_new a second point with +eps difference
t_eps=[t_new; t_new+eps*precision];
t_eps=t_eps(:).';
% interpolate with those points and get the differences between them
differences = diff(interp1(t,y,t_eps,'spline'));
% delete all differences wich are not between t_new and t_new + eps
differences(2:2:end)=[];
% get the derivatives of each point
slopes = differences./(eps*precision);
You can of course replace t_new with t (or any other time you want to get the differential of) if you want to get the derivatives at the old points.
This method is slightly inferior to method a) in your case, as it is slower and a bit less precise. But maybe it's useful to somebody else who is in a different situation.
I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.
does anyone know how to calculate the empirical quantiles of a distribution in matlab? specifically I have issues working w the empiricalQuantiles() function and need to calculate empirical quantiles of a rolling population (a matrix that is say 49x1025 for every 100 points).
if you can also give information on how to calculate the inverse of the empirical distribution (which should give approximately the same answer) that would be great
% Simulating empirical data
empiricalData=randn(50000,1);
% Quantile evaluation
% For instance: Median
y = quantile(empiricalData,[.50]);