I'm running Matlab code for kernel density, i.e., [f,xi] = ksdensity(x), where x is a two column bivariate data. The resulting output f is the density vector, while xi is the meshgrid of evaluation points that is 30x30 in dimension. See the documentation here: Link.
I'm trying to increase number of evaluation points that I receive from this code. There is an option mentioned in the documentation called 'NumPoints' that is only applicable for univariate data. Is there an option or ways that I can increase the meshgrid points of evaluation points of bivariate data to, say, 100x100?
You need to use the optional second input argument pts to specify the range and number of the output points in your grid. See this example in the documentation. Depending on your input data, you could specify something like this:
pts = [linspace(min(x(:,1)),max(x(:,1)),1000).' linspace(min(x(:,2)),max(x(:,2)),1000).'];
The NumPoints is npoints in the ksdensity(). e.g., [f,xi] = ksdensity(x, 'npoints', 1000) will return 1000 points of xi and f.
Related
I am trying trying to graph the polynomial fit of a 2D dataset in Matlab.
This is what I tried:
rawTable = readtable('Test_data.xlsx','Sheet','Sheet1');
x = rawTable.A;
y = rawTable.B;
figure(1)
scatter(x,y)
c = polyfit(x,y,2);
y_fitted = polyval(c,x);
hold on
plot(x,y_fitted,'r','LineWidth',2)
rawTable.A and rawTable.A are randomly generated numbers. (i.e. the x dataset cannot be represented in the following form : x=0:0.1:100)
The result:
second-order polynomial
But the result I expect looks like this (generated in Excel):
enter image description here
How can I graph the second-order polynomial fit in MATLAB?
I sense some confusion regarding what the output of each of those Matlab function mean. So I'll clarify. And I think we need some details as well. So expect some verbosity. A quick answer, however, is available at the end.
c = polyfit(x,y,2) gives the coefficient vectors of the polynomial fit. You can get the fit information such as error estimate following the documentation.
Name this polynomial as P. P in Matlab is actually the function P=#(x)c(1)*x.^2+c(2)*x+c(3).
Suppose you have a single point X, then polyval(c,X) outputs the value of P(X). And if x is a vector, polyval(c,x) is a vector corresponding to [P(x(1)), P(x(2)),...].
Now that does not represent what the fit is. Just as a quick hack to see something visually, you can try plot(sort(x),polyval(c,sort(x)),'r','LineWidth',2), ie. you can first sort your data and try plotting on those x-values.
However, it is only a hack because a) your data set may be so irregularly spaced that the spline doesn't represent function or b) evaluating on the whole of your data set is unnecessary and inefficient.
The robust and 'standard' way to plot a 2D function of known analytical form in Matlab is as follows:
Define some evenly-spaced x-values over the interval you want to plot the function. For example, x=1:0.1:10. For example, x=linspace(0,1,100).
Evaluate the function on these x-values
Put the above two components into plot(). plot() can either plot the function as sampled points, or connect the points with automatic spline, which is the default.
(For step 1, quadrature is ambiguous but specific enough of a term to describe this process if you wish to communicate with a single word.)
So, instead of using the x in your original data set, you should do something like:
t=linspace(min(x),max(x),100);
plot(t,polyval(c,t),'r','LineWidth',2)
Consider the following draws for a 2x1 vector in Matlab with a probability distribution that is a mixture of two Gaussian components.
P=10^3; %number draws
v=1;
%First component
mu_a = [0,0.5];
sigma_a = [v,0;0,v];
%Second component
mu_b = [0,8.2];
sigma_b = [v,0;0,v];
%Combine
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);
%Draws
RV_temp = random(obj,P);%Px2
% Transform each component of RV_temp into a uniform in [0,1] by estimating the cdf.
RV1=ksdensity(RV_temp(:,1), RV_temp(:,1),'function', 'cdf');
RV2=ksdensity(RV_temp(:,2), RV_temp(:,2),'function', 'cdf');
Now, if we check whether RV1 and RV2 are uniformly distributed on [0,1] by doing
ecdf(RV1)
ecdf(RV2)
we can see that RV1 is uniformly distributed on [0,1] (the empirical cdf is close to the 45 degree line) while RV2 is not.
I don't understand why. It seems that the more distant are mu_a(2)and mu_b(2), the worse the job done by ksdensity with a reasonable number of draws. Why?
When you have a mixture of N(0.5,v) and N(8.2,v) then the range of the generated data is larger than if you had expectation which were closer, like N(0,v) and N(0,v), as you have in the other dimension. Then you ask ksdensity to approximate a function using P points inside this range.
Like in standard linear interpolation, the denser the points the better approximation of the function (inside the range), this is the same case here. Thus in the N(0.5,v) and N(8.2,v) where the points are "sparse" (or sparser, is that a word?) the approximation is worse than in the N(0,v) and N(0,v) where the points are denser.
As a small side note, are there any reason that you do not apply ksdensity directly on the bivariate data? Also I cannot reproduce your comment where you say that 5e2points are also good. Final comment, 1e3 is typically prefered over 10^3.
I think this is simply about the number of samples you're using. For the first example, the means of the two Gaussians are relatively close, hence a thousand samples are enough to obtain a cdf really close the the U[0,1] cdf. On the second vector though, you have a higher difference, and need more samples. With 100000 samples, I obtained the following result:
With 1000 I obtained this:
Which is clearly farther from the Uniform cdf function. Try to increase the number of samples to a million and check if the result is again getting closer.
I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.
I have plotted a piece-wise defined continuous linear function comprising of several oblique straight lines joined end-to-end:-
x=[0,1/4,1/2,3/4,1];
oo=[1.23 2.31 1.34 5.69 7] % edit
y=[oo(1),oo(2),oo(3),oo(4),oo(5)];
plot(x,y,'g--')
I now wish to sample points from this plot itself, say i want the y corresponding to x=0.89. How to achieve that using Matlab? Is there a special function in-built in Matlab?
Yes, there's a built-in function for that: interp1:
vq = interp1(x,v,xq) returns interpolated values of a 1-D function at specific query points using linear interpolation. Vector x contains the sample points, and v contains the corresponding values, v(x). Vector xq contains the coordinates of the query points.
[...]
See the linked documentation for further options. For example, you can specify the interpolation method (default is linear), or whether you want to extrapolate (i.e. allow for xq values to lie outside the original x range).
Is there any built-in functions in MATLAB that would statistically extend a sequence of real numbers so that the resulting sequence is extended to any size I want. I have a sequence of 499 elements and I want to extend it to 4096 elements. Thanks in advance.
If you're wanting to interpolate a vector of 499 elements to a higher resolution of 4096 elements, you can use the INTERP1 function in the following way (where x is your 499-element vector):
y = interp1(x,linspace(1,499,4096));
The above uses the function LINSPACE to generate a 4096-element vector of values spaced linearly between 1 and 499, which is then used as the interpolation points. By default, the INTERP1 function uses linear interpolation to compute new values between the old points. You can use other interpolation methods in the following way:
y = interp1(x,linspace(1,499,4096),'spline'); %# Cubic spline method
y = interp1(x,linspace(1,499,4096),'pchip'); %# Piecewise cubic Hermite method
I don't really understand the word "statistically" in the question, but from your comments it seems that you just need linear (or smooth) interpolation.
Try with interp1q or interp1.
If you know the distribution of the data to be in a Pearson or Johnson system of parametric family of distributions, then you can generate more data using the sampling functions pearsrnd and johnsrnd (useful in generating random values without specifying which parametric distribution)
Example:
%# load data, lets say this is vector of 499 elements
data = load('data.dat');
%# generate more data using pearsrnd
moments = {mean(data),std(data),skewness(data),kurtosis(data)};
newData = pearsrnd(moments{:}, [4096-499 1]);
%# concat sequences
extendedData = [data; newData];
%# plot histograms (you may need to adjust the num of bins to see the similarity)
subplot(121), hist(data), xlabel('x'), ylabel('Frequency')
subplot(122), hist(extendedData), xlabel('x'), ylabel('Frequency')
or using johnsrnd:
%# generate more data using johnsrnd
quantiles = quantile(data, normcdf([-1.5 -0.5 0.5 1.5]));
newData = johnsrnd(quantiles, [4096-499 1]);
On the other hand, if you want to assume a non-paramteric distribution, you can use the ecdf function or the ksdensity function.
Please refer to the demo Nonparametric Estimates of Cumulative Distribution Functions and Their Inverses for a complete example (highly suggested!).