Is there any built-in functions in MATLAB that would statistically extend a sequence of real numbers so that the resulting sequence is extended to any size I want. I have a sequence of 499 elements and I want to extend it to 4096 elements. Thanks in advance.
If you're wanting to interpolate a vector of 499 elements to a higher resolution of 4096 elements, you can use the INTERP1 function in the following way (where x is your 499-element vector):
y = interp1(x,linspace(1,499,4096));
The above uses the function LINSPACE to generate a 4096-element vector of values spaced linearly between 1 and 499, which is then used as the interpolation points. By default, the INTERP1 function uses linear interpolation to compute new values between the old points. You can use other interpolation methods in the following way:
y = interp1(x,linspace(1,499,4096),'spline'); %# Cubic spline method
y = interp1(x,linspace(1,499,4096),'pchip'); %# Piecewise cubic Hermite method
I don't really understand the word "statistically" in the question, but from your comments it seems that you just need linear (or smooth) interpolation.
Try with interp1q or interp1.
If you know the distribution of the data to be in a Pearson or Johnson system of parametric family of distributions, then you can generate more data using the sampling functions pearsrnd and johnsrnd (useful in generating random values without specifying which parametric distribution)
Example:
%# load data, lets say this is vector of 499 elements
data = load('data.dat');
%# generate more data using pearsrnd
moments = {mean(data),std(data),skewness(data),kurtosis(data)};
newData = pearsrnd(moments{:}, [4096-499 1]);
%# concat sequences
extendedData = [data; newData];
%# plot histograms (you may need to adjust the num of bins to see the similarity)
subplot(121), hist(data), xlabel('x'), ylabel('Frequency')
subplot(122), hist(extendedData), xlabel('x'), ylabel('Frequency')
or using johnsrnd:
%# generate more data using johnsrnd
quantiles = quantile(data, normcdf([-1.5 -0.5 0.5 1.5]));
newData = johnsrnd(quantiles, [4096-499 1]);
On the other hand, if you want to assume a non-paramteric distribution, you can use the ecdf function or the ksdensity function.
Please refer to the demo Nonparametric Estimates of Cumulative Distribution Functions and Their Inverses for a complete example (highly suggested!).
Related
I'm running Matlab code for kernel density, i.e., [f,xi] = ksdensity(x), where x is a two column bivariate data. The resulting output f is the density vector, while xi is the meshgrid of evaluation points that is 30x30 in dimension. See the documentation here: Link.
I'm trying to increase number of evaluation points that I receive from this code. There is an option mentioned in the documentation called 'NumPoints' that is only applicable for univariate data. Is there an option or ways that I can increase the meshgrid points of evaluation points of bivariate data to, say, 100x100?
You need to use the optional second input argument pts to specify the range and number of the output points in your grid. See this example in the documentation. Depending on your input data, you could specify something like this:
pts = [linspace(min(x(:,1)),max(x(:,1)),1000).' linspace(min(x(:,2)),max(x(:,2)),1000).'];
The NumPoints is npoints in the ksdensity(). e.g., [f,xi] = ksdensity(x, 'npoints', 1000) will return 1000 points of xi and f.
I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.
I have plotted a piece-wise defined continuous linear function comprising of several oblique straight lines joined end-to-end:-
x=[0,1/4,1/2,3/4,1];
oo=[1.23 2.31 1.34 5.69 7] % edit
y=[oo(1),oo(2),oo(3),oo(4),oo(5)];
plot(x,y,'g--')
I now wish to sample points from this plot itself, say i want the y corresponding to x=0.89. How to achieve that using Matlab? Is there a special function in-built in Matlab?
Yes, there's a built-in function for that: interp1:
vq = interp1(x,v,xq) returns interpolated values of a 1-D function at specific query points using linear interpolation. Vector x contains the sample points, and v contains the corresponding values, v(x). Vector xq contains the coordinates of the query points.
[...]
See the linked documentation for further options. For example, you can specify the interpolation method (default is linear), or whether you want to extrapolate (i.e. allow for xq values to lie outside the original x range).
I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!
Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.
I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.
I have a Matlab figure I want to use in a paper. This figure contains multiple cdfplots.
Now the problem is that I cannot use the markers because the become very dense in the plot.
If i want to make the samples sparse I have to drop some samples from the cdfplot which will result in a different cdfplot line.
How can I add enough markers while maintaining the actual line?
One method is to get XData/YData properties from your curves follow solution (1) from #ephsmith and set it back. Here is an example for one curve.
y = evrnd(0,3,100,1); %# random data
%# original data
subplot(1,2,1)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
%# reduced data
subplot(1,2,2)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
xdata = get(h,'XData');
ydata = get(h,'YData');
set(h,'XData',xdata(1:5:end));
set(h,'YData',ydata(1:5:end));
Another method is to calculate empirical CDF separately using ECDF function, then reduce the results before plotting with PLOT.
y = evrnd(0,3,100,1); %# random data
[f, x] = ecdf(y);
%# original data
subplot(1,2,1)
plot(x,f,'*')
%# reduced data
subplot(1,2,2)
plot(x(1:5:end),f(1:5:end),'r*')
Result
I know this is potentially unnecessary given MATLAB's built-in functions (in the Statistics Toolbox anyway) but it may be of use to other viewers who do not have access to the toolbox.
The empirical CMF (CDF) is essentially the cumulative sum of the empirical PMF. The latter is attainable in MATLAB via the hist function. In order to get a nice approximation to the empirical PMF, the number of bins must be selected appropriately. In the following example, I assume that 64 bins is good enough for your data.
%# compute a histogram with 64 bins for the data points stored in y
[f,x]=hist(y,64);
%# convert the frequency points in f to proportions
f = f./sum(f);
%# compute the cumulative sum of the empirical PMF
cmf = cumsum(f);
Now you can choose how many points you'd like to plot by using the reduced data example given by yuk.
n=20 ; % number of total data markers in the curve graph
M_n = round(linspace(1,numel(y),n)) ; % indices of markers
% plot the whole line, and markers for selected data points
plot(x,y,'b-',y(M_n),y(M_n),'rs')
verry simple.....
try reducing the marker size.
x = rand(10000,1);
y = x + rand(10000,1);
plot(x,y,'b.','markersize',1);
For publishing purposes I tend to use the plot tools on the figure window. This allow you to tweak all of the plot parameters and immediately see the result.
If the problem is that you have too many data points, you can:
1). Plot using every nth sample of the data. Experiment to find an n that results in the look you want.
2). I typically fit curves to my data and add a few sparsely placed markers to plots of the fits to differentiate the curves.
Honestly, for publishing purposes I have always found that choosing different 'LineStyle' or 'LineWidth' properties for the lines gives much cleaner results than using different markers. This would also be a lot easier than trying to downsample your data, and for plots made with CDFPLOT I find that markers simply occlude the stairstep nature of the lines.