empirical quantiles in matlab - matlab

does anyone know how to calculate the empirical quantiles of a distribution in matlab? specifically I have issues working w the empiricalQuantiles() function and need to calculate empirical quantiles of a rolling population (a matrix that is say 49x1025 for every 100 points).
if you can also give information on how to calculate the inverse of the empirical distribution (which should give approximately the same answer) that would be great

% Simulating empirical data
empiricalData=randn(50000,1);
% Quantile evaluation
% For instance: Median
y = quantile(empiricalData,[.50]);

Related

Inverse cumulative distribution function in MATLAB given an empirical PDF

Is it possible to generate random numbers with a distribution that depends on empirical probability data? I imagine this is possible by taking the inverse cumulative distribution function. I have seen some examples where this is done in MATLAB (the software that I'm using) but all of those examples have an underlying analytic form for the probability. Here I have only the PDF. For instance, I have data of probabilities for a particular event. Most of the probabilities are zero and hence not unique, but not all.
My goal is to generate the random numbers and then figure out what the distribution is. I'd really appreciate if people can help clear up my thinking here.
EDIT:
I think I want something like:
cdf=cumsum(pdf); % calculate pdf from empirical pdf
M=length(cdf);
xq=linspace(0,1,M);
invcdf=interp1(cdf,xq,xq); % calculate inverse cdf, i.e., x
but how do I take into account that a lot of the values of the pdf are zero and not unique? Is this even the right approach?
I am basing my answer on Inverse empirical cumulative distribution function from the MathWorks File Exchange. See that link for other suggestions to solving your problem.
% y - input: data set
% q - input: desired quantile (can be a scalar or a vector)
% xq - output: ICDF at specified quantile
[f, x] = ecdf(y);
xq = zeros(size(q));
for ii = 1:length(q)
xq(ii) = min(x(q(ii) <= f));
end
I'd eliminate the for loop if you're only using scalars. Also, there may be a more efficient way to vectorize the for loop, but this should at least get you started.

Approximate continuous probability distribution in Matlab

Suppose I have a continuous probability distribution, e.g., Normal, on a support A. Suppose that there is a Matlab code that allows me to draw random numbers from such a distribution, e.g., this.
I want to build a Matlab code to "approximate" this continuous probability distribution with a probability mass function spanning over r points.
This means that I want to write a Matlab code to:
(1) Select r points from A. Let us call these points a1,a2,...,ar. These points will constitute the new discretised support.
(2) Construct a probability mass function over a1,a2,...,ar. This probability mass function should "well" approximate the original continuous probability distribution.
Could you help by providing also an example? This is a similar question asked for Julia.
Here some of my thoughts. Suppose that the continuous probability distribution of interest is one-dimensional. One way to go could be:
(1) Draw 10^6 random numbers from the continuous probability distribution of interest and store them in a column vector D.
(2) Suppose that r=10. Compute the 10-th, 20-th,..., 90-th quantiles of D. Find the median point falling in each of the 10 bins obtained. Call these median points a1,...,ar.
How can I construct the probability mass function from here?
Also, how can I generalise this procedure to more than one dimension?
Update using histcounts: I thought about using histcounts. Do you think it is a valid option? For many dimensions I can use this.
clear
rng default
%(1) Draw P random numbers for standard normal distribution
P=10^6;
X = randn(P,1);
%(2) Apply histcounts
[N,edges] = histcounts(X);
%(3) Construct the new discrete random variable
%(3.1) The support of the discrete random variable is the collection of the mean values of each bin
supp=zeros(size(N,2),1);
for j=2:size(N,2)+1
supp(j-1)=(edges(j)-edges(j-1))/2+edges(j-1);
end
%(3.2) The probability mass function of the discrete random variable is the
%number of X within each bin divided by P
pmass=N/P;
%(4) Check if the approximation is OK
%(4.1) Find the CDF of the discrete random variable
CDF_discrete=zeros(size(N,2),1);
for h=2:size(N,2)+1
CDF_discrete(h-1)=sum(X<=edges(h))/P;
end
%(4.2) Plot empirical CDF of the original random variable and CDF_discrete
ecdf(X)
hold on
scatter(supp, CDF_discrete)
I don't know if this is what you're after but maybe it can help you. You know, P(X = x) = 0 for any point in a continuous probability distribution, that is the pointwise probability of X mapping to x is infinitesimal small, and thus regarded as 0.
What you could do instead, in order to approximate it to a discrete probability space, is to define some points (x_1, x_2, ..., x_n), and let their discrete probabilities be the integral of some range of the PDF (from your continuous probability distribution), that is
P(x_1) = P(X \in (-infty, x_1_end)), P(x_2) = P(X \in (x_1_end, x_2_end)), ..., P(x_n) = P(X \in (x_(n-1)_end, +infty))
:-)

Non parametric estimate of cdf in Matlab

I have a vector A in Matlab of dimension Nx1. I want to get a non-parametric estimate the cdf at each point in A and store all the values in a vector B of dimension Nx1. Which different options do I have?
I have read about ecdf and ksdensity but it is not clear to me what is the difference, pros and cons. Any direction would be appreciated.
This doesn't exactly answer your question, but you can compute the empirical CDF very simply:
A = randn(1,1e3); % example Gaussian data
x_cdf = sort(A);
y_cdf = (1:numel(A))/numel(A);
plot(x_cdf, y_cdf) % plot CDF
This works because, by definition, each sample contributes to the (empirical) CDF with an increment of 1/N. That is, for values smaller than the minimum sample the CDF equals 0; for values between the minimum sample and the next highest sample it equals 1/N, etc.
The advantage of this approach is that you know exactly what is being done.
If you need to evaluate the empirical CDF at prescribed x-axis values:
A = randn(1,1e3); % example Gaussian data
x_cdf = -5:.1:5;
y_cdf = sum(bsxfun(#le, A(:), x_cdf), 1)/numel(A);
plot(x_cdf, y_cdf) % plot CDF
If you have prescribed y-axis values, the corresponding x-axis values are by definition the quantiles of the (empirical) distribution:
A = randn(1,1e3); % example Gaussian data
y_cdf = 0:.01:1;
x_cdf = quantile(A, y_cdf);
plot(x_cdf, y_cdf) % plot CDF
You want ecdf, not ksdensity.
ecdf computes the empirical distribution function of your data set. This converges to the cumulative distribution function of the underlying population as the sample size increases.
ksdensity computes a kernel density estimation from your data. This converges to the probability density function of the underlying population as the sample size increases.
The PDF tells you how likely you are to get values near a given value. It wiggles up and down over your domain, going up near more likely values and falling near less likely values. The CDF tells you how likely you are to get values below a given value. So it always starts at zero at the left end of your domain and increases monotonically to one at the right end of your domain.

how to calculate the spectral density of a matrix of data use matlab

I am not doing signal processing. But in my area, I will use the spectral density of a matrix of data. I get quite confused at a very detailed level.
%matrix H is given.
corr=xcorr2(H); %get the correlation
spec=fft2(corr); % Wiener-Khinchin Theorem
In matlab, xcorr2 will calculate the correlation function of this matrix. The lag will range from -N+1 to N-1. So if size of matrix H is N by N, then size of corr will be 2N-1 by 2N-1. For discretized data, I should use corr or half of corr?
Another problem is I think Wiener-Khinchin Theorem is basically for continuous function. I have always thought that Discretized FT is an approximation to Continuous FT, or you can say it is a tool to calculate Continuous FT. If you use matlab build in function 'fft', you should divide the final result by \delta x.
Any kind soul who knows this area well there to share some matlab code with me?
Basically, approximating a continuous FT by a Discretized FT is the same as approximating an integral by a finite sum.
We will first discuss the 1D case, then we'll discuss the 2D case.
Let's look at the Wiener-Kinchin theorem (for example here).
It states that :
"For the discrete-time case, the power spectral density of the function with discrete values x[n], is :
where
Is the autocorrelation function of x[n]."
1) You can see already that the sum is taken from -infty to +infty in the calculation of S(f)
2) Now considering the Matlab fft - You can see (command 'edit fft' in Matlab), that it is defined as :
X(k) = sum_{n=1}^N x(n)*exp(-j*2*pi*(k-1)*(n-1)/N), 1 <= k <= N.
which is exactly what you want to be done in order to calculate the power spectral density for a frequency f.
Note that, for continuous functions, S(f) will be a continuous function. For Discretized function, S(f) will be discrete.
Now that we know all that, it can easily be extended to the 2D case. Indeed, the structure of fft2 matches the structure of the right hand side of the Wiener-Kinchin Theorem for the 2D case.
Though, it will be necessary to divide your result by NxM, where N is the number of sample points in x and M is the number of sample points in y.

How do I compute the Inverse gaussian distribution from given CDF?

I want to compute the parameters mu and lambda for the Inverse Gaussian Distribution given the CDF.
By 'given the CDF' I mean that I have given the data AND the (estimated) quantile for the data I.e.
Quantile - Value
0.01 - 10
0.5 - 12
0.7 - 13
Now I want to find out the inverse gaussian distribution for this data so that I can e.g. Look up the quantile for value 11 based on my distribution.
How can I find out the values mu and lambda?
The only solution I can think of is using Gradient descent to find the best mu and lambda using RMSE as an error measure.
Isn't there a better solution?
Comment: Matlab's MLE-Algorithm is not an option, since it does not use the quantile data.
As all you really want to do is estimate the quantiles of the distribution at unknown values and you have a lot of data points you can simply interpolate the values you want to lookup.
quantile_estimate = interp1(values, quantiles, value_of_interest);
According to #mpiktas here I implemented a gradient descent algorithm for estimating my mu and lambda:
Make initial guess using MLE
Learn mu and lambda using gradient descent with RMSE as error measure.
The following article explains in detail how to compute quantiles (the inverse CDF) for the inverse Gaussian distribution:
Giner, G, and Smyth, GK (2016). statmod: probability calculations for the inverse Gaussian distribution. R Journal. http://arxiv.org/abs/1603.06687
Code for the R language is contained in the R package statmod available from CRAN. For example:
> library(statmod)
> qinvgauss(0.01, lower.tail=FALSE)
[1] 4.98
computes the 0.01 upper tail quantile of the standard IG distribution.