Is it possible to define your own probability density function in MATLAB or Octave and use it
for generating random numbers?
MATLAB and Octave have default functions like rand, randn built in to draw points at random from a uniform, or normal distributions but there seems to be no documentation of how to define my very own proability density function.
Sampling from an arbitrary random distribution is not always trivial. For well known distributions there are tricks to implement them and most of them are implemented in stats toolbox as Oli said.
If your distribution of interest is of difficult form, there are many sampling algorithms that may help you, such as, rejection sampling, slice sampling, Metropolis–Hastings algorithm.
If your distribution is discrete, or can be approximated by a discrete distribution fairly well, then you can just do multinomial sampling using randsamp.
If you have the stats toolbox, you can use random(), as it has a lot of useful PDFs built-in.
I've had to do that a few times recently, and it's not exactly an easy thing to accomplish. My favorite technique was to use Inverse transform sampling.
The idea is quite simple:
create a cdf
use a uniform random number generator.
identify the RV that maps to your cdf value.
Related
I am trying to fit a custom distribution to a large (~O(500,000) measurements) dataset using scipy. I have derived a theoretical PDF based on some other factors, but both by hand and using symbolic integration software I cannot find an exact form of the CDF.
Currently, simply evaluating 1000 random samples from my custom distribution is expensive, which I believe is due to the need to invert an unknown CDF. If I cannot find an explicit form of the CDF and it's inverse, is there anything else I can do to speed up usage of this distribution?
I've used maple, matlab and Sympy to try and determine a CDF, yet none give a result. I also tried down-sampling my data whilst still retaining the tail attributes, but this still required so much data that doing anything with the distribution was slow.
My distribution is a sub-class of SciPy's rv_continuous class.
Thanks for any advice.
This sounds like you want to sample from a Kernel Density Estimation of the probability distribution. While Scipy does offer a Gaussian Kernel package, for that many measurements you would be much better off using sklearn's implementation. A good resource with code examples can be found on Jake VanderPlas's blog.
The pdf for the multivariate normal distribution in MATLAB is mvnpdf(...). What about the case where multiple variables are uniformly distributed: Is there a function to describe their joint distribution analogous to the multivariate normal distribution? If there is no such function, is there a trick to handle this case?
The simplest way how several variables can be uniformly distributed is if they are mutually independent; in that case you simply have a uniform distribution over the hypercube in the space spanned by the variables. In order to get samples from this distribution, you just separately generate samples for each of the variables.
The point where a "trick" might be necessary is if you have dependencies between the variables even though the marginal distribution for each of them is still uniform. In this case you have to describe the dependency structure, and I'm not aware of any standard way to do this (the way dependencies between normally distributed variables are described by a correlation matrix).
Of course such distributions exist: For two dimensions, one possibility would be to have a joint distribution that looks like a solution to the "eight rooks" problem:
Another one actually derives from the introductory Matlab example, the magic square:
Both of these examples are discrete distributions, but can be produced at arbitrary granularity, or simply interpreted as piecewise constant continuous distributions.
As you can see there are many possibilities for a multivariate distribution each of whose marginal distributions are uniform. The question you have to answer for yourself is what kind of dependencies, if any, you are interested in?
If I'm understanding the question properly, we want to calculate the pdf of a multivariate uniform distribution. By definition, the pdf is constant for all values in the support the distribution. Thus to calculate the pdf all that is required is to calculate the norming constant, which is given by the inverse of the integral of the support. That is to say, the pdf is given by
f(x) = 1 / integral(A)
where A is the support set, and x is an element in A. If an analytic solution to integral(A) is not available, then a numerical integrator can be employed.
Is there any possibility to fit a curve to that histogram above in Matlab?
The histogram is not normalized or anything like that.
I know that there is a function called histfit,but can i use it here?
Try this FileExchange submission:
ALLFITDIST - Fit all valid parametric probability distributions to data.
--- UPDATE ---
ALLFITDIST is no longer available on the MATLAB File Exchange.
You can try this instead:
FITMETHIS - finds best-fitting distribution to data vector, including non-parametric.
If you know the underlying distribution (i.e. skewed gaussian etc.), you can manually do a maximum likelihood estimate for the parameters of the distribution and then plot the resulting distribution on top of your histogram. However, you need to normalize your histogram so that you see empirical probabilities instead of the numbers.
I think what you want it to fit a distribution, not any curve that might not have finite area under the curve. Data looks like it's censored on the right tail, but over all it may fit log normal distribution or Gamma distribution pretty well. If you have stats toolbox, try gamfit or lognfit for starter.
See also Kernel density estimation
http://en.wikipedia.org/wiki/Kernel_density
I am trying to learn the kernel density estimation from the basic. Anyone have the simple routine for 1d KDE would be great helpful. Thanks.
If you have the statistics toolbox in MATLAB, you can use the ksdensity to estimate pdf/cdf using kernel smoothing. Here's an example
data=[randn(2000,1);4+randn(2000,1)];%# create a bimodal Gaussian distribution
x=linspace(-4,8,1e4);%# need to evaluate density at these points
pF=ksdensity(data,x,'function','pdf');%# evaluate the pdf of the data points
If you plot it, it should look like this
You can also get the cumulative distribution or the inverse cumulative or change the kernel that is used. You can look up the list of options from the link provided. This should help you get started :)
I am using Perl to model a random variable (Y) which is the sum of some ~15-40k independent Bernoulli random variables (X_i), each with a different success probability (p_i). Formally, Y=Sum{X_i} where Pr(X_i=1)=p_i and Pr(X_i=0)=1-p_i.
I am interested in quickly answering queries such as Pr(Y<=k) (where k is given).
Currently, I use random simulations to answer such queries. I randomly draw each X_i according to its p_i, then sum all X_i values to get Y'. I repeat this process a few thousand times and return the fraction of times Pr(Y'<=k).
Obviously, this is not totally accurate, although accuracy greatly increases as the number of simulations I use increases.
Can you think of a reasonable way to get the exact probability?
First, I would avoid using the rand built-in for this purpose which is too dependent on the underlying C library implementation to be reliable (see, for example, my blog post pointing out that the range of rand on Windows has cardinality 32,768).
To use the Monte-Carlo approach, I would start with a known good random generator, such as Rand::MersenneTwister or just use one of Random.org's services and pre-compute a CDF for Y assuming Y is pretty stable. If each Y is only used once, pre-computing the CDF is obviously pointless.
To quote Wikipedia:
In probability theory and statistics, the Poisson binomial distribution is the discrete probability distribution of a sum of independent Bernoulli trials.
In other words, it is the probability distribution of the number of successes in a sequence of n independent yes/no experiments with success probabilities p1, …, pn. (emphasis mine)
Closed-Form Expression for the Poisson-Binomial Probability Density Function might be of interest. The article is behind a paywall:
and we discuss several of its advantages regarding computing speed and implementation and in simplifying analysis, with examples of the latter including the computation of moments and the development of new trigonometric identities for the binomial coefficient and the binomial cumulative distribution function (cdf).
As far as I recall, shouldn't this end up asymptotically as a normal distribution? See also this newsgroup thread: http://newsgroups.derkeiler.com/Archive/Sci/sci.stat.consult/2008-05/msg00146.html
If so, you can use Statistics::Distrib::Normal.
To obtain the exact solution you can exploit the fact that the probability distribution of the sum of two or more independent random variables is the convolution of their individual distributions. Convolution is a bit expensive but must be calculated only if the p_i change.
Once you have the probability distribution, you can easily obtain the CDF by calculating the cumulative sum of the probabilities.