Fitting a distribution to data - MATLAB - matlab

I am trying to fit a distribution to some data I've collected from microscopy images. We know that the peak at about 152 is due to a Poisson process. I'd like to fit a distribution to the large density in the center of the image, while ignoring the high intensity data. I know how to fit a Normal distribution to the data (red curve), but it doesn't do a good job of capturing the heavy tail on the right. Although the Poisson distribution should be able to model the tail to the right, it doesn't do a very good job either (green curve), because the mode of the distribution is at 152.
PD = fitdist(data, 'poisson');
The Poisson distribution with lambda = 152 looks very Gaussian-like.
Does anyone have an idea how to fit a distribution that will do a good job of capturing the right-tail of the data?
Link to an image showing the data and my attempts at distribution fitting.

The distribution looks a bit like an Ex-Gaussian (see the green line in the first wikipedia figure), that is, a mixture model of a normal and an exponential random variable.
On a side note, are you aware that, although the events of a poisson process are poisson distributed, the waiting times between the events are exponentially distributed? Given that a gaussian noise added to your measurement, an ex-gaussian distribution could be theoretically possible. (Of course this does not mean that this is also plausible.)
A tutorial on fitting the ex-gaussian with MatLab can be found in
Lacouture Y, Cousineau D. (2008)
How to use MATLAB to fit the ex‐Gaussian and other probability functions to a distribution of response times.
Tutorials in Quantitative Methods for Psychology 4 (1), p. 35‐45.
http://www.tqmp.org/Content/vol04-1/p035/p035.pdf

take a look at this: http://blogs.mathworks.com/pick/2012/02/10/finding-the-best/
it reviews the following FEX submission about fitting distributions: http://www.mathworks.com/matlabcentral/fileexchange/34943

Related

Matlab Random Distribution Generation without overlapping

I need to generate fibers in a certain size of box or beam. The distribution will be random and without overlapping. The algorithm is shown in the image below along with the result.
I am able to generate random distribution in Matlab but I can't figure out how to avoid overlapping as shown in the algorithm. The resultant I will be using in the Ansys simulation software for analysis.
The algorithm I have taken from other reference but I have modified parameters which are as under, fiber length 12mm, fiber diameter 35um, box size (40mm x 40mm x 160mm), fiber volume fraction = 2%, and the number of fibers within the box is around 443500.
The said codings are beyond my expertise, can anyone help me write the code for the said algorithm in Matlab?
I used the attached algorithm for fiber insertion without overlap, it works

Fitting a gaussian to data with Matlab

I want to produce a figure like the following one (found in a paper)
I think it is done using histfit
However, histfit doesen't really work with my data. The bars exceed the curve. My data is not really normally distributed but I want all the bins to be inside the curve except some outliers. Is there any way to fit a gaussian and plot it like in the above figure?
Edit
This is what histfit(data)has given
I want to fit a gaussian to it and keep some values as ouliers. I need to only use a normal distribution as it is going to be used in a Kalman filter based on the assumption that the data is normally distributed. The fact that is not really normally distributed will certainly affect the performance of the filter but I have to feed it first with the parameters of a normal distribution , i.e mean and std.
I'm not sure you understand how a fit works, if your data is kinda gaussian the function will plot the fitted curve based on the values, some bars will be above some below, it all depends on how the least squares are minimized over the entire curve. you can't force the fit to look different, this is the result of the fitting process. If your data is not normally distributed then the goodness of the fit is poor. without having more info or data, this is the best I can answer :)

Resampling data with minimal loss of information in time-domain

I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.
You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason

Histogram computational efficiency

I am trying to plot a 2 GB matrix using MATLAB hist on a computer with 4 GB RAM. The operation is taking hours. Are there ways to increase the performance of the computation, by pre-sorting the data, pre-determining bin sizes, breaking the data into smaller groups, deleting the raw data as the data is added to bins, etc?
Also, after the data is plotted, I need to adjust the binning to ensure the curve is smooth. This requires starting over and re-binning the raw data. I assume the strategy involving the least computation would be to first bin the data using very small bins and then manipulate the bin size of the output, rather than re-binning the raw data. What is the best way to adjust bin sizes post-binning (assuming the bin sizes can only grow and not shrink)?
I don't like answers to StackOverflow Questions of the form "well even though you asked how to do X, you don't really want to do X, you really want to do Y, so here's a solution to Y"
But that's what i am going to do here. I think such an answer is justified in this rare instance becuase the answer below is in accord with sound practices in statistical analysis and because it avoids the current problem in front of you which is crunching 4 GB of datda.
If you want to represent the distribution of a population using a non-parametric density estimator, and you wwish to avoid poor computational performance, a kernel density estimator (KDE) will do the job far better than a histogram.
To begin with, there's a clear preference for KDEs versus histograms among the majority of academic and practicing statisticians. Among the numerous texts on this topic, ne that i think is particularly good is An introduction to kernel density estimation )
Reasons why KDE is preferred to histogram
the shape of a histogram is strongly influenced by the choice of
total number of bins; yet there is no authoritative technique for
calculating or even estimating a suitable value. (Any doubts about this, just plot a histogram from some data, then watch the entire shape of the histogram change as you adjust the number of bins.)
the shape of the histogram is strongly influenced by the choice of
location of the bin edges.
a histogram gives a density estimate that is not smooth.
KDE eliminates completely histogram properties 2 and 3. Although KDE doesn't produce a density estimate with discrete bins, an analogous parameter, "bandwidth" must still be supplied.
To calculate and plot a KDE, you need to pass in two parameter values along with your data:
kernel function: the most common options (all available in the MATLAB kde function) are: uniform, triangular, biweight, triweight, Epanechnikov, and normal. Among these, gaussian (normal) is probably most often used.
bandwith: the choice of value for bandwith will almost certainly have a huge effect on the quality of your KDE. Therefore, sophisticated computation platforms like MATLAB, R, etc. include utility functions (e.g., rusk function or MISE) to estimate bandwith given oother parameters.
KDE in MATLAB
kde.m is the function in MATLAB that implementes KDE:
[h, fhat, xgrid] = kde(x, 401);
Notice that bandwith and kernel are not supplied when calling kde.m. For bandwitdh: kde.m wraps a function for bandwidth selection; and for the kernel function, gaussian is used.
But will using KDE in place of a histogram solve or substantially eliminate the very slow performance given your 2 GB dataset?
It certainly should.
In your Question, you stated that the lagging performance occurred during plotting. A KDE does not require mapping of thousands (missions?) of data points a symbol, color, and specific location on a canvas--instead it plots a single smooth line. And because the entire data set doesn't need to be rendered one point at a time on the canvas, they don't need to be stored (in memory!) while the plot is created and rendered.

Fit A Curve to a Histogram

Is there any possibility to fit a curve to that histogram above in Matlab?
The histogram is not normalized or anything like that.
I know that there is a function called histfit,but can i use it here?
Try this FileExchange submission:
ALLFITDIST - Fit all valid parametric probability distributions to data.
--- UPDATE ---
ALLFITDIST is no longer available on the MATLAB File Exchange.
You can try this instead:
FITMETHIS - finds best-fitting distribution to data vector, including non-parametric.
If you know the underlying distribution (i.e. skewed gaussian etc.), you can manually do a maximum likelihood estimate for the parameters of the distribution and then plot the resulting distribution on top of your histogram. However, you need to normalize your histogram so that you see empirical probabilities instead of the numbers.
I think what you want it to fit a distribution, not any curve that might not have finite area under the curve. Data looks like it's censored on the right tail, but over all it may fit log normal distribution or Gamma distribution pretty well. If you have stats toolbox, try gamfit or lognfit for starter.
See also Kernel density estimation
http://en.wikipedia.org/wiki/Kernel_density