I 'd like to fit my empirical data to a poisson distribution curve.
I have the mean given value, say 2.3, and data (empirical).
def fit_poisson(data=None,network=None,mu=2.3):
sns.set_theme()
fig, ax = plt.subplots(1, 1)
x = np.arange(poisson.ppf(0.01, mu),
poisson.ppf(0.99, mu))
sns.histplot(data, stat='density')
plt.plot(x, poisson.pmf(x, mu))
It plots:
Apparently, there's is a range issue in y, here. Maybe a problem with lambda? How do I properly fit my empirical histogram to a poisson distribution curve of same mean?
Poisson random variables are discrete: their y value is "probability" not "density". But the default behavior of histplot avoids guessing that you have discrete data, and it is choosing bins with binwidth < 1 in this case.
Because density normalization forces the area of all bars to sum to 1, that means the density value for the bar containing observations of a certain value will be greater than the probability mass on that value.
There are two relevant parameters here:
stat="probability" will make the heights of the bars sum to 1, so they will match the PMF (assuming binwidth < 2, so that only one unique value appears in each bar)
discrete=True, which sets binwidth=1 (and aligns the center of each bar with integral values)
sns.histplot(data, stat='probability', discrete=True, shrink=.8)
I've also added shrink=0.8, which draws the bars a bit narrower than the binwidth; this helps emphasize the discrete nature of the data.
(Note that with discrete=True (implying binwidth=1), density and probability normalization will do the same thing so that's actually all you need, but Probability is the right y axis label to use here).
Related
New to MATLAB, I want to take a vector data, normalize it, and plot it as a normal distribution. I have code to normalize my data and plot it as a histogram, but it does not come out as a normally distributed, so can someone point me in the right direction as to how to do this. The code below is for normalizing the data:
subplot(3, 1, 1)
[x, y] = hist(data, 50);
bar(y, x/trapz(y, x))
So this normalizes my histogram but does not make a normally distributed curve. The data is not random and is stored as a vector.
What you probably need is histfit which produces a histogram and fits a distribution to it at the same time. You can choose the number of bins and this distribution to fit as arguments to the function. The example below uses 100 bins for the histogram and fits a normal curve to your data:
histfit(data, 100, 'normal')
The default value for number of bins is square-root of the number of elements in data, rounded up, and the default value for distribution is normal. Full documentation for histfit is available here.
I have a list of data that I am trying to fit to a polynomial and I am trying to plot the 95% confidence bands for the parameters as well (in Matlab).
If my data are x and y
f=fit(x,y,'poly2')
plot(f,x,y)
ci=confint(f,0.95);
a_ci=ci(1,:);
b_ci=ci(2,:);
I do not know how to proceed after that to get the minimum and maximum band around my data. Does anyone know how to do that?
I can see that you have the curve fitting toolbox installed, which is good, because you need it for the following code to work.
Basic fit of example data
Let's define some example data and a possible fit function. (I could also have used poly2 here, but I wanted to keep it a bit more general.)
xdata = (0:0.1:1)'; % column vector!
noise = 0.1*randn(size(xdata));
ydata = xdata.^2 + noise;
f = fittype('a*x.^2 + b');
fit1 = fit(xdata, ydata, f, 'StartPoint', [1,1])
plot(fit1, xdata, ydata)
Side note: plot() is not our usual plot function, but a method of the cfit-object fit1.
Confidence intervals of the fitted parameters
Our fit uses the data to determine the coefficients a,b of the underlying model f(x)=ax2+b. You already did this, but for completeness here is how you can read out the uncertainty of the coefficients for any confidence interval. The coefficients are alphabetically ordered, which is why I can use ci(1,:) for a, and so on.
names = coeffnames(fit1) % check the coefficient order!
ci = confint(fit1, 0.95); % 2 sigma interval
a_ci = ci(1,:)
b_ci = ci(2,:)
By default, Matlab uses 2σ (0.95) confidence intervals. Some people (physicists) prefer to quote the 1σ (0.68) intervals.
Confidence and Prediction Bands
It's a good habit to plot confidence bands or prediction bands around the data – especially when the coefficients are correlated! But you should take a moment to think about which one of the two you want to plot:
Prediction band: If I take a new measurement value, where would I expect it to lie? In Matlab terms, this is called the “observation band”.
Confidence band: Where do I expect the true value to lie? In Matlab terms, this is called the “functional band”.
As with the coefficient’s confidence intervals, Matlab uses 2σ bands by default, and the physicists among us switch this to 1σ intervals. By its nature, the prediction band is bigger, because it is the combination of the error of the model (the confidence band!) and the error of the measurement.
There is a another destinction to make, one that I don’t fully understand. Both Matlab and Wikipedia make that distinction.
Pointwise: How big is the prediction/confidence band for a single measurement/true value? In virtually all cases I can think of, this is what you would want to ask as a physicist.
Simultaneous: How big do you have to make the prediction/confidence band if you want a set of all new measurements/all prediction points to lie within the band with a given confidence?
In my personal opinion, the “simultaneous band” is not a band! For a measurement with n points, it should be n individual error bars!
The prediction/confidence distinction and the pointwise/simultaneous distinction give you a total of four options for “the” band around the plot. Matlab makes the 2σ pointwise prediction band easily accessible, but what you seem to be interested in is the 2σ pointwise confidence band. It is a bit more cumbersome to plot, because you have to specify dummy data over which to evaluate the prediction band:
x_dummy = linspace(min(xdata), max(xdata), 100);
figure(1); clf(1);
hold all
plot(xdata,ydata,'.')
plot(fit1) % by default, evaluates the fit over the currnet XLim
% use "functional" (confidence!) band; use "simultaneous"=off
conf1 = predint(fit1,x_dummy,0.95,'functional','off');
plot(x_dummy, conf1, 'r--')
hold off
Note that the confidence band at x=0 equals the confidence interval of the fit-coefficient b!
Extrapolation
If you want to extrapolate to x-values that are not covered by the range of your data, you can evaluate the fit and the prediction/confidence band for a bigger range:
x_range = [0, 2];
x_dummy = linspace(x_range(1), x_range(2), 100);
figure(1); clf(1);
hold all
plot(xdata,ydata,'.')
xlim(x_range)
plot(fit1)
conf1 = predint(fit1,x_dummy,0.68,'functional','off');
plot(x_dummy, conf1, 'r--')
hold off
I currently have a vector of calculated probability densities, i.e.
probden = (0.0008, 0.0016, 0.0048, 0.0064, 0.0072, ... , 1.0936, ... , 0.0072, 0.0064, 0.0048, 0.0016, 0.0008)
The list of calculated probability densities should be in the shape of a normal distribution.
I also have a same length list of the bins of each probability density.
I am trying to create a histogram such that each probability density is reflected on each bin on the X-axis.
If I use the function hist, it only shows how many probability densities are in each bin.
How should I go on approaching this issue?
Thanks!
The function that goes hand in hand with hist is bar
In your case, you already have your histogram/distribution values (so no need to call hist), you can directly call bar:
bar( YourvectorOfBins , probden )
I have a question on plotting probability distribution and cumulative distribution curves using Matlab. I apologize for asking a noob question but I am new to Matlab, having only used it for a few hours.
I have a set of data which has the size range for the sand particles found on a beach in millimeters (e.g: >2.00, 1.00–2.00, 0.50–1.00, <0.50).
And their corresponding percentages of finding these sand particles are as follows:
(e.g.: 30, 25.5, 35.9, 8.6).
How am I supposed to input the values in the Matlab system for it to plot the probability distribution and cumulative distribution curves on the same plot with different colors? Percentage should be the y-axis and the size range should be the x-axis.
If your dataset is literally 4 points, then you can simply enter them literally. For example, if my dataset was {(A, 1), (B, 2), (C, 3)}, then we could simply set y = [1, 2, 3] and x = {'a', 'b', 'c'}.
For distributions, you should take a look at the sum and cumsum functions.
For plotting, take a look at bar for frequency plots and plot for cumulative plots (this is just my preference). The documentation contains information on setting colors.
For plotting on the same graph, look at hold. To label your plot and your axes, look at xlabel, ylabel, and title.
Matlab has a good FAQ on setting the actual values that are displayed on each axis. For example, I can plot my dataset above by plotting the y vector only, and then setting the X tick labels to 'A', 'B', and 'C'.
I'd be careful with the cumulative distribution function (CDF). It might make more sense to reorder your data in increasing particle size (see fliplr() function) otherwise your CDF's interpretation will be suspect.
The cumsum() function can get your CDF from the given probability mass function (PMF).
label={'<0.50','0.50-1.00','1.00-2.00','>2.00'}';
pmf = [0.086 0.359 0.255 0.30]';
cdf = cumsum(pmf);
bar(pmf) % PMF
set(gca,'XTickLabel',label)
title('Sand Particle Size Distribution')
xlabel('Sand particle size (mm)')
figure
stairs(cdf,'ks-','LineWidth',2.0) % CDF
set(gca,'XTick',1:length(label),'XTickLabel',label)
ylabel('Percentile')
xlabel('Sand particle size (mm)')
ylim([0 1])
I have a set of samples, S, and I want to find its PDF. The problem is when I use ksdensity I get values greater than one!
[f,xi] = ksdensity(S)
In array f, most of the values are greater than one! Would you please tell me what the problem can be? Thanks for your help.
For example:
S=normrnd(0.3035, 0.0314,1,1000);
ksdensity(S)
ksdensity, as the name says, estimates a probability density function over a continuous variable. Probability densities can be larger than 1, they can actually have arbitrary values from zero upwards. The constraint on probabilities is that their sum over an exhaustive range of possibilities has to be 1. For probability densities, the constraint is that the integral over the whole range of values is 1.
A crude approximation of an integral of the pdf estimated by ksdensity can be obtained in Matlab like this:
sum(f) * min(diff(xi))
assuming that the values in xi are equally spaced. The value of this expression should be approximately 1.
If in your application you believe this approximation is not close enough to 1, you might want to specify the grid of estimation points (second parameter pts) such that the spacing is finer or the range is wider than the one automatically generated by ksdensity.