Histogram (hist) not starting (and ending) in zero - matlab

I'm using the Matlab function "hist" to estimate the probability density function of a realization of a random process I have.
I'm actually:
1) taking the histogram of h0
2) normalizing its area in order to get 1
3) plotting the normalized curve.
The problem is that, no matter how many bins I use, the histogram never start from 0 and never go back to 0 whereas I would really like that kind of behavior.
The code I use is the following:
Nbin = 36;
[n,x0] = hist(h0,Nbin);
edge = find(n~=0,1,'last');
Step = x0(edge)/Nbin;
Scale_factor = sum(Step*n);
PDF_h0 = n/Scale_factor;
hist(h0 ,Nbin) %plot the histogram
figure;
plot(a1,p_rice); %plot the theoretical curve in blue
hold on;
plot(x0, PDF_h0,'red'); %plot the normalized curve obtained from the histogram
And the plots I get are:

If your problem is that the plotted red curve does not go to zero: you can solve that adding initial and final points with y-axis value 0. It seems from your code that the x-axis separation is Step, so it would be:
plot([x0(1)-Step x0 x0(end)+Step], [0 PDF_h0 0], 'red')

Related

Shouldn't fitdist(data,'Lognormal') give the same result and plot as fitdist(log(data),'Normal') in Matlab?

I am trying to fit a distribution curve to the histogram of some data. (I have used some model data here instead because it is difficult to upload the actual data. I have included the complete code after my question.)
Because the histogram looks normally distributed when I plotted the x-axis in logscale, I transform the data first before fitting a normal distribution to it and I got the following results:
>>pdn=fitdist(log(data),'Normal')
pdn =
Normal distribution
mu = -0.334458 [-0.34704, -0.321876]
sigma = 0.351478 [0.342804, 0.360605]
When I plotted out the pdf with the histogram, I got this:
The result seems reasonable to me. Then I discovered that in the Matlab fitdist(), it already has a 'Lognormal' option and I don't really need the transform my data first and this is what I got:
>>pdln = fitdist(data,'Lognormal')
pdln =
Lognormal distribution
mu = -0.334458 [-0.34704, -0.321876]
sigma = 0.351478 [0.342804, 0.360605]
Exactly the same mean and standard deviation as I have got before. However, when I plotted it out with the histogram, I got a different curve:
This curve fits better to the data but the positions of the mean and the mean+/-std points are not as I have expected (i.e. mean at the peak and the mean+/-std at the same levels).
Which come to my question, why would fitdist(data,'Lognormal') give the same result as fitdist(log(data),'Normal') but a different plot? I have looked through the Matlab help pages and I still could not understand why, or where are my mistakes, please help.
My aim for all this is to get some numerical parameters about the distributions of my data under different conditions and compare them to see if there is any difference. At the moment, I am not certain which way would give me reliable estimates of the means and standard deviations.
The code for the graphs is below:
%random data in lognormal distribution
mu=-0.335742;
sigma=0.35228;
data=lognrnd(mu,sigma,[3000 1]);
%make histogram
interval=0.1;
svalue=sort(data);
bx(1)=interval/2;
i=2;
while bx(i-1)<=max(svalue)
bx(i)=bx(i-1)+interval;
i=i+1;
end
by=hist(svalue,bx);
subplot(211)
h = bar(bx,by,'hist');
set(h,'FaceColor',[.9 .9 .9]);
set(gca,'xlim',[0.05 10]);
xticks=[0.05 0.1 0.2 0.5 1 2 5 10];
set(gca,'xscale','log','xminortick','on')
set(gca,'xtick',xticks)
ylabel('counts')
subplot(212)
h = bar(bx,by,'hist');
set(h,'FaceColor',[.9 .9 .9]);
set(gca,'xlim',[0.05 10]);
xticks=[0.05 0.1 0.2 0.5 1 2 5 10];
set(gca,'xscale','log','xminortick','on')
set(gca,'xtick',xticks)
ylabel('counts')
% fit distribution curves
pdf_x = 0:0.01:max(data);
max_by=max(by); % for scaling the pdf to the histogram
% case 1 - PDF fitted using fitdist(log(data),'Normal')
subplot(211)
hold on
pdn = fitdist(log(data),'Normal')
pdf_y = pdf(pdn,log(pdf_x));
h1=plot(pdf_x,pdf_y./max(pdf_y).*max_by,'-k');
range=[exp(pdn.mu-pdn.sigma) exp(pdn.mu+pdn.sigma)];
h2=plot(exp(pdn.mu),pdf(pdn,(pdn.mu))./max(pdf_y).*max_by,'sk') ;
h3=plot(range,pdf(pdn,log(range))./max(pdf_y).*max_by,'ok') ;
title('PDF fitted using fitdist(log(data),''Normal'')');
legend([h1 h2 h3],'pdf','mean','meam+/-std');
% case 2 - PDF fitted using fitdist(data,'Lognormal')
subplot(212)
hold on
pdln = fitdist(data,'Lognormal')
pdf_y = pdf(pdln,pdf_x);
h1=plot(pdf_x,pdf_y./max(pdf_y).*max_by,'-b');
range=[exp(pdln.mu-pdln.sigma) exp(pdln.mu+pdln.sigma)];
h2=plot(exp(pdln.mu),pdf(pdln,exp(pdln.mu))./max(pdf_y).*max_by,'sb');
h3=plot(range,pdf(pdln,range)./max(pdf_y).*max_by,'ob') ;
title('PDF fitted using fitdist(data,''Lognormal'')');
legend([h1 h2 h3],'pdf','mean','meam+/-std');

MATLAB Histogram Problems

I need to calculate the probability density of the function 1/2-x where x is a randomly generated number between 0 and 1.
My code looks like this:
n=10^6;
b=10^2;
x=(1./(2-(rand(1,n))));
a=histfit(x,b,'kernel'); % Requires Statistics and Machine Learning Toolbox
xlim([0.5 1.0])
And I get a decent graph that looks like this:
As it may be apparent, there are a couple of problems with this:
MATLAB draws a fit that differs from my histogram, because it counts in the empty space outside the [0.5 1] range of the function as well. This results in a distorted fit towards the edges. (The reason you don't see said empty space is because I entered an xlim there)
I don't know how I could divide every value in the Y-axis by 10^6, which would give me my probability density.
Thanks in advance.
To solve both of your problems, I suggest using hist (Note if you have a version above 2010b, you should use histogram instead) instead of histfit to first get the values of your histogram and then doing a regression and plotting them:
n=10^6;
b=10^2;
x=(1./(2-(rand(1,n))));
[counts,centers]=hist(x,b);
density = counts./trapz(centers, counts);
%// Thanks to #Arpi for this correction
polyFitting = polyfit(centers,density,3)
polyPlot = polyval(polyFitting,centers)
figure
bar(centers,density)
hold on
plot(centers,polyPlot,'r','LineWidth',3)
You can also up the resolution by adjusting b, which is set to 100 currently. Also try different regressions to see which one you prefer.
1. Better result can be obtained by using ksdensity and specifying the support of the distribution.
2. By using hist you have access to the counts and centers, thus the normalization to get density is straightforward.
Code to demonstrate the suggestions:
rng(125)
n=10^6;
b=10^2;
x=(1./(2-(rand(1,n))));
subplot(1,2,1)
a = histfit(x,b,'kernel');
title('Original')
xlim([0.5 1.0])
[f,c] = hist(x,b);
% normalization to get density
f = f/trapz(c,f);
% kernel density
pts = linspace(0.5, 1, 100);
[fk,xk] = ksdensity(x, pts, 'Support', [0.5, 1]);
subplot(1,2,2)
bar(c,f)
hold on
plot(xk,fk, 'red', 'LineWidth', 2)
title('Improved')
xlim([0.5 1.0])
Comparing the results:
EDIT: If you do not like the endings:
pts = linspace(0.5, 1, 500);
[fk,xk] = ksdensity(x, pts, 'Support', [0.5, 1]);
bar(c,f)
hold on
plot(xk(2:end-1),fk(2:end-1), 'red', 'LineWidth', 2)
title('Improved_2')
xlim([0.5 1.0])

Plot a grid of Gaussians with Matlab

With the following code I'm able to draw the plot of a single 2D-Gaussian function:
x=linspace(-3,3,1000);
y=x';
[X,Y]=meshgrid(x,y);
z=exp(-(X.^2+Y.^2)/2);
surf(x,y,z);shading interp
This is the produced plot:
However, I'd like to plot a grid having a specified number x of these 2D-Gaussians.
Think of the following picture as an above view of the plot I'd like to produce (where in particular the grid is made of 5x5 2D-Gaussians). Each Gaussian should be weighed by a coefficient such that if it's negative the Gaussian is pointing towards negative values of the z axis (black points in the grid below) and if it's positive it's as in the above image (white points in the grid below).
Let me provide some mathematical details. The grid corresponds to a mixture of 2D-Gaussians summed as in the following equation:
In which each Gaussian has its own mean and deviation.
Note that each Gaussian of the mixture should be put in a determined (X,Y) coordinate, in such a way that they are equally distant from each other. e.g think of the central Gaussian in (0,0) then the other ones should be in (-1,1) (0,1) (1,1) (-1,0) (1,0) (-1,-1) (0,-1) (1,-1) in the case of a grid with dimension 3x3.
Can you provide me (and explain to me) how can I do such a plot?
Thanks in advance for the help.
Indeed you said yourself, put (as an example just for the means)
[X,Y]=meshgrid(x,y); % //mesh
g_centers = -3:3;
[x_g,y_g] = meshgrid(g_centers,g_centers); % //grid of centers (coarser)
mu = [x_g(:) , y_g(:)]; % // mesh of centers in column
z = zeros(size(X));
for i = 1:size(mu,1)
z= z + exp(-((X-mu(i,1)).^2+(Y-mu(i,2)).^2)/( 2* .001) );
end
surf(X,Y,z);shading interp

Relative Frequency Histograms and Probability Density Functions

The function called DicePlot simulates rolling 10 dice 5000 times.
The function calculates the sum of values of the 10 dice of each roll, which will be a 1 ⇥ 5000 vector, and plot relative frequency histogram with edges of bins being selected in where each bin in the histogram represents a possible value of for the sum of the dice.
The mean and standard deviation of the 1 ⇥ 5000 sums of dice values will be computed, and the probability density function of normal distribution (with the mean and standard deviation computed) on top of the relative frequency histogram will be plotted.
Below is my code so far - What am I doing wrong? The graph shows up but not the extra red line on top? I looked at answers like this, and I don't think I'll be plotting anything like the Gaussian function.
% function[]= DicePlot()
for roll=1:5000
diceValues = randi(6,[1, 10]);
SumDice(roll) = sum(diceValues);
end
distr=zeros(1,6*10);
for i = 10:60
distr(i)=histc(SumDice,i);
end
bar(distr,1)
Y = normpdf(X)
xlabel('sum of dice values')
ylabel('relative frequency')
title(['NumDice = ',num2str(NumDice),' , NumRolls = ',num2str(NumRolls)]);
end
It is supposed to look like
But it looks like
The red line is not there because you aren't plotting it. Look at the documentation for normpdf. It computes the pdf, it doesn't plot it. So you problem is how do you add this line to the plot. The answer to that problem is to google "matlab hold on".
Here's some code to get you going in the right direction:
% Normalize your distribution
normalizedDist = distr/sum(distr);
bar(normalizedDist ,1);
hold on
% Setup your density function using the mean and std of your sample data
mu = mean(SumDice);
stdv = std(SumDice);
yy = normpdf(xx,mu,stdv);
xx = linspace(0,60);
% Plot pdf
h = plot(xx,yy,'r'); set(h,'linewidth',1.5);

How to fit a curve by a series of segmented lines in Matlab?

I have a simple loglog curve as above. Is there some function in Matlab which can fit this curve by segmented lines and show the starting and end points of these line segments ? I have checked the curve fitting toolbox in matlab. They seems to do curve fitting by either one line or some functions. I do not want to curve fitting by one line only.
If there is no direct function, any alternative to achieve the same goal is fine with me. My goal is to fit the curve by segmented lines and get locations of the end points of these segments .
First of all, your problem is not called curve fitting. Curve fitting is when you have data, and you find the best function that describes it, in some sense. You, on the other hand, want to create a piecewise linear approximation of your function.
I suggest the following strategy:
Split manually into sections. The section size should depend on the derivative, large derivative -> small section
Sample the function at the nodes between the sections
Find a linear interpolation that passes through the points mentioned above.
Here is an example of a code that does that. You can see that the red line (interpolation) is very close to the original function, despite the small amount of sections. This happens due to the adaptive section size.
function fitLogLog()
x = 2:1000;
y = log(log(x));
%# Find section sizes, by using an inverse of the approximation of the derivative
numOfSections = 20;
indexes = round(linspace(1,numel(y),numOfSections));
derivativeApprox = diff(y(indexes));
inverseDerivative = 1./derivativeApprox;
weightOfSection = inverseDerivative/sum(inverseDerivative);
totalRange = max(x(:))-min(x(:));
sectionSize = weightOfSection.* totalRange;
%# The relevant nodes
xNodes = x(1) + [ 0 cumsum(sectionSize)];
yNodes = log(log(xNodes));
figure;plot(x,y);
hold on;
plot (xNodes,yNodes,'r');
scatter (xNodes,yNodes,'r');
legend('log(log(x))','adaptive linear interpolation');
end
Andrey's adaptive solution provides a more accurate overall fit. If what you want is segments of a fixed length, however, then here is something that should work, using a method that also returns a complete set of all the fitted values. Could be vectorized if speed is needed.
Nsamp = 1000; %number of data samples on x-axis
x = [1:Nsamp]; %this is your x-axis
Nlines = 5; %number of lines to fit
fx = exp(-10*x/Nsamp); %generate something like your current data, f(x)
gx = NaN(size(fx)); %this will hold your fitted lines, g(x)
joins = round(linspace(1, Nsamp, Nlines+1)); %define equally spaced breaks along the x-axis
dx = diff(x(joins)); %x-change
df = diff(fx(joins)); %f(x)-change
m = df./dx; %gradient for each section
for i = 1:Nlines
x1 = joins(i); %start point
x2 = joins(i+1); %end point
gx(x1:x2) = fx(x1) + m(i)*(0:dx(i)); %compute line segment
end
subplot(2,1,1)
h(1,:) = plot(x, fx, 'b', x, gx, 'k', joins, gx(joins), 'ro');
title('Normal Plot')
subplot(2,1,2)
h(2,:) = loglog(x, fx, 'b', x, gx, 'k', joins, gx(joins), 'ro');
title('Log Log Plot')
for ip = 1:2
subplot(2,1,ip)
set(h(ip,:), 'LineWidth', 2)
legend('Data', 'Piecewise Linear', 'Location', 'NorthEastOutside')
legend boxoff
end
This is not an exact answer to this question, but since I arrived here based on a search, I'd like to answer the related question of how to create (not fit) a piecewise linear function that is intended to represent the mean (or median, or some other other function) of interval data in a scatter plot.
First, a related but more sophisticated alternative using regression, which apparently has some MATLAB code listed on the wikipedia page, is Multivariate adaptive regression splines.
The solution here is to just calculate the mean on overlapping intervals to get points
function [x, y] = intervalAggregate(Xdata, Ydata, aggFun, intStep, intOverlap)
% intOverlap in [0, 1); 0 for no overlap of intervals, etc.
% intStep this is the size of the interval being aggregated.
minX = min(Xdata);
maxX = max(Xdata);
minY = min(Ydata);
maxY = max(Ydata);
intInc = intOverlap*intStep; %How far we advance each iteraction.
if intOverlap <= 0
intInc = intStep;
end
nInt = ceil((maxX-minX)/intInc); %Number of aggregations
parfor i = 1:nInt
xStart = minX + (i-1)*intInc;
xEnd = xStart + intStep;
intervalIndices = find((Xdata >= xStart) & (Xdata <= xEnd));
x(i) = aggFun(Xdata(intervalIndices));
y(i) = aggFun(Ydata(intervalIndices));
end
For instance, to calculate the mean over some paired X and Y data I had handy with intervals of length 0.1 having roughly 1/3 overlap with each other (see scatter image):
[x,y] = intervalAggregate(Xdat, Ydat, #mean, 0.1, 0.333)
x =
Columns 1 through 8
0.0552 0.0868 0.1170 0.1475 0.1844 0.2173 0.2498 0.2834
Columns 9 through 15
0.3182 0.3561 0.3875 0.4178 0.4494 0.4671 0.4822
y =
Columns 1 through 8
0.9992 0.9983 0.9971 0.9955 0.9927 0.9905 0.9876 0.9846
Columns 9 through 15
0.9803 0.9750 0.9707 0.9653 0.9598 0.9560 0.9537
We see that as x increases, y tends to decrease slightly. From there, it is easy enough to draw line segments and/or perform some other kind of smoothing.
(Note that I did not attempt to vectorize this solution; a much faster version could be assumed if Xdata is sorted.)