In my project, I used a generic cosine function to fit my data:
cos_fun = #(p, theta) p(1) + p(2) * cos(theta - p(3))
p = nlinfit(x,y,cos_fun,[1 1 0])
As a result, p has three values, which are y-offset, amplitude and phase.
Can I draw a smooth cosine curve using these three parameters?
TL;DR: It is possible to both fit and plot the curve, with and without requiring toolboxes. All cases presented below.
Plotting
Plotting the function follows directly from the function used to obtain your parameters using plot(). Notice that you control the smoothness of the plotted function based on the step size for the domain (see step below).
In the figure, the results obtained from the nlinfit() (Toolbox required) are the same as "SSE" obtained without a toolbox using fminsearch().
% Plot (No toolbox Required)
step = 0.01; % smaller is smoother
Xrng = 0:step:12;
figure, hold on, box on
plot(Xdata,Ydata,'b.','DisplayName','Data')
plot(Xrng,cos_fun(p_SSE,Xrng),'r--','DisplayName','SSE')
plot(Xrng,cos_fun(p_SAE,Xrng),'k--','DisplayName','SAE')
legend('show')
As pointed out in the comment by #Daniel, you can also make the plot with nlintool() but this requires the Statistics Toolbox.
nlintool(Xdata,Ydata,cos_fun,[1 1 0]) % toolbox required
Fitting
Using nlinfit(): (Statistics Toolbox Required)
pNL = nlinfit(Xdata,Ydata,cos_fun,[1 1 0]) % same as SSE approach below
A Toolbox Free Approach:
You can construct a convex error function to minimize and return the global optimum using fminsearch() as a down and dirty approach. For example, the sum of squared error or the sum of absolute error will be convex.
% MATLAB R2019a
% Generate Example Data
sigma = 0.5; % increase this for more variable data (more noise)
Xdata = [repmat(1:10,1,4)].';
Ydata = cos(Xdata)+sigma*randn(length(Xdata),1);
% Function Evaluation
cos_fun=#(p,x) p(1) + p(2).*cos(x-p(3));
% Error Functions
SSEh =#(p) sum((cos_fun(p,Xdata)-Ydata).^2); % sum of squared error
SAEh =#(p) sum(abs(cos_fun(p,Xdata)-Ydata)); % sum of absolute error
Of course, these will give you different errors for the same parameter.
% Test
SSEh([1 1 0])
SAEh([1 1 0])
But you then call fminsearch() given an initial guess for the parameters, p0, to obtain the parameters that minimize your chosen error function. Since SSEh and SAEh are both convex with respect to p, there's no need to do this multiple times and save the best one since for every p0, you'll get the same answer.
p0 = [1 1 0.25]; % Initial starting point
[p_SSE, SSE] = fminsearch(SSEh,p0)
[p_SAE, SAE] = fminsearch(SAEh,p0)
You fit slightly different curves depending on the error function.
Notice that SSEh(pNL) and SSEh(p_SSE) are the same since pNL equals p_SSE since nlinfit() estimates the coefficients "using iterative least squares estimation."
Related
Suppose I have a vector t = [0 0.1 0.9 1 1.4], and a vector x = [1 3 5 2 3]. How can I compute the derivative of x with respect to time that has the same length as the original vectors?
I should not use any symbolic operations. The command diff(x)./diff(t) does not produce a vector of the same length. Should I first interpolate the x(t) function and then take its derivative?
Different approaches exist to calculate the derivative at the same points as your initial data:
Finite differences: Use a central difference scheme at your inner points and a forward/backward scheme at your first/last point
or
Curve fitting: Fit a curve through your points, calculate the derivative of this fitted function and sample them at the same points as the original data. Typical fitting functions are polynomials or spline functions.
Note that the curve fitting approach gives better results, but needs more tuning options and is slower (~100x).
Demonstration
As an example, I will calculate the derivative of a sine function:
t = 0:0.1:1;
y = sin(t);
Its exact derivative is well known:
dy_dt_exact = cos(t);
The derivative can approximately been calculated as:
Finite differences:
dy_dt_approx = zeros(size(y));
dy_dt_approx(1) = (y(2) - y(1))/(t(2) - t(1)); % forward difference
dy_dt_approx(end) = (y(end) - y(end-1))/(t(end) - t(end-1)); % backward difference
dy_dt_approx(2:end-1) = (y(3:end) - y(1:end-2))./(t(3:end) - t(1:end-2)); % central difference
or
Polynomial fitting:
p = polyfit(t,y,5); % fit fifth order polynomial
dp = polyder(p); % calculate derivative of polynomial
The results can be visualised as follows:
figure('Name', 'Derivative')
hold on
plot(t, dy_dt_exact, 'DisplayName', 'eyact');
plot(t, dy_dt_approx, 'DisplayName', 'finite difference');
plot(t, polyval(dp, t), 'DisplayName', 'polynomial');
legend show
figure('Name', 'Error')
hold on
plot(t, abs(dy_dt_approx - dy_dt_exact)/max(dy_dt_exact), 'DisplayName', 'finite difference');
plot(t, abs(polyval(dp, t) - dy_dt_exact)/max(dy_dt_exact), 'DisplayName', 'polynomial');
legend show
The first graph shows the derivatives itself and the second graph plots the relative errors made by both methods.
Discussion
One clearly sees that the curve fitting method gives better results than the finite differences, but it is ~100x slower. The curve fitting methods has a relative error of order 10^-5. Note that the finite differences approach becomes better when your data is sampled more densely or you use a higher order scheme. The disadvantage of the curve fitting approach is that one has to choose a good polynomial order. Spline functions may be better suited in general.
A 10x faster sampled dataset, i.e. t = 0:0.01:1;, results in the following graphs:
How to normalize a histogram such that the area under the probability density function is equal to 1?
My answer to this is the same as in an answer to your earlier question. For a probability density function, the integral over the entire space is 1. Dividing by the sum will not give you the correct density. To get the right density, you must divide by the area. To illustrate my point, try the following example.
[f, x] = hist(randn(10000, 1), 50); % Create histogram from a normal distribution.
g = 1 / sqrt(2 * pi) * exp(-0.5 * x .^ 2); % pdf of the normal distribution
% METHOD 1: DIVIDE BY SUM
figure(1)
bar(x, f / sum(f)); hold on
plot(x, g, 'r'); hold off
% METHOD 2: DIVIDE BY AREA
figure(2)
bar(x, f / trapz(x, f)); hold on
plot(x, g, 'r'); hold off
You can see for yourself which method agrees with the correct answer (red curve).
Another method (more straightforward than method 2) to normalize the histogram is to divide by sum(f * dx) which expresses the integral of the probability density function, i.e.
% METHOD 3: DIVIDE BY AREA USING sum()
figure(3)
dx = diff(x(1:2))
bar(x, f / sum(f * dx)); hold on
plot(x, g, 'r'); hold off
Since 2014b, Matlab has these normalization routines embedded natively in the histogram function (see the help file for the 6 routines this function offers). Here is an example using the PDF normalization (the sum of all the bins is 1).
data = 2*randn(5000,1) + 5; % generate normal random (m=5, std=2)
h = histogram(data,'Normalization','pdf') % PDF normalization
The corresponding PDF is
Nbins = h.NumBins;
edges = h.BinEdges;
x = zeros(1,Nbins);
for counter=1:Nbins
midPointShift = abs(edges(counter)-edges(counter+1))/2;
x(counter) = edges(counter)+midPointShift;
end
mu = mean(data);
sigma = std(data);
f = exp(-(x-mu).^2./(2*sigma^2))./(sigma*sqrt(2*pi));
The two together gives
hold on;
plot(x,f,'LineWidth',1.5)
An improvement that might very well be due to the success of the actual question and accepted answer!
EDIT - The use of hist and histc is not recommended now, and histogram should be used instead. Beware that none of the 6 ways of creating bins with this new function will produce the bins hist and histc produce. There is a Matlab script to update former code to fit the way histogram is called (bin edges instead of bin centers - link). By doing so, one can compare the pdf normalization methods of #abcd (trapz and sum) and Matlab (pdf).
The 3 pdf normalization method give nearly identical results (within the range of eps).
TEST:
A = randn(10000,1);
centers = -6:0.5:6;
d = diff(centers)/2;
edges = [centers(1)-d(1), centers(1:end-1)+d, centers(end)+d(end)];
edges(2:end) = edges(2:end)+eps(edges(2:end));
figure;
subplot(2,2,1);
hist(A,centers);
title('HIST not normalized');
subplot(2,2,2);
h = histogram(A,edges);
title('HISTOGRAM not normalized');
subplot(2,2,3)
[counts, centers] = hist(A,centers); %get the count with hist
bar(centers,counts/trapz(centers,counts))
title('HIST with PDF normalization');
subplot(2,2,4)
h = histogram(A,edges,'Normalization','pdf')
title('HISTOGRAM with PDF normalization');
dx = diff(centers(1:2))
normalization_difference_trapz = abs(counts/trapz(centers,counts) - h.Values);
normalization_difference_sum = abs(counts/sum(counts*dx) - h.Values);
max(normalization_difference_trapz)
max(normalization_difference_sum)
The maximum difference between the new PDF normalization and the former one is 5.5511e-17.
hist can not only plot an histogram but also return you the count of elements in each bin, so you can get that count, normalize it by dividing each bin by the total and plotting the result using bar. Example:
Y = rand(10,1);
C = hist(Y);
C = C ./ sum(C);
bar(C)
or if you want a one-liner:
bar(hist(Y) ./ sum(hist(Y)))
Documentation:
hist
bar
Edit: This solution answers the question How to have the sum of all bins equal to 1. This approximation is valid only if your bin size is small relative to the variance of your data. The sum used here correspond to a simple quadrature formula, more complex ones can be used like trapz as proposed by R. M.
[f,x]=hist(data)
The area for each individual bar is height*width. Since MATLAB will choose equidistant points for the bars, so the width is:
delta_x = x(2) - x(1)
Now if we sum up all the individual bars the total area will come out as
A=sum(f)*delta_x
So the correctly scaled plot is obtained by
bar(x, f/sum(f)/(x(2)-x(1)))
The area of abcd`s PDF is not one, which is impossible like pointed out in many comments.
Assumptions done in many answers here
Assume constant distance between consecutive edges.
Probability under pdf should be 1. The normalization should be done as Normalization with probability, not as Normalization with pdf, in histogram() and hist().
Fig. 1 Output of hist() approach, Fig. 2 Output of histogram() approach
The max amplitude differs between two approaches which proposes that there are some mistake in hist()'s approach because histogram()'s approach uses the standard normalization.
I assume the mistake with hist()'s approach here is about the normalization as partially pdf, not completely as probability.
Code with hist() [deprecated]
Some remarks
First check: sum(f)/N gives 1 if Nbins manually set.
pdf requires the width of the bin (dx) in the graph g
Code
%http://stackoverflow.com/a/5321546/54964
N=10000;
Nbins=50;
[f,x]=hist(randn(N,1),Nbins); % create histogram from ND
%METHOD 4: Count Densities, not Sums!
figure(3)
dx=diff(x(1:2)); % width of bin
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND with dx
% 1.0000
bar(x, f/sum(f));hold on
plot(x,g,'r');hold off
Output is in Fig. 1.
Code with histogram()
Some remarks
First check: a) sum(f) is 1 if Nbins adjusted with histogram()'s Normalization as probability, b) sum(f)/N is 1 if Nbins is manually set without normalization.
pdf requires the width of the bin (dx) in the graph g
Code
%%METHOD 5: with histogram()
% http://stackoverflow.com/a/38809232/54964
N=10000;
figure(4);
h = histogram(randn(N,1), 'Normalization', 'probability') % hist() deprecated!
Nbins=h.NumBins;
edges=h.BinEdges;
x=zeros(1,Nbins);
f=h.Values;
for counter=1:Nbins
midPointShift=abs(edges(counter)-edges(counter+1))/2; % same constant for all
x(counter)=edges(counter)+midPointShift;
end
dx=diff(x(1:2)); % constast for all
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND
% Use if Nbins manually set
%new_area=sum(f)/N % diff of consecutive edges constant
% Use if histogarm() Normalization probability
new_area=sum(f)
% 1.0000
% No bar() needed here with histogram() Normalization probability
hold on;
plot(x,g,'r');hold off
Output in Fig. 2 and expected output is met: area 1.0000.
Matlab: 2016a
System: Linux Ubuntu 16.04 64 bit
Linux kernel 4.6
For some Distributions, Cauchy I think, I have found that trapz will overestimate the area, and so the pdf will change depending on the number of bins you select. In which case I do
[N,h]=hist(q_f./theta,30000); % there Is a large range but most of the bins will be empty
plot(h,N/(sum(N)*mean(diff(h))),'+r')
There is an excellent three part guide for Histogram Adjustments in MATLAB (broken original link, archive.org link),
the first part is on Histogram Stretching.
I am trying to simulate a network of mobile robots that uses artificial potential fields for movement planning to a shared destination xd. This is done by generating a series of m-files (one for each robot) from a symbolic expression, as this seems to be the best way in terms of computational time and accuracy. However, I can't figure out what is going wrong with my gradient computation: the analytical gradient that is being computed seems to be faulty, while the numerical gradient is calculated correctly (see the image posted below). I have written a MWE listed below, which also exhibits this problem. I have checked the file generating part of the code, and it does return a correct function file with a correct gradient. But I can't figure out why the analytic and numerical gradient are so different.
(A larger version of the image below can be found here)
% create symbolic variables
xd = sym('xd',[1 2]);
x = sym('x',[2 2]);
% create a potential function and a gradient function for both (x,y) pairs
% in x
for i=1:size(x,1)
phi = norm(x(i,:)-xd)/norm(x(1,:)-x(2,:)); % potential field function
xvector = reshape(x.',1,size(x,1)*size(x,2)); % reshape x to allow for gradient computation
grad = gradient(phi,xvector(2*i-1:2*i)); % compute the gradient
gradx = grad(1);grady=grad(2); % split the gradient in two components
% create function file names
gradfun = strcat('GradTester',int2str(i),'.m');
phifun = strcat('PotTester',int2str(i),'.m');
% generate two output files
matlabFunction(gradx, grady,'file',gradfun,'outputs',{'gradx','grady'},'vars',{xvector, xd});
matlabFunction(phi,'file',phifun,'vars',{xvector, xd});
end
clear all % make sure the workspace is empty: the functions are in the files
pause(0.1) % ensure the function file has been generated before it is called
% these are later overwritten by a specific case, but they can be used for
% debugging
x = 0.5*rand(2);
xd = 0.5*rand(1,2);
% values for the Stackoverflow case
x = [0.0533 0.0023;
0.4809 0.3875];
xd = [0.4087 0.4343];
xp = x; % dummy variable to keep x intact
% compute potential field and gradient for both (x,y) pairs
for i=1:size(x,1)
% create a grid centered on the selected (x,y) pair
xGrid = (x(i,1)-0.1):0.005:(x(i,1)+0.1);
yGrid = (x(i,2)-0.1):0.005:(x(i,2)+0.1);
% preallocate the gradient and potential matrices
gradx = zeros(length(xGrid),length(yGrid));
grady = zeros(length(xGrid),length(yGrid));
phi = zeros(length(xGrid),length(yGrid));
% generate appropriate function handles
fun = str2func(strcat('GradTester',int2str(i)));
fun2 = str2func(strcat('PotTester',int2str(i)));
% compute analytic gradient and potential for each position in the xGrid and
% yGrid vectors
for ii = 1:length(yGrid)
for jj = 1:length(xGrid)
xp(i,:) = [xGrid(ii) yGrid(jj)]; % select the position
Xvec = reshape(xp.',1,size(x,1)*size(x,2)); % turn the input into a vector
[gradx(ii,jj),grady(ii,jj)] = fun(Xvec,xd); % compute gradients
phi(jj,ii) = fun2(Xvec,xd); % compute potential value
end
end
[FX,FY] = gradient(phi); % compute the NUMERICAL gradient for comparison
%scale the numerical gradient
FX = FX/0.005;
FY = FY/0.005;
% plot analytic result
subplot(2,2,2*i-1)
hold all
xlim([xGrid(1) xGrid(end)]);
ylim([yGrid(1) yGrid(end)]);
quiver(xGrid,yGrid,-gradx,-grady)
contour(xGrid,yGrid,phi)
title(strcat('Analytic result for position ',int2str(i)));
xlabel('x');
ylabel('y');
subplot(2,2,2*i)
hold all
xlim([xGrid(1) xGrid(end)]);
ylim([yGrid(1) yGrid(end)]);
quiver(xGrid,yGrid,-FX,-FY)
contour(xGrid,yGrid,phi)
title(strcat('Numerical result for position ',int2str(i)));
xlabel('x');
ylabel('y');
end
The potential field I am trying to generate is defined by an (x,y) position, in my code called xd. x is the position matrix of dimension N x 2, where the first column represents x1, x2, and so on, and the second column represents y1, y2, and so on. Xvec is simply a reshaping of this vector to x1,y1,x2,y2,x3,y3 and so on, as the matlabfunction I am generating only accepts vector inputs.
The gradient for robot i is being calculated by taking the derivative w.r.t. x_i and y_i, these two components together yield a single derivative 'vector' shown in the quiver plots. The derivative should look like this, and I checked that the symbolic expression for [gradx,grady] indeed looks like that before an m-file is generated.
To fix the particular problem given in the question, you were actually calculating phi in such a way that meant you doing gradient(phi) was not giving the correct results compared to the symbolic gradient. I'll try and explain. Here is how you created xGrid and yGrid:
% create a grid centered on the selected (x,y) pair
xGrid = (x(i,1)-0.1):0.005:(x(i,1)+0.1);
yGrid = (x(i,2)-0.1):0.005:(x(i,2)+0.1);
But then in the for loop, ii and jj were used like phi(jj,ii) or gradx(ii,jj), but corresponding to the same physical position. This is why your results were different. Another problem you had was you used gradient incorrectly. Matlab assumes that [FX,FY]=gradient(phi) means that phi is calculated from phi=f(x,y) where x and y are matrices created using meshgrid. You effectively had the elements of phi arranged differently to that, an so gradient(phi) gave the wrong answer. Between reversing the jj and ii, and the incorrect gradient, the errors cancelled out (I suspect you tried doing phi(jj,ii) after trying phi(ii,jj) first and finding it didn't work).
Anyway, to sort it all out, on the line after you create xGrid and yGrid, put this in:
[X,Y]=meshgrid(xGrid,yGrid);
Then change the code after you load fun and fun2 to:
for ii = 1:length(xGrid) %// x loop
for jj = 1:length(yGrid) %// y loop
xp(i,:) = [X(ii,jj);Y(ii,jj)]; %// using X and Y not xGrid and yGrid
Xvec = reshape(xp.',1,size(x,1)*size(x,2));
[gradx(ii,jj),grady(ii,jj)] = fun(Xvec,xd);
phi(ii,jj) = fun2(Xvec,xd);
end
end
[FX,FY] = gradient(phi,0.005); %// use the second argument of gradient to set spacing
subplot(2,2,2*i-1)
hold all
axis([min(X(:)) max(X(:)) min(Y(:)) max(Y(:))]) %// use axis rather than xlim/ylim
quiver(X,Y,gradx,grady)
contour(X,Y,phi)
title(strcat('Analytic result for position ',int2str(i)));
xlabel('x');
ylabel('y');
subplot(2,2,2*i)
hold all
axis([min(X(:)) max(X(:)) min(Y(:)) max(Y(:))])
quiver(X,Y,FX,FY)
contour(X,Y,phi)
title(strcat('Numerical result for position ',int2str(i)));
xlabel('x');
ylabel('y');
I have some other comments about your code. I think your potential function is ill-defined, which is causing all sorts of problems. You say in the question that x is an Nx2 matrix, but you potential function is defined as
norm(x(i,:)-xd)/norm(x(1,:)-x(2,:));
which means if N was three, you'd have the following three potentials:
norm(x(1,:)-xd)/norm(x(1,:)-x(2,:));
norm(x(2,:)-xd)/norm(x(1,:)-x(2,:));
norm(x(3,:)-xd)/norm(x(1,:)-x(2,:));
and I don't think the third one makes sense. I think this could be causing some confusion with the gradients.
Also, I'm not sure if there is a reason to create the .m file functions in your real code, but they are not necessary for the code you posted.
My intention is to plot the original mathematical function values from the differential equation of the second order below:
I(thetadbldot)+md(g-o^2asin(ot))sin(theta)=0
where thetadbldot is the second derivative of theta with respect to t and m,d,I,g,a,o are given constants. Initial conditions are theta(0)=pi/2 and thetadot(0)=0.
My issue is that my knowledge and tutoring is limited to storing the values of the derivatives and returning them, not values from the original mathematical function in the equation. Below you can see a code that calculates the differential in Cauchy-form and gives me the derivatives. Does anyone have suggestions what to do? Thanks!
function xdot = penduluma(t,x)
% The function penduluma(t,x) calculates the differential
% I(thetadbldot)+md(g-o^2asin(ot))sin(theta)=0 where thetadbldot is the second
% derivative of theta with respect to t and m,d,I,g,a,o are given constants.
% For the state-variable form, x1=theta and x2=thetadot. x is a 2x1 vector on the form
% [theta,thetadot].
m=1;d=0.2;I=0.1;g=9.81;a=0.1;o=4;
xdot = [x(2);m*d*(o^2*a*sin(o*t)-g)*sin(x(1))/I];
end
options=odeset('RelTol', 1e-6);
[t,xa]=ode45(#penduluma,[0,20],[pi/2,0],options);
% Then the desired vector from xa is plotted to t. As it looks now the desired
% values are not found in xa however.
Once you have the angle, you can calculate the angular velocity and acceleration using diff:
options=odeset('RelTol', 1e-6);
[t,xa]=ode45(#penduluma,[0,20],[pi/2,0],options);
x_ddot = zeros(size(t));
x_ddot(2:end) = diff(xa(:,2))./diff(t);
plot(t,xa,t,x_ddot)
legend('angle','angular velocity','angular acceleration')
which gives the following plot in Octave (should be the same in MATLAB):
Alternatively, you can work it out using your original differential equation:
x_ddot = -m*d*(o^2*a*sin(o*t)-g).*sin(xa(:,1))/I;
which gives a similar result:
How to normalize a histogram such that the area under the probability density function is equal to 1?
My answer to this is the same as in an answer to your earlier question. For a probability density function, the integral over the entire space is 1. Dividing by the sum will not give you the correct density. To get the right density, you must divide by the area. To illustrate my point, try the following example.
[f, x] = hist(randn(10000, 1), 50); % Create histogram from a normal distribution.
g = 1 / sqrt(2 * pi) * exp(-0.5 * x .^ 2); % pdf of the normal distribution
% METHOD 1: DIVIDE BY SUM
figure(1)
bar(x, f / sum(f)); hold on
plot(x, g, 'r'); hold off
% METHOD 2: DIVIDE BY AREA
figure(2)
bar(x, f / trapz(x, f)); hold on
plot(x, g, 'r'); hold off
You can see for yourself which method agrees with the correct answer (red curve).
Another method (more straightforward than method 2) to normalize the histogram is to divide by sum(f * dx) which expresses the integral of the probability density function, i.e.
% METHOD 3: DIVIDE BY AREA USING sum()
figure(3)
dx = diff(x(1:2))
bar(x, f / sum(f * dx)); hold on
plot(x, g, 'r'); hold off
Since 2014b, Matlab has these normalization routines embedded natively in the histogram function (see the help file for the 6 routines this function offers). Here is an example using the PDF normalization (the sum of all the bins is 1).
data = 2*randn(5000,1) + 5; % generate normal random (m=5, std=2)
h = histogram(data,'Normalization','pdf') % PDF normalization
The corresponding PDF is
Nbins = h.NumBins;
edges = h.BinEdges;
x = zeros(1,Nbins);
for counter=1:Nbins
midPointShift = abs(edges(counter)-edges(counter+1))/2;
x(counter) = edges(counter)+midPointShift;
end
mu = mean(data);
sigma = std(data);
f = exp(-(x-mu).^2./(2*sigma^2))./(sigma*sqrt(2*pi));
The two together gives
hold on;
plot(x,f,'LineWidth',1.5)
An improvement that might very well be due to the success of the actual question and accepted answer!
EDIT - The use of hist and histc is not recommended now, and histogram should be used instead. Beware that none of the 6 ways of creating bins with this new function will produce the bins hist and histc produce. There is a Matlab script to update former code to fit the way histogram is called (bin edges instead of bin centers - link). By doing so, one can compare the pdf normalization methods of #abcd (trapz and sum) and Matlab (pdf).
The 3 pdf normalization method give nearly identical results (within the range of eps).
TEST:
A = randn(10000,1);
centers = -6:0.5:6;
d = diff(centers)/2;
edges = [centers(1)-d(1), centers(1:end-1)+d, centers(end)+d(end)];
edges(2:end) = edges(2:end)+eps(edges(2:end));
figure;
subplot(2,2,1);
hist(A,centers);
title('HIST not normalized');
subplot(2,2,2);
h = histogram(A,edges);
title('HISTOGRAM not normalized');
subplot(2,2,3)
[counts, centers] = hist(A,centers); %get the count with hist
bar(centers,counts/trapz(centers,counts))
title('HIST with PDF normalization');
subplot(2,2,4)
h = histogram(A,edges,'Normalization','pdf')
title('HISTOGRAM with PDF normalization');
dx = diff(centers(1:2))
normalization_difference_trapz = abs(counts/trapz(centers,counts) - h.Values);
normalization_difference_sum = abs(counts/sum(counts*dx) - h.Values);
max(normalization_difference_trapz)
max(normalization_difference_sum)
The maximum difference between the new PDF normalization and the former one is 5.5511e-17.
hist can not only plot an histogram but also return you the count of elements in each bin, so you can get that count, normalize it by dividing each bin by the total and plotting the result using bar. Example:
Y = rand(10,1);
C = hist(Y);
C = C ./ sum(C);
bar(C)
or if you want a one-liner:
bar(hist(Y) ./ sum(hist(Y)))
Documentation:
hist
bar
Edit: This solution answers the question How to have the sum of all bins equal to 1. This approximation is valid only if your bin size is small relative to the variance of your data. The sum used here correspond to a simple quadrature formula, more complex ones can be used like trapz as proposed by R. M.
[f,x]=hist(data)
The area for each individual bar is height*width. Since MATLAB will choose equidistant points for the bars, so the width is:
delta_x = x(2) - x(1)
Now if we sum up all the individual bars the total area will come out as
A=sum(f)*delta_x
So the correctly scaled plot is obtained by
bar(x, f/sum(f)/(x(2)-x(1)))
The area of abcd`s PDF is not one, which is impossible like pointed out in many comments.
Assumptions done in many answers here
Assume constant distance between consecutive edges.
Probability under pdf should be 1. The normalization should be done as Normalization with probability, not as Normalization with pdf, in histogram() and hist().
Fig. 1 Output of hist() approach, Fig. 2 Output of histogram() approach
The max amplitude differs between two approaches which proposes that there are some mistake in hist()'s approach because histogram()'s approach uses the standard normalization.
I assume the mistake with hist()'s approach here is about the normalization as partially pdf, not completely as probability.
Code with hist() [deprecated]
Some remarks
First check: sum(f)/N gives 1 if Nbins manually set.
pdf requires the width of the bin (dx) in the graph g
Code
%http://stackoverflow.com/a/5321546/54964
N=10000;
Nbins=50;
[f,x]=hist(randn(N,1),Nbins); % create histogram from ND
%METHOD 4: Count Densities, not Sums!
figure(3)
dx=diff(x(1:2)); % width of bin
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND with dx
% 1.0000
bar(x, f/sum(f));hold on
plot(x,g,'r');hold off
Output is in Fig. 1.
Code with histogram()
Some remarks
First check: a) sum(f) is 1 if Nbins adjusted with histogram()'s Normalization as probability, b) sum(f)/N is 1 if Nbins is manually set without normalization.
pdf requires the width of the bin (dx) in the graph g
Code
%%METHOD 5: with histogram()
% http://stackoverflow.com/a/38809232/54964
N=10000;
figure(4);
h = histogram(randn(N,1), 'Normalization', 'probability') % hist() deprecated!
Nbins=h.NumBins;
edges=h.BinEdges;
x=zeros(1,Nbins);
f=h.Values;
for counter=1:Nbins
midPointShift=abs(edges(counter)-edges(counter+1))/2; % same constant for all
x(counter)=edges(counter)+midPointShift;
end
dx=diff(x(1:2)); % constast for all
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND
% Use if Nbins manually set
%new_area=sum(f)/N % diff of consecutive edges constant
% Use if histogarm() Normalization probability
new_area=sum(f)
% 1.0000
% No bar() needed here with histogram() Normalization probability
hold on;
plot(x,g,'r');hold off
Output in Fig. 2 and expected output is met: area 1.0000.
Matlab: 2016a
System: Linux Ubuntu 16.04 64 bit
Linux kernel 4.6
For some Distributions, Cauchy I think, I have found that trapz will overestimate the area, and so the pdf will change depending on the number of bins you select. In which case I do
[N,h]=hist(q_f./theta,30000); % there Is a large range but most of the bins will be empty
plot(h,N/(sum(N)*mean(diff(h))),'+r')
There is an excellent three part guide for Histogram Adjustments in MATLAB (broken original link, archive.org link),
the first part is on Histogram Stretching.