Histogram with logarithmic bins and normalized - matlab

I want to make a histogram of every column of a matrix, but I want the bins to be logarithmic and also normalized. And after I create the histogram I want to make a fit on it without showing the bars. This is what I have tried:
y=histogram(x,'Normalized','probability');
This gives me the histogram normalized, but I don't know how to make the bins logarithmic.

There are two different ways of creating a logarithmic histogram:
Compute the histogram of the logarithm of the data. This is probably the nicest approach, as you let the software decide on how many bins to create, etc. The x-axis now doesn't match your data, it matches the log of your data. For fitting a function, this is likely beneficial, but for display it could be confusing. Here I change the tick mark labels to show the actual value, keeping the tick marks themselves at their original values:
y = histogram(log(x),'Normalization','probability');
h = gca;
h.XTickLabels = exp(h.XTick);
Determine your own bin edges, on a logarithmic scale. Here you need to determine how many bins you need, depending on the number of samples and the distribution of samples.
b = 2.^(1:0.25:3);
y = histogram(x,b,'Normalization','probability');
set(gca,'XTick',b) % This just puts the tick marks in between bars so you can see what we did.
Method 1 lets MATLAB determine number of bins and bin edges automatically depending on the input data. Hence it is not suitable for creating multiple matching histograms. For that case, use method 2. The in edges can be obtained more simply this way:
N = 10; % number of bins
start = min(x); % first bin edge
stop = max(x); % last bin edge
b = 2.^linspace(log2(start),log2(stop),N+1);

I think the correct syntax would be Normalization.
To make it logarithmic, you have to change the axes object.
For example :
ha = axes;
y = histogram( x,'Normalization','probability' );
ha.YScale = 'log';

Related

How to automatically normalize multiple histograms to get to the same maximum level?

I have multiple histograms generated from various samples that need to be combined in the end. What I have found is that I am not getting good results at the combination stage because different plots have different max values, but if I normalize them to somewhat similar values I get a good result.
For example the below three plots:
Now as can be seen one of the plots peak at around 0.067 while the other two at around 0.4. I cannot combine them in this state, but after looking at the plots visually I know that if I multiply the first plot 0.6 I get this:
Now they are at same level and can be displayed together.
I am doing this visually for every result. Would it be possible to automate this? As its not always like this, sometimes the first and second inputs(plot) are low but the third one is peaked and I would have to divide the third plot by a certain value, which I know after visually looking at the plots.
Matlabs function histogram has some normalization types built in. You can either normalize the number of counts, or the sum of the histogram area (see also), ... but you cannot yet normalize for a maximum value which is what you want probably.
I recommend to compute the histograms without plotting using histcounts, then normalizing them to a common maximum like 1 for example and then plotting them all together or separate in bar plots.
Example:
% generate example data
a = randn(100, 1) + 5;
b = randn(100, 1) * 4 + 8;
nbins = 0:20;
% compute histograms
[na, edges] = histcounts(a, nbins);
centers = mean([edges(1:end-1);edges(2:end)]);
nb = histcounts(b, nbins);
% normalize histograms to maximum equals 1
na = na / max(na);
nb = nb / max(nb);
% plot as bar plots with specified colors (or however you want to plot them)
figure;
bar_handle = bar(centers', [na',nb']);
bar_handle(1).FaceColor = 'r';
bar_handle(2).FaceColor = 'g';
title('histogram normalized to max');
and it looks like

How to make smooth plot with matrix that don't have the same column and line [duplicate]

Let's say we have the following data:
A1= [41.3251
18.2350
9.9891
36.1722
50.8702
32.1519
44.6284
60.0892
58.1297
34.7482
34.6447
6.7361
1.2960
1.9778
2.0422];
A2=[86.3924
86.4882
86.1717
85.8506
85.8634
86.1267
86.4304
86.6406
86.5022
86.1384
86.5500
86.2765
86.7044
86.8075
86.9007];
When I plot the above data using plot(A1,A2);, I get this graph:
Is there any way to make the graph look smooth like a cubic plot?
Yes you can. You can interpolate in between the keypoints. This will require a bit of trickery though. Blindly using interpolation with any of MATLAB's commands won't work because they require that the independent axes (the x-axis in your case) to increase. You can't do this with your data currently... at least out of the box. Therefore you'll have to create a dummy list of values that span from 1 up to as many elements as there are in A1 (or A2 as they're both equal in size) to create an independent axis and interpolate both arrays independently by specifying the dummy list with a finer spacing in resolution. This finer spacing is controlled by the total number of new points you want to introduce in the plot. These points will be defined within the range of the dummy list but the spacing in between each point will decrease as you increase the total number of new points. As a general rule, the more points you add the less spacing there will be and so the plot should be more smooth. Once you do that, plot the final values together.
Here's some code for you to run. We will be using interp1 to perform the interpolation for us and most of the work. The function linspace creates the finer grid of points in the dummy list to facilitate the interpolation. N would be the total number of desired points you want to plot. I've made it 500 for now meaning that 500 points will be used for interpolation using your original data. Experiment by increasing (or decreasing) the total number of points and seeing what effect this has in the smoothness of your data.
I'll also be using the Piecewise Cubic Hermite Interpolating Polynomial or pchip as the method of interpolation, which is basically cubic spline interpolation if you want to get technical. Assuming that A1 and A2 are already created:
%// Specify number of interpolating points
N = 500;
%// Specify dummy list of points
D = 1 : numel(A1);
%// Generate finer grid of points
NN = linspace(1, numel(A1), N);
%// Interpolate each set of points independently
A1interp = interp1(D, A1, NN, 'pchip');
A2interp = interp1(D, A2, NN, 'pchip');
%// Plot the data
plot(A1interp, A2interp);
I now get the following:

Smooth plot of non-dependent variable graph

Let's say we have the following data:
A1= [41.3251
18.2350
9.9891
36.1722
50.8702
32.1519
44.6284
60.0892
58.1297
34.7482
34.6447
6.7361
1.2960
1.9778
2.0422];
A2=[86.3924
86.4882
86.1717
85.8506
85.8634
86.1267
86.4304
86.6406
86.5022
86.1384
86.5500
86.2765
86.7044
86.8075
86.9007];
When I plot the above data using plot(A1,A2);, I get this graph:
Is there any way to make the graph look smooth like a cubic plot?
Yes you can. You can interpolate in between the keypoints. This will require a bit of trickery though. Blindly using interpolation with any of MATLAB's commands won't work because they require that the independent axes (the x-axis in your case) to increase. You can't do this with your data currently... at least out of the box. Therefore you'll have to create a dummy list of values that span from 1 up to as many elements as there are in A1 (or A2 as they're both equal in size) to create an independent axis and interpolate both arrays independently by specifying the dummy list with a finer spacing in resolution. This finer spacing is controlled by the total number of new points you want to introduce in the plot. These points will be defined within the range of the dummy list but the spacing in between each point will decrease as you increase the total number of new points. As a general rule, the more points you add the less spacing there will be and so the plot should be more smooth. Once you do that, plot the final values together.
Here's some code for you to run. We will be using interp1 to perform the interpolation for us and most of the work. The function linspace creates the finer grid of points in the dummy list to facilitate the interpolation. N would be the total number of desired points you want to plot. I've made it 500 for now meaning that 500 points will be used for interpolation using your original data. Experiment by increasing (or decreasing) the total number of points and seeing what effect this has in the smoothness of your data.
I'll also be using the Piecewise Cubic Hermite Interpolating Polynomial or pchip as the method of interpolation, which is basically cubic spline interpolation if you want to get technical. Assuming that A1 and A2 are already created:
%// Specify number of interpolating points
N = 500;
%// Specify dummy list of points
D = 1 : numel(A1);
%// Generate finer grid of points
NN = linspace(1, numel(A1), N);
%// Interpolate each set of points independently
A1interp = interp1(D, A1, NN, 'pchip');
A2interp = interp1(D, A2, NN, 'pchip');
%// Plot the data
plot(A1interp, A2interp);
I now get the following:

Matlab: plotting frequency distribution with a curve

I have to plot 10 frequency distributions on one graph. In order to keep things tidy, I would like to avoid making a histogram with bins and would prefer having lines that follow the contour of each histogram plot.
I tried the following
[counts, bins] = hist(data);
plot(bins, counts)
But this gives me a very inexact and jagged line.
I read about ksdensity, which gives me a nice curve, but it changes the scaling of my y-axis and I need to be able to read the frequencies from the y-axis.
Can you recommend anything else?
You're using the default number of bins for your histogram and, I will assume, for your kernel density estimation calculations.
Depending on how many data points you have, that will certainly not be optimal, as you've discovered. The first thing to try is to calculate the optimum bin width to give the smoothest curve while simultaneously preserving the underlying PDF as best as possible. (see also here, here, and here);
If you still don't like how smooth the resulting plot is, you could try using the bins output from hist as a further input to ksdensity. Perhaps something like this:
[kcounts,kbins] = ksdensity(data,bins,'npoints',length(bins));
I don't have your data, so you may have to play with the parameters a bit to get exactly what you want.
Alternatively, you could try fitting a spline through the points that you get from hist and plotting that instead.
Some code:
data = randn(1,1e4);
optN = sshist(data);
figure(1)
[N,Center] = hist(data);
[Nopt,CenterOpt] = hist(data,optN);
[f,xi] = ksdensity(data,CenterOpt);
dN = mode(diff(Center));
dNopt = mode(diff(CenterOpt));
plot(Center,N/dN,'.-',CenterOpt,Nopt/dNopt,'.-',xi,f*length(data),'.-')
legend('Default','Optimum','ksdensity')
The result:
Note that the "optimum" bin width preserves some of the fine structure of the distribution (I had to run this a couple times to get the spikes) while the ksdensity gives a smooth curve. Depending on what you're looking for in your data, that may be either good or bad.
How about interpolating with splines?
nbins = 10; %// number of bins for original histogram
n_interp = 500; %// number of values for interpolation
[counts, bins] = hist(data, nbins);
bins_interp = linspace(bins(1), bins(end), n_interp);
counts_interp = interp1(bins, counts, bins_interp, 'spline');
plot(bins, counts) %// original histogram
figure
plot(bins_interp, counts_interp) %// interpolated histogram
Example: let
data = randn(1,1e4);
Original histogram:
Interpolated:
Following your code, the y axis in the above figures gives the count, not the probability density. To get probability density you need to normalize:
normalization = 1/(bins(2)-bins(1))/sum(counts);
plot(bins, counts*normalization) %// original histogram
plot(bins_interp, counts_interp*normalization) %// interpolated histogram
Check: total area should be approximately 1:
>> trapz(bins_interp, counts_interp*normalization)
ans =
1.0009

Matlab cdfplot: how to control the spacing of the marker spacing

I have a Matlab figure I want to use in a paper. This figure contains multiple cdfplots.
Now the problem is that I cannot use the markers because the become very dense in the plot.
If i want to make the samples sparse I have to drop some samples from the cdfplot which will result in a different cdfplot line.
How can I add enough markers while maintaining the actual line?
One method is to get XData/YData properties from your curves follow solution (1) from #ephsmith and set it back. Here is an example for one curve.
y = evrnd(0,3,100,1); %# random data
%# original data
subplot(1,2,1)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
%# reduced data
subplot(1,2,2)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
xdata = get(h,'XData');
ydata = get(h,'YData');
set(h,'XData',xdata(1:5:end));
set(h,'YData',ydata(1:5:end));
Another method is to calculate empirical CDF separately using ECDF function, then reduce the results before plotting with PLOT.
y = evrnd(0,3,100,1); %# random data
[f, x] = ecdf(y);
%# original data
subplot(1,2,1)
plot(x,f,'*')
%# reduced data
subplot(1,2,2)
plot(x(1:5:end),f(1:5:end),'r*')
Result
I know this is potentially unnecessary given MATLAB's built-in functions (in the Statistics Toolbox anyway) but it may be of use to other viewers who do not have access to the toolbox.
The empirical CMF (CDF) is essentially the cumulative sum of the empirical PMF. The latter is attainable in MATLAB via the hist function. In order to get a nice approximation to the empirical PMF, the number of bins must be selected appropriately. In the following example, I assume that 64 bins is good enough for your data.
%# compute a histogram with 64 bins for the data points stored in y
[f,x]=hist(y,64);
%# convert the frequency points in f to proportions
f = f./sum(f);
%# compute the cumulative sum of the empirical PMF
cmf = cumsum(f);
Now you can choose how many points you'd like to plot by using the reduced data example given by yuk.
n=20 ; % number of total data markers in the curve graph
M_n = round(linspace(1,numel(y),n)) ; % indices of markers
% plot the whole line, and markers for selected data points
plot(x,y,'b-',y(M_n),y(M_n),'rs')
verry simple.....
try reducing the marker size.
x = rand(10000,1);
y = x + rand(10000,1);
plot(x,y,'b.','markersize',1);
For publishing purposes I tend to use the plot tools on the figure window. This allow you to tweak all of the plot parameters and immediately see the result.
If the problem is that you have too many data points, you can:
1). Plot using every nth sample of the data. Experiment to find an n that results in the look you want.
2). I typically fit curves to my data and add a few sparsely placed markers to plots of the fits to differentiate the curves.
Honestly, for publishing purposes I have always found that choosing different 'LineStyle' or 'LineWidth' properties for the lines gives much cleaner results than using different markers. This would also be a lot easier than trying to downsample your data, and for plots made with CDFPLOT I find that markers simply occlude the stairstep nature of the lines.