Matlab: plotting frequency distribution with a curve - matlab

I have to plot 10 frequency distributions on one graph. In order to keep things tidy, I would like to avoid making a histogram with bins and would prefer having lines that follow the contour of each histogram plot.
I tried the following
[counts, bins] = hist(data);
plot(bins, counts)
But this gives me a very inexact and jagged line.
I read about ksdensity, which gives me a nice curve, but it changes the scaling of my y-axis and I need to be able to read the frequencies from the y-axis.
Can you recommend anything else?

You're using the default number of bins for your histogram and, I will assume, for your kernel density estimation calculations.
Depending on how many data points you have, that will certainly not be optimal, as you've discovered. The first thing to try is to calculate the optimum bin width to give the smoothest curve while simultaneously preserving the underlying PDF as best as possible. (see also here, here, and here);
If you still don't like how smooth the resulting plot is, you could try using the bins output from hist as a further input to ksdensity. Perhaps something like this:
[kcounts,kbins] = ksdensity(data,bins,'npoints',length(bins));
I don't have your data, so you may have to play with the parameters a bit to get exactly what you want.
Alternatively, you could try fitting a spline through the points that you get from hist and plotting that instead.
Some code:
data = randn(1,1e4);
optN = sshist(data);
figure(1)
[N,Center] = hist(data);
[Nopt,CenterOpt] = hist(data,optN);
[f,xi] = ksdensity(data,CenterOpt);
dN = mode(diff(Center));
dNopt = mode(diff(CenterOpt));
plot(Center,N/dN,'.-',CenterOpt,Nopt/dNopt,'.-',xi,f*length(data),'.-')
legend('Default','Optimum','ksdensity')
The result:
Note that the "optimum" bin width preserves some of the fine structure of the distribution (I had to run this a couple times to get the spikes) while the ksdensity gives a smooth curve. Depending on what you're looking for in your data, that may be either good or bad.

How about interpolating with splines?
nbins = 10; %// number of bins for original histogram
n_interp = 500; %// number of values for interpolation
[counts, bins] = hist(data, nbins);
bins_interp = linspace(bins(1), bins(end), n_interp);
counts_interp = interp1(bins, counts, bins_interp, 'spline');
plot(bins, counts) %// original histogram
figure
plot(bins_interp, counts_interp) %// interpolated histogram
Example: let
data = randn(1,1e4);
Original histogram:
Interpolated:
Following your code, the y axis in the above figures gives the count, not the probability density. To get probability density you need to normalize:
normalization = 1/(bins(2)-bins(1))/sum(counts);
plot(bins, counts*normalization) %// original histogram
plot(bins_interp, counts_interp*normalization) %// interpolated histogram
Check: total area should be approximately 1:
>> trapz(bins_interp, counts_interp*normalization)
ans =
1.0009

Related

Histogram with logarithmic bins and normalized

I want to make a histogram of every column of a matrix, but I want the bins to be logarithmic and also normalized. And after I create the histogram I want to make a fit on it without showing the bars. This is what I have tried:
y=histogram(x,'Normalized','probability');
This gives me the histogram normalized, but I don't know how to make the bins logarithmic.
There are two different ways of creating a logarithmic histogram:
Compute the histogram of the logarithm of the data. This is probably the nicest approach, as you let the software decide on how many bins to create, etc. The x-axis now doesn't match your data, it matches the log of your data. For fitting a function, this is likely beneficial, but for display it could be confusing. Here I change the tick mark labels to show the actual value, keeping the tick marks themselves at their original values:
y = histogram(log(x),'Normalization','probability');
h = gca;
h.XTickLabels = exp(h.XTick);
Determine your own bin edges, on a logarithmic scale. Here you need to determine how many bins you need, depending on the number of samples and the distribution of samples.
b = 2.^(1:0.25:3);
y = histogram(x,b,'Normalization','probability');
set(gca,'XTick',b) % This just puts the tick marks in between bars so you can see what we did.
Method 1 lets MATLAB determine number of bins and bin edges automatically depending on the input data. Hence it is not suitable for creating multiple matching histograms. For that case, use method 2. The in edges can be obtained more simply this way:
N = 10; % number of bins
start = min(x); % first bin edge
stop = max(x); % last bin edge
b = 2.^linspace(log2(start),log2(stop),N+1);
I think the correct syntax would be Normalization.
To make it logarithmic, you have to change the axes object.
For example :
ha = axes;
y = histogram( x,'Normalization','probability' );
ha.YScale = 'log';

Smoothing the curve

Below is my code for the plot. How to make the plot more smoother.
len1 = [25, 250, 500, 750, 1000];
for k1 = 1:length(len1)
standard_deviation1(k1) = std(resdphs(1:5000, len1(k1)));
end
f10 = [110, 100, 90, 80, 70];
figure(3),plot(f10, standard_deviation1);xlabel('frequency'); ylabel('standarddev');
grid
As stated in the comments, you can first try to apply a moving average to your data which applies local smoothing to overlapping windows in your data. However, for this to be successful, you must have a higher point density to achieve good smoothing. Currently, your plot only has a few points uniformly spaced at 500 units and so moving average will significantly alter the way the plot looks. I'll show you an example soon.
Let's get back to the method at hand. First, apply linear interpolation between each of the points to get a higher point density. After you apply linear interpolation, you can apply the moving average operation with conv. However, what will happen is that in between your keypoints will exist artificial data that isn't representative of your problem. I'd also like to mention that this plot is for aesthetic purposes and the data in between the keypoints should not be used for any critical decisions.
If you simply want to plot the points, consider not using plot and using stem instead. In any case, use interp1 as the base method for interpolating in between the keypoints. Once you do that, you can apply a moving average by convolution - specifically, use a kernel that has a small amount of filter taps that are all equally weighted. Something like a 5-tap window or 7-tap window may suffice.
Using the variables that you declared above:
%// Specify number of total points
num_points = 300;
%// Specify moving average window
move_size = 7;
%// Specify interpolated y coordinates
xpts = linspace(min(f10), max(f10), num_points);
out = interp1(f10, standard_deviation1, xpts, 'linear');
%// Apply moving average
kernel = (1/move_size)*(ones(1,move_size));
out_smooth = conv(out, kernel, 'same');
%// Also apply moving average on the raw data itself for demonstration
out_smooth_raw = conv(standard_deviation1, kernel, 'same');
%// Plot everything
plot(f10, standard_deviation1, f10, out_smooth_raw, 'x-', xpts, out_smooth);
legend('Original Data', 'Smoothed Data - Raw', 'Smoothed Data - Interpolated');
Let's do this with some example data:
f10 = 0 : 500 : 5000;
rng(123); %// Set seed for reproducibility
standard_deviation1 = rand(1,numel(f10));
Using the above data and with the above code, we get this plot:
As you can see, applying a moving average on your data without interpolation significantly alters the data because of the resolution. If you apply interpolation first, then apply a moving average, you will see that you get a somewhat better representation of your original data with the corners smoothed. Bear in mind that the data at the beginning and at the end of the smoothened result will be meaningless as you would be taking the moving average of windows with zeros padded into the data to allow the calculations to work.

ksdensity of frequency histogram

I would like to plot multiple frequency histograms on one graph, however, my frequency plot is jagged and not pretty. As shown below with this code:
mmin = min([Data]);
mmax = max([Data]);
figure(1);n = hist(Data, x);
f = n/sum(n);
plot(x,f,'r','LineWidth',3)
To make it smooth, I looked into ksdensity and created the graph below based on this code:
[f,xi] = ksdensity(data);
figure(1)
plot(xi,f);
However, I noticed that my graph is no longer plotting frequency on the y-axis. Is there anyway to correct for this change using ksdensity? I really like how the graph looks as opposed to my frequency histogram and would like to keep using ksdensity, unless there is a better suggestion.
Thank you!
Data Sample:
https://www.dropbox.com/s/4ax2cuvugvqxjh6/splicing_attempt2_normalized_combined.txt?dl=0
The trick is here that I don't think you are calculating the frequency correctly in your histogram. You are neglecting the bin width. Your frequency should be the number of SNPs per position, which requires dividing by the number of (possibly fractional) positions per bin.
Try this:
Data = rand(1, 1e4);
figure(1);
[n, c] = hist(Data, 20);
dc = abs(c(2) - c(1));
f = n./(dc * sum(n));
plot(c,f,'r','LineWidth',3)
[~,f_kde,xi] = kde(Data);
line(xi,f_kde);
I don't have the Statistics Toolbox, so I'm using the File Exchange kde function instead, but both work the same way.
If the first graph is indeed what you are after, then do a little algebra-fu, and instead of dividing the histogram values by the bin width, multiply the kdensity values by the same bin width.
As I mention in my other histogram answer, there are numerous methods for choosing optimal bin widths for a histogram. I chose 20 here for expediency.

Matlab cdfplot: how to control the spacing of the marker spacing

I have a Matlab figure I want to use in a paper. This figure contains multiple cdfplots.
Now the problem is that I cannot use the markers because the become very dense in the plot.
If i want to make the samples sparse I have to drop some samples from the cdfplot which will result in a different cdfplot line.
How can I add enough markers while maintaining the actual line?
One method is to get XData/YData properties from your curves follow solution (1) from #ephsmith and set it back. Here is an example for one curve.
y = evrnd(0,3,100,1); %# random data
%# original data
subplot(1,2,1)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
%# reduced data
subplot(1,2,2)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
xdata = get(h,'XData');
ydata = get(h,'YData');
set(h,'XData',xdata(1:5:end));
set(h,'YData',ydata(1:5:end));
Another method is to calculate empirical CDF separately using ECDF function, then reduce the results before plotting with PLOT.
y = evrnd(0,3,100,1); %# random data
[f, x] = ecdf(y);
%# original data
subplot(1,2,1)
plot(x,f,'*')
%# reduced data
subplot(1,2,2)
plot(x(1:5:end),f(1:5:end),'r*')
Result
I know this is potentially unnecessary given MATLAB's built-in functions (in the Statistics Toolbox anyway) but it may be of use to other viewers who do not have access to the toolbox.
The empirical CMF (CDF) is essentially the cumulative sum of the empirical PMF. The latter is attainable in MATLAB via the hist function. In order to get a nice approximation to the empirical PMF, the number of bins must be selected appropriately. In the following example, I assume that 64 bins is good enough for your data.
%# compute a histogram with 64 bins for the data points stored in y
[f,x]=hist(y,64);
%# convert the frequency points in f to proportions
f = f./sum(f);
%# compute the cumulative sum of the empirical PMF
cmf = cumsum(f);
Now you can choose how many points you'd like to plot by using the reduced data example given by yuk.
n=20 ; % number of total data markers in the curve graph
M_n = round(linspace(1,numel(y),n)) ; % indices of markers
% plot the whole line, and markers for selected data points
plot(x,y,'b-',y(M_n),y(M_n),'rs')
verry simple.....
try reducing the marker size.
x = rand(10000,1);
y = x + rand(10000,1);
plot(x,y,'b.','markersize',1);
For publishing purposes I tend to use the plot tools on the figure window. This allow you to tweak all of the plot parameters and immediately see the result.
If the problem is that you have too many data points, you can:
1). Plot using every nth sample of the data. Experiment to find an n that results in the look you want.
2). I typically fit curves to my data and add a few sparsely placed markers to plots of the fits to differentiate the curves.
Honestly, for publishing purposes I have always found that choosing different 'LineStyle' or 'LineWidth' properties for the lines gives much cleaner results than using different markers. This would also be a lot easier than trying to downsample your data, and for plots made with CDFPLOT I find that markers simply occlude the stairstep nature of the lines.

Smoothing of histogram with a low-pass filter in MATLAB

I have an image and my aim is to binarize the image. I have filtered the image with a low pass Gaussian filter and have computed the intensity histogram of the image.
I now want to perform smoothing of the histogram so that I can obtain the threshold for binarization. I used a low pass filter but it did not work. This is the filter I used.
h = fspecial('gaussian', [8 8],2);
Can anyone help me with this? What is the process with respect to smoothing of a histogram?
imhist(Ig);
Thanks a lot for all your help.
I've been working on a very similar problem recently, trying to compute a threshold in order to exclude noisy background pixels from MRI data prior to performing other computations on the images. What I did was fit a spline to the histogram to smooth it while maintaining an accurate fit of the shape. I used the splinefit package from the file exchange to perform the fitting. I computed a histogram for a stack of images treated together, but it should work similarly for an individual image. I also happened to use a logarithmic transformation of my histogram data, but that may or may not be a useful step for your application.
[my_histogram, xvals] = hist(reshape(image_volume), 1, []), number_of_bins);
my_log_hist = log(my_histogram);
my_log_hist(~isfinite(my_log_hist)) = 0; % Get rid of NaN values that arise from empty bins (log of zero = NaN)
figure(1), plot(xvals, my_log_hist, 'b');
hold on
breaks = linspace(0, max_pixel_intensity, numberofbreaks);
xx = linspace(0, max_pixel_intensity, max_pixel_intensity+1);
pp = splinefit(xvals, my_log_hist, breaks, 'r');
plot(xx, ppval(pp, xx), 'r');
Note that the spline is differentiable and you can use ppdiff to get the derivative, which is useful for finding maxima and minima to help pick an appropriate threshold. The numberofbreaks is set to a relatively low number so that the spline will smooth the histogram. I used linspace in the example to pick the breaks, but if you know that some portion of the histogram exhibits much greater curvature than elsewhere, you'd want to have more breaks in that region and less elsewhere in order to accurately capture the shape of the histogram.
To smooth the histogram you need to use a 1-D filter. This is easily done using the filter function. Here is an example:
I = imread('pout.tif');
h = imhist(I);
smooth_h = filter(normpdf(-4:4, 0,1),1,h);
Of course you can use any smoothing function you choose. The mean would simply be ones(1,8).
Since your goal here is just to find the threshold to binarize an image you could just use the graythresh function which uses Otsu's method.