I have raw observations of 500 numeric values (ranging from 1 to 25000) in a text file, I wish to make a frequency distribution in MATLAB. I did try the histogram (hist), however I would prefer a frequency distribution curve than blocks and bars.
Any help is appreciated !
If you pass two output parameters to HIST, you will get both the x-axis and y-axis values. Then you can plot the data as you like. For instance,
[counts, bins] = hist(mydata);
plot(bins, counts); %# get a line plot of the histogram
You could try Kernel smoothing density estimate
Related
I have to plot 10 frequency distributions on one graph. In order to keep things tidy, I would like to avoid making a histogram with bins and would prefer having lines that follow the contour of each histogram plot.
I tried the following
[counts, bins] = hist(data);
plot(bins, counts)
But this gives me a very inexact and jagged line.
I read about ksdensity, which gives me a nice curve, but it changes the scaling of my y-axis and I need to be able to read the frequencies from the y-axis.
Can you recommend anything else?
You're using the default number of bins for your histogram and, I will assume, for your kernel density estimation calculations.
Depending on how many data points you have, that will certainly not be optimal, as you've discovered. The first thing to try is to calculate the optimum bin width to give the smoothest curve while simultaneously preserving the underlying PDF as best as possible. (see also here, here, and here);
If you still don't like how smooth the resulting plot is, you could try using the bins output from hist as a further input to ksdensity. Perhaps something like this:
[kcounts,kbins] = ksdensity(data,bins,'npoints',length(bins));
I don't have your data, so you may have to play with the parameters a bit to get exactly what you want.
Alternatively, you could try fitting a spline through the points that you get from hist and plotting that instead.
Some code:
data = randn(1,1e4);
optN = sshist(data);
figure(1)
[N,Center] = hist(data);
[Nopt,CenterOpt] = hist(data,optN);
[f,xi] = ksdensity(data,CenterOpt);
dN = mode(diff(Center));
dNopt = mode(diff(CenterOpt));
plot(Center,N/dN,'.-',CenterOpt,Nopt/dNopt,'.-',xi,f*length(data),'.-')
legend('Default','Optimum','ksdensity')
The result:
Note that the "optimum" bin width preserves some of the fine structure of the distribution (I had to run this a couple times to get the spikes) while the ksdensity gives a smooth curve. Depending on what you're looking for in your data, that may be either good or bad.
How about interpolating with splines?
nbins = 10; %// number of bins for original histogram
n_interp = 500; %// number of values for interpolation
[counts, bins] = hist(data, nbins);
bins_interp = linspace(bins(1), bins(end), n_interp);
counts_interp = interp1(bins, counts, bins_interp, 'spline');
plot(bins, counts) %// original histogram
figure
plot(bins_interp, counts_interp) %// interpolated histogram
Example: let
data = randn(1,1e4);
Original histogram:
Interpolated:
Following your code, the y axis in the above figures gives the count, not the probability density. To get probability density you need to normalize:
normalization = 1/(bins(2)-bins(1))/sum(counts);
plot(bins, counts*normalization) %// original histogram
plot(bins_interp, counts_interp*normalization) %// interpolated histogram
Check: total area should be approximately 1:
>> trapz(bins_interp, counts_interp*normalization)
ans =
1.0009
I have the following code, which I use to obtain the graph below. How can I determine the probability density of the values as I want my Y-Axis label to be Probability density or,do I have to normalise the Y-values?
Thanks
% thresh_strain contains a Normally Distributed set of numbers
[mu_j,sigma_j] = normfit(thresh_strain);
x=linspace(mu_j-4*sigma_j,mu_j+4*sigma_j,200);
pdf_x = 1/sqrt(2*pi)/sigma_j*exp(-(x-mu_j).^2/(2*sigma_j^2));
plot(x,pdf_x);
Your figure as it stands is correct - the area under the curve is 1. It does not need to be normalised.
You can check this by plotting the cumulative distribution function:
plot(x,(x(2)-x(1)).*cumsum(pdf_x));
The y-axis in your figure needs to be relabeled as it is not "number of dents". "Probability density" is an acceptable label.
I'm using pwelch to plot a power spectral density. I want to use the format
pwelch=(x,window,noverlap,nfft,fs,'onesided')
but with a log scale on the x axis.
I've also tried
[P,F]=(x,window,noverlap,nfft,fs);
plot(F,P)
but it doesn't give the same resulting plot as above. Therefore,
semilogx(F,P)
isn't a good solution.
OK, so to start, I've never heard of this function or this method. However, I was able to generate the same plot that the function produced using output arguments instead. I ran the example from the help text.
EXAMPLE:
Fs = 1000; t = 0:1/Fs:.296;
x = cos(2*pi*t*200)+randn(size(t)); % A cosine of 200Hz plus noise
pwelch(x,[],[],[],Fs,'twosided'); % Uses default window, overlap & NFFT.
That produces this plot:
I then did: plot(bar,10*log10(foo)); grid on; to produce the linear version (same exact plot, minus labels):
or
semilogx(bar,10*log10(foo)); grid on; for the log scale on the x-axis.
I don't like that the x-scale is sampled linearly but displayed logarithmically (that's a word right?), but it seems to look ok.
Good enough?
I'm trying to find the maximum peak on a power spectral density plot created in Matlab. I can create the plot just fine but am having difficulty correctly marking it. I use the find peaks and max function to find it but Matlab cannot correctly mark it. It finds the correct height but marks it a little to the left or right. Here is the code:
data = load ('EEGData(test1).txt', '-ascii');
figure(1)
plot(data)
Y =fft(data,251);
Pyy = Y.*conj(Y)/251;
f = 1000/251*(0:127);
figure(2)
plot(f,Pyy(1:128))
title('Power spectral density')
xlabel('Frequency (Hz)')
[a,b] = findpeaks(Pyy(1:128));
MAX = max(a);
hold on
plot(f(b), MAX,'or')
any help would be greatly appreciated.
When I tested your code as is by replacing data with
data=randn(251,1);
...I found that the local peak locations indicated by the red o markers were in the correct locations. It was just that all peaks were marked at the height of the maximum peak.
I'm not 100% sure what you are trying to do, but it looks as if you are trying to just find the maximum peak. If this is the case then you do not need the findpeaks function. Just replace the last few lines of your code with the following ...
[MAX, MAXidx] = max(Pyy(1:128));
hold on
plot(f(MAXidx), MAX,'or')
I have a Matlab figure I want to use in a paper. This figure contains multiple cdfplots.
Now the problem is that I cannot use the markers because the become very dense in the plot.
If i want to make the samples sparse I have to drop some samples from the cdfplot which will result in a different cdfplot line.
How can I add enough markers while maintaining the actual line?
One method is to get XData/YData properties from your curves follow solution (1) from #ephsmith and set it back. Here is an example for one curve.
y = evrnd(0,3,100,1); %# random data
%# original data
subplot(1,2,1)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
%# reduced data
subplot(1,2,2)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
xdata = get(h,'XData');
ydata = get(h,'YData');
set(h,'XData',xdata(1:5:end));
set(h,'YData',ydata(1:5:end));
Another method is to calculate empirical CDF separately using ECDF function, then reduce the results before plotting with PLOT.
y = evrnd(0,3,100,1); %# random data
[f, x] = ecdf(y);
%# original data
subplot(1,2,1)
plot(x,f,'*')
%# reduced data
subplot(1,2,2)
plot(x(1:5:end),f(1:5:end),'r*')
Result
I know this is potentially unnecessary given MATLAB's built-in functions (in the Statistics Toolbox anyway) but it may be of use to other viewers who do not have access to the toolbox.
The empirical CMF (CDF) is essentially the cumulative sum of the empirical PMF. The latter is attainable in MATLAB via the hist function. In order to get a nice approximation to the empirical PMF, the number of bins must be selected appropriately. In the following example, I assume that 64 bins is good enough for your data.
%# compute a histogram with 64 bins for the data points stored in y
[f,x]=hist(y,64);
%# convert the frequency points in f to proportions
f = f./sum(f);
%# compute the cumulative sum of the empirical PMF
cmf = cumsum(f);
Now you can choose how many points you'd like to plot by using the reduced data example given by yuk.
n=20 ; % number of total data markers in the curve graph
M_n = round(linspace(1,numel(y),n)) ; % indices of markers
% plot the whole line, and markers for selected data points
plot(x,y,'b-',y(M_n),y(M_n),'rs')
verry simple.....
try reducing the marker size.
x = rand(10000,1);
y = x + rand(10000,1);
plot(x,y,'b.','markersize',1);
For publishing purposes I tend to use the plot tools on the figure window. This allow you to tweak all of the plot parameters and immediately see the result.
If the problem is that you have too many data points, you can:
1). Plot using every nth sample of the data. Experiment to find an n that results in the look you want.
2). I typically fit curves to my data and add a few sparsely placed markers to plots of the fits to differentiate the curves.
Honestly, for publishing purposes I have always found that choosing different 'LineStyle' or 'LineWidth' properties for the lines gives much cleaner results than using different markers. This would also be a lot easier than trying to downsample your data, and for plots made with CDFPLOT I find that markers simply occlude the stairstep nature of the lines.