How do I fit distributions to data sets in matlab? - matlab

I'm trying to find a fit to my data using matlab but I'm having a lot of trouble, here's what ive done so far:
A = load('homicide_crime.txt') % A is a two column array the first column is the year and the second column is crime in that year
norm_crime = (A(:,2)-mean(A(:,2)))/std(A(:,2));
[f,x]=hist(norm_crime,20);
plot(x,f/trapz(x,f))
y=normpdf(x,0,1);
hold on
plot(x,y)
This is the resulting plot
.
So i tried afterwards using the distribution fitter which gave me this.
Neither of these things look right since the peak aren't aligned and the fit is too small.
Here is the data set for anyone intrested
https://pastebin.com/CyddrN1R.
Any help is much appreciated.

Actually, I think you are confusing data transformation with distribution fitting.
DATA TRANSFORMATION
In this approach, data is manipulated through a non-linear transformation in order to achieve a perfect fit. This means that it forces your data to follow the chosen distribution rule. To accomplish this with a normal distribution, all you have to do is applying the following code:
A = load('homicide_crime.txt');
years = A(:,1);
crimes = A(:,2);
figure(),histfit(crimes);
rank = tiedrank(crimes);
p = rank ./ (numel(rank) + 1);
crimes_normal = norminv(p,0,1);
figure(),histfit(crimes_normal);
Using the following manipulation:
crimes_normal = (crimes - mean(crimes)) ./ std(crimes);
that can also be written as:
crimes_normal = zscore(crimes);
you modify your observations so that they have mu=0 and sigma=1, but this is far from making them perfectly fit a normal distribution.
DISTRIBUTION FITTING
In this approach, the parameters of the chosen distribution are calculated over the given dataset, and then random observations are drawn. On one side you have your empirical observations, and on the other side you have your fitted data. A goodness-of-fit test can finally tell you how well empirical observations fit the given distribution comparing them to theoretical observations.
Since your are working with a normal distribution, you know that it is fully described by two parameters: mu and sigma. Hence:
A = load('homicide_crime.txt');
years = A(:,1);
crimes_emp = A(:,2);
[mu,sigma] = normfit(crimes_emp);
% you can also use
% mu = mean(crimes);
% sigma = std(crimes);
% to achieve the same result
[f,x] = hist(crimes_emp);
crimes_the = normpdf(x,mu,sigma) .* max(f);
figure();
bar(x, (f ./ sum(f)));
hold on;
plot(x,crimes_the,'-r','LineWidth',2);
hold off;
And this returns something very close to the problem you originally noticed. As you can clearly see, without even running a Kolmogorov-Smirnov or an Anderson-Darling... your data doesn't fit a normal distribution well.

You can try a non-parametric density estimation method. I used kernel density estimation (KDE) with the default normal distribution as the kernel, to obtain the result as shown below. The Matlab command for the same is ksdensity() and documentation can be found here.
A = load('homicide_crime.txt') % load data
years = A(:,1);
values = A(:,2);
[f0,x0] = hist(values,100); % plot the actual histogram
[f1,x1,b1] = ksdensity(values); % KDE with automatically assigned bandwidth
[f2,x2,b2] = ksdensity(values,'Bandwidth',b1*0.6); % 60% of initial bandwidth (b1)
[f3,x3,b3] = ksdensity(values,'Bandwidth',b2*0.6); % 60% of previous bandwidth (b2) = 36% of initial bandwidth (b1)
[f4,x4,b4] = ksdensity(values,'Bandwidth',b3*0.6); % 60% of previous bandwidth (b3) = 21.6% of initial bandwidth (b1)
figure; hold on;
bar(years, f0/(sum(f0)*10) ); % scaled for visualization
plot(years, f1, 'y')
plot(years, f2, 'c')
plot(years, f3, 'g')
plot(years, f4, 'r','linewidth',3) % Final fit
In the code above, I first plot the histogram and then calculate the kde without any user specified bandwidth. This leads to an oversmooth fitting (yellow curve). With a few trials by reducing the bandwidth as only 60% of the previous iteration, I finally was able to get the closest fit (red curve). You can play around the bandwidth to get a still better desirable fit.

Related

FFT: why the reconstructions from different frequency-domain data produce the same result

Does the (i)fftshift operation which changes the position of an certain value have something to do with the reconstructed image?
If using zero-filling, cutting the data in frequency-domain also make no sense?
A MATLAB demonstration:
I = imread('cameraman.tif');
% making 3 different frequency data
kraw = fft2(I);
kshift = fftshift(kraw);
kcut = kshift(:,1:end-64);
imshow(abs([kraw,kshift,kcut]),[])
% reconstructing
ToImage = #(x) uint8(abs(x));
Rraw = ToImage(ifft2(kraw));
Rshift = ToImage(ifft2(kshift));
Rcut = ToImage(ifft2(kcut,size(I,1),size(I,2)));
imshow([I,Rraw,Rshift,Rcut])
% metric the difference
ssim_raw = ssim(uint8(abs(Rraw)),I);
ssim_shift = ssim(uint8(abs(Rshift)),I);
ssim_cut = ssim(uint8(abs(Rcut)),I);
title(['SSIM: 1-----|-----',num2str(ssim_raw),'----|-----',num2str(ssim_shift),'----|-----',num2str(ssim_cut)])
I can't run matlab right now, but the general answer is that they have to produce different results. The DFT is an isomorphism, which means that there is one and only one spectrum for any image and one and only one image for any spectrum.
You should probably look at the actual coherent differences of the results. For instance, an fftshift in the frequency domain is equivalent to a linear phase multiplication in the spatial domain and will not affect the magnitude. The cut example surprises me, so I suspect its the result of how the ssim metric works. I am not familiar with it so I can't give any specifics.

Remove base line drift with peicewise cubic spline algorithm using MATLAB

I have a signal which I want to remove the basline drift using the picewise cubic spline algorithm in MATLAB.
d=load(file, '-mat');
t=1:length(a);
xq1=1:0.01:length(a);
p = pchip(t,a,xq1);
s = spline(t,a,xq1);
%
figure, hold on, plot(a, 'g'), plot(t,a,'o',xq1,p,'-',xq1,s,'-.')
legend('Sample Points','pchip','spline','Location','SouthEast')
But I cant see any basline removal..The orginal data is exactly on the interpolated one.
or in the other signal as we can see no base line is removed.
The question is how I can "use peicewise cubic spline interpolation to
remove basline drift" in MATLAB.
Thank you
It seems likely that you are looking to fit a polynomial to your data to estimate baseline drift due to thermal variations. The problem with spline is that it will always perfectly fit your data (similar to pchip) because it is an interpolation technique. You probably want a courser fit which you can get using polyfit. The following code sample shows how you could use polyfit to estimate the drift. In this case I fit a 3rd-order polynomial.
% generate some fake data
t = 0:60;
trend = 0.003*t.^2;
x = trend + sin(0.1*2*pi*t) + randn(1,numel(t))*0.5;
% estimate trend using polyfit
p_est = polyfit(t,x,3);
trend_est = polyval(p_est,t);
% plot results
plot(t,x,t,trend,t,trend_est,t,x-trend_est);
legend('data','trend','estimated trend','trend removed','Location','NorthWest');
Update
If you have the curve fitting toolbox you can fit a cubic spline with an extra smoothing constraint. In the example above you could use
trend_est = fnval(csaps(t,x,0.01),t);
instead of polyfit and polyval. You will have to play with the smoothing parameter, 0 being completely linear and 1 giving the same results as spline.
I think you should reduce the number of points at which you calculate the spline fit (this avoids over-fitting) and successively interpolate the fit on the original x-data.
t = 0:60;
trend = 0.003*t.^2;
x = trend + sin(0.1*2*pi*t) + randn(1,numel(t))*0.5;
figure;hold on
plot(t,x,'.-')
%coarser x-data
t2=[1:10:max(t) t(end)]; %%quick and dirty. You probably wanna do better than this
%spline fit here
p = pchip(t,x,t2);
s = spline(t,x,t2);
plot(t2,s,'-.','color' ,'g')
%interpolate back
trend=interp1(t2,s,t);
%remove the trend
plot(t,x-trend,'-.','color' ,'c')

In Matlab, How to divide multivariate Gaussian distributions to separate Gaussians?

I have an image with multivariate Gaussian distribution in histogram. I want to segment the image to two regions so that they both can follow the normal distribution like the red and blue curves shows in histogram. I know Gaussian mixture model potentially works for that. I tried to use fitgmdist function and then clustering the two parts but still not work well. Any suggestion will be appreciated.
Below is the Matlab code for my appraoch.
% Read Image
I = imread('demo.png');
I = rgb2gray(I);
data = I(:);
% Fit a gaussian mixture model
obj = fitgmdist(data,2);
idx = cluster(obj,data);
cluster1 = data(idx == 1,:);
cluster2 = data(idx == 2,:);
% Display Histogram
histogram(cluster1)
histogram(cluster2)
Your solution is correct
The way you are displaying your histogram poorly represents the detected distributions.
Normalize the bin sizes because histogram is a frequency count
Make the axes limits consistent (or plot on same axis)
These two small changes show that you're actually getting a pretty good distribution fit.
histogram(cluster1,0:.01:1); hold on;
histogram(cluster2,0:.01:1);
Re-fit a gaussian-curve to each cluster
Once you have your clusters if you treat them as independent distributions, you can smooth the tails where the two distributions merge.
gcluster1 = fitdist(cluster1,'Normal');
gcluster2 = fitdist(cluster2,'Normal');
x_values = 0:.01:1;
y1 = pdf(gcluster1,x_values);
y2 = pdf(gcluster2,x_values);
plot(x_values,y1);hold on;
plot(x_values,y2);
How are you trying to use this 'model'? If the data is constant, then why dont you measure, the mean/variances for the two gaussians seperately?
And if you are trying to generate new values from this mixed distribution, then you can look into a mixture model with weights given to each of the above distributions.

Matlab: plotting frequency distribution with a curve

I have to plot 10 frequency distributions on one graph. In order to keep things tidy, I would like to avoid making a histogram with bins and would prefer having lines that follow the contour of each histogram plot.
I tried the following
[counts, bins] = hist(data);
plot(bins, counts)
But this gives me a very inexact and jagged line.
I read about ksdensity, which gives me a nice curve, but it changes the scaling of my y-axis and I need to be able to read the frequencies from the y-axis.
Can you recommend anything else?
You're using the default number of bins for your histogram and, I will assume, for your kernel density estimation calculations.
Depending on how many data points you have, that will certainly not be optimal, as you've discovered. The first thing to try is to calculate the optimum bin width to give the smoothest curve while simultaneously preserving the underlying PDF as best as possible. (see also here, here, and here);
If you still don't like how smooth the resulting plot is, you could try using the bins output from hist as a further input to ksdensity. Perhaps something like this:
[kcounts,kbins] = ksdensity(data,bins,'npoints',length(bins));
I don't have your data, so you may have to play with the parameters a bit to get exactly what you want.
Alternatively, you could try fitting a spline through the points that you get from hist and plotting that instead.
Some code:
data = randn(1,1e4);
optN = sshist(data);
figure(1)
[N,Center] = hist(data);
[Nopt,CenterOpt] = hist(data,optN);
[f,xi] = ksdensity(data,CenterOpt);
dN = mode(diff(Center));
dNopt = mode(diff(CenterOpt));
plot(Center,N/dN,'.-',CenterOpt,Nopt/dNopt,'.-',xi,f*length(data),'.-')
legend('Default','Optimum','ksdensity')
The result:
Note that the "optimum" bin width preserves some of the fine structure of the distribution (I had to run this a couple times to get the spikes) while the ksdensity gives a smooth curve. Depending on what you're looking for in your data, that may be either good or bad.
How about interpolating with splines?
nbins = 10; %// number of bins for original histogram
n_interp = 500; %// number of values for interpolation
[counts, bins] = hist(data, nbins);
bins_interp = linspace(bins(1), bins(end), n_interp);
counts_interp = interp1(bins, counts, bins_interp, 'spline');
plot(bins, counts) %// original histogram
figure
plot(bins_interp, counts_interp) %// interpolated histogram
Example: let
data = randn(1,1e4);
Original histogram:
Interpolated:
Following your code, the y axis in the above figures gives the count, not the probability density. To get probability density you need to normalize:
normalization = 1/(bins(2)-bins(1))/sum(counts);
plot(bins, counts*normalization) %// original histogram
plot(bins_interp, counts_interp*normalization) %// interpolated histogram
Check: total area should be approximately 1:
>> trapz(bins_interp, counts_interp*normalization)
ans =
1.0009

MATLAB Fitting Function

I am trying to fit a line to some data without using polyfit and polyval. I got some good help already on how to implement this and I have gotten it to work with a simple sin function. However, when applied to the function I am trying to fit, it does not work. Here is my code:
clear all
clc
lb=0.001; %lowerbound of data
ub=10; %upperbound of data
step=.1; %step-size through data
a=.03;
la=1482/120000; %1482 is speed of sound in water and 120kHz
ep1=.02;
ep2=.1;
x=lb:step:ub;
r_sq_des=0.90; %desired value of r^2 for the fit of data without noise present
i=1;
for x=lb:step:ub
G(i,1)= abs(sin((a/la)*pi*x*(sqrt(1+(1/x)^2)-1)));
N(i,1)=2*rand()-1;
Ghat(i,1)=(1+ep1*N(i,1))*G(i,1)+ep2*N(i,1);
r(i,1)=x;
i=i+1;
end
x=r;
y=G;
V=[x.^0];
Vfit=[x.^0];
for i=1:1:1000
V = [x.^i V];
c = V \ y;
Vfit = [x.^i Vfit];
yFit=Vfit*c;
plot(x,y,'o',x,yFit,'--')
drawnow
pause
end
The first two sections are just defining variables and the function. The second for loop is where I am making the fit. As you can see, I have it pause after every nth order in order to see the fit.
I changed your fit formula a bit, I got the same answers but quickly got
a warning that the matrix was singular. No sense in continuing past
the point that the inversion is singular.
Depending on what you are doing you can usually change out variables or change domains.
This doesn't do a lot better, but it seemed to help a little bit.
I increased the number of samples by a factor of 10 since the initial part of the curve
didn't look sampled highly enough.
I added a weighting variable but it is set to equal weight in the code below. Attempts
to deweight the tail didn't help as much as I hoped.
Probably not really a solution, but perhaps will help with a few more knobs/variables.
...
step=.01; %step-size through data
...
x=r;
y=G;
t=x.*sqrt(1+x.^(-2));
t=log(t);
V=[ t.^0];
w=ones(size(t));
for i=1:1:1000
% Trying to solve for value of c
% c that
% yhat = V*c approximates y
% or y = V*c
% V'*y = V'*V * c
% c = (V'*V) \ V'*y
V = [t.^i V];
c = (V'*diag(w.^2)*V ) \ (V'*diag(w.^2)*y) ;
yFit=V*c;
subplot(211)
plot(t,y,'o',t,yFit,'--')
subplot(212)
plot(x,y,'o',x,yFit,'--')
drawnow
pause
end
It looks like more of a frequency estimation problem, and trying to fit a unknown frequency
with polynomial tends to be touch and go. Replacing the polynomial basis with a quick
sin/cos basis didn't seem to do to bad.
V = [sin(t*i) cos(t*i) V];
Unless you specifically need a polynomial basis, you can apply your knowledge of the problem domain to find other potential basis functions for your fit, or to attempt to make the domain in which you are performing the fit more linear.
As dennis mentioned, a different set of basis functions might do better. However you can improve the polynomial fit with QR factorisation, rather than just \ to solve the matrix equation. It is a badly conditioned problem no matter what you do however, and using smooth basis functions wont allow you to accurately reproduce the sharp corners in the actual function.
clear all
close all
clc
lb=0.001; %lowerbound of data
ub=10; %upperbound of data
step=.1; %step-size through data
a=.03;
la=1482/120000; %1482 is speed of sound in water and 120kHz
ep1=.02;
ep2=.1;
x=logspace(log10(lb),log10(ub),100)';
r_sq_des=0.90; %desired value of r^2 for the fit of data without noise present
y=abs(sin(a/la*pi*x.*(sqrt(1+(1./x).^2)-1)));
N=2*rand(size(x))-1;
Ghat=(1+ep1*N).*y+ep2*N;
V=[x.^0];
xs=(lb:.01:ub)';
Vfit=[xs.^0];
for i=1:1:20%length(x)-1
V = [x.^i V];
Vfit = [xs.^i Vfit];
[Q,R]=qr(V,0);
c = R\(Q'*y);
yFit=Vfit*c;
plot(x,y,'o',xs,yFit)
axis([0 10 0 1])
drawnow
pause
end