I have a data like this :
I have plotted histogram for each month :
I can not seem to fig out which probability distribution will fit it the most.
I tried to fit in gamma distribution but results are bad :
alpha_mom = precip_mean ** 2 / precip_var
beta_mom = precip_var / precip_mean
from scipy.stats.distributions import gamma
unemployment.Jan.hist(normed=True, bins=20)
plt.plot(np.linspace(0, 10), gamma.pdf(np.linspace(0, 10), alpha_mom[2],
beta_mom[2]))
Can someone guide me which distribution will fir above data. I could not seem to figure out.
here are result for fitting for month of Jan:
Related
I have previously posted this on the Mathworks Community, but am reposting here for a wider audience...
I have a 1 dimensional Histogram, to which I want to fit gaussians:
In the above example I need to find the centres of the 4 dominant peaks, however, the number of peaks may vary in a different Histogram.
Below is a MWE of my approach:
bins = 2000;
fsc_hist = histogram(FSC_data.FSC_HF,bins);hold on;
%% smooth data to get rid of discretization
fscValues = fsc_hist.Values;
binStep = (fsc_hist.BinLimits(2)-fsc_hist.BinLimits(1))/fsc_hist.NumBins;
binCenters = binStep * [0:fsc_hist.NumBins-1];
smoothValues = smooth(binCenters, fscValues, 0.1, 'rloess');
%% fit GMM
expectedPeaks = 4;
gmm = fitgmdist(smoothValues, expectedPeaks, 'RegularizationValue', 0.1);
Which returns the following GMM result:
Gaussian mixture distribution with 4 components in 1 dimensions
Component 1: Mixing proportion: 0.294734 Mean: 0.2417
Component 2: Mixing proportion: 0.152275 Mean: 41.9369
Component 3: Mixing proportion: 0.344658 Mean: 6.8231
Component 4: Mixing proportion: 0.208333 Mean: 24.6758
Obviously, the calculated Mean values of the gaussians is not correct.
Where is my approach going wrong? I believe that either my first input to the fitgmdist function must somehow be normalised, or that I need to post-process the output. So far, my attempts have failed.
What's happening is that the mixing models is giving you the means of Gaussian distributions of the counts. Instead of inputting the histogram into fitgmdist, you should input the raw FSC_data.FSC_HF data into the first argument.
I am trying to fit a distribution curve to the histogram of some data. (I have used some model data here instead because it is difficult to upload the actual data. I have included the complete code after my question.)
Because the histogram looks normally distributed when I plotted the x-axis in logscale, I transform the data first before fitting a normal distribution to it and I got the following results:
>>pdn=fitdist(log(data),'Normal')
pdn =
Normal distribution
mu = -0.334458 [-0.34704, -0.321876]
sigma = 0.351478 [0.342804, 0.360605]
When I plotted out the pdf with the histogram, I got this:
The result seems reasonable to me. Then I discovered that in the Matlab fitdist(), it already has a 'Lognormal' option and I don't really need the transform my data first and this is what I got:
>>pdln = fitdist(data,'Lognormal')
pdln =
Lognormal distribution
mu = -0.334458 [-0.34704, -0.321876]
sigma = 0.351478 [0.342804, 0.360605]
Exactly the same mean and standard deviation as I have got before. However, when I plotted it out with the histogram, I got a different curve:
This curve fits better to the data but the positions of the mean and the mean+/-std points are not as I have expected (i.e. mean at the peak and the mean+/-std at the same levels).
Which come to my question, why would fitdist(data,'Lognormal') give the same result as fitdist(log(data),'Normal') but a different plot? I have looked through the Matlab help pages and I still could not understand why, or where are my mistakes, please help.
My aim for all this is to get some numerical parameters about the distributions of my data under different conditions and compare them to see if there is any difference. At the moment, I am not certain which way would give me reliable estimates of the means and standard deviations.
The code for the graphs is below:
%random data in lognormal distribution
mu=-0.335742;
sigma=0.35228;
data=lognrnd(mu,sigma,[3000 1]);
%make histogram
interval=0.1;
svalue=sort(data);
bx(1)=interval/2;
i=2;
while bx(i-1)<=max(svalue)
bx(i)=bx(i-1)+interval;
i=i+1;
end
by=hist(svalue,bx);
subplot(211)
h = bar(bx,by,'hist');
set(h,'FaceColor',[.9 .9 .9]);
set(gca,'xlim',[0.05 10]);
xticks=[0.05 0.1 0.2 0.5 1 2 5 10];
set(gca,'xscale','log','xminortick','on')
set(gca,'xtick',xticks)
ylabel('counts')
subplot(212)
h = bar(bx,by,'hist');
set(h,'FaceColor',[.9 .9 .9]);
set(gca,'xlim',[0.05 10]);
xticks=[0.05 0.1 0.2 0.5 1 2 5 10];
set(gca,'xscale','log','xminortick','on')
set(gca,'xtick',xticks)
ylabel('counts')
% fit distribution curves
pdf_x = 0:0.01:max(data);
max_by=max(by); % for scaling the pdf to the histogram
% case 1 - PDF fitted using fitdist(log(data),'Normal')
subplot(211)
hold on
pdn = fitdist(log(data),'Normal')
pdf_y = pdf(pdn,log(pdf_x));
h1=plot(pdf_x,pdf_y./max(pdf_y).*max_by,'-k');
range=[exp(pdn.mu-pdn.sigma) exp(pdn.mu+pdn.sigma)];
h2=plot(exp(pdn.mu),pdf(pdn,(pdn.mu))./max(pdf_y).*max_by,'sk') ;
h3=plot(range,pdf(pdn,log(range))./max(pdf_y).*max_by,'ok') ;
title('PDF fitted using fitdist(log(data),''Normal'')');
legend([h1 h2 h3],'pdf','mean','meam+/-std');
% case 2 - PDF fitted using fitdist(data,'Lognormal')
subplot(212)
hold on
pdln = fitdist(data,'Lognormal')
pdf_y = pdf(pdln,pdf_x);
h1=plot(pdf_x,pdf_y./max(pdf_y).*max_by,'-b');
range=[exp(pdln.mu-pdln.sigma) exp(pdln.mu+pdln.sigma)];
h2=plot(exp(pdln.mu),pdf(pdln,exp(pdln.mu))./max(pdf_y).*max_by,'sb');
h3=plot(range,pdf(pdln,range)./max(pdf_y).*max_by,'ob') ;
title('PDF fitted using fitdist(data,''Lognormal'')');
legend([h1 h2 h3],'pdf','mean','meam+/-std');
I can't get my mind around the concept of how to calculate bias and variance from a random set.
I have created the code to generate a random normal set of numbers.
% Generate random w, x, and noise from standard Gaussian
w = randn(10,1);
x = randn(600,10);
noise = randn(600,1);
and then extract the y values
y = x*w + noise;
After that I split my data into a training (100) and test (500) set
% Split data set into a training (100) and a test set (500)
x_train = x([ 1:100],:);
x_test = x([101:600],:);
y_train = y([ 1:100],:);
y_test = y([101:600],:);
train_l = length(y_train);
test_l = length(y_test);
Then I calculated the w for a specific value of lambda (1.2)
lambda = 1.2;
% Calculate the optimal w
A = x_train'*x_train+lambda*train_l*eye(10,10);
B = x_train'*y_train;
w_train = A\B;
Finally, I am computing the square error:
% Compute the mean squared error on both the training and the
% test set
sum_train = sum((x_train*w_train - y_train).^2);
MSE_train = sum_train/train_l;
sum_test = sum((x_test*w_train - y_test).^2);
MSE_test = sum_test/test_l;
I know that if I create a vector of lambda (I have already done that) over some iterations I can plot the average MSE_train and MSE_test as a function of lambda, where then I will be able to verify that large differences between MSE_test and MSE_train indicate high variance, thus overfit.
But, what I want to do extra, is to calculate the variance and the bias^2.
Taken from Ridge Regression Notes at page 7, it guides us how to calculate the bias and the variance.
My questions is, should I follow its steps on the whole random dataset (600) or on the training set? I think the bias^2 and the variance should be calculated on the training set. Also, in Theorem 2 (page 7 again) the bias is calculated by the negative product of lambda, W, and beta, the beta is my original w (w = randn(10,1)) am I right?
Sorry for the long post, but I really want to understand how the concept works in practice.
UPDATE 1:
Ok, so following the previous paper didn't generate any good results. So, I took the standard form of Ridge Regression Bias-Variance which is:
Based on that, I created (I used the test set):
% Bias and Variance
sum_bias=sum((y_test - mean(x_test*w_train)).^2);
Bias = sum_bias/test_l;
sum_var=sum((mean(x_test*w_train)- x_test*w_train).^2);
Variance = sum_var/test_l;
But, after 200 iterations and for 10 different lambdas this is what I get, which is not what I expected.
Where in fact, I was hoping for something like this:
sum_bias=sum((y_test - mean(x_test*w_train)).^2); Bias = sum_bias/test_l
Why have you squared the difference between y_test and y_predicted = x_test*w_train?
I don't believe your formula for bias is correct. In your question, the 'bias term' above in blue is the bias^2 however surely your formula is neither the bias nor the bias^2 since you have only squared the residuals, not the entire bias?
I've made a GMModel using fitgmdist. The idea is to produce two gaussian distributions on the data and use that to predict their labels. How can I determine if a future data point fits into one of those distributions? Am I misunderstanding the purpose of a GMModel?
clear;
load C:\Users\Daniel\Downloads\data1 data;
% Mixed Gaussian
GMModel = fitgmdist(data(:, 1:4),2)
Produces
GMModel =
Gaussian mixture distribution with 2 components in 4 dimensions
Component 1:
Mixing proportion: 0.509709
Mean: 2.3254 -2.5373 3.9288 0.4863
Component 2:
Mixing proportion: 0.490291
Mean: 2.5161 -2.6390 0.8930 0.4833
Edit:
clear;
load C:\Users\Daniel\Downloads\data1 data;
% Mixed Gaussian
GMModel = fitgmdist(data(:, 1:4),2);
P = posterior(GMModel, data(:, 1:4));
X = round(P)
blah = X(:, 1)
dah = data(:, 5)
Y = max(mean(blah == dah), mean(~blah == dah))
I don't understand why you round the posterior values. Here is what I would do after fitting a mixture model.
P = posterior(GMModel, data(:, 1:4));
[~,Y] = max(P,[],2);
Now Y contains the labels that is index of which Gaussian the data belongs in-terms of maximum aposterior (MAP). Important thing to do is to align the labels before evaluating the classification error. Since renumbering might happen, i.e., Gaussian component 1 in the true might be component 2 in the clustering produced and so on. May be that why you are getting varying accuracy ranging from 51% accuracy to 95% accuracy, in addition to other subtle problems.
I want to compare the actual distribution of a time series with the normal law having same mean and std deviation. The interval on which I am computing this Gaussian distribution starts at the min and ends at the max values of the time series.
The problem is that I obtain a gaussian which has the classic bell shape but is shifted upwards, since the integral of the normal pdf is about 9.
Here it is my code:
N = 30; %number of segments in the interval
DISTR = struct('interval',NaN(size(shares.ret,1),N-1),'perf',NaN(size(shares.ret,1),N-1),'normal',NaN(size(shares.ret,1),N-1));
first_ret = table2array(rowfun(#(x) find(~isnan(x),1),shares.ret,'SeparateInputs',false)); %THIS LINE allows to calculate the distribution for EACH FUND on his own time horizon
for i = 1:size(shares.ret,1)
xbins = linspace(min(shares.ret{i,first_ret(i):end},[],2),max(shares.ret{i,first_ret(i):end},[],2),N);
y = (xbins(2)-xbins(1))/2;
DISTR.interval(i,:) = xbins(1:end-1)+y;
DISTR.perf(i,:) = hist(shares.ret{i,first_ret(i):end},DISTR.interval(i,1:end))/sum(~isnan(shares.ret{i,first_ret(i):end}),2);
DISTR.normal(i,:) = normpdf(DISTR.interval(i,:),mean(shares.ret{i,first_ret(i):end}),std(shares.ret{i,first_ret(i):end}));
end
Here I found a similar question that I didn't understand how to adapt to my case.
Any suggestion/help will be really appreciated.
Thanks