sum of MATLAB gaussian distribution is greater than 1 - matlab

I want to compare the actual distribution of a time series with the normal law having same mean and std deviation. The interval on which I am computing this Gaussian distribution starts at the min and ends at the max values of the time series.
The problem is that I obtain a gaussian which has the classic bell shape but is shifted upwards, since the integral of the normal pdf is about 9.
Here it is my code:
N = 30; %number of segments in the interval
DISTR = struct('interval',NaN(size(shares.ret,1),N-1),'perf',NaN(size(shares.ret,1),N-1),'normal',NaN(size(shares.ret,1),N-1));
first_ret = table2array(rowfun(#(x) find(~isnan(x),1),shares.ret,'SeparateInputs',false)); %THIS LINE allows to calculate the distribution for EACH FUND on his own time horizon
for i = 1:size(shares.ret,1)
xbins = linspace(min(shares.ret{i,first_ret(i):end},[],2),max(shares.ret{i,first_ret(i):end},[],2),N);
y = (xbins(2)-xbins(1))/2;
DISTR.interval(i,:) = xbins(1:end-1)+y;
DISTR.perf(i,:) = hist(shares.ret{i,first_ret(i):end},DISTR.interval(i,1:end))/sum(~isnan(shares.ret{i,first_ret(i):end}),2);
DISTR.normal(i,:) = normpdf(DISTR.interval(i,:),mean(shares.ret{i,first_ret(i):end}),std(shares.ret{i,first_ret(i):end}));
end
Here I found a similar question that I didn't understand how to adapt to my case.
Any suggestion/help will be really appreciated.
Thanks

Related

Matlab: generate random numbers from custom made probability density function

I have a dataset with 3-hourly precipitation amounts for the month of January in the period 1977-1983 (see attachment). However, I want to generate precipitation data for the period 1984-1990 based upon these data. Therefore, I was wondering if it would be possible to create a custom made probability density function of the precipitation amounts (1977-1983) and from this, generate random numbers (precipitation data) for the desired period (1984-1990).
Is this possible in Matlab and could someone help me by doing so?
Thanks in advance!
A histogram will give you an estimate of the PDF -- just divide the bin counts by the total number of samples. From there you can estimate the CDF by integrating. Finally, you can choose a uniformly distributed random number between 0 and 1 and estimate the argument of the CDF that would yield that number. That is, if y is the random number you choose, then you want to find x such that CDF(x) = y. The value of x will be a random number with the desired PDF.
If you have 'Statistics and Machine Learning Toolbox', you can evaluate the PDF of the data with 'Kernel Distribution' method:
Percip_pd = fitdist(Percip,'Kernel');
Then use it for generating N random numbers from the same distribution:
y = random(Percip_pd,N,1);
Quoting #AnonSubmitter85:
"estimate the CDF by integrating. Finally, you can choose a uniformly
distributed random number between 0 and 1 and estimate the argument of
the CDF that would yield that number. That is, if y is the random
number you choose, then you want to find x such that CDF(x) = y. The
value of x will be a random number with the desired PDF."
%random sampling
N=10; %number of resamples
pdf = normrnd(0, 1, 1,100); %your pdf
s = cumsum(pdf); %its cumulative distribution
r = rand(N,1); %random numbers between 0 and 1
for ii=1:N
inds = find(s>r(ii));
indeces(ii)=inds(1); %find first value greater than the random number
end
resamples = pdf(indeces) %the resamples

Normal distribution with the minimum values in Matlab

I tried to generate 1000 the random values in normal distribution by the normrnd function.
A = normrnd(4,1,[1000 1]);
I would like to set the minimum value is 2. However, that function just can define the mean and sd. How can I set the minimum value is 2 ?
You can't. Gaussian or normally distributed numbers are in a bell curve, with the tails tailing off to infinity. What you can do is "censor" them by eliminating every number beyond a cut-off.
Since you choose mean = 4 and sigma = 1, you will end up ~95% elements of A fall within range [2,6]. The number of elements whose values smaller than 2 is about 2.5%. If you consider this figure is small, you can wrap these elements to a minimum value. For example:
A = normrnd(4,1,[1000 1]);
A(A < 2) = A(A<2) + 2 - min(A(A<2))
Of course, it is technically not gaussian distribution. However if you have total control of mean and sigma, you can get a "more gaussian like" distribution by adding an offset to A:
A = A + 2 - min(A)
Note: This assumes you can have an arbitrarily set standard deviation, which may not be the case
As others have said, you cannot specify a lower bound for a true Gaussian. However, you can generate a Gaussian and estimate 1-p percent of values to be above and then ignore p percent of values (which will fall outside your cutoff).
For example, in the following code, I am generating a Gaussian where 95% of data-points fall above 2. Then I am removing all points below 2, knowing that 5% of data will be removed.
This is a solution because setting as p gets closer to zero, your chances of getting uncensored sample data that follows your Gaussian curve and is entirely above your cutoff goes to 100% (Really it's defined by the p/n ratio, but if n is fixed this is true).
n = 1000; % number of samples
cutoff = 2; % Cutoff point for min-value
mu = 4; % Mean
p = .05; % Percentile you would like to cutoff
z = -sqrt(2) * erfcinv(p*2); % compute z score
sigma = (cutoff - mu)/z; % compute standard deviation
A = normrnd(mu,sigma,[n 1]);
I would recommend removing values below the cutoff rather than re-attributing them to the lower bound of your distribution, but that is up to you.
A(A<cutoff) = []; % removes all values of A less than cutoff
If you want to be symmetrical (which you should to prevent sample skew) the following should work.
A(A>(2*mu-cutoff)) = [];

Computing a moving average

I need to compute a moving average over a data series, within a for loop. I have to get the moving average over N=9 days. The array I'm computing in is 4 series of 365 values (M), which itself are mean values of another set of data. I want to plot the mean values of my data with the moving average in one plot.
I googled a bit about moving averages and the "conv" command and found something which i tried implementing in my code.:
hold on
for ii=1:4;
M=mean(C{ii},2)
wts = [1/24;repmat(1/12,11,1);1/24];
Ms=conv(M,wts,'valid')
plot(M)
plot(Ms,'r')
end
hold off
So basically, I compute my mean and plot it with a (wrong) moving average. I picked the "wts" value right off the mathworks site, so that is incorrect. (source: http://www.mathworks.nl/help/econ/moving-average-trend-estimation.html) My problem though, is that I do not understand what this "wts" is. Could anyone explain? If it has something to do with the weights of the values: that is invalid in this case. All values are weighted the same.
And if I am doing this entirely wrong, could I get some help with it?
My sincerest thanks.
There are two more alternatives:
1) filter
From the doc:
You can use filter to find a running average without using a for loop.
This example finds the running average of a 16-element vector, using a
window size of 5.
data = [1:0.2:4]'; %'
windowSize = 5;
filter(ones(1,windowSize)/windowSize,1,data)
2) smooth as part of the Curve Fitting Toolbox (which is available in most cases)
From the doc:
yy = smooth(y) smooths the data in the column vector y using a moving
average filter. Results are returned in the column vector yy. The
default span for the moving average is 5.
%// Create noisy data with outliers:
x = 15*rand(150,1);
y = sin(x) + 0.5*(rand(size(x))-0.5);
y(ceil(length(x)*rand(2,1))) = 3;
%// Smooth the data using the loess and rloess methods with a span of 10%:
yy1 = smooth(x,y,0.1,'loess');
yy2 = smooth(x,y,0.1,'rloess');
In 2016 MATLAB added the movmean function that calculates a moving average:
N = 9;
M_moving_average = movmean(M,N)
Using conv is an excellent way to implement a moving average. In the code you are using, wts is how much you are weighing each value (as you guessed). the sum of that vector should always be equal to one. If you wish to weight each value evenly and do a size N moving filter then you would want to do
N = 7;
wts = ones(N,1)/N;
sum(wts) % result = 1
Using the 'valid' argument in conv will result in having fewer values in Ms than you have in M. Use 'same' if you don't mind the effects of zero padding. If you have the signal processing toolbox you can use cconv if you want to try a circular moving average. Something like
N = 7;
wts = ones(N,1)/N;
cconv(x,wts,N);
should work.
You should read the conv and cconv documentation for more information if you haven't already.
I would use this:
% does moving average on signal x, window size is w
function y = movingAverage(x, w)
k = ones(1, w) / w
y = conv(x, k, 'same');
end
ripped straight from here.
To comment on your current implementation. wts is the weighting vector, which from the Mathworks, is a 13 point average, with special attention on the first and last point of weightings half of the rest.

Power spectral density of FFT

I have a piece of code that gets the FFT of a part of the signal and I'm now trying to get the PSD,
Fs = 44100;
cj = sqrt(-1);
%T=.6;
dt = 1/Fs;
left=test(:,1);
right=test(:,2);
time = 45;
interval =.636;
w_range = time*Fs: (time+interval)*Fs-1;
I = left(w_range);
Q = right(w_range);
n = interval * Fs;
f = -Fs/2:Fs/n:Fs/2-Fs/n;
s = I+cj.*Q;
% Smooth the signal ss = smooth(s,201);
sf = (fftshift(fft(ss(1:n)))); %FFT of signal
figure(1) plot(f,((20*log10((abs(sf))./max(abs(sf))))))
From my understanding, in order to get the PSD I just need to raise sf to the power of 2, or is there anything more I need to perform?
Technically yes, you can obtain the power-spectral density (PSD) of a periodic signal by taking the squared-magnitude of its FFT. Note that if you are going to plot it on a logarithmic decibel scale, there is really no difference between 20*log10(abs(sf)) or 10*log10(abs(sf).^2).
There is however generally more to it in the sense that the PSD estimate computed in this way tends to have a fairly large variance. There are a number of techniques which can be used to improve the estimate. A simple one consists of applying a window to sections of data, perform the FFT, then averaging the resulting PSDs (i.e. averaging the squared-magnitudes).
You are perfectly right. You just have to built the square of the absolute values.

Find only relevant points in MATLAB

I have a MATLAB function that finds charateristic points in a sample. Unfortunatley it only works about 90% of the time. But when I know at which places in the sample I am supposed to look I can increase this to almost 100%. So I would like to know if there is a function in MATLAB that would allow me to find the range where most of my results are, so I can then recalculate my characteristic points. I have a vector which stores all the results and the right results should lie inside a range of 3% between -24.000 to 24.000. Wheras wrong results are always lower than the correct range. Unfortunatley my background in statistics is very rusty so I am not sure how this would be called.
Can somebody give me a hint what I would be looking for? Is there a function build into MATLAB that would give me the smallest possible range where e.g. 90% of the results lie.
EDIT: I am sorry if I didn't make my question clear. Everything in my vector can only range between -24.000 and 24.000. About 90% of my results will be in a range which spans approximately 1.44 ([24-(-24)]*3% = 1.44). These are very likely to be the correct results. The remaining 10% are outside of that range and always lower (why I am not sure taking then mean value is a good idea). These 10% are false and result from blips in my input data. To find the remaining 10% I want to repeat my calculations, but now I only want to check the small range.
So, my goal is to identify where my correct range lies. Delete the values I have found outside of that range. And then recalculate my values, not on a range between -24.000 and 24.000, but rather on a the small range where I already found 90% of my values.
The relevant points you're looking for are the percentiles:
% generate sample data
data = [randn(900,1) ; randn(50,1)*3 + 5; ; randn(50,1)*3 - 5];
subplot(121), hist(data)
subplot(122), boxplot(data)
% find 5th, 95th percentiles (range that contains 90% of the data)
limits = prctile(data, [5 95])
% find data in that range
reducedData = data(limits(1) < data & data < limits(2));
Other approachs exist to detect outliers, such as the IQR outlier test and the three standard deviation rule, among many others:
%% three standard deviation rule
z = 3;
bounds = z * std(data)
reducedData = data( abs(data-mean(data)) < bounds );
and
%% IQR outlier test
Q = prctile(data, [25 75]);
IQ = Q(2)-Q(1);
%a = 1.5; % mild outlier
a = 3.0; % extreme outlier
bounds = [Q(1)-a*IQ , Q(2)+a*IQ]
reducedData = data(bounds(1) < data & data < bounds(2));
BTW if you want to get the z value (|X|<z) that corresponds to 90% area under the curve, use:
area = 0.9; % two-tailed probability
z = norminv(1-(1-area)/2)
Maybe you should try mean value (in matlab: mean) and standard deviation (in matlab: std)?
What is the statistic distribution of your data?
See also this wiki page, section "Interpretation and application".
In general for almost every distribution, very useful Chebyshev's inequalities take place.
In most of the cases this should work:
meanval = mean(data)
stDev = std(data)
and probably the most (75%) of your values will be placed in range:
<meanVal - 2*stDev, meanVal + 2*stDev>
it seems like maybe you want to find the number x in [-24,24] that maximizes the number of sample points in [x,x+1.44]; probably the fastest way to do this involves a sort of the sample points, which is ultimately nlog(n) time; a cheesy approximation would be as follows:
brkpoints = linspace(-24,24-1.44,n_brkpoints); %choose n_brkpoints big, but < # of sample points?
n_count = histc(data,[brkpoints,inf]); %count # data points between breakpoints;
accbins = 1.44 / (brkpoints(2) - brkpoints(1); %# of bins to accumulate;
cscount = cumsum(n_count); %half of the boxcar sum computation;
boxsum = cscount - [zeros(accbins,1);cscount(1:end-accbins)]; %2nd half;
[dum,maxi] = max(boxsum); %which interval has the maximal # counts?
lorange = brkpoints(maxi); %the lower range;
hirange = lorange + 1.44
this solution does fudge some of the corner case stuff about the bottom and top bin, etc.
note that if you're going to go by the Chebyshev inequality route, Petunin's Inequality is probably applicable, and will give a slight boost.