Is kurtosis in excess when using the function from pyspark module?

Is kurtosis in excess when using the function from pyspark module? - pyspark

When using the kurtosis function from the pyspark module pyspark.sql.functions.kurtosis(col), is the result in excess of the Normal distribution? ie. is 3 already subtracted from the kurtosis to yield k-3?
Or would we have to calculate the excess ourselves?

I could be wrong, but since pyspark gives negative values for its kurtosis, I assume that it is excess kurtosis which it has already subtracted 3 from its calculation.

Related

Kurtosis function in Julia

So I've been playing around with Julia, and I've discovered that the function to calculate the kurtosis of a probability distribution is implemented differently between Julia and MATLAB.
In Julia, do:
using Distributions
dist = Beta(3, 5)
x = rand(dist, 10000)
kurtosis(x) #gives a value approximately around -0.42
In MATLAB do:
x = betarnd(3, 5, [1, 10000]);
kurtosis(x) %gives something approximately around 2.60
What's happening here? Why is the kurtosis different between the two languages?

As explained here: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
We often use excess Kurtosis (Kurtosis - 3) so that the (Excess) Kurtosis of a normal distribution becomes zero. As shown in the distributions.jl docs that is what is used by kurtosis(x) in Julia.
Matlab does not use the excess measure (there is even a note in the docs that mentions this potential issue).

Limiting the number of decimal points taken by matlab for calculation

I was checking the similarity of two random matrices by using two methods ,after addition, sum of one elements of first matrix comes as .7095 and same .7095 for second matrix matrix but when tried to find the difference in sum ,instead of zero it was giving a value very close to zero .Later I checked in work space and find out that first number is actually 0.709485632903040 and the second is 0.709485632903037. It is extremely important for me to have the difference vector to be zero so as I am using that zero in my later stages of my program .If matlab do calculation for a precision in less than 4 or 5 digits I can achieve that. I want to limit the calculation only for 4 or 5 digits,I had used digits(4) but it is not working . I want matlab to do calculation in a precision up to 4 decimal places only(not for display,calculation inside matlab) ,Is there is a way to do that??

Python and Matlab compute variances differently. Am I using the correct functions? [duplicate]

I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?

The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.

The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).

For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.

How do I calculate in Matlab the 95% confidence interval with lsqcurvefit?

due to some problems in Matlab with fixed parameters, I had to switch from the std. fit command to lsqcurvefit.
For the normal fit command, one of the output parameters is gof, from which I can calculate the +/- of each parameter and the r^2 value.
That should be possible for the lsqcurvefit as well. But I don't get it as one of the output parameters.
Or to put my question in other words: how do I calculate the +/- of a fitparamter from the lsqcurvefit?
Can someone help me with that?
Thanks, Niko

Yep. Get all the output parameters of lsqcurvefit and use them in nlparci like so:
[x,resnorm,residual,exitflag,output,lambda,jacobian] =...
lsqcurvefit(#myfun,x0,xdata,ydata);
conf = nlparci(x,residual,'jacobian',jacobian)
Now conf contains an N x 2 matrix for your N fit parameters. Each row of conf gives the upper and lower 95% confidence interval for the corresponding parameter.

Kurtosis of a normal distribution

According to what I read from here, the kurtosis of a normal distribution should be around 3. However, when I use the kurtosis function provided by MATLAB, I could not verify it:
data1 = randn(1,20000);
v1 = kurtosis(data1)
It seems that the kurtosis of a normal distribution is around 0. I was wondering what's wrong with it. Thanks!
EDIT
I am using MATLAB 2012b.

If it did that, this would be a strong indication that it was computing excess kurtosis, which is defined to be kurtosis minus three.
However, my MATLAB doesn't actually do that:
MATLAB>> data1 = randn(1,20000);
MATLAB>> kurtosis(data1)
ans =
2.9825