I have values for 10th-25th percentile range which is 0.49, 25th-50th percentile is 1.36 (this is peak), 50th-75th percentile is 0.18, >90th percentile is 0.15.
I want to interpolate the values for the ranges >5, 5th-10th, 75th-90th percentile. How to do that in MATLAB?
If I assume a normal distribution, whose peak is 1.36 (25th-50th percentile) (as shown in figure attached), how to interpolate the values of unknown percentile ranges?
Actually, performing an interpolation in order to find percentile values looks not very good to me. If you are dealing with a normal distribution and its parameters (mu and sigma) are known, what you are looking for is the norminv function (official documentation: https://mathworks.com/help/stats/norminv.html).
X = norminv(P,mu,sigma) computes the inverse of the normal CDF using
the corresponding mean mu and standard deviation sigma at the
corresponding probabilities in P. The parameters in sigma must be
positive, and the values in P must lie in the interval [0 1].
For example, this is how you find the interval that contains 95% of the values of a standard normal distribution:
norminv([0.025 0.975],0,1)
This is how you find the 99th percentile of a normal distribution with mu=10 and sigma=3.5:
norminv(0.99,10,3.5)
If you don't know those parameters, you can estimate them from the data you actually have. The parameters of the normal family are the mean and the standard deviation; once they are known, the underlying distribution is fully described. Actually:
The mean of a normal distribution is halfway between the 25th and the 75th percentile. Average these two values to approximate it.
In a normal distribution, the difference between the 25th and the 75th percentile is about 1.35 times its standard deviation. So take the difference between the aforementioned values and divide it by 1.35 in order to obtain an approximation of the standard deviation.
If you want to go with a linear interpolation, have a look at interp1 (https://mathworks.com/help/matlab/ref/interp1.html).
Related
I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?
The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.
The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).
For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.
Let us say I have 'n' vectors of data of unequal lengths. All of these vectors are similar (range, etc., see example below), and can be fitted into a specific probability distribution. How can we average out these distributions? Does it make sense? If yes, how do I go about programming it ?
For example:-
for n=2, for data1 of 400 samples, I get normal distribution with range 1 to 5, and mean 3 and standard deviation 0.75.
for data2 of 500 samples, I get normal distribution with range 0.95 to 5.2, and mean 3.05 and standard deviation 0.78.
What test can I do in MATLAB to test the spread of a histogram? For example, in the given set of histograms, I am only interested in 1,2,3,5 and 7 (going from left to right, top to bottom) because they are less spread out. How can I obtain a value that will tell me if a histogram is positively skewed?
It may be possible using Chi-Squared tests but I am not sure what the MATLAB code will be for that.
You can use the standard definition of skewness. In other words, you can use:
You compute the mean of your data and you use the above equation to calculate skewness. Positive and negative skewness are like so:
Source: Wikipedia
As such, the larger the value, the more positively skewed it is. The more negative the value, the more negatively skewed it is.
Now to compute the mean of your histogram data, it's quite simple. You simply do a weighted sum of the histogram entries and divide by the total number of entries. Given that your histogram is stored in h, the bin centres of your histogram are stored in x, you would do the following. What I will do here is assume that you have bins from 0 up to N-1 where N is the total number of bins in the histogram... judging from your picture:
x = 0:numel(h)-1; %// Judging from your pictures
num_entries = sum(h(:));
mu = sum(h.*x) / num_entries;
skew = ((1/num_entries)*(sum((h.*x - mu).^3))) / ...
((1/(num_entries-1))*(sum((h.*x - mu).^2)))^(3/2);
skew would contain the numerical measure of skewness for a histogram that follows that formula. Therefore, with your problem statement, you will want to look for skewness numbers that are positive and large. I can't really comment on what threshold you should look at, but look for positive numbers that are much larger than most of the histograms that you have.
I have two matrixesthat belongs gaussian distribtion.The size are 3x3. Now, I want to estimate up and down threshold of their matrixes. I denote mean and standard deviation of each matrix is μ1;μ2 and σ1;σ2. The high and low threshold are
T_hight=(μ1+μ2)./2+k1∗(σ1+σ2)./2
T_low=(μ1+μ2)./2-k2∗(σ1+σ2)./2
where k1,k2 is const
My question is "Is my formula correct? Because this is gaussian distribution, so k1=k2,Right? And this is my code. Could you check have me"
μ1=mean(v1(:));first matrix
σ1=std2(v1(:));
μ2=mean(v2(:));second matrix
σ2=std2(v2(:));
k1=k2=1;
T_hight=(μ1+μ2)./2+k1∗(σ1+σ2)./2;
T_low=(μ1+μ2)./2-k2∗(σ1+σ2)./2;
In the formula you are using, the joint standard deviation is wrong.it should be
T_high=(μ1+μ2)./2+k1∗sqrt((σ1^2+σ2^2)/2);
T_low=(μ1+μ2)./2-k2∗sqrt((σ1^2+σ2^2)/2);
As you treat all 18 pixels as belonging to the same distribution, why not use the following
v=[v1(:);v2(:)];
μ=mean(v);
σ=std(v);
k1=k2=1;
T_high=μ+k1*σ;
T_low=μ-k2∗σ1;
I have came up with a matlab code to plot a probability density and a cumulative graph. I have used the matlab to compute the standard deviation and the mean as well.
My next task is to find the 15th and 85 percentile of the cumulative graph. I tried to use 'prctile (prob, 15)' to calculate the 15th percentile but it does not seem to be the same value as what I have observed from the graph.
Is there any other ways to find the 15th and 85 percentile?
This should give you the 15% and 85% percentile values as you see in your cumulative graph:
15_percentile = prob(find(prob<prctile(prob,15),1));
85_percentile = prob(find(prob>prctile(prob,85),1,'last'));
There are several ways to calculate a percentile (see http://en.wikipedia.org/wiki/Percentile)
The gotcha here is that MatLab and Excel don't agree (Excel uses the definition empployed by the National Institiute of Standards And Technology in the US...also default for R)...worth considering if you swap data and analysis between MatLab and Excel.
Use the percentile function if you have that statistics toolbox (type help prctile).
http://www.mathworks.com/help/stats/prctile.html
Alternatively write it yourself! A percentile is simply the data sorted, and the value closest to the percentile you want (for example if you have 1000 values, your 15th percentile will be the (15/100)*1000=150th value! Make sure you sort the data from smallest to largest.
There is a special way to deal with values that fall in between samples, but these depend on the definition you use. Some take the nearest, others take the average between two samples, and some others calculate how close they are to the samples and take a value that is linearly proportional to that.