Normal Probability Plot interpretation [closed]

Normal Probability Plot interpretation [closed] - matlab

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 12 years ago.
Improve this question
I have a very basic question. What is the basis of the normal probability plot i.e. what do the probabilities represent? I am testing for a standard normal distribution. My normplot (in MATLAB) revealed that the values were more or less in a straight line BUT the probability of 0.5 corresponded to a value other than zero.
My question is, how do I interpret this? Does this mean that my data is normally distributed but has a non-zero mean (i.e. not standard normal) or does this probability only reflect something else? I tried Google and one link said the probabilities are the cumulative probabilities from the z-table, and I can't figure out what to make of it.
Also in MATLAB, is it that as long as the values are fitting into the line drawn by the program (the red dotted line) the values come from a normal distribution? In one of my graphs, the dotted line is very steep but the values fit in, does this mean that the one or two values that are way outside this line are just outliers?
I'm very new to stats, so please help!
Thanks!

My question is, how do I interpret this? Does this mean that my data is normally distributed but has a non-zero mean (i.e. not standard normal) or does this probability only reflect something else?
You are correct. If you run normplot and get data very close to the fitted line, that means your data has a cumulative distribution function that is very close to a normal distribution. The 0.5 CDF point corresponds to the mean value of the fitted normal distribution. (Looks like about 0.002 in your case)
The reason you get a straight line is that the y-axis is nonlinear, and it's made to be "warped" in such a way that a perfect Gaussian cumulative distribution would map into a line: the y-axis marks are linear with the inverse error function.
When you look at the ends and they have steeper slopes than the fitted line, that means your distribution has shorter tails than a normal distribution, i.e. there are fewer outliers, perhaps due to some physical constraint that prevents excessive variation from the mean.

The normal distribution is a density function. The probability of any single value will be 0. This because you have the total probability ( = 1) distributed between an infinite number of values (its a continuous function).
What you have there in the graph (of the normal distribution) is how the probability is distributed (y axis) around the values (x axis). So what you can get from the graph is the probability of an interval either between 2 points, from -infinite to any point, or from any point to +infinte. This probability is obtained integrating the function (of the normal distribution) defined from point1 to point2.
But you don't have to do this integral since you have the z table. The z table gives you the probability of x being between -infinite and x (aplying the equation that relates x to z)
I don't have matlab here, but i guess the straight line you mention is the cumulative distribution function, which tells you the probability of x between [-infinite, x], and is determined by the sum (or integral in this case) from -infinite to the value of x (or obtained in the z table)
Sorry if my english was bad.
I hope i was helpful.

Related

Pearson correlation coefficent

This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.

Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!

Why does the randn function return a number larger than 1? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I thought randn returns a random number which belongs to a normal distribution with mean 0 and standard deviation 1. Therefore, I expect to get a number in the range (0, 1). But what I get is a number not in the range (0,1).
What am I doing wrong?

You are thinking of a uniform distribution. A normal distribution can, in theory, have very big numbers, with very low probability.
randn has a mean of 0 and standard deviation of 1. The normal distribution is the bell-curve / Gaussian shape, with the highest probability at the mean and probability falling off relative to the standard deviation.
What you are looking for is rand, which "samples" from a uniform random distribution, which gives numbers bounded between 0 and 1 with even probability at all points.

You're confusing the normal distribution with the uniform distribution.

Another possible source of confusion:
A normal distribution with mean 0 and variance 1 is often denoted N(0,1). This is sometimes called the standard normal distribution and implies that samples are drawn from all real numbers, i.e., the range (−∞,+∞), with a mean 0 and variance 1. The standard deviation is also 1 in this case, but this notation specifies the variance (many screw this up). The transformation N(μ,σ2) = μ + σ N(0,1), where μ is the mean, σ2 is the variance, and σ is the standard deviation, is very useful.
Similarly, a continuous uniform distribution over the open interval (0,1) is often denoted U(0,1). This is often called a standard uniform distribution and implies that samples are drawn uniformly from just the range (0,1). Similarly, the transformation U(a,b) = a + (b − a) U(0,1), where a and b represent the edges of a scaled interval, is useful.
Note that the 0's and 1's in these two cases do not represent the same things at all other than being parameters that describe each distribution. The ranges that these two distributions are sampled from are called the support.

Detect incorrect points in a homogeneous surface

In my project i have hige surfaces of 20.000 points computed by a algorithm. This algorithm, sometimes, has an error, computing 1 or more points in an small area incorrectly.
This error can not be solved in the algorithm, but needs to be detected afterwards.
The error can be seen in the next figure:
As you can see, there is a point wrongly computed that not only breaks the full homogeneous surface, but also destroys the aestetics of the plot (wich is also important in the project.)
Sometimes it can be more than a point, in general no more than 5 or 6. The error is allways the Z axis, so no need to check X and Y
I have been squeezing my mind to find a bit "generic" algorithm to detect this poitns.
I thougth that maybe taking patches of surface and meaning the Z, then detecting the points out of the variance... but I dont think it will work allways.
Any ideas?
NOTE: I dont want someone to write code for me, just an idea.
PD: relevant code for the avobe image:
[x,y] = meshgrid([-2:.07:2]);
Z = x.*exp(-x.^2-y.^2);
subplot(1,2,1)
surf(x,y,Z,gradient(Z))
subplot(1,2,2)
Z(35,35)=Z(35,35)+0.3;
surf(x,y,Z,gradient(Z))

The standard trick is to use a Laplacian, looking for the largest outliers. (This is not unlike what Mohsen posed for an answer, but is actually a bit easier.) You could even probably do it with conv2, so it would be pretty efficient.
I could offer a few ways to implement the idea. A simple one is to use my gridfit tool, found on the File Exchange. (Gridfit essentially uses a Laplacian for its smoothing operation.) Fit the surface with all points included, then look for the single point that was perturbed the most by the fit. Exclude it, then rerun the fit, again looking for the largest outlier. (With gridfit, you can use weights to give points a zero weight, a simple way to exclude a point or list of points.) When the largest perturbation that was needed is small enough, you can decide to stop the process. A nice thing is gridfit will also impute new values for the outliers, filling in all of the holes.
A second approach is to use the Laplacian directly, in more of a filtering approach. Here, you simply compute a value at each point that is the average of each neighbor to the left, right, above, and below. The single value that is most largely in disagreement with its computed average is replaced with a new value. Or, you can use a weighted average of the new value with the old one there. Again, iterate until the process does not generate anything larger than some tolerance. (This is the basis of an old outlier detection and correction scheme that I recall from the Fortran IMSL libraries, but probably dates back to roughly 30 years ago.)

Since your functions seems to vary smoothly these abrupt changes can be detected by looking into the derivatives. You can
Take the derivative in one direction
Calculate mean and standard deviation of derivative
Find the points by looking for points that are further from mean by certain multiple of standard deviation.
Here is the code
U=diff(Z);
V=(U-mean(U(:)))/std(U(:));
surf(x(2:end,:),y(2:end,:),V)
V=[zeros(1,size(V,2)); V];
V(abs(V)<10)=0;
V=sign(V);
W=cumsum(V);
[I,J]=find(W);
outliers = [I, J];
For your example you get this plot for V with a peak at around 21.7 while second peak is at around 1.9528, so maybe a threshold of 10 is ok.
and running the code returns
outliers =
35 35
The need for cumsum is for the cases that you have a patch of points next to each other that are incorrect.

Controlled random number/dataset generation in MATLAB

Say, I have a cube of dimensions 1x1x1 spanning between coordinates (0,0,0) and (1,1,1). I want to generate a random set of points (assume 10 points) within this cube which are somewhat uniformly distributed (i.e. within certain minimum and maximum distance from each other and also not too close to the boundaries). How do I go about this without using loops? If this is not possible using vector/matrix operations then the solution with loops will also do.
Let me provide some more background details about my problem (This will help in terms of what I exactly need and why). I want to integrate a function, F(x,y,z), inside a polyhedron. I want to do it numerically as follows:
$F(x,y,z) = \sum_{i} F(x_i,y_i,z_i) \times V_i(x_i,y_i,z_i)$
Here, $F(x_i,y_i,z_i)$ is the value of function at point $(x_i,y_i,z_i)$ and $V_i$ is the weight. So to calculate the integral accurately, I need to identify set of random points which are not too close to each other or not too far from each other (Sorry but I myself don't know what this range is. I will be able to figure this out using parametric study only after I have a working code). Also, I need to do this for a 3D mesh which has multiple polyhedrons, hence I want to avoid loops to speed things out.

Check out this nice random vectors generator with fixed sum FEX file.
The code "generates m random n-element column vectors of values, [x1;x2;...;xn], each with a fixed sum, s, and subject to a restriction a<=xi<=b. The vectors are randomly and uniformly distributed in the n-1 dimensional space of solutions. This is accomplished by decomposing that space into a number of different types of simplexes (the many-dimensional generalizations of line segments, triangles, and tetrahedra.) The 'rand' function is used to distribute vectors within each simplex uniformly, and further calls on 'rand' serve to select different types of simplexes with probabilities proportional to their respective n-1 dimensional volumes. This algorithm does not perform any rejection of solutions - all are generated so as to already fit within the prescribed hypercube."

Use i=rand(3,10) where each column corresponds to one point, and each row corresponds to the coordinate in one axis (x,y,z)

Metric rectification using the dual degenerate conic in MATLAB [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm implementing metric rectification of an image with projective distortion in the following manner:
From the original image I'm finding two sets of parallel lines and finding their intersection points (the vanishing points at infinity).
I'm selecting five non-collinear points on a circle to be fit to a conic, then I'm checking where that conic intersects the line at infinity using the aforementioned points.
I use those points to find the distorted dual degenerate conic.
Theoretically, since the distorted conic is determined by C*'=HC*H' (C* is the dual degenerate conic, ' is transpose, H is my homography), I should be able to run SVD to determine H. Undistorted, C* is a 3x3 identity matrix with the last element on the diagonal zero. However, if I run SVD I don't get ones in the diagonal matrix. For some matrices I can avoid this by using Cholesky factorization instead (which factors to C*'=HH' which, at least for this, is mostly okay) but this requires a matrix that's positive definite. Is there a way to distribute the scale inside the diagonal matrix returned in SVD equally into the U and V' matrices while keeping them the same? (e.g. U = V).
I'm using MATLAB for this. I'm sure I'm missing something obvious...

The lack of positive definiteness of the resulting matrices is due to noise, as the image used had too much radial distortion rendering even the selection of many points on the circle fairly useless in this approach.
The point missed in the SVD approach was to remove the scale from the diagonal component by right and left multiplying by the square root of the diagonal matrix (with the last diagonal element set to 1, since that singular value should be zero but a zero component there would not yield correct results).