Pearson correlation coefficent - matlab

This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.

Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!

Related

Should I perform data centering before apply SVD?

I have to use SVD in Matlab to obtain a reduced version of my data.
I've read that the function svds(X,k) performs the SVD and returns the first k eigenvalues and eigenvectors. There is not mention in the documentation if the data have to be normalized.
With normalization I mean both substraction of the mean value and division by the standard deviation.
When I implemented PCA, I used to normalize in such way. But I know that it is not needed when using the matlab function pca() because it computes the covariance matrix by using cov() which implicitly performs the normalization.
So, the question is. I need the projection matrix useful to reduce my n-dim data to k-dim ones by SVD. Should I perform data normalization of the train data (and therefore, the same normalization to further projected new data) or not?
Thanks
Essentially, the answer is yes, you should typically perform normalization. The reason is that features can have very different scalings, and we typically do not want to take scaling into account when considering the uniqueness of features.
Suppose we have two features x and y, both with variance 1, but where x has a mean of 1 and y has a mean of 1000. Then the matrix of samples will look like
n = 500; % samples
x = 1 + randn(n,1);
y = 1000 + randn(n,1);
svd([x,y])
But the problem with this is that the scale of y (without normalizing) essentially washes out the small variations in x. Specifically, if we just examine the singular values of [x,y], we might be inclined to say that x is a linear factor of y (since one of the singular values is much smaller than the other). But actually, we know that that is not the case since x was generated independently.
In fact, you will often find that you only see the "real" data in a signal once we remove the mean. At the extremely end, you could image that we have some feature
z = 1e6 + sin(t)
Now if somebody just gave you those numbers, you might look at the sequence
z = 1000001.54, 1000001.2, 1000001.4,...
and just think, "that signal is boring, it basically is just 1e6 plus some round off terms...". But once we remove the mean, we see the signal for what it actually is... a very interesting and specific one indeed. So long story short, you should always remove the means and scale.
It really depends on what you want to do with your data. Centering and scaling can be helpful to obtain principial components that are representative of the shape of the variations in the data, irrespective of the scaling. I would say it is mostly needed if you want to further use the principal components itself, particularly, if you want to visualize them. It can also help during classification since your scores will then be normalized which may help your classifier. However, it depends on the application since in some applications the energy also carries useful information that one should not discard - there is no general answer!
Now you write that all you need is "the projection matrix useful to reduce my n-dim data to k-dim ones by SVD". In this case, no need to center or scale anything:
[U,~] = svd(TrainingData);
RecudedData = U(:,k)'*TestData;
will do the job. The svds may be worth considering when your TrainingData is huge (in both dimensions) so that svd is too slow (if it is huge in one dimension, just apply svd to the gram matrix).
It depends!!!
A common use in signal processing where it makes no sense to normalize is noise reduction via dimensionality reduction in correlated signals where all the fearures are contiminated with a random gaussian noise with the same variance. In that case if the magnitude of a certain feature is twice as large it's snr is also approximately twice as large so normalizing the features makes no sense since it would just make the parts with the worse snr larger and the parts with the good snr smaller. You also don't need to subtract the mean in that case (like in PCA), the mean (or dc) isn't different then any other frequency.

Which scaling technique does it use?

I have a matrix X, the size of which is 100*2000 double. I want to know which kind of scaling technique is applied to matrix X in the following command, and why it does not use z-score to do scaling?
X = X./repmat(sqrt(sum(X.^2)),size(X,1),1);
That scaling comes from linear algebra. That's what we call normalizing by producing a unit vector. Assuming that each row is an observation and each column is a feature, what's happening here is that we are going through every observation that you collected and normalizing each feature value over all observations such that the overall length / magnitude of a particular feature for all observations is set to 1.
The bottom division takes a look at each feature and determines the norm or magnitude of the feature over all observations. Once you find these magnitudes, you then take each feature for each observation and divide by their respective magnitudes.
The reason why unit vectors are often employed is to describe a point in feature space with respect to a set of basis vectors. Normalizing by producing unit vectors gives you the smallest possible way to represent one component in feature space and so what's probably happening here is that the observations are now being transformed such that each component / feature is being represented in terms of a set of basis vectors. Each basis vector is one feature in the data.
Check out the Wikipedia article on Unit Vectors for more details: http://en.wikipedia.org/wiki/Unit_vector

Spreading one matrix elements to another with weighted random numbers MATLAB

So I was trying to spread one matrix elements, which were generated with poissrnd, to another with using some bigger (wider?) probability function (for example 100 different possibilities with different weights) to plot both of them and see if the fluctuations after spread went down. After seeing it doesn't work right (fluctuations got bigger) I tried to identify what I did wrong on a really simple example. After testing it for a really long time I still can't understand what's wrong. The example goes like this:
I generate vector with poissrnd and vector for spreading (filled with zeros at the start)
Each element from the poiss vector tells me how many numbers (0.1 of the element value) to generate from the possible options which are: [1,2,3] with corresponding weights [0.2,0.5,0.2]
I spread what I got to my another vector on 3 elements: the corresponding (k-th one), one bofore the corresponding one and one after the corresponding one (so for example if k=3, the elements should be spread like this: most should go into 3rd element of another vector, and rest should go to 2nd and 1st element)
Plot both 0.1*poiss vector and vector after spreading to compare if fluctuations went down
The way I generate weighted numbers is from this thread: Weighted random numbers in MATLAB
and this is the code I'm using:
clear all
clc
eta=0.1;
N=200;
fot=10000000;
ix=linspace(-100,100,N);
mn =poissrnd(fot/N, 1, N);
dataw=zeros(1,N);
a=1:3;
w=[.25,.5,.25];
for k=1:N
[~,R] = histc(rand(1,eta*mn(1,k)),cumsum([0;w(:)./sum(w)]));
R = a(R);
przydz=histc(R,a);
if (k>1) && (k<N)
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,k-1)=dataw(1,k-1)+przydz(1,1);
dataw(1,k+1)=dataw(1,k+1)+przydz(1,3);
elseif k==1
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,N)=dataw(1,N)+przydz(1,1);
dataw(1,k+1)=dataw(1,k+1)+przydz(1,3);
else
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,k-1)=dataw(1,k-1)+przydz(1,1);
dataw(1,1)=dataw(1,1)+przydz(1,3);
end
end
plot(ix,eta*mn,'g',ix,dataw,'r')
The fluctuations are still bigger, and I can't identify what's wrong... Is the method for generating weighted numbers wrong in this situation? Cause it doesn't seem so. The way I'm accumulating data from the first vector seems fine too. Is there another way I could do it (so I could then optimize it for using 'bigger' probability functions)?
Sorry for my terrible English.
[EDIT]:
Here is simple pic to show what I meant (I hope it's understandable)
How about trying negative binomial distribution? It is often used as a hyper-dispersed analogue of Poisson distribution. Additional links can be found in this paper, as well as some apparatus in supplement.

Explaining corr2 function in Matlab

Can someone explain to me the correlation function corr2 in MATLAB? I know that it is for 2D comparing similarities of objects, but in the equation I have doubts what it is A and B (probably matrices for comparison), and also Amn and Bmn.
I'm not sure how MATLAB executes this function, because I have found in several cases that the correlation is not executed for the entire image (matrix) but instead it divides the image into blocks and then compares blocks of one picture with blocks of another picture.
In MATLAB's documentation, the corr2 equation is not put as referral point to the way the equation itself is calculated, like in other functions in MATLAB's documentation, such as referring to what book it is taken from and where it is explained.
The correlation coefficient is a number representing the similarity between 2 images in relation with their respective pixel intensity.
As you pointed out this function is used to calculate this coefficient:
Here A and B are the images you are comparing, whereas the subscript indices m and n refer to the pixel location in the image. Basically what Matab does is to compute, for every pixel location in both images, the difference between the intensity value at that pixel and the mean intensity of the whole image, denoted as a letter with a straightline over it.
As Kostya pointed out, typing edit corr2 in the command window will show you the code used by Matlab to compute the correlation coefficient. The formula is basically this:
a = a - mean2(a);
b = b - mean2(b);
r = sum(sum(a.*b))/sqrt(sum(sum(a.*a))*sum(sum(b.*b)));
where:
a is the input image and b is the image you wish to compare to a.
If we break down the formula, we see that a - mean2(a) and b-mean2(b) are the elements in the numerator of the above equation. mean2(a) is equivalent to mean(mean(a)) or mean(a(:)), that is the mean intensity of the whole image. This is only calculated once.
The 3rd line of code calculates the coefficient. Here sum(sum(a.*b)) calculates the double-sum present in the formula element-wise, that is considering each pixel location separately. Be aware that using sum(a) calculates the sum in every column individually, hence in order to get a single value you need to apply sum twice.
That's pretty much the same happening in the denominator, however calculations are performed on a-mean2(a)^2 and b-mean2(b)^2. You can see this a some kind of normalization process in which you consider the pixel intensity difference among each individual image.
As for your last comment, you can break down an image into small blocks and calculate the correlation coefficient on them; that might save some time for very large images but since everything is vectorized the calculation is quite fast. It might be useful in distributed processing I guess. Of course the correlation coefficient between 2 blocks of images is not necessarily identical to that of the whole image.
For the sake of curiosity you can look at this paper which highlights some caveats in using the correlation coefficient for image comparison.
Hope that makes things a bit clearer!

Controlled random number/dataset generation in MATLAB

Say, I have a cube of dimensions 1x1x1 spanning between coordinates (0,0,0) and (1,1,1). I want to generate a random set of points (assume 10 points) within this cube which are somewhat uniformly distributed (i.e. within certain minimum and maximum distance from each other and also not too close to the boundaries). How do I go about this without using loops? If this is not possible using vector/matrix operations then the solution with loops will also do.
Let me provide some more background details about my problem (This will help in terms of what I exactly need and why). I want to integrate a function, F(x,y,z), inside a polyhedron. I want to do it numerically as follows:
$F(x,y,z) = \sum_{i} F(x_i,y_i,z_i) \times V_i(x_i,y_i,z_i)$
Here, $F(x_i,y_i,z_i)$ is the value of function at point $(x_i,y_i,z_i)$ and $V_i$ is the weight. So to calculate the integral accurately, I need to identify set of random points which are not too close to each other or not too far from each other (Sorry but I myself don't know what this range is. I will be able to figure this out using parametric study only after I have a working code). Also, I need to do this for a 3D mesh which has multiple polyhedrons, hence I want to avoid loops to speed things out.
Check out this nice random vectors generator with fixed sum FEX file.
The code "generates m random n-element column vectors of values, [x1;x2;...;xn], each with a fixed sum, s, and subject to a restriction a<=xi<=b. The vectors are randomly and uniformly distributed in the n-1 dimensional space of solutions. This is accomplished by decomposing that space into a number of different types of simplexes (the many-dimensional generalizations of line segments, triangles, and tetrahedra.) The 'rand' function is used to distribute vectors within each simplex uniformly, and further calls on 'rand' serve to select different types of simplexes with probabilities proportional to their respective n-1 dimensional volumes. This algorithm does not perform any rejection of solutions - all are generated so as to already fit within the prescribed hypercube."
Use i=rand(3,10) where each column corresponds to one point, and each row corresponds to the coordinate in one axis (x,y,z)