I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited.
E.g. for the graph:
the vector:
v1 = {100, 50, 0 30, 0}
would mean that we spent:
100secs at vertex 1
50secs at vertex 2 and
30secs at vertex 4
(vertices 3 & 5 where not visited, thus the 0s).
I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine_similarity is:
as described here.
But I noticed the following. Assume k=2 and one of the vectors is:
v1 = {90,0,0,0,0}
In the process of solving the optimization problem of minimizing the total distance from candidate centroids, assume that at some point, 2 candidate centroids are:
c1 = {90,90,90,90,90}
c2 = {1000, 1000, 1000, 1000, 1000}
Running the cosine_distance formula for (v1, c1) and (v1, c2) we get exactly the same distance of 0.5527864045 for both.
I would assume that v1 is more similar (closer) to c1 than c2. Apparently this is not the case.
Q1. Why is this assumption wrong?
Q2. Is the cosine distance a correct distance function for this case?
Q3. What would be a better one given the nature of the problem?
Let's divide cosine similarity into parts and see how and why it works.
Cosine between 2 vectors - a and b - is defined as:
cos(a, b) = sum(a .* b) / (length(a) * length(b))
where .* is an element-wise multiplication. Denominator is here just for normalization, so let's simply call it L. With it our functions turns into:
cos(a, b) = sum(a .* b) / L
which, in its turn, may be rewritten as:
cos(a, b) = (a[1]*b[1] + a[2]*b[2] + ... + a[k]*b[k]) / L =
= a[1]*b[1]/L + a[2]*b[2]/L + ... + a[k]*b[k]/L
Let's get a bit more abstract and replace x * y / L with function g(x, y) (L here is constant, so we don't put it as function argument). Our cosine function thus becomes:
cos(a, b) = g(a[1], b[1]) + g(a[2], b[2]) + ... + g(a[n], b[n])
That is, each pair of elements (a[i], b[i]) is treated separately, and result is simply sum of all treatments. And this is good for your case, because you don't want different pairs (different vertices) to mess with each other: if user1 visited only vertex2 and user2 - only vertex1, then they have nothing in common, and similarity between them should be zero. What you actually don't like is how similarity between individual pairs - i.e. function g() - is calculated.
With cosine function similarity between individual pairs looks like this:
g(x, y) = x * y / L
where x and y represent time users spent on the vertex. And here's the main question: does multiplication represent similarity between individual pairs well? I don't think so. User who spent 90 seconds on some vertex should be close to user who spent there, say, 70 or 110 seconds, but much more far from users who spend there 1000 or 0 seconds. Multiplication (even normalized by L) is totally misleading here. What it even means to multiply 2 time periods?
Good news is that this is you who design similarity function. We have already decided that we are satisfied with independent treatment of pairs (vertices), and we only want individual similarity function g(x, y) to make something reasonable with its arguments. And what is reasonable function to compare time periods? I'd say subtraction is a good candidate:
g(x, y) = abs(x - y)
This is not similarity function, but instead distance function - the closer are values to each other, the smaller is result of g() - but eventually idea is the same, so we can interchange them when we need.
We may also want to increase impact of large mismatches by squaring the difference:
g(x, y) = (x - y)^2
Hey! We've just reinvented (mean) squared error! We can now stick to MSE to calculate distance, or we can proceed finding good g() function.
Sometimes we may want not increase, but instead smooth the difference. In this case we can use log:
g(x, y) = log(abs(x - y))
We can use special treatment for zeros like this:
g(x, y) = sign(x)*sign(y)*abs(x - y) # sign(0) will turn whole expression to 0
Or we can go back from distance to similarity by inversing the difference:
g(x, y) = 1 / abs(x - y)
Note, that in recent options we haven't used normalization factor. In fact, you can come up with some good normalization for each case, or just omit it - normalization is not always needed or good. For example, in cosine similarity formula if you change normalization constant L=length(a) * length(b) to L=1, you will get different, but still reasonable results. E.g.
cos([90, 90, 90]) == cos(1000, 1000, 1000) # measuring angle only
cos_no_norm([90, 90, 90]) < cos_no_norm([1000, 1000, 1000]) # measuring both - angle and magnitude
Summarizing this long and mostly boring story, I would suggest rewriting cosine similarity/distance to use some kind of difference between variables in two vectors.
Cosine similarity is meant for the case where you do not want to take length into accoun, but the angle only.
If you want to also include length, choose a different distance function.
Cosine distance is closely related to squared Euclidean distance (the only distance for which k-means is really defined); which is why spherical k-means works.
The relationship is quite simple:
squared Euclidean distance sum_i (x_i-y_i)^2 can be factored into sum_i x_i^2 + sum_i y_i^2 - 2 * sum_i x_i*y_i. If both vectors are normalized, i.e. length does not matter, then the first two terms are 1. In this case, squared Euclidean distance is 2 - 2 * cos(x,y)!
In other words: Cosine distance is squared Euclidean distance with the data normalized to unit length.
If you don't want to normalize your data, don't use Cosine.
Q1. Why is this assumption wrong?
As we see from the definition, cosine similarity measures angle between 2 vectors.
In your case, vector v1 lies flat on the first dimension, while c1 and c2 both are equally aligned from the axes, and thus, cosine similarity has to be same.
Note that the trouble lies with c1 and c2 pointing in the same direction. Any v1 will have the same cosine similarity with both of them. For illustration :
Q2. Is the cosine distance a correct distance function for this case?
As we see from the example in hand, probably not.
Q3. What would be a better one given the nature of the problem?
Consider Euclidean Distance.
Related
Let's say I have the following two vectors:
x = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
y = [(10-1).*rand(7,1) + 1; randi(10,1,1)];
The first seven elements are continuous values in the range [1,10]. The last element is an integer in the range [1,10].
Now I would like to compute the euclidean distance between x and y. I think the integer element is a problem because all other elements can get very close but the integer element has always spacings of ones. So there is a bias towards the integer element.
How can I calculate something like a normalized euclidean distance on it?
According to Wolfram Alpha, and the following answer from cross validated, the normalized Eucledean distance is defined by:
You can calculate it with MATLAB by using:
0.5*(std(x-y)^2) / (std(x)^2+std(y)^2)
Alternatively, you can use:
0.5*((norm((x-mean(x))-(y-mean(y)))^2)/(norm(x-mean(x))^2+norm(y-mean(y))^2))
I would rather normalise x and y before calculating the distance and then vanilla Euclidean would suffice.
In your example
x_norm = (x -1) / 9; % normalised x
y_norm = (y -1) / 9; % normalised y
dist = norm(x_norm - y_norm); % Euclidean distance between normalised x, y
However, I am not sure about whether having an integer element contributes to some sort of bias but we have already gotten kind of off-topic for stack overflow :)
From Euclidean Distance - raw, normalized and double‐scaled coefficients
SYSTAT, Primer 5, and SPSS provide Normalization options for the data so as to permit an investigator to compute a distance
coefficient which is essentially “scale free”. Systat 10.2’s
normalised Euclidean distance produces its “normalisation” by dividing
each squared discrepancy between attributes or persons by the total
number of squared discrepancies (or sample size).
Frankly, I can see little point in this standardization – as the final
coefficient still remains scale‐sensitive. That is, it is impossible
to know whether the value indicates high or low dissimilarity from the
coefficient value alone
I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:
t = 1 ./ (1 + exp(-z));
where
z = x*theta
And I am using the following cost function to calculate cost, to determine when to stop training.
function cost = computeCost(x, y, theta)
htheta = sigmoid(x*theta);
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end
I am getting the cost at each step to be NaN as the values of htheta are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?
This is the gradient descent code for logistic regression:
function [theta,cost_history] = batchGD(x,y,theta,alpha)
cost_history = zeros(1000,1);
for iter=1:1000
htheta = sigmoid(x*theta);
new_theta = zeros(size(theta,1),1);
for feature=1:size(theta,1)
new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))
end
theta = new_theta;
cost_history(iter) = computeCost(x,y,theta);
end
end
There are two possible reasons why this may be happening to you.
The data is not normalized
This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.
Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.
This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.
One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:
m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?
Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:
mX = mean(x,1);
mX(1) = 0;
sX = std(x,[],1);
sX(1) = 1;
xnew = bsxfun(#rdivide, bsxfun(#minus, x, mX), sX);
The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.
Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:
xxnew = bsxfun(#rdivide, bsxfun(#minus, xx, mX), sX);
Now that you have this, you can perform your predictions:
pred = sigmoid(xxnew*theta) >= 0.5;
You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.
The learning rate is too large
As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.
As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.
Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.
Let's assume you have an observation where:
the true value is y_i = 1
your model is quite extreme and says that P(y_i = 1) = 1
Then your cost function will get a value of NaN because you're adding 0 * log(0), which is undefined. Hence:
Your formula for the cost function has a problem (there is a subtle 0, infinity issue)!
As #rayryeng pointed out, 0 * log(0) produces a NaN because 0 * Inf isn't kosher. This is actually a huge problem: if your algorithm believes it can predict a value perfectly, it incorrectly assigns a cost of NaN.
Instead of:
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
You can avoid multiplying 0 by infinity by instead writing your cost function in Matlab as:
y_logical = y == 1;
cost = sum(-log(htheta(y_logical))) + sum( - log(1 - htheta(~y_logical)));
The idea is if y_i is 1, we add -log(htheta_i) to the cost, but if y_i is 0, we add -log(1 - htheta_i) to the cost. This is mathematically equivalent to -y_i * log(htheta_i) - (1 - y_i) * log(1- htheta_i) but without running into numerical problems that essentially stem from htheta_i being equal to 0 or 1 within the limits of double precision floating point.
It happened to me because an indetermination of the type:
0*log(0)
This can happen when one of the predicted values Y equals either 0 or 1.
In my case the solution was to add an if statement to the python code as follows:
y * np.log (Y) + (1-y) * np.log (1-Y) if ( Y != 1 and Y != 0 ) else 0
This way, when the actual value (y) and the predicted one (Y) are equal, no cost needs to be computed, which is the expected behavior.
(Notice that when a given Y is converging to 0 the left addend is canceled (because of y=0) and the right addend tends toward 0. The same happens when Y converges to 1, but with the opposite addend.)
(There is also a very rare scenario, which you probably won't need to worry about, where y=0 and Y=1 or viceversa, but if your dataset is standarized and the weights are properly initialized it won't be an issue.)
I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:
t = 1 ./ (1 + exp(-z));
where
z = x*theta
And I am using the following cost function to calculate cost, to determine when to stop training.
function cost = computeCost(x, y, theta)
htheta = sigmoid(x*theta);
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end
I am getting the cost at each step to be NaN as the values of htheta are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?
This is the gradient descent code for logistic regression:
function [theta,cost_history] = batchGD(x,y,theta,alpha)
cost_history = zeros(1000,1);
for iter=1:1000
htheta = sigmoid(x*theta);
new_theta = zeros(size(theta,1),1);
for feature=1:size(theta,1)
new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))
end
theta = new_theta;
cost_history(iter) = computeCost(x,y,theta);
end
end
There are two possible reasons why this may be happening to you.
The data is not normalized
This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.
Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.
This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.
One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:
m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?
Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:
mX = mean(x,1);
mX(1) = 0;
sX = std(x,[],1);
sX(1) = 1;
xnew = bsxfun(#rdivide, bsxfun(#minus, x, mX), sX);
The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.
Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:
xxnew = bsxfun(#rdivide, bsxfun(#minus, xx, mX), sX);
Now that you have this, you can perform your predictions:
pred = sigmoid(xxnew*theta) >= 0.5;
You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.
The learning rate is too large
As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.
As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.
Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.
Let's assume you have an observation where:
the true value is y_i = 1
your model is quite extreme and says that P(y_i = 1) = 1
Then your cost function will get a value of NaN because you're adding 0 * log(0), which is undefined. Hence:
Your formula for the cost function has a problem (there is a subtle 0, infinity issue)!
As #rayryeng pointed out, 0 * log(0) produces a NaN because 0 * Inf isn't kosher. This is actually a huge problem: if your algorithm believes it can predict a value perfectly, it incorrectly assigns a cost of NaN.
Instead of:
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
You can avoid multiplying 0 by infinity by instead writing your cost function in Matlab as:
y_logical = y == 1;
cost = sum(-log(htheta(y_logical))) + sum( - log(1 - htheta(~y_logical)));
The idea is if y_i is 1, we add -log(htheta_i) to the cost, but if y_i is 0, we add -log(1 - htheta_i) to the cost. This is mathematically equivalent to -y_i * log(htheta_i) - (1 - y_i) * log(1- htheta_i) but without running into numerical problems that essentially stem from htheta_i being equal to 0 or 1 within the limits of double precision floating point.
It happened to me because an indetermination of the type:
0*log(0)
This can happen when one of the predicted values Y equals either 0 or 1.
In my case the solution was to add an if statement to the python code as follows:
y * np.log (Y) + (1-y) * np.log (1-Y) if ( Y != 1 and Y != 0 ) else 0
This way, when the actual value (y) and the predicted one (Y) are equal, no cost needs to be computed, which is the expected behavior.
(Notice that when a given Y is converging to 0 the left addend is canceled (because of y=0) and the right addend tends toward 0. The same happens when Y converges to 1, but with the opposite addend.)
(There is also a very rare scenario, which you probably won't need to worry about, where y=0 and Y=1 or viceversa, but if your dataset is standarized and the weights are properly initialized it won't be an issue.)
I have measured a variable x in equidistant long intervals (every 10 min) and a variable y in non-equidistant short intervals (somewhere between every 30 s and 90 s). Timestamps (datenum) for both x and y are available, but they are never equal, so intersect doesn't work. How can I aggregate y (e.g. mean(y(...)) in interval x(i+1) - x(i)) so I can compare the two (e.g. plot them against each other or plot them with the same time-vector)?
/edit 1: Confused x and y in my last but one sentence.
/edit 2: I feel like I didn't give you enough information in the original question, sorry for that.
Many of you suggest interpolation. x is an average wind speed over a period of 10 minutes, not a distinct measurement. So if I say time = 07:10 and x = 3 m/s, that means mean(x) = 3 m/s for the period from 07:00 to 07:10. This is why I think it's probably not the best idea to interpolate it. y is one of many (very noisy) other variables and I want to find out the influence of (mean) x on y. So I would either like to assign many values of y to one measurement of x (in that 10 minute period), or assign a mean(y) to that one measurement of x. I assume that the solutions are quite similar, code wise.
Reading the edited (2x) question:
You are trying to estimate the value of x at some point that you don't have the measurement for. You have measurements before and after. The only thing you can do is to interpolate. What method you choose is somewhat harder to decide.
Your options are:
piecewise constant interpolation (what I think you are suggesting)
linear interpolation
spline
others
/edit: If you just want to get an average value of y between two x measurements, I suggest the following:
new_y = zeros(size(x));
new_y(1) = mean(y(ty<=tx(1)));
for ii=2:length(x):
new_y(ii) = mean(y(and(ty>tx(ii-1),ty<=tx(ii))));
end
Maybe an even better solution would be using hist:
n = hist(ty,tx)
Vector n contains the number of values of ty that are closest to values in tx. Since both are monotonous, n tells you how to group values in y. Then you can use mat2cell to put y into a cell array where each cell corresponds to one measurement of x. The second parameter n now specifies how many values to put in each cell.
new_y = mat2cell(y,n)
To aggregate values, use accumarray:
accumarray(fix(ty(:) / T) + 1, y, [], #mean)
Here y is the sampled signal, ty is the timestamp array and T is the time interval of the aggregated values (for example, T = 10 / (24 * 60) = 0.0069 for a 10-minute interval).
You can interpolate data from x to non-equidistant timestamps (or vice versa) (see interp1 function) and compare results.
Plot:
plot(Time_x, x, Time_y, y)
Here's a simple example of using the 1-d interpolation.
# make two example functions on different x bases.
x1 = [0:.023:10];
x2 = [0:1:10];
y1 = x1.^2/10;
y2 = 10 - x2.^1.3;
# convert both to a common x base (x1 in this case).
y2i = interp1(x2,y2,x1);
plot(x1,y1,x1,y2i)
Use linear interpolation!
Its easy & fun to do yourself. The idea is: Since you know the timestamps for x, the values for x, and the values for y, (but the timestamps for y don't match that of x), you can use linear interpolation (or of higher order if you need) to interpolate/"update" values for y as if they occured at the timestamps of x. After that you can plot both x and interpolated y values against the same x timestamp vector.
see: http://en.wikipedia.org/wiki/Linear_interpolation
Hi,
I would like to create a correlation matrix between the two data set presented above that will ignore any appearances of zeros (in the picture above, the green color), anyone knows what is the most efficient way that will produce a smooth result?
Is there any correlation method that can identify the similarity point by point and by thus the results will have the "shape" of the original matrix?
thank u
note: I do not have the matlab statistic toolbox
2. Is there any correlation method that can identify the similarity point by
point and by thus the results will have the "shape" of the original matrix?
Let's start with your second point because it is more clear, what you want there. You want to do a point-by-point comparison of two images, say, A and B. This boils down to measuring the similarity of two scalars a and b. Let's assume that these scalars are from the interval [0, Q], where Q depends on your image format (Q == 1 or Q == 255 are common in Matlab).
Now, the simplest measure of distance is the difference d = |a - b|. You might want to normalize this to [0, 1] and also invert the values to measure similarity instead of distance. In Matlab:
S = 1 - abs(A - B) / Q;
You mentioned about ignoring the zeros in the images. Well, you need to define, what similarity measure you expect for a zero. One possibility is to set the similarity to zero, whenever one pixel is zero:
S(A == 0 | B == 0) = 0;
You could also say that the similarity there is undefined and set the similarity to NaN:
S(A == 0 | B == 0) = nan;
Of course, you can also say that the mismatch between 10 and 11 is as bad as the mismatch between 100 and 110. In this case, you could take the distance with respect to the sum a + b (known as Bray Curtis normalization or normalized Euclidean metric)
D = abs(A - B) ./ (A + B)
S = 1 - D / max(D(:));
You run into problems, if both matrices have a zero-value pixel at the same location. Again, there are several possibilities: You can augment the sum with a small positive value alpha (e.g. alpha = 1e-6) which prevents a division by zero: D = abs(A - B) ./ (alpha + A + B).
Another option is to ignore the infinite values in D and add your 'zero-processing' here, i.e.
D = abs(A - B) ./ (A + B)
D(A == 0 | B == 0) = nan;
S = 1 - D / max(D(:));
You see, there are plenty of possibilities.
1. I would like to create a correlation matrix [...]
You should definitly think more about this point and come up with a better description of what to compute. If your matrices are of size m x m, you have m^2 variables. From this you could compute a correlation matrix m^2 x m^2, which measures the correlation of every pixel to every other pixel. This matrix will also have the largest values in the diagonal (these are the variances). However, I would not suggest to compute the correlation matrix if you only have two realisations.
Another option is to measure the similarity of rows or columns in the two images. Then you end up with a vector 1 x m of correlation coefficients.
However, I do not know how to compute a correlation matrix of size m x m from two inputs of size m x m, which has the largest values in the diagonal.
To just get a general correlation coefficient I'd use corr2. From the docs:
r = corr2(A,B)
Returns the correlation coefficient r
between A and B, where A and B are matrices or vectors of the same
size. r is a scalar double.
Roughly, I believe it's just calculating corr(A(:), B(:)).