Alternate approach for pdist() from scipy in Julia? - scipy

My objective is to replicate the functionality of pdist() from SciPy in Julia.
I tried using Distances.jl package to perform pairwise computation of distance between observations. However, the results are not same as seen in the below mentioned example.
Python Example:
from scipy.spatial.distance import pdist
a = [[1,2], [3,4], [5,6], [7,8]]
b = pdist(a)
print(b)
output --> array([2.82842712, 5.65685425, 8.48528137, 2.82842712, 5.65685425, 2.82842712])
Julia Example:
using Distances
a = [1 2; 3 4; 5 6; 7 8]
dist_function(x) = pairwise(Euclidean(), x, dims = 1)
dist_function(a)
output -->
4×4 Array{Float64,2}:
0.0 2.82843 5.65685 8.48528
2.82843 0.0 2.82843 5.65685
5.65685 2.82843 0.0 2.82843
8.48528 5.65685 2.82843 0.0
With reference to above examples:
Is pdist() from SciPy in python has metric value set to Euclidean() by default?
How may I approach this problem, to replicate the results in Julia?
Please suggest a solution to resolve this problem.
Documentation reference for pdist() :--> https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
Thanks in advance!!

According to the documentation page you linked, to get the same form as Julia from python (yes, I know, this is the reverse of your question), you can pass it to squareform. I.e. in your example, add
from scipy.spatial.distance import squareform
squareform(b)
Also, yes, from the same documentation page, you can see that the 'metric' parameter defaults to 'euclidean' if not explictly defined.
For the reverse situation, simply note that the python vector is simply all the elements in the off-diagonal (since for a 'proper' distance metric, the resulting distance matrix is symmetric).
So you can simply collect all the elements from the off-diagonal into a vector.

For (1), the answer is yes as per the documentation you linked, which says at the top
scipy.spatial.distance.pdist(X, metric='euclidean', *args, **kwargs)
indicating that the metric arg is indeed set to 'euclidean' by default.
I'm not sure I understand your second question - the results are the same? The only difference to me seems to be that scipy returns the upper triangular as a vector, so if it's just about doing this have a look at: https://discourse.julialang.org/t/vector-of-upper-triangle/7764

Related

MATLAB fmincon constraining vector elements

Thanks for reading this, I have a matlab function 'myfun' that returns a scalar for a given input vector X. Now I am trying to minimize this function using fmincon but I have troubles constraining my output vector elements.
X0=1:1:10;
fhandle = #myfun;
lb=X0(1)*ones(length(X0),1);
ub=X0(end)*ones(length(X0),1);
[X]=fmincon(fhandle,X0,[],[],[],[],lb,ub);
First off, the elements cannot be smaller than X0(1) or larger than X0(end).
So far so good I think, but I have two more constraints for my output vector which I cannot find a solution for searching the questions here. The first one being
X(1)=X0(1)
and
X(end)=X0(end)
So the first and last elements must be set as constants.
My final constraint has to do with the change in value from element i to i+1, it has to be limited to a certain value A and element i must always be less than or equall to element i+1
X(i)<=X(i+1)
X(i+1)-X(i)<=E
An example output X with the following inputs X0 and A would be
X0=1:1:10;
E=3;
X=[1 1.1 1.2 1.4 1.7 2.0 2.7 4.7 7 10]
If somebody has tips on which parts/functions of fmincon or other minimization functions in Matlab to use, much appreciated!
PS: As I read the full post again I realize that my 2 constraints I'm looking for will imply the first one
Your question consists out of two parts:
Applying equality constraints on the design variables:
Set the lower bound and upper bound to the same value:
ub(1) = lb(1)
lb(end) = ub(end);
Applying inequality constraints (X(i+1)-X(i)<=E):
Reformulate your equations in the following matrix form:
A*X <= B
with
A = zeros(9, 10);
A(:, 1:9) = -eye(9)
A(:, 2:10) = A(:, 2:10) + eye(9)
B = ones(9, 1)*E;
Then you can call fmincon as follows:
[X]=fmincon(fhandle,X0,A,B,[],[],lb,ub);

Why matrix inverse computed by updation and pseudo-inverse is not matching? Matrix 'X' & 'r' are attached with the link shown below [duplicate]

I had a matrix D which is m*n and i am calculating the pseudo inverse using the formula inv(D'*D)*D' but it is not generating the same result as pinv(D).I need the term inv(D'*D) which i require for incremental operation. My all accuracy depends upon inv(D'*D) which is not correct. Is there any alternate way to get inv(D'*D) accurately? can any one help me please?
% D is 3x4 matrix that i had copied from one blog just for demonstration purpose. Actually original one of mine also had same problem bu its size is too large that i can't post it here.
D = -[1/sqrt(2) 1 1/sqrt(2) 0;0 1/sqrt(2) 1 1/sqrt(2);-1/sqrt(2) 0 1/sqrt(2) 1];
B1 = pinv(D)
B2 = D'*inv(D*D')
B1 =
-0.353553390593274 0.000000000000000 0.353553390593274
-0.375000000000000 -0.176776695296637 0.125000000000000
-0.176776695296637 -0.250000000000000 -0.176776695296637
0.125000000000000 -0.176776695296637 -0.375000000000000
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND =
1.904842e-017.
B2 =
-0.250000000000000 0 0.500000000000000
-0.500000000000000 0 0
0.250000000000000 -0.500000000000000 0
0 0 -0.750000000000000
I need inv(D'D) to do incremental operation. Actually in my problem at step 1, each time a new row will be added to the last position of D and in step 2 first row of the D will be removed. So i want to find final D inverse using the inverse which i calculated before these two steps. More precisely have a look here:
B = inv(D'*D); % if i can calculate it accurately then further work is as follows
D1 = [D;Lr]; %Lr is last row to be added
BLr = B-((B*Lr'*Lr*B)/(1+Lr*B*Lr')); % Row addition formula
Fr = D1(1,:); % First row to be removed
D2 = removerows(D1,1);
BFr = BLr+ ((BLr*Fr'*Fr*BLr)/(1-Fr*BLr*Fr')); % row deletion formula
B = BFr;
Y = BFr*D2;
The formulae (D^T D)^-1 D^T or D^T (D D^T)^-1 you are using for the Moore-Penrose pseudoinverse are only valid if D has full column or full row rank, respectively.
This is not true in your case, as the warning "Matrix is close to singular" shows.
The matlab pinv command works for arbitrary D, even if the matrix has neither full row or full column rank.
Try running cond(D) on your matrix and see what the condition number is. The higher the number, the more ill-conditioned your matrix is. Similarly, you can run cond(D'*D). A matrix can be full rank and still be ill-conditioned. On paper, an ill-conditioned matrix is still invertible. However, when you attempt to directly invert an ill-conditioned matrix on a computer, small precision errors caused by quantization and other effects can cause wildly undpredictable results in the solution.
For the above stated reason, there is usually a better way (more numerically stable) to achieve what you are after than to compute the inverse directly. Many of these involve matrix decomposotion techniques such as SVD. If you help us understand why you need inv(D'*D) it would be easier to point you in the direction of the appropriate alternative. For example, if you just need the pseudo-inverse, go ahead and use pinv(), even though it differs from your result using inv(). The pinv() function and the \ (mldivide) backslash operator are much more numerically stable tools than inv().
See the official documentation at http://www.mathworks.com/help/matlab/ref/pinv.html .
If A x ~ b, the solution x = pinv(A) * b produces the minimum-norm solution, but x = A\b doesn't. See the numerical example at the link above.

manual calculation of pseudo inverse is not same as pinv in matlab

I had a matrix D which is m*n and i am calculating the pseudo inverse using the formula inv(D'*D)*D' but it is not generating the same result as pinv(D).I need the term inv(D'*D) which i require for incremental operation. My all accuracy depends upon inv(D'*D) which is not correct. Is there any alternate way to get inv(D'*D) accurately? can any one help me please?
% D is 3x4 matrix that i had copied from one blog just for demonstration purpose. Actually original one of mine also had same problem bu its size is too large that i can't post it here.
D = -[1/sqrt(2) 1 1/sqrt(2) 0;0 1/sqrt(2) 1 1/sqrt(2);-1/sqrt(2) 0 1/sqrt(2) 1];
B1 = pinv(D)
B2 = D'*inv(D*D')
B1 =
-0.353553390593274 0.000000000000000 0.353553390593274
-0.375000000000000 -0.176776695296637 0.125000000000000
-0.176776695296637 -0.250000000000000 -0.176776695296637
0.125000000000000 -0.176776695296637 -0.375000000000000
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND =
1.904842e-017.
B2 =
-0.250000000000000 0 0.500000000000000
-0.500000000000000 0 0
0.250000000000000 -0.500000000000000 0
0 0 -0.750000000000000
I need inv(D'D) to do incremental operation. Actually in my problem at step 1, each time a new row will be added to the last position of D and in step 2 first row of the D will be removed. So i want to find final D inverse using the inverse which i calculated before these two steps. More precisely have a look here:
B = inv(D'*D); % if i can calculate it accurately then further work is as follows
D1 = [D;Lr]; %Lr is last row to be added
BLr = B-((B*Lr'*Lr*B)/(1+Lr*B*Lr')); % Row addition formula
Fr = D1(1,:); % First row to be removed
D2 = removerows(D1,1);
BFr = BLr+ ((BLr*Fr'*Fr*BLr)/(1-Fr*BLr*Fr')); % row deletion formula
B = BFr;
Y = BFr*D2;
The formulae (D^T D)^-1 D^T or D^T (D D^T)^-1 you are using for the Moore-Penrose pseudoinverse are only valid if D has full column or full row rank, respectively.
This is not true in your case, as the warning "Matrix is close to singular" shows.
The matlab pinv command works for arbitrary D, even if the matrix has neither full row or full column rank.
Try running cond(D) on your matrix and see what the condition number is. The higher the number, the more ill-conditioned your matrix is. Similarly, you can run cond(D'*D). A matrix can be full rank and still be ill-conditioned. On paper, an ill-conditioned matrix is still invertible. However, when you attempt to directly invert an ill-conditioned matrix on a computer, small precision errors caused by quantization and other effects can cause wildly undpredictable results in the solution.
For the above stated reason, there is usually a better way (more numerically stable) to achieve what you are after than to compute the inverse directly. Many of these involve matrix decomposotion techniques such as SVD. If you help us understand why you need inv(D'*D) it would be easier to point you in the direction of the appropriate alternative. For example, if you just need the pseudo-inverse, go ahead and use pinv(), even though it differs from your result using inv(). The pinv() function and the \ (mldivide) backslash operator are much more numerically stable tools than inv().
See the official documentation at http://www.mathworks.com/help/matlab/ref/pinv.html .
If A x ~ b, the solution x = pinv(A) * b produces the minimum-norm solution, but x = A\b doesn't. See the numerical example at the link above.

Remove highly correlated components

I have got a problem to remove highly correlated components. Can I ask how to do this?
For example, I have got 40 instances with 20 features (random created). Feature 2 and 18 is highly correlated with feature 4. And feature 6 is highly correlated with feature 10. Then how to remove the highly correlated (redundant) features such as 2, 18 and 10? Essentially, I need the index of remaining features 1, 3, 4, 5, 6, ..., 9, 11, ..., 17, 19, 20.
Matlab codes:
x = randn(40,20);
x(:,2) = 2.*x(:,4);
x(:,18) = 3.*x(:,4);
x(:,6) = 100.*x(:,10);
x_corr = corr(x);
size(x_corr)
figure, imagesc(x_corr),colorbar
Correlation matrix x_corr looks like
edit:
I worked out a way:
x_corr = x_corr - diag(diag(x_corr));
[x_corrX, x_corrY] = find(x_corr>0.8);
for i = 1:size(x_corrX,1)
xx = find(x_corrY == x_corrX(i));
x_corrX(xx,:) = 0;
x_corrY(xx,:) = 0;
end
x_corrX = unique(x_corrX);
x_corrX = x_corrX(2:end);
im = setxor(x_corrX, (1:20)');
Am I right? Or you have a better idea please post. Thanks.
edit2: Is this method the same as using PCA?
It seems quite clear that this idea of yours, to simply remove highly correlated variables from the analysis is NOT the same as PCA. PCA is a good way to do rank reduction of what seems to be a complicated problem, into one that turns out to have only a few independent things happening. PCA uses an eigenvalue (or svd) decomposition to achieve that goal.
Anyway, you might have a problem. For example, suppose that A is highly correlated to B, and B is highly correlated to C. However, it need not be true that A and C are highly correlated. Since correlation can be viewed as a measure of the angle between those vectors in their corresponding high dimensional vector space, this can be easily made to happen.
As a trivial example, I'll create two variables, A and B, that are correlated at a "moderate" level.
n = 50;
A = rand(n,1);
B = A + randn(n,1)/2;
corr([A,B])
ans =
1 0.55443
0.55443 1
So here 0.55 is the correlation. I'll create C to be virtually the average of A and B. It will be highly correlated by your definition.
C = [A + B]/2 + randn(n,1)/100;
corr([A,B,C])
ans =
1 0.55443 0.80119
0.55443 1 0.94168
0.80119 0.94168 1
Clearly C is the bad guy here. But if one were to simply look at the pair [A,C] and remove A from the analysis, then do the same with the pair [B,C] and then remove B, we would have made the wrong choices. And this was a trivially constructed example.
In fact, it is true that the eigenvalues of the correlation matrix might be of interest.
[V,D] = eig(corr([A,B,C]))
V =
-0.53056 -0.78854 -0.311
-0.57245 0.60391 -0.55462
-0.62515 0.11622 0.7718
D =
2.5422 0 0
0 0.45729 0
0 0 0.00046204
The fact that D has two significant diagonal elements, and a tiny one tells us that really, this is a two variable problem. What PCA will not easily tell us is which vector to simply remove though, and the problem would only be less clear with more variables, with many interactions between all of them.
I think the answer of woodchips is quite good. But when you're using eigenvalues, you can run into some trouble. If the dataset is large enough, there will always be some small eigenvalues, but you won't be sure what they tell you.
Instead, consider grouping your data by a simple clustering method. It's easy to implement in Matlab.
http://www.mathworks.de/de/help/stats/cluster-analysis-1-1.html
edit:
If you disregard the points that woodchips made, you're solution is okay, as an algorithm.

What's the most idiomatic way to create a vector with a 1 at index i?

In Matlab, suppose I would like to create a 0-vector of length L, except with a 1 at index i?
For example, something like:
>> mostlyzeros(6, 3)
ans =
0 0 1 0 0 0
The purpose is so I can use it as a 'selection' vector which I'll multiply element-wise with another vector.
The simplest way I can think of is this:
a = (1:N)==m;
where N>=m. Having said that, if you want to use the resulting vector as a "selection vector", I don't know why you'd multiply two vectors elementwise, as I would expect that to be relatively slow and inefficient. If you want to get a vector containing only the m-th value of vector v in the m-th position, this would be a more straightforward method:
b = ((1:N)==m)*v(m);
Although the most natural method would have to be this:
b(N)=0;
b(m)=v(m);
assuming that b isn't defined before this (if b is defined, you need to use zeros rather than just assigning the Nth value as zero - it has been my experience that creating a zero vector or matrix that didn't exist before that is most easily done by assigning the last element of it to be zero - it's also useful for extending a matrix or vector).
I'm having a hard time thinking of anything more sensible than:
Vec = zeros(1, L);
Vec(i) = 1;
But I'd be happy to be proven wrong!
UPDATE: The one-liner solution provided by #GlenO is very neat! However, be aware that if efficiency is the chief criteria, then a few speed tests on my machine indicate that the simple method proposed in this answer and the other two answers is 3 or 4 times faster...
NEXT UPDATE: Ah! So that's what you mean by "selection vectors". #GlenO has given a good explanation of why for this operation a vector of ones and zeros is not idiomatic Matlab - however you choose to build it.
ps Try to avoid using i as a subscript, since it is actually a matlab function.
Just for the fun of it, another one-liner:
function [out] = mostlyzeros(idx, L)
out([L, idx]) = [0 1];
I can think of:
function mostlyones(m,n)
mat=zeros(1,m);
mat(n)=1;
Also, one thing to note. In MATLAB, index starts from one and not from zero. So your function call should have been mostlyzeros(6,3)
I would simply create a zero-vector and change whatever value you like to one:
function zeroWithOne(int numOfZeros, int pos)
a = zeros(numOfZeros,1);
a(pos) = 1;
Another one line option, which should be fast is:
vec = sparse(1, ii, 1, 1, L);