Dot product of two columns containing sparse vectors - pyspark

I have a pyspark dataframe with two columns containing sparse vectors. I want to create a third column (Column C) containing the element-wise product of these two column
Column A:
(262144,[45252,99197,108625],[1.0,2.0,1.0])
Column B
(262144,[45252,99197,108625],[1.252762968495368,1.6945957207744073,1.252762968495368])
I'm looking to get:
Column C
(262144,[45252,99197,108625],[1/1.252762968495368,2/1.6945957207744073,1/1.252762968495368])
How can I do it in the most scalable way? Thank you.

Related

Most efficient way to divide Sparse Vector element by an int in a Pyspark Dataframe

I have a pyspark dataframe that contains a column with SparseVectors in each row. I'm looking to divide all the elements of that Sparse vector by length of another column that contains lists.
What would be the most efficient way to do this?
Thank you.

Spark Scala: compute Pearson correlation of 2 Vectors

I have a DataFrame where each row has 2 vector columns.
Like
ratings1 | ratings2
Vector | Vector
And I'm trying to understand how can I use Statistics.corr method to compute similarity score between those two columns for each row.
As for now, I realised that corr doesn't accept two Vectors as parameters. So, should I create an RDD for each pair of Vectors?

how can one make seaborn clustermap cluster rows and cols jointly

How can one constrain the rows and cols clustering to be the same (e.g. for finding groups in a pair-wise matrix).
In the docs you can turn row or col clustering on/off but it's independent of each other.
It seems like per-computing the linkage and then feed it into both the rows and columns works. Since my D matrix is symmetric the linkage will be identical for both rows and columns.
This can be accomplished with the following code:
from scipy.cluster.hierarchy import linkage
link = linkage(D) # D being the measurement
seaborn.clustermap(D, row_linkage=link, col_linkage=link)
Maybe you can first calculate the correlation matrix upon D and then plot the clustermap.
corr_df=D.corr()
seaborn.clustermap(corr_df.as_matrix())

How to compute the outer product of two binary vectors

I am generating a random binary matrix with a specific number of ones in each row. Now, I want to take each row in the matrix and multiply it by its transpose (i.e row1'*row1).
So, I am using row1=rnd_mat(1,:) to get the first row. However, in the multiplication step I get this error
"Both logical inputs must be scalar. To compute elementwise TIMES, use TIMES (.*) instead."
Knowing that I don't want to compute element-wise, I want to generate a matrix using the outer product. I tried to write row1 manually using [0 0 1 ...], and tried to find the outer product. I managed to get the matrix I wanted.
So, does anyone have some ideas on how I can do this?
Matrix multiplication of logical matrices or vectors is not supported in MATLAB. That is the reason why you are getting that error. You need to convert your matrix into double or another valid numeric input before attempting to do that operation. Therefore, do something like this:
rnd_mat = double(rnd_mat); %// Cast to double
row1 = rnd_mat(1,:);
result = row1.'*row1;
What you are essentially computing is the outer product of two vectors. If you want to avoid casting to double, consider using bsxfun to do the job for you instead:
result = bsxfun(#times, row1.', row1);
This way, you don't need to cast your matrix before doing the outer product. Remember, the outer product of two vectors is simply an element-wise multiplication of two matrices where one matrix is consists of a row vector where each row is a copy of the row vector while the other matrix is a column vector, where each column is a copy of the column vector.
bsxfun automatically broadcasts each row vector and column vector so that we produce two matrices of compatible dimensions, and performs an element by element multiplication, thus producing the outer product.

Find a new matrix considering the values of one of the columns

I have a m-rows by n-columns matrix in matlab. One of the columns (#2) contain the values of VIP and I want to obtain the matrix that only contain the rows of the original matrix where VIP>1.
How can I find this new matrix?