divide Matrix by a number - pyspark

I have a BlockMatrix and I would like to divide this matrix by a number (e.g. by 2). However, the pyspark.mllib matrix library does not offer any function for dividing a number, only for dot product and for addition/subtraction. How can I manage to divide each entry in the blockmatrix by a number?

Have you tried map?
It works like this:
from pyspark.mllib.linalg.distributed import BlockMatrix
matrix = BlockMatrix(...)
def divide_by_2(x):
return x / 2
matrix = matrix.map(divide_by_2)

Related

Cosine similarity of two sparse vectors in Scala Spark

I have a dataframe with two columns where each row has a Sparse Vector. I try to find a proper way to calculate the cosine similarity (or just the dot product) of the two vectors in each row.
However, I haven't been able to find any library or tutorial to do it for Sparse vectors.
The only way I found is the following:
Create a k X n matrix, where n items are described as k-dimensioned vectors. For representing each item as a k dimension vector, you can use ALS which represents each entity in a latent factor space. The dimension of this space (k) can be chosen by you. This k X n matrix can be represented as RDD[Vector].
Convert this k X n matrix to RowMatrix.
Use columnSimilarities() function to get a n X n matrix of similarities between n items.
I feel it is an overkill to calculate all the cosine similarities for each pair while I need it only for the specific pairs in my (quite big) dataframe.
In Spark 3 there is now method dot for a SparseVector object, which takes another vector as its argument.
If you want to do this in earlier versions, you could create a user defined function that follows this algorithm:
Take intersection of your vectors' indices.
Get two subarrays of your vectors' values based on the indices from the intersection.
Do pairwise multiplication of the elements of those two subarrays.
Sum the values resulting values from such pairwise multiplication.
Here's my realization of it:
import org.apache.spark.ml.linalg.SparseVector
def dotProduct(vec: SparseVector, vecOther: SparseVector) = {
val commonIndices = vec.indices intersect vecOther.indices
commonIndices.map(x => vec(x) * vecOther(x)).reduce(_+_)
}
I guess you know how to turn it into a Spark UDF from here and apply it to your dataframe's columns.
And if you normalize your sparse vectors with org.apache.spark.ml.feature.Normalizer before computing your dot product, you'll get cosine similarity in the end (by definition).
Great answer above #Sergey-Zakharov +1.
A few adds-on:
The reduce doesn't work on empty sequences.
Make sure computing L2 normalization.
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
val l2NormData = normalizer.transform(df_features)
and
val dotProduct = udf {(v1: SparseVector, v2: SparseVector) =>
v1.indices.intersect(v2.indices).map(x => v1(x) * v2(x)).reduceOption(_ + _).getOrElse(0.0)
}
and then
val df = dfA.crossJoin(broadcast(dfB))
.withColumn("dot", dotProduct(col("featuresA"), col("featuresB")))
If the number of vectors you want to calculate the dot product with is small, cache the RDD[Vector] table. Create a new table [cosine_vectors] that is a filter on the original table to only select the vectors you want the cosine similarities for. Broadcast join those two together and calculate.

Spark out of memory when reducing by key

I'm working on a algorithm that requires math operations on large matrix. Basically, the algorithm involves the following steps:
Inputs: two vectors u and v of size n
For each vector, compute pairwise Euclidean distance between elements in the vector. Return two matrix E_u and E_v
For each entry in the two matrices, apply a function f. Return two matrix M_u, M_v
Find the eigen values and eigen vectors of M_u. Return e_i, ev_i for i = 0,...,n-1
Compute the outer product for each eigen vector. Return a matrix O_i = e_i*transpose(e_i), i = 0,...,n-1
Adjust each eigen value with e_i = e_i + delta_i, where delta_i = sum all elements(elementwise product of O_i and M_v)/2*mu, where mu is a parameter
Final return a matrix A = elementwise sum (e_i * O_i) over i = 0,...,n-1
The issue I'm facing is mainly the memory when n is large (15000 or more), since all matrices here are dense matrices. My current way to implement this may not be the best, and partially worked.
I used a RowMatrix for M_u and get eigen decomposition using SVD.
The resulting U factor of SVD is a row matrix whose columns are ev_i's, so I have to manually transpose it so that its rows become ev_i. The resulting e vector is the eigen values e_i.
Since a previous attempt of directly mapping each row ev_i to O_i failed due to out of memory, I'm currently doing
R = U.map{
case(i,ev_i) => {
(i, ev_i.toArray.zipWithIndex)
}
}//add index for each element in a vector
.flatMapValues(x=>x)}
.join(U)//eigen vectors column is appended
.map{case(eigenVecId, ((vecElement,elementId), eigenVec))=>(elementId, (eigenVecId, vecElement*eigenVec))}
To compute adjusted e_i's in step 5 above, M_v is stored as rdd of tuples (i, denseVector). Then
deltaRdd = R.join(M_v)
.map{
case(j,((i,row_j_of_O_i),row_j_of_M_v))=>
(i,row_j_of_O_i.t*DenseVector(row_j_of_M_v.toArray)/(2*mu))
}.reduceByKey(_+_)
Finally, to compute A, again due to memory issue, I have to first joining rows from different rdds and then reducing by key. Specifically,
R_rearranged = R.map{case(j, (i, row_j_of_O_i))=>(i,(j,row_j_of_O_i))}
termsForA = R_rearranged.join(deltaRdd)
A = termsForA.map{
case(i,(j,row_j_of_O_i), delta_i)) => (j, (delta_i + e(i))*row_j_of_O_i)
}
.reduceByKey(_+_)
The above implementation worked to the step of termsForA, which means if I execute an action on termsForA like termsForA.take(1).foreach(println), it succeeded. But if I execute an action on A, like A.count(), an OOM error occured on driver.
I tried to tune sparks configuration to increase driver memory as well as parallelism level, but all failed.
Use IndexedRowMatrix instead of RowMatrix, it will help in conversions and transpose.
Suppose your IndexedRowMatrix is Irm
svd = Irm.computeSVD(k, True)
U = svd.U
U = U.toCoordinateMatrix().transpose().toIndexedRowMatrix()
You can convert Irm to BlockMatrix for multiplication with another distributed BlockMatrix.
I guess at some point Spark decided there's no need to carry out operations on executors, and do all the work on driver. Actually, termsForA would fail as well in action like count. Somehow I made it work by broadcasting deltaRdd and e.

Matrix indices for creating sparse matrix

I want to create a 4 by 4 sparse matrix A. I want assign values (e.g. 1) to following entries:
A(2,1), A(3,1), A(4,1)
A(2,2), A(3,2), A(4,2)
A(2,3), A(3,3), A(4,3)
A(2,4), A(3,4), A(4,4)
According to the manual page, I know that I should store the indices by row and column respectively. That is, for row indices,
r=[2,2,2,2,3,3,3,3,4,4,4,4]
Also, for column indices
c=[1,2,3,4,1,2,3,4,1,2,3,4]
Since I want to assign 1 to each of the entries, so I use
value = ones(1,length(r))
Then, my sparse matrix will be
Matrix = sparse(r,c,value,4,4)
My problem is this:
Indeed, I want to construct a square matrix of arbitrary dimension. Says, if it is a 10 by 10 matrix, then my column vector will be
[1,2,..., 10, 1,2, ..., 10, 1,...,10, 1,...10]
For row vector, it will be
[2,2,...,2,3,3,...,3,...,10, 10, ...,10]
I would like to ask if there is a quick way to build these column and row vector in an efficient manner? Thanks in advance.
I think the question aims to create vectors c,r in an easy way.
n = 4;
c = repmat(1:n,1,n-1);
r = reshape(repmat(2:n,n,1),1,[]);
Matrix = sparse(r,c,value,n,n);
This will create your specified vectors in general.
However as pointed out by others full sparse matrixes are not very efficient due to overhead. If I recall correctly a sparse matrix offers advantages if the density is lower than 25%. Having everything except the first row will result in slower performance.
You can sparse a matrix after creating its full version.
A = (10,10);
A(1,:) = 0;
B = sparse(A);

Finding the largest vector inside a matrix

I am trying to find the largest vector inside a matrix compound by vectors with MATLAB, however I am having some difficulties, so I would be very thankful if someone help me. I have this:
The matrix paths (solution of a Dijkstra function), which is a 1000x1000 matrix, whose values are vectors of 1 row and different number of columns (when the columns are bigger than 10, the values appear as "1x11 double, 1x12 double, etc"). The matrix paths have this form:
1 2 3 ....
1 1 <1x20 double> <1x16 double>
2 <1x20 double> 2 [2,870,183,492,641,863,611,3]
3 <1x16 double> [3,611,863,641,492,183,870,2] 3
4 <1x25 double> <1x12 double> <1x14 double>
.
.
.
At first I thought in just finding the largest vector in the matrix by
B = max(length(paths))
However MATLAB returns B = 1000, value which is feasible, but not likely. When trying to find out the position of the vector by using:
[row,column] = find(length(paths) == B)
MATLAB returns row = 1, column = 1, which for sure is wrong... I have thought that maybe is a problem of how MATLAB takes the data. It is like it doesn't consider the entries of the matrix as vectors, because when I enter:
length(paths(3,2))
It returns me 1, but it should return 8 as I understand, also when introducing:
paths(3,2)
It returns [1x8 double] but I expect to see the whole vector. I don't know what to do, maybe a "for" loop, I really do not know if MATLAB takes the data of the matrix as vectors or as simple double values.
The cell with the largest vector can be found using cellfun and numel to get the number of elements in each numeric matrix stored in the cells of paths:
vecLens = cellfun(#numel,paths);
[maxLen,im] = max(vecLens(:));
[rowMax,colMax] = ind2sub(size(vecLens),im)
This gets a 1000x1000 numeric matrix vecLens containing the sizes, max gets the linear index of the largest element, and ind2sub translates that to row,column indexes.
A note on length: It gives you the size of the largest dimension. The size of paths is 1000x1000, so length(paths) is 1000. My advice is, Don't ever use length. Use size, specifying the dimension you want.
If multiple vectors are the same length, you get the first one with the above approach. To get all of them (starting after the max command):
maxMask = vecLens==maxLen;
if nnz(maxMask)>1,
[rowMax,colMax] = find(maxMask);
else
[rowMax,colMax] = ind2sub(size(vecLens),im)
end
or just
[rowMax,colMax] = find(vecLens==maxLen);

Unitize the columns of matrix

Given a matrix d x n (d-dimensional, n-object) I would like to compute the unit length vector of each columns. (i.e the resultant matrix should have unit length in every column)
how can i do it without looping every column?
I'm assuming you're using the L2 norm. In that case,
normalizedVector = bsxfun(#rdivide,vector,sqrt(sum(vector.^2,1)));
will have unit length along each column.