I'm wondering what would be a good way of multiply each column of a RowMatrix
by an integer (each row being multiplied by a different integer).
I know I could for example create a diagonal mllib "Matrix" object containing
the values a1 ... an ( ai being the coefficient I want to multiply the ith
column of the RowMatrix by ), and then I could just use the matrix multiplication of mllib (multiplying a RowMatrix by a Matrix, which yields a RowMatrix as result). However this is not efficient probably and does not show how to
do stuff on a RowMatrix.
I'm new to writing functions on rowmatrices and tried looking a bit at some of
the already existing ones and was a bit confused.
Thanks for you help
It's unclear whether you want to multiply each row or each column by a different integer. Your title and second paragraph say each column, but your first sentence says each row. Regardless, these sorts of operations are probably most easily implemented by calling .rows and operating on the underlying RDD[Vector]. For instance:
def multiplyColumns(m: RowMatrix, xs: Array[Double]): RowMatrix = {
val newRowsRdd = m.rows.map {
row => Vectors.dense(row.toArray.zip(xs).map{case (a, b) => a * b})
}
new RowMatrix(newRowsRdd)
}
Related
I have a dataframe with two columns where each row has a Sparse Vector. I try to find a proper way to calculate the cosine similarity (or just the dot product) of the two vectors in each row.
However, I haven't been able to find any library or tutorial to do it for Sparse vectors.
The only way I found is the following:
Create a k X n matrix, where n items are described as k-dimensioned vectors. For representing each item as a k dimension vector, you can use ALS which represents each entity in a latent factor space. The dimension of this space (k) can be chosen by you. This k X n matrix can be represented as RDD[Vector].
Convert this k X n matrix to RowMatrix.
Use columnSimilarities() function to get a n X n matrix of similarities between n items.
I feel it is an overkill to calculate all the cosine similarities for each pair while I need it only for the specific pairs in my (quite big) dataframe.
In Spark 3 there is now method dot for a SparseVector object, which takes another vector as its argument.
If you want to do this in earlier versions, you could create a user defined function that follows this algorithm:
Take intersection of your vectors' indices.
Get two subarrays of your vectors' values based on the indices from the intersection.
Do pairwise multiplication of the elements of those two subarrays.
Sum the values resulting values from such pairwise multiplication.
Here's my realization of it:
import org.apache.spark.ml.linalg.SparseVector
def dotProduct(vec: SparseVector, vecOther: SparseVector) = {
val commonIndices = vec.indices intersect vecOther.indices
commonIndices.map(x => vec(x) * vecOther(x)).reduce(_+_)
}
I guess you know how to turn it into a Spark UDF from here and apply it to your dataframe's columns.
And if you normalize your sparse vectors with org.apache.spark.ml.feature.Normalizer before computing your dot product, you'll get cosine similarity in the end (by definition).
Great answer above #Sergey-Zakharov +1.
A few adds-on:
The reduce doesn't work on empty sequences.
Make sure computing L2 normalization.
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
val l2NormData = normalizer.transform(df_features)
and
val dotProduct = udf {(v1: SparseVector, v2: SparseVector) =>
v1.indices.intersect(v2.indices).map(x => v1(x) * v2(x)).reduceOption(_ + _).getOrElse(0.0)
}
and then
val df = dfA.crossJoin(broadcast(dfB))
.withColumn("dot", dotProduct(col("featuresA"), col("featuresB")))
If the number of vectors you want to calculate the dot product with is small, cache the RDD[Vector] table. Create a new table [cosine_vectors] that is a filter on the original table to only select the vectors you want the cosine similarities for. Broadcast join those two together and calculate.
I'm working on a algorithm that requires math operations on large matrix. Basically, the algorithm involves the following steps:
Inputs: two vectors u and v of size n
For each vector, compute pairwise Euclidean distance between elements in the vector. Return two matrix E_u and E_v
For each entry in the two matrices, apply a function f. Return two matrix M_u, M_v
Find the eigen values and eigen vectors of M_u. Return e_i, ev_i for i = 0,...,n-1
Compute the outer product for each eigen vector. Return a matrix O_i = e_i*transpose(e_i), i = 0,...,n-1
Adjust each eigen value with e_i = e_i + delta_i, where delta_i = sum all elements(elementwise product of O_i and M_v)/2*mu, where mu is a parameter
Final return a matrix A = elementwise sum (e_i * O_i) over i = 0,...,n-1
The issue I'm facing is mainly the memory when n is large (15000 or more), since all matrices here are dense matrices. My current way to implement this may not be the best, and partially worked.
I used a RowMatrix for M_u and get eigen decomposition using SVD.
The resulting U factor of SVD is a row matrix whose columns are ev_i's, so I have to manually transpose it so that its rows become ev_i. The resulting e vector is the eigen values e_i.
Since a previous attempt of directly mapping each row ev_i to O_i failed due to out of memory, I'm currently doing
R = U.map{
case(i,ev_i) => {
(i, ev_i.toArray.zipWithIndex)
}
}//add index for each element in a vector
.flatMapValues(x=>x)}
.join(U)//eigen vectors column is appended
.map{case(eigenVecId, ((vecElement,elementId), eigenVec))=>(elementId, (eigenVecId, vecElement*eigenVec))}
To compute adjusted e_i's in step 5 above, M_v is stored as rdd of tuples (i, denseVector). Then
deltaRdd = R.join(M_v)
.map{
case(j,((i,row_j_of_O_i),row_j_of_M_v))=>
(i,row_j_of_O_i.t*DenseVector(row_j_of_M_v.toArray)/(2*mu))
}.reduceByKey(_+_)
Finally, to compute A, again due to memory issue, I have to first joining rows from different rdds and then reducing by key. Specifically,
R_rearranged = R.map{case(j, (i, row_j_of_O_i))=>(i,(j,row_j_of_O_i))}
termsForA = R_rearranged.join(deltaRdd)
A = termsForA.map{
case(i,(j,row_j_of_O_i), delta_i)) => (j, (delta_i + e(i))*row_j_of_O_i)
}
.reduceByKey(_+_)
The above implementation worked to the step of termsForA, which means if I execute an action on termsForA like termsForA.take(1).foreach(println), it succeeded. But if I execute an action on A, like A.count(), an OOM error occured on driver.
I tried to tune sparks configuration to increase driver memory as well as parallelism level, but all failed.
Use IndexedRowMatrix instead of RowMatrix, it will help in conversions and transpose.
Suppose your IndexedRowMatrix is Irm
svd = Irm.computeSVD(k, True)
U = svd.U
U = U.toCoordinateMatrix().transpose().toIndexedRowMatrix()
You can convert Irm to BlockMatrix for multiplication with another distributed BlockMatrix.
I guess at some point Spark decided there's no need to carry out operations on executors, and do all the work on driver. Actually, termsForA would fail as well in action like count. Somehow I made it work by broadcasting deltaRdd and e.
I have a RDD of the form
(3,CompactBuffer((-0.063763,0.060122,0.250393), (0.006971,-0.096478,0.123718), (-0.198281,-0.079444,-0.015460)))
I need to calculate the average of the vectors in the compactBuffer
val averagevector = filteredvectors.reduce((a,b) => sum(b)/b.size)
By doing a reduce Action something like the one shown above.
My averagevector should be something like (3, (avg(1), avg(2), avg(3))) where avg(1) is the average of all the first elements in the CompactBuffer shown above.
Is there an efficient way to reduce a block matrix to a sum of all of its values? I'm looking to calculate the Euclidean distance between two block matrices (d2, as defined in the response here https://math.stackexchange.com/questions/507742/distance-similarity-between-two-matrices).
As a follow up, there doesn't appear to be a simple way to subtract two block matrices. Is there any way to multiply each by a constant?
Edit: Found a workaround for subtraction. V, W, and H are the three matrices. The negOneBlock is a matrix of size V which only contain negative ones.
V.add((W.multiply(H)).multiply(negOneBlock))
Applying a sum for each block and then reducing should quite efficient.
import org.apache.spark.mllib.linalg.distributed._
def sum(mat: BlockMatrix) = mat.blocks.map(_._2.toArray.sum).sum
where
_.blocks
creates a RDD[((Int, Int), Matrix)],
_._2
extracts Matrix, and
toArray.sum
aggregates all values in the block. For data like:
val mat: BlockMatrix = new CoordinateMatrix(sc.parallelize(Seq(
MatrixEntry(0, 10, 1.0), MatrixEntry(10, 1024, 2.0),
MatrixEntry(3000, 10, 3.0))
)).toBlockMatrix(128, 128)
sum(mat)
we get expected result which 6.0.
I have a question about finding index of the maximum values along rows of matrix. How can I do this in Spark Scala? This function would be like argmax in numpy in Python.
What's the type of your matrix ? If it's a RowMatrix, you can access the RDD of its row vectors using rows.
Then it's a simple matter of finding the maximum of each vector of this RDD[Vector], if I understand correctly. You can therefore myMatrix.rows.map{_.toArray.max}.
If you have a DenseMatrix you can convert it to an Array, at which stage you'll have a list of elements in row-major form. You can also access the number of columns of your matrix with numCols, and then use the collections method grouped to obtain rows.
myMatrix.toArray.grouped(myMatrix.numCols).map{_.max}
I think you will have to get the values as an array to get the maximum value.
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
val result = dm.toArray.max
println(result)