I have a BlockMatrix and I would like to divide this matrix by a number (e.g. by 2). However, the pyspark.mllib matrix library does not offer any function for dividing a number, only for dot product and for addition/subtraction. How can I manage to divide each entry in the blockmatrix by a number?
Have you tried map?
It works like this:
from pyspark.mllib.linalg.distributed import BlockMatrix
matrix = BlockMatrix(...)
def divide_by_2(x):
return x / 2
matrix = matrix.map(divide_by_2)
I'm wondering what would be a good way of multiply each column of a RowMatrix
by an integer (each row being multiplied by a different integer).
I know I could for example create a diagonal mllib "Matrix" object containing
the values a1 ... an ( ai being the coefficient I want to multiply the ith
column of the RowMatrix by ), and then I could just use the matrix multiplication of mllib (multiplying a RowMatrix by a Matrix, which yields a RowMatrix as result). However this is not efficient probably and does not show how to
do stuff on a RowMatrix.
I'm new to writing functions on rowmatrices and tried looking a bit at some of
the already existing ones and was a bit confused.
Thanks for you help
It's unclear whether you want to multiply each row or each column by a different integer. Your title and second paragraph say each column, but your first sentence says each row. Regardless, these sorts of operations are probably most easily implemented by calling .rows and operating on the underlying RDD[Vector]. For instance:
def multiplyColumns(m: RowMatrix, xs: Array[Double]): RowMatrix = {
val newRowsRdd = m.rows.map {
row => Vectors.dense(row.toArray.zip(xs).map{case (a, b) => a * b})
}
new RowMatrix(newRowsRdd)
}
For an input Dataframe the intent is to generate only half of the self-cartesian product. Given the cartesian product results in a symmetric matrix we only really need to calculate either the upper or the lower triangular portion above (resp below) the diagonal that is set to zeros:
The dataframe crossjoin :
val df3 = df2.crossJoin(df2)
will generate the FULL - which we do not want.
Given the similarity matrix is symmetric with 1's along the diagonal we do not need to calculate the upper half or the diagonal itself - as shown in the LOWER DiagO's below:
Any suggestions on how to obtain the result with the least computation?
The following is not a perfect answer: it does result in first generating the full cartesian product. But at least the output results are correct.
/** Generate schema for cartesian product of an input dataframe */
def joinSchema(df: DataFrame) =
types.StructType(df.schema.fields.map {
f => StructField(s"${f.name}_a", f.dataType, f.nullable)
} ++ df.schema.fields.map { f => StructField(s"${f.name}_b", f.dataType, f.nullable)}
)
// Create the cartesian product via crossJoin
val schema = joinSchema(dfIn)
val df3 = df2.crossJoin(dfIn)
val cartesianDf = spark.createDataFrame(df3.rdd, schema)
cartDf.createOrReplaceTempView("cartesian")
// Retain the lower triangular entries below the diagonal
select * from cartesian where id_a < id_b
I'm working on a algorithm that requires math operations on large matrix. Basically, the algorithm involves the following steps:
Inputs: two vectors u and v of size n
For each vector, compute pairwise Euclidean distance between elements in the vector. Return two matrix E_u and E_v
For each entry in the two matrices, apply a function f. Return two matrix M_u, M_v
Find the eigen values and eigen vectors of M_u. Return e_i, ev_i for i = 0,...,n-1
Compute the outer product for each eigen vector. Return a matrix O_i = e_i*transpose(e_i), i = 0,...,n-1
Adjust each eigen value with e_i = e_i + delta_i, where delta_i = sum all elements(elementwise product of O_i and M_v)/2*mu, where mu is a parameter
Final return a matrix A = elementwise sum (e_i * O_i) over i = 0,...,n-1
The issue I'm facing is mainly the memory when n is large (15000 or more), since all matrices here are dense matrices. My current way to implement this may not be the best, and partially worked.
I used a RowMatrix for M_u and get eigen decomposition using SVD.
The resulting U factor of SVD is a row matrix whose columns are ev_i's, so I have to manually transpose it so that its rows become ev_i. The resulting e vector is the eigen values e_i.
Since a previous attempt of directly mapping each row ev_i to O_i failed due to out of memory, I'm currently doing
R = U.map{
case(i,ev_i) => {
(i, ev_i.toArray.zipWithIndex)
}
}//add index for each element in a vector
.flatMapValues(x=>x)}
.join(U)//eigen vectors column is appended
.map{case(eigenVecId, ((vecElement,elementId), eigenVec))=>(elementId, (eigenVecId, vecElement*eigenVec))}
To compute adjusted e_i's in step 5 above, M_v is stored as rdd of tuples (i, denseVector). Then
deltaRdd = R.join(M_v)
.map{
case(j,((i,row_j_of_O_i),row_j_of_M_v))=>
(i,row_j_of_O_i.t*DenseVector(row_j_of_M_v.toArray)/(2*mu))
}.reduceByKey(_+_)
Finally, to compute A, again due to memory issue, I have to first joining rows from different rdds and then reducing by key. Specifically,
R_rearranged = R.map{case(j, (i, row_j_of_O_i))=>(i,(j,row_j_of_O_i))}
termsForA = R_rearranged.join(deltaRdd)
A = termsForA.map{
case(i,(j,row_j_of_O_i), delta_i)) => (j, (delta_i + e(i))*row_j_of_O_i)
}
.reduceByKey(_+_)
The above implementation worked to the step of termsForA, which means if I execute an action on termsForA like termsForA.take(1).foreach(println), it succeeded. But if I execute an action on A, like A.count(), an OOM error occured on driver.
I tried to tune sparks configuration to increase driver memory as well as parallelism level, but all failed.
Use IndexedRowMatrix instead of RowMatrix, it will help in conversions and transpose.
Suppose your IndexedRowMatrix is Irm
svd = Irm.computeSVD(k, True)
U = svd.U
U = U.toCoordinateMatrix().transpose().toIndexedRowMatrix()
You can convert Irm to BlockMatrix for multiplication with another distributed BlockMatrix.
I guess at some point Spark decided there's no need to carry out operations on executors, and do all the work on driver. Actually, termsForA would fail as well in action like count. Somehow I made it work by broadcasting deltaRdd and e.
Given a matrix d x n (d-dimensional, n-object) I would like to compute the unit length vector of each columns. (i.e the resultant matrix should have unit length in every column)
how can i do it without looping every column?
I'm assuming you're using the L2 norm. In that case,
normalizedVector = bsxfun(#rdivide,vector,sqrt(sum(vector.^2,1)));
will have unit length along each column.