Convert RDD of Matrix to RDD of Vector - scala

I have a RDD[Matrix[Double]] and want to convert it to RDD[Vector] (Each row in the Matrix will be converted to a Vector).
I've seen related answer like Convert Matrix to RowMatrix in Apache Spark using Scala, but it's one Matrix to RDD of Vector. While my case is RDD of Matrix.

Use flatMap on code to convert Matrix to Seq[Vector]:
// from https://stackoverflow.com/a/28172826/1206998
def toSeqOfVector(m: Matrix): Seq[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
rows.map(row => new DenseVector(row.toArray))
}
val matrices: RDD[Matrix] = ??? // your input
val vectors: RDD[Vector] = matrices.flatMap(toSeqOfVector)
Note: I didn't test this code, but this is the principle

Related

Similarity matrix using a spark dataframe

For an input Dataframe the intent is to generate only half of the self-cartesian product. Given the cartesian product results in a symmetric matrix we only really need to calculate either the upper or the lower triangular portion above (resp below) the diagonal that is set to zeros:
The dataframe crossjoin :
val df3 = df2.crossJoin(df2)
will generate the FULL - which we do not want.
Given the similarity matrix is symmetric with 1's along the diagonal we do not need to calculate the upper half or the diagonal itself - as shown in the LOWER DiagO's below:
Any suggestions on how to obtain the result with the least computation?
The following is not a perfect answer: it does result in first generating the full cartesian product. But at least the output results are correct.
/** Generate schema for cartesian product of an input dataframe */
def joinSchema(df: DataFrame) =
types.StructType(df.schema.fields.map {
f => StructField(s"${f.name}_a", f.dataType, f.nullable)
} ++ df.schema.fields.map { f => StructField(s"${f.name}_b", f.dataType, f.nullable)}
)
// Create the cartesian product via crossJoin
val schema = joinSchema(dfIn)
val df3 = df2.crossJoin(dfIn)
val cartesianDf = spark.createDataFrame(df3.rdd, schema)
cartDf.createOrReplaceTempView("cartesian")
// Retain the lower triangular entries below the diagonal
select * from cartesian where id_a < id_b

How to convert RDD[Double] to Vector in Scala Spark

I have an IndexedRowMatrix of doubles. I want to compute the sum of each row of the matrix and save the results to a Vector. After that I want to broadcast this vector.
I am creating an RDD of Doubles, which contains the sums, but I cannot turn it into a vector.
So, the question basically is how to create the Vector I want from the IndexedRowMatrix.
Collect to the driver and construct a vector:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val sc: SparkContext = ???
val rdd: RDD[Double] = ???
val vec: Vector = Vectors.dense(rdd.collect)
val broadcastVec = sc.broadcast(vec)
References:
https://spark.apache.org/docs/2.1.0/mllib-data-types.html#local-vector
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

Scala Breeze DenseMatrix to SparseMatrix conversion

Im struggeling to find a way to quickly convert a DenseMatrix to SparseMatrix.
I tried flattening the DenseMatrix to an array, converting it to a Sparse Matrix and then reshaping it but this is not possible since there is no reshape function..
val dm = DenseMatrix((1,2,3),(0,0,0),(0,0,0))
val sm =CSCMatrix(dm.toArray)
sm.reshape(3,3)
error: value reshape is not a member of breeze.linalg.CSCMatrix[Int]
How about something like this:
val dm = DenseMatrix((1,2,3),(0,0,0),(0,0,0))
val sm = CSCMatrix.tabulate(dm.rows, dm.cols)(dm(_, _))

Calculate Spearman Correlation on a Spark DataFrame

I would like to run a Spearman correlation on data that is currently in a Spark DataFrame. Currently, only the Pearson correlation calculation is available to operate on columns in a DataFrame. It appears that I can do a Spearman correlation using Spark's MLlib, but I need to pass two RDD[Double] to the function. The columns I want to compare are Double according to the current schema.
Is there a way to select the columns I want and make the be an array of Doubles so that I can use the MLlib correlation function to get the Spearman correlation coefficient?
You can simply select columns of interest, extract values and compute statistics:
import sqlContext.implicits._
import org.apache.spark.mllib.stat.Statistics
// Generate some random data
scala.util.Random.setSeed(1)
val df = sc.parallelize(g.sample(1000).zip(g.sample(1000))).toDF("x", "y")
// Select columns and extract values
val rddX = df.select($"x").rdd.map(_.getDouble(0))
val rddY = df.select($"y").rdd.map(_.getDouble(0))
val correlation: Double = Statistics.corr(rddX, rddY, "spearman")
You should be able to do something like this
val firstRDD: RDD[Double] = yourDF.select("field1").map(row => row.getDouble(0))
val secondRDD: RDD[Double] = yourDF.select("field2").map(row => row.getDouble(0))
val corr = Statistics.corr(firstRDD, secondRDD, "spearman")

spark conversion from RowMatrix to CoordinateMatrix in scala

I am trying to perform some operations in Distributed matrix space in Scala Spark, and I am wondering if there is any straightforward way to convert a distributed RowMatrix to a distributed CoordinateMatrix?
For example, if
rdd_input = Array[org.apache.spark.mllib.linalg.Vector] = Array([10], [0], [0], [40])
The conversion of RowMatrix to 2 X 2 dense matrix was accomplished by,
rdd_input.zipWithIndex.groupBy{case (x, i) => i / 2}.map(_._2.map(_._1))
If the Matrix is sparse, how can I convert this matrix to a CoordinateMatrix?
Any help is appreciated.