I have an IndexedRowMatrix of doubles. I want to compute the sum of each row of the matrix and save the results to a Vector. After that I want to broadcast this vector.
I am creating an RDD of Doubles, which contains the sums, but I cannot turn it into a vector.
So, the question basically is how to create the Vector I want from the IndexedRowMatrix.
Collect to the driver and construct a vector:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val sc: SparkContext = ???
val rdd: RDD[Double] = ???
val vec: Vector = Vectors.dense(rdd.collect)
val broadcastVec = sc.broadcast(vec)
References:
https://spark.apache.org/docs/2.1.0/mllib-data-types.html#local-vector
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
Related
I have a RDD[Matrix[Double]] and want to convert it to RDD[Vector] (Each row in the Matrix will be converted to a Vector).
I've seen related answer like Convert Matrix to RowMatrix in Apache Spark using Scala, but it's one Matrix to RDD of Vector. While my case is RDD of Matrix.
Use flatMap on code to convert Matrix to Seq[Vector]:
// from https://stackoverflow.com/a/28172826/1206998
def toSeqOfVector(m: Matrix): Seq[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
rows.map(row => new DenseVector(row.toArray))
}
val matrices: RDD[Matrix] = ??? // your input
val vectors: RDD[Vector] = matrices.flatMap(toSeqOfVector)
Note: I didn't test this code, but this is the principle
I am trying to calculate the distance between the row in a dataframe and a vector( org.apache.spark.ml.linalg.Vector).
I plan to do anomaly detection with K-Means algorithm, so I got the center id which is a Vector then I can calculate the distance with row in a dataframe, but I got below error:
Vectors.sqdist(v1,centerid)
<console>:54: error: type mismatch;
found : scala.collection.immutable.Vector[org.apache.spark.sql.Row]
How to convert the Vector[org.apache.spark.sql.Row] to org.apache.spark.ml.linalg.Vector?
You can use VectorAssembler to convert your Row to a features Vector. Try this:
val df: DataFrame = ???
val vector = new VectorAssembler().setInputCols(Array("yourInputColumns")).setOutputCol("features")
vector.transform(df)
As output you will get a Dataframe with one column with the type
org.apache.spark.ml.linalg.Vector
Im struggeling to find a way to quickly convert a DenseMatrix to SparseMatrix.
I tried flattening the DenseMatrix to an array, converting it to a Sparse Matrix and then reshaping it but this is not possible since there is no reshape function..
val dm = DenseMatrix((1,2,3),(0,0,0),(0,0,0))
val sm =CSCMatrix(dm.toArray)
sm.reshape(3,3)
error: value reshape is not a member of breeze.linalg.CSCMatrix[Int]
How about something like this:
val dm = DenseMatrix((1,2,3),(0,0,0),(0,0,0))
val sm = CSCMatrix.tabulate(dm.rows, dm.cols)(dm(_, _))
I started using Breeze since a few weeks and I am not able to do something that seems simple. I want to transform a Transpose into a DenseMatrix, for example:
val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0)) // DenseMatrix
val meanCols = mean(matrix(::, *)) // Transpose
val meanColsDM = meanCols.toDenseMatrix // Error: value toDenseMatrix is not a member of breeze.linalg.Transpose
I thought about creating a loop to transform the Transpose into an array to then create the DenseMatrix (1 row, 2 cols using the matrix from the example) but I wonder if there is a simpler way to obtain the same thing.
I need to do this to then concatene the mean of the columns with other matrices, I did not put the code in the example as it is not the source of the problem.
meanCols is a Transpose[DenseVector[Double]], which is just a wrapper for a DenseVector[Double]. If you want the result in a matrix with one row and two columns, you can transpose it again with .t to get a DenseVector[Double] and then convert that to a matrix with .toDenseVector:
scala> import breeze.linalg._, breeze.stats.mean
import breeze.linalg._
import breeze.stats.mean
scala> val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0))
matrix: breeze.linalg.DenseMatrix[Double] =
1.0 3.5
3.0 2.0
scala> val meanCols = mean(matrix(::, *))
meanCols: breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] = ...
scala> val meanColsDM = meanCols.t.toDenseMatrix
meanColsDM: breeze.linalg.DenseMatrix[Double] = 2.0 2.75
I would like to run a Spearman correlation on data that is currently in a Spark DataFrame. Currently, only the Pearson correlation calculation is available to operate on columns in a DataFrame. It appears that I can do a Spearman correlation using Spark's MLlib, but I need to pass two RDD[Double] to the function. The columns I want to compare are Double according to the current schema.
Is there a way to select the columns I want and make the be an array of Doubles so that I can use the MLlib correlation function to get the Spearman correlation coefficient?
You can simply select columns of interest, extract values and compute statistics:
import sqlContext.implicits._
import org.apache.spark.mllib.stat.Statistics
// Generate some random data
scala.util.Random.setSeed(1)
val df = sc.parallelize(g.sample(1000).zip(g.sample(1000))).toDF("x", "y")
// Select columns and extract values
val rddX = df.select($"x").rdd.map(_.getDouble(0))
val rddY = df.select($"y").rdd.map(_.getDouble(0))
val correlation: Double = Statistics.corr(rddX, rddY, "spearman")
You should be able to do something like this
val firstRDD: RDD[Double] = yourDF.select("field1").map(row => row.getDouble(0))
val secondRDD: RDD[Double] = yourDF.select("field2").map(row => row.getDouble(0))
val corr = Statistics.corr(firstRDD, secondRDD, "spearman")