Converting a [(Int, Seq[Double])] RDD to LabeledPoint - scala

I have an RDD of the following format and would like to convert it into a LabeledPoint RDD in order to process it in mllib :
Test: RDD[(Int, Seq[Double])] = Array((1,List(1.0,3.0,8.0),(2,List(3.0, 3.0,8.0),(1,List(2.0,3.0,7.0),(1,List(5.0,5.0,9.0))
I tried with map
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
Test.map(x=> LabeledPoint(x._1, Vectors.sparse(x._2)))
but I get this error
mllib.linalg.Vector cannot be applied to (Seq[scala.Double])
So presumably the Seq element needs to be converted first but I don't know into what.

There are a few problems here:
label should be Double not Int
SparseVector requires number of elements, indices and values
none of the vector constructors accepts list of Double
your data looks dense not sparse
One possible solution:
val rdd = sc.parallelize(Array(
(1, List(1.0,3.0,8.0)),
(2, List(3.0, 3.0,8.0)),
(1, List(2.0,3.0,7.0)),
(1, List(5.0,5.0,9.0))))
rdd.map { case (k, vs) =>
LabeledPoint(k.toDouble, Vectors.dense(vs.toArray))
}
and another:
rdd.collect { case (k, v::vs) =>
LabeledPoint(k.toDouble, Vectors.dense(v, vs: _*)) }

As you can notice in LabeledPoint's documentation its constructor receives a Double as a label and a Vector as features (DenseVector or SparseVector). However, if you take a look in both inherited classes' constructors they receive an Array, therefore you need to convert your Seq to Array.
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import org.apache.spark.mllib.regression.LabeledPoint
val rdd = sc.parallelize(Array((1, Seq(1.0,3.0,8.0)),
(2, Seq(3.0, 3.0,8.0)),
(1, Seq(2.0,3.0, 7.0)),
(1, Seq(5.0, 5.0, 9.0))))
val x = rdd.map{
case (a: Int, b:Seq[Double]) => LabeledPoint(a, new DenseVector(b.toArray))
}
x.take(2).foreach(println)
//(1.0,[1.0,3.0,8.0])
//(2.0,[3.0,3.0,8.0])

Related

Spark merge/combine arrays in groupBy/aggregate

The following Spark code correctly demonstrates what I want to do and generates the correct output with a tiny demo data set.
When I run this same general type of code on a large volume of production data, I am having runtime problems. The Spark job runs on my cluster for ~12 hours and fails out.
Just glancing at the code below, it seems inefficient to explode every row, just to merge it back down. In the given test data set, the fourth row with three values in array_value_1 and three values in array_value_2, that will explode to 3*3 or nine exploded rows.
So, in a larger data set, a row with five such array columns, and ten values in each column, would explode out to 10^5 exploded rows?
Looking at the provided Spark functions, there are no out of the box functions that would do what I want. I could supply a user-defined-function. Are there any speed drawbacks to that?
val sparkSession = SparkSession.builder.
master("local")
.appName("merge list test")
.getOrCreate()
val schema = StructType(
StructField("category", IntegerType) ::
StructField("array_value_1", ArrayType(StringType)) ::
StructField("array_value_2", ArrayType(StringType)) ::
Nil)
val rows = List(
Row(1, List("a", "b"), List("u", "v")),
Row(1, List("b", "c"), List("v", "w")),
Row(2, List("c", "d"), List("w")),
Row(2, List("c", "d", "e"), List("x", "y", "z"))
)
val df = sparkSession.createDataFrame(rows.asJava, schema)
val dfExploded = df.
withColumn("scalar_1", explode(col("array_value_1"))).
withColumn("scalar_2", explode(col("array_value_2")))
// This will output 19. 2*2 + 2*2 + 2*1 + 3*3 = 19
logger.info(s"dfExploded.count()=${dfExploded.count()}")
val dfOutput = dfExploded.groupBy("category").agg(
collect_set("scalar_1").alias("combined_values_2"),
collect_set("scalar_2").alias("combined_values_2"))
dfOutput.show()
It could be inefficient to explode but fundamentally the operation you try to implement is simply expensive. Effectively it is just another groupByKey and there is not much you can do here to make it better. Since you use Spark > 2.0 you could collect_list directly and flatten:
import org.apache.spark.sql.functions.{collect_list, udf}
val flatten_distinct = udf(
(xs: Seq[Seq[String]]) => xs.flatten.distinct)
df
.groupBy("category")
.agg(
flatten_distinct(collect_list("array_value_1")),
flatten_distinct(collect_list("array_value_2"))
)
In Spark >= 2.4 you can replace udf with composition of built-in functions:
import org.apache.spark.sql.functions.{array_distinct, flatten}
val flatten_distinct = (array_distinct _) compose (flatten _)
It is also possible to use custom Aggregator but I doubt any of these will make a huge difference.
If sets are relatively large and you expect significant number of duplicates you could try to use aggregateByKey with mutable sets:
import scala.collection.mutable.{Set => MSet}
val rdd = df
.select($"category", struct($"array_value_1", $"array_value_2"))
.as[(Int, (Seq[String], Seq[String]))]
.rdd
val agg = rdd
.aggregateByKey((MSet[String](), MSet[String]()))(
{case ((accX, accY), (xs, ys)) => (accX ++= xs, accY ++ ys)},
{case ((accX1, accY1), (accX2, accY2)) => (accX1 ++= accX2, accY1 ++ accY2)}
)
.mapValues { case (xs, ys) => (xs.toArray, ys.toArray) }
.toDF

Create a SparseVector from the elements of RDD

Using Spark, I have a data structure of type val rdd = RDD[(x: Int, y:Int), cov:Double] in Scala, where each element of the RDD represents an element of a matrix with x representing the row, y representing the column and cov representing the value of the element:
I need to create SparseVectors from rows of this matrix. So I decided to first convert the rdd to RDD[x: Int, (y:Int, cov:Double)] and then use groupByKey to put all elements of a specific row together like this:
val rdd2 = rdd.map{case ((x,y),cov) => (x, (y, cov))}.groupByKey()
Now I need to create the SparseVectors:
val N = 7 //Vector Size
val spvec = {(x: Int,y: Iterable[(Int, Double)]) => new SparseVector(N.toLong, Array(y.map(el => el._1.toInt)), Array(y.map(el => el._2.toDouble)))}
val vecs = rdd2.map(spvec)
However, this is the error that pops up.
type mismatch; found :Iterable[Int] required:Int
type mismatch; found :Iterable[Double] required:Double
I am guessing that y.map(el => el._1.toInt) is returning an iterable which Array cannot be applied on. I would appreciate if someone could help with how to do this.
The simplest solution is to convert to RowMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val rdd: RDD[((Int, Int), Double)] = ???
val vs: RDD[org.apache.spark.mllib.linalg.SparseVector]= new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toRowMatrix.rows.map(_.toSparse)
If you want to preserve row indices you can use toIndexedRowMatrix instead:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toIndexedRowMatrix.rows.map { case IndexedRow(i, vs) => (i, vs.toSparse) }

Spark Scala: Vector Dataframe to RDD of values

I have a spark dataframe that has a vector in it:
org.apache.spark.sql.DataFrame = [sF: vector]
and I'm trying to convert it to a RDD of values:
org.apache.spark.rdd.RDD[(Double, Double)]
However, I haven't been able to convert it properly. I've tried:
val m2 = m1.select($"sF").rdd.map{case Row(v1, v2) => (v1.toString.toDouble, v2.toString.toDouble)}
and it compiles, but I get a runtime error:
scala.MatchError: [[-0.1111111111111111,-0.2222222222222222]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
when i do:
m2.take(10).foreach(println).
Is there something I'm doing wrong?
Assuming you want the first two values of the vectors present in the sF column, maybe this will work:
import org.apache.spark.mllib.linalg.Vector
val m2 = m1
.select($"sF")
.map { case Row(v: Vector) => (v(0), v(1)) }
You are getting an error because when you do case Row(v1, v2), it will not match the contents of the rows in your DataFrame, because you are expecting two values on each row (v1 and v2), but there is only one: a Vector.
Note: you don't need to call .rdd if you are going to do a .map operation.

How to convert a map to Spark's RDD

I have a data set which is in the form of some nested maps, and its Scala type is:
Map[String, (LabelType,Map[Int, Double])]
The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.
I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.
It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:
writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else
val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt")
// Split the data into training and test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
I know it should be as easy to directly load the data variable from data_map, but I don't know how.
Any help is appreciated!
I guess you want something like this
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// If you know this upfront, otherwise it can be computed
// using flatMap
// trainMap.values.flatMap(_._2.keys).max + 1
val nFeatures: Int = ???
val trainMap = Map(
"x001" -> (-1, Map(0 -> 1.0, 3 -> 5.0)),
"x002" -> (1, Map(2 -> 5.0, 3 -> 6.0)))
val trainRdd: RDD[(String, LabeledPoint)] = sc
// Convert Map to Seq so it can passed to parallelize
.parallelize(trainMap.toSeq)
.map{case (id, (labelInt, values)) => {
// Convert nested map to Seq so it can be passed to Vector
val features = Vectors.sparse(nFeatures, values.toSeq)
// Convert label to Double so it can be used for LabeledPoint
val label = labelInt.toDouble
(id, LabeledPoint(label, features))
}}
It can be done in two ways
sc.textFile("libsvm_data.txt").map(s => createObject())
Convert map into collection of objects and use sc.parallelize()
The first one is preferrable.

Spark: Summary statistics

I am trying to use Spark summary statistics as described at: https://spark.apache.org/docs/1.1.0/mllib-statistics.html
According to Spark docs :
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
I have a problem building observations:RDD[Vector] object. I try:
scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]
scala> val observations = sc.parallelize(Array(v))
observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] = ParallelCollectionRDD[3] at parallelize at <console>:19
scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
Questions:
1) How should I cast DenseVector to Vector?
2) In real program instead of array of doubles I have a to get statistics on a collection that I get from RDD using:
def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.
So I have to do:
myRdd.countByKey().values.map(_.toDouble)
Which does not make much sense because instead of working with RDDs I now have to work with regular Scala collections whiich at some time stop fitting into memory. All advantages of Spark distributed computations is lost.
How to solve this in scalable manner?
Update
In my case I have:
val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)
How to convert doubleCnts into observations: RDD[Vector] ?
1) You don't need to cast, you just need to type:
val observations = sc.parallelize(Array(v: Vector))
2) Use aggregateByKey (map all the keys to to 1, and reduce by summing) rather than countByKey.
DenseVector has a compressed function. so you can change the RDD[ DenseVector] to RDD[Vector] as :
val st =observations.map(x=>x.compressed)