Spark Scala: Vector Dataframe to RDD of values - scala

I have a spark dataframe that has a vector in it:
org.apache.spark.sql.DataFrame = [sF: vector]
and I'm trying to convert it to a RDD of values:
org.apache.spark.rdd.RDD[(Double, Double)]
However, I haven't been able to convert it properly. I've tried:
val m2 = m1.select($"sF").rdd.map{case Row(v1, v2) => (v1.toString.toDouble, v2.toString.toDouble)}
and it compiles, but I get a runtime error:
scala.MatchError: [[-0.1111111111111111,-0.2222222222222222]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
when i do:
m2.take(10).foreach(println).
Is there something I'm doing wrong?

Assuming you want the first two values of the vectors present in the sF column, maybe this will work:
import org.apache.spark.mllib.linalg.Vector
val m2 = m1
.select($"sF")
.map { case Row(v: Vector) => (v(0), v(1)) }
You are getting an error because when you do case Row(v1, v2), it will not match the contents of the rows in your DataFrame, because you are expecting two values on each row (v1 and v2), but there is only one: a Vector.
Note: you don't need to call .rdd if you are going to do a .map operation.

Related

Transpose RDD[Vector] to change records to attributes for an csv of size 500.000 x 50

I would like to read a csv file and transpose it to measure correlation between attributes. But when I transpose it I get below error:
not enough arguments for method transpose: (implicit asTraversable:
org.apache.spark.mllib.linalg.Vector =>
scala.collection.GenTraversableOnce[B])Seq[Seq[B]]. Unspecified value
parameter asTraversable.
Error occurred in an application involving default arguments.
val file = "/data.csv"
val data = sc.textFile(file).map(line => Vectors.dense(line.split (",").map(_.toDouble).distinct))
val transposedData = sc.parallelize(data.collect.toSeq.transpose)
val correlMatrix: Matrix = Statistics.corr(transposedData, "pearson")
println(correlMatrix.toString)
not enough arguments for method transpose: (implicit asTraversable: org.apache.spark.mllib.linalg.Vector => scala.collection.GenTraversableOnce[B])Seq[Seq[B]]. Unspecified value parameter asTraversable.
data RDD is a collection of org.apache.spark.mllib.linalg.Vector i.e. collection of objects. But transpose would require collection of collection.
data.collect.toSeq simply gives you Seq[Vector] which cannot be transposed.
The following code should work for you
val data = sc.textFile(file).map(line => line.split (",").map(_.toDouble))
val untransposedData = data.map(Vectors.dense(_))
val transposedData = sc.parallelize(data.collect.toSeq.transpose).map(x => Vectors.dense(x.toArray))
val correlMatrix: Matrix = Statistics.corr(transposedData, "pearson")
println(correlMatrix.toString)
Note: distinct is removed as it would make the two dimentional matrix to be uneven which would lead to another issue.

How to create DataFrame from Scala's List of Iterables?

I have the following Scala value:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
When I try the following:
sqlContext.createDataFrame(values)
I got this error:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?
Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD.
Here is an example with Spark 2.0 but it exists in older versions too
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1
The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.
As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.
To convert List[Iterable[Any]] to List[Row], we can say
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
val df = sqlContext.createDataFrame(rdd, schema)
Simplest approach:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")
In Spark 2 we can use DataSet by just converting list to DS by toDS API
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
val ds = list.toDS()
This more convenient than rdd or df
The most concise way I've found:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))

Converting a [(Int, Seq[Double])] RDD to LabeledPoint

I have an RDD of the following format and would like to convert it into a LabeledPoint RDD in order to process it in mllib :
Test: RDD[(Int, Seq[Double])] = Array((1,List(1.0,3.0,8.0),(2,List(3.0, 3.0,8.0),(1,List(2.0,3.0,7.0),(1,List(5.0,5.0,9.0))
I tried with map
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
Test.map(x=> LabeledPoint(x._1, Vectors.sparse(x._2)))
but I get this error
mllib.linalg.Vector cannot be applied to (Seq[scala.Double])
So presumably the Seq element needs to be converted first but I don't know into what.
There are a few problems here:
label should be Double not Int
SparseVector requires number of elements, indices and values
none of the vector constructors accepts list of Double
your data looks dense not sparse
One possible solution:
val rdd = sc.parallelize(Array(
(1, List(1.0,3.0,8.0)),
(2, List(3.0, 3.0,8.0)),
(1, List(2.0,3.0,7.0)),
(1, List(5.0,5.0,9.0))))
rdd.map { case (k, vs) =>
LabeledPoint(k.toDouble, Vectors.dense(vs.toArray))
}
and another:
rdd.collect { case (k, v::vs) =>
LabeledPoint(k.toDouble, Vectors.dense(v, vs: _*)) }
As you can notice in LabeledPoint's documentation its constructor receives a Double as a label and a Vector as features (DenseVector or SparseVector). However, if you take a look in both inherited classes' constructors they receive an Array, therefore you need to convert your Seq to Array.
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import org.apache.spark.mllib.regression.LabeledPoint
val rdd = sc.parallelize(Array((1, Seq(1.0,3.0,8.0)),
(2, Seq(3.0, 3.0,8.0)),
(1, Seq(2.0,3.0, 7.0)),
(1, Seq(5.0, 5.0, 9.0))))
val x = rdd.map{
case (a: Int, b:Seq[Double]) => LabeledPoint(a, new DenseVector(b.toArray))
}
x.take(2).foreach(println)
//(1.0,[1.0,3.0,8.0])
//(2.0,[3.0,3.0,8.0])

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].