In Spark-Scala, how to copy Array of Lists into DataFrame? - scala

I am familiar with Python and I am learning Spark-Scala.
I want to build a DataFrame which has structure desribed by this syntax:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),
(3.0, Vectors.dense(1.3, 1.0)),
(1.0, Vectors.dense(1.2, -0.5))
)).toDF("label", "features")
I got the above syntax from this URL:
http://spark.apache.org/docs/latest/ml-pipeline.html
Currently my data is in array which I had pulled out of a DF:
val my_a = gspc17_df.collect().map{row => Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}
The structure of my array is very similar to the above DF:
my_a: Array[Seq[Any]] =
Array(
List(-1.4830674013266898, [-0.004192832940431825,-0.003170667657263393]),
List(-0.05876766500768526, [-0.008462913654529357,-0.006880595828929472]),
List(1.0109273250546658, [-3.1816797620416693E-4,-0.006502619326182358]))
How to copy data from my array into a DataFrame which has the above structure?
I tried this syntax:
val my_df = spark.createDataFrame(my_a).toDF("label","features")
Spark barked at me:
<console>:105: error: inferred type arguments [Seq[Any]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
<console>:105: error: type mismatch;
found : scala.collection.mutable.WrappedArray[Seq[Any]]
required: Seq[A]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
scala>

The first problem here is that you use List to store row data. List is a homogeneous data structure and since the only common type for Any (row(2)) and DenseVector is Any (Object) you end up with a Seq[Any].
The next issue is that you use row(2) at all. Since Row is effectively a collection of Any this operation doesn't return any useful type and result couldn't be stored in a DataFrame without providing an explicit Encoder.
From the more Sparkish perspective it is not the good approach neither. collect-int just to transform data shouldn't require any comment and. mapping over Rows just to create Vectors doesn't make much sense either.
Assuming that there is no type mismatch you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(df.columns(3), df.columns(4)))
.setOutputCol("features")
assembler.transform(df).select(df.columns(2), "features")
or if you really want to handle this manually an UDF.
val toVec = udf((x: Double, y: Double) => Vectors.dense(x, y))
df.select(col(df.columns(2)), toVec(col(df.columns(3)), col(df.columns(4))))
In general I would strongly recommend getting familiar with Scala before you start using it with Spark.

Related

Input Spark Scala Dataframe Column as Vector

Relatively new to scala and the Spark API kit but I have a question trying to make use of the vector assembler
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler
to then make use of matrix correlations
https://spark.apache.org/docs/2.1.0/mllib-statistics.html#correlations
The dataframe column is of dtype linalg.Vector
val assembler = new VectorAssembler()
val trainwlabels3 = assembler.transform(trainwlabels2)
trainwlabels3.dtypes(0)
res90: (String, String) = (features,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7)
and yet calling this to an RDD for the statistics tool throws a mismatch error.
val data: RDD[Vector] = sc.parallelize(
trainwlabels3("features")
)
<console>:80: error: type mismatch;
found : org.apache.spark.sql.Column
required: Seq[org.apache.spark.mllib.linalg.Vector]
Thanks in advance for any help.
You should just select:
val features = trainwlabels3.select($"features")
Convert to RDD
val featuresRDD = features.rdd
and map:
featuresRDD.map(_.getAs[Vector]("features"))
This should work for you:
val rddForStatistics = new VectorAssembler()
.transform(trainwlabels2)
.select($"features")
.as[Vector] //turns Dataset[Row] (a.k.a DataFrame) to DataSet[Vector]
.rdd
However, you should avoid RDDs and figure out how to do what you want with the DataFrame-based API (in the spark.ml package) because working with RDDs is all but deprecated in MLlib.

How to create DataFrame from Scala's List of Iterables?

I have the following Scala value:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
When I try the following:
sqlContext.createDataFrame(values)
I got this error:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?
Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD.
Here is an example with Spark 2.0 but it exists in older versions too
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1
The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.
As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.
To convert List[Iterable[Any]] to List[Row], we can say
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
val df = sqlContext.createDataFrame(rdd, schema)
Simplest approach:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")
In Spark 2 we can use DataSet by just converting list to DS by toDS API
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
val ds = list.toDS()
This more convenient than rdd or df
The most concise way I've found:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))

How to convert RDD[Row] to RDD[Vector]

I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].

Spark: Summary statistics

I am trying to use Spark summary statistics as described at: https://spark.apache.org/docs/1.1.0/mllib-statistics.html
According to Spark docs :
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
I have a problem building observations:RDD[Vector] object. I try:
scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]
scala> val observations = sc.parallelize(Array(v))
observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] = ParallelCollectionRDD[3] at parallelize at <console>:19
scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
Questions:
1) How should I cast DenseVector to Vector?
2) In real program instead of array of doubles I have a to get statistics on a collection that I get from RDD using:
def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.
So I have to do:
myRdd.countByKey().values.map(_.toDouble)
Which does not make much sense because instead of working with RDDs I now have to work with regular Scala collections whiich at some time stop fitting into memory. All advantages of Spark distributed computations is lost.
How to solve this in scalable manner?
Update
In my case I have:
val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)
How to convert doubleCnts into observations: RDD[Vector] ?
1) You don't need to cast, you just need to type:
val observations = sc.parallelize(Array(v: Vector))
2) Use aggregateByKey (map all the keys to to 1, and reduce by summing) rather than countByKey.
DenseVector has a compressed function. so you can change the RDD[ DenseVector] to RDD[Vector] as :
val st =observations.map(x=>x.compressed)