How to create DataFrame from Scala's List of Iterables? - scala

I have the following Scala value:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
When I try the following:
sqlContext.createDataFrame(values)
I got this error:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?

Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD.
Here is an example with Spark 2.0 but it exists in older versions too
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1
The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.

As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.
To convert List[Iterable[Any]] to List[Row], we can say
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
val df = sqlContext.createDataFrame(rdd, schema)

Simplest approach:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")

In Spark 2 we can use DataSet by just converting list to DS by toDS API
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
val ds = list.toDS()
This more convenient than rdd or df

The most concise way I've found:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))

Related

Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint]

I'm using RDD[LabeledPoint] in my code. But now I have to normalize data using the MinMax method.
I saw that exist in ml library the MinMaxScaler, but this works with DataFrames: org.apache.spark.ml.feature.MinMaxScaler.
Because of the full code was already written with RDDs, I think I could do the followings steps to don't change anything else:
Convert the RDD[LabeledPoint] to DataFrame
Apply MinMaxScaler to the DataFrame
Convert the DataFrame to the RDD[LabeledPoint]
The thing is I do not how can I make it. I don't have column names (but the feature vector in the LabeledPoint has 9 dimension), and I also couldn't adapt other examples to my case. For instance, the code in:
https://stackoverflow.com/a/36909553/5081366
or Scaling each column of a dataframe
I will appreciate your help!
Finally, I am able to answer my own question!
Where allData is an RDD[LabeledPoint]:
// The following import doesn't work externally because the implicits object is defined inside the SQLContext class
val sqlContext = SparkSession
.builder()
.appName("Spark In Action")
.master("local")
.getOrCreate()
import sqlContext.implicits._
// Create a DataFrame from RDD[LabeledPoint]
val all = allData.map(e => (e.label, e.features))
val df_all = all.toDF("labels", "features")
// Scaler instance above with the same min(0) and max(1)
val scaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("featuresScaled")
.setMax(1)
.setMin(0)
// Scaling
var df_scaled = scaler.fit(df_all).transform(df_all)
// Drop the unscaled column
df_scaled = df_scaled.drop("features")
// Convert DataFrame to RDD[LabeledPoint]
val rdd_scaled = df_scaled.rdd.map(row => LabeledPoint(
row.getAs[Double]("labels"),
row.getAs[Vector]("featuresScaled")
))
I hope this will help someone else!

Input Spark Scala Dataframe Column as Vector

Relatively new to scala and the Spark API kit but I have a question trying to make use of the vector assembler
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler
to then make use of matrix correlations
https://spark.apache.org/docs/2.1.0/mllib-statistics.html#correlations
The dataframe column is of dtype linalg.Vector
val assembler = new VectorAssembler()
val trainwlabels3 = assembler.transform(trainwlabels2)
trainwlabels3.dtypes(0)
res90: (String, String) = (features,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7)
and yet calling this to an RDD for the statistics tool throws a mismatch error.
val data: RDD[Vector] = sc.parallelize(
trainwlabels3("features")
)
<console>:80: error: type mismatch;
found : org.apache.spark.sql.Column
required: Seq[org.apache.spark.mllib.linalg.Vector]
Thanks in advance for any help.
You should just select:
val features = trainwlabels3.select($"features")
Convert to RDD
val featuresRDD = features.rdd
and map:
featuresRDD.map(_.getAs[Vector]("features"))
This should work for you:
val rddForStatistics = new VectorAssembler()
.transform(trainwlabels2)
.select($"features")
.as[Vector] //turns Dataset[Row] (a.k.a DataFrame) to DataSet[Vector]
.rdd
However, you should avoid RDDs and figure out how to do what you want with the DataFrame-based API (in the spark.ml package) because working with RDDs is all but deprecated in MLlib.

In Spark-Scala, how to copy Array of Lists into DataFrame?

I am familiar with Python and I am learning Spark-Scala.
I want to build a DataFrame which has structure desribed by this syntax:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),
(3.0, Vectors.dense(1.3, 1.0)),
(1.0, Vectors.dense(1.2, -0.5))
)).toDF("label", "features")
I got the above syntax from this URL:
http://spark.apache.org/docs/latest/ml-pipeline.html
Currently my data is in array which I had pulled out of a DF:
val my_a = gspc17_df.collect().map{row => Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}
The structure of my array is very similar to the above DF:
my_a: Array[Seq[Any]] =
Array(
List(-1.4830674013266898, [-0.004192832940431825,-0.003170667657263393]),
List(-0.05876766500768526, [-0.008462913654529357,-0.006880595828929472]),
List(1.0109273250546658, [-3.1816797620416693E-4,-0.006502619326182358]))
How to copy data from my array into a DataFrame which has the above structure?
I tried this syntax:
val my_df = spark.createDataFrame(my_a).toDF("label","features")
Spark barked at me:
<console>:105: error: inferred type arguments [Seq[Any]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
<console>:105: error: type mismatch;
found : scala.collection.mutable.WrappedArray[Seq[Any]]
required: Seq[A]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
scala>
The first problem here is that you use List to store row data. List is a homogeneous data structure and since the only common type for Any (row(2)) and DenseVector is Any (Object) you end up with a Seq[Any].
The next issue is that you use row(2) at all. Since Row is effectively a collection of Any this operation doesn't return any useful type and result couldn't be stored in a DataFrame without providing an explicit Encoder.
From the more Sparkish perspective it is not the good approach neither. collect-int just to transform data shouldn't require any comment and. mapping over Rows just to create Vectors doesn't make much sense either.
Assuming that there is no type mismatch you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(df.columns(3), df.columns(4)))
.setOutputCol("features")
assembler.transform(df).select(df.columns(2), "features")
or if you really want to handle this manually an UDF.
val toVec = udf((x: Double, y: Double) => Vectors.dense(x, y))
df.select(col(df.columns(2)), toVec(col(df.columns(3)), col(df.columns(4))))
In general I would strongly recommend getting familiar with Scala before you start using it with Spark.

How to convert RDD[Row] to RDD[Vector]

I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].