Converting Scala mutable arrays to a spark dataframe - scala

I have three mutable arrays defined as:
import scala.collection.mutable.ArrayBuffer
var quartile_1 = ArrayBuffer[Double]()
var quartile_3 = ArrayBuffer[Double]()
var id = ArrayBuffer[String]()
quartile_1 and quartile_3 are information at id level and I am currently computing them as:
def func1(x: org.apache.spark.sql.Row) {
val apQuantile = df_auth_for_qnt.where($"id" === x(0).toString).stat.approxQuantile("tran_amt", Array(0.25, 0.75), 0.001)
quartile_1 += apQuantile(0)
quartile_3 += apQuantile(1)
id += x(0).toString()
}
val cardNumList = df_auth_for_qnt_gb.where($"tran_cnt" > 8).select("card_num_1").collect.foreach(func1)
Is there a better approach than appending them to mutable arrays? My goal is to have the quantile data, id available as a dataframe - so that I can do further joins.

Mutable structures like ArrayBuffer are evil, especially in parallelizable context. Here they can be avoided quite easily.
func1 can return a tuple of (String, Array[Double]), where the first element corresponds to the id (former id buffer) and the second element is the quartiles returned from approxQuantile:
def func1(x: Row): (String, Array[Double]) = {
val cardNum1 = x(0).toString
val quartiles = df_auth_for_qnt.where($"id" === cardNum1).stat.approxQuantile("tran_amt", Array(0.25, 0.75), 0.001)
(cardNum1, quartiles)
}
Now, using functional chaning we can obtain an immutable result structure.
As a DataFrame:
val resultDf = df_auth_for_qnt_gb.where($"tran_cnt" > 8).select("card_num_1").map(func1).toDF("id", "quartiles")
Or as a Map[String, Array[Double]] with same associations as in the tuples returned from func1:
val resultMap = df_auth_for_qnt_gb.where($"tran_cnt" > 8).select("card_num_1").map(func1).collect().toMap

Related

ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast

I am using an Aggregator to apply some custom merge on a DataFrame after grouping its records by their primary key:
case class Player(
pk: String,
ts: String,
first_name: String,
date_of_birth: String
)
case class PlayerProcessed(
var ts: String,
var first_name: String,
var date_of_birth: String
)
// Cutomer Aggregator -This just for the example, actual one is more complex
object BatchDedupe extends Aggregator[Player, PlayerProcessed, PlayerProcessed] {
def zero: PlayerProcessed = PlayerProcessed("0", null, null)
def reduce(bf: PlayerProcessed, in : Player): PlayerProcessed = {
bf.ts = in.ts
bf.first_name = in.first_name
bf.date_of_birth = in.date_of_birth
bf
}
def merge(bf1: PlayerProcessed, bf2: PlayerProcessed): PlayerProcessed = {
bf1.ts = bf2.ts
bf1.first_name = bf2.first_name
bf1.date_of_birth = bf2.date_of_birth
bf1
}
def finish(reduction: PlayerProcessed): PlayerProcessed = reduction
def bufferEncoder: Encoder[PlayerProcessed] = Encoders.product
def outputEncoder: Encoder[PlayerProcessed] = Encoders.product
}
val ply1 = Player("12121212121212", "10000001", "Rogger", "1980-01-02")
val ply2 = Player("12121212121212", "10000002", "Rogg", null)
val ply3 = Player("12121212121212", "10000004", null, "1985-01-02")
val ply4 = Player("12121212121212", "10000003", "Roggelio", "1982-01-02")
val seq_users = sc.parallelize(Seq(ply1, ply2, ply3, ply4)).toDF.as[Player]
val grouped = seq_users.groupByKey(_.pk)
val non_sorted = grouped.agg(BatchDedupe.toColumn.name("deduped"))
non_sorted.show(false)
This returns:
+--------------+--------------------------------+
|key |deduped |
+--------------+--------------------------------+
|12121212121212|{10000003, Roggelio, 1982-01-02}|
+--------------+--------------------------------+
Now, I would like to order the records based on ts before aggregating them. From here I understand that .sortBy("ts") do not guarantee the order after the .groupByKey(_.pk). So I was trying to apply the .sortBy between the .groupByKey and the .agg
The output of the .groupByKey(_.pk) is a KeyValueGroupedDataset[String,Player], being the second element an Iterator. So to apply some sorting logic there I convert it into a Seq:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts))}.agg(BatchDedupe.toColumn.name("deduped"))
sorted.show(false)
However, the output of .mapGroups after adding the sorting logic is a Dataset[(String, Seq[Player])]. So when I try to invoke the .agg function on it I am getting the following exception:
Caused by: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line050e0d37885948cd91f7f7dd9e3b4da9311.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Player
How could I convert back the output of my .mapGroups(...) into a KeyValueGroupedDataset[String,Player]?
I tried to cast back to Iterator as follows:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts).toIterator)}.agg(BatchDedupe.toColumn.name("deduped"))
But this approach produced the following exception:
UnsupportedOperationException: No Encoder found for Iterator[Player]
- field (class: "scala.collection.Iterator", name: "_2")
- root class: "scala.Tuple2"
How else can I add the sort logic between the .groupByKey and .agg methods?
Based on the discussion above, the purpose of the Aggregator is to get the latest field values per Player by ts ignoring null values.
This can be achieved fairly easily aggregating all fields individually using max_by. With that there's no need for a custom Aggregator nor the mutable aggregation buffer.
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => expr(s"max_by($colName, if(isNotNull($colName), ts, null))").as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]
On the most recent versions of Spark you can also use the build in max_by expression:
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => max_by(col(colName), when(col(colName).isNotNull, col("ts"))).as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]

Unable to flatten array of DataFrames

I have an array of DataFrames that I obtain by using randomSplit() in this manner:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
I'll be iterating over folds using a for loop, where I will be dropping the ith entry inside folds and store it separately. Then I will be using all the others as another DataFrame as in my code below:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
for (j <- ts.indices) {
trainSet = trainSet.union(ts(j))
}
}
While this does serve my purpose, I was also trying another approach where I would still separate folds into ts and testSet, and then use the flatten function for the remaining inside ts to create another DataFrame using something like this:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = ts.flatten
}
But at the initialization of the trainSet line, I get an error that: No Implicits Found for parameter asTrav: Dataset[Row] => Traversable[U_]. I have also done import spark.implicits._ after initializing the SparkSession.
My end goal with the creation of trainSet after flatten is to retrieve a DataFrame created after joining (union) the other Dataset[Row]s inside ts. I'm not sure where I'm going wrong.
I'm using Spark 2.4.5 with Scala 2.11.12
EDIT 1: Added how I read the Dataframe
I'm not sure what's your intention here but instead of using mutable variables and flattening you can do recursive iteration like this:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
val testSet = spark.createDataFrame(Seq.empty)
val trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
go(folds, Array.empty)
def go(items: Array[Dataset[Row]], result: Array[Dataset[Row]]): Array[Dataset[Row]] = items match {
case arr # Array(_, _*) =>
val res = arr.map { t =>
trainSet.union(t)
}
go(arr.tail, result ++ res)
case Array() => result
}
As I have seen the use case of testSet, there is no usage of it in the method body
I have replaced that for loop with a simple reduce:
val trainSet = ts.reduce((a,b) => a.union(b))

Return type to assign to val for RDDs

I am playing around with spark code to know more about shuffling. I wrote the following code to see how are stages formed if there is a if-else statement. I have declared val result so that the result could be assigned to it later in the if statement. But I am not sure about the return type to assign to it.
Is there an abstract class that goes with all the RDDs?
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
val result: RDD // What is the correct return type here?
if(count == 1)
{
result= d.rightOuterJoin(x)
result.collect()
}
d is a RDD[(Int, Int)]
Then doing a reduce by key gives the same thing but reduced down
Doing a right outer join then gives you RDD of (Int, (Option[Int], Int)) - ie for each key the L and R value (with the L option being optional if not there)
So doing a collect gives you an array of the same thing
The API documentation is not easy to follow for all these functions, there is a lot of generic types, and a lot of implicit types. I would recommend that you either use an IDE which will hint the types for you, or else use a tool that gives you a console that you can try snippets in.
you can avoid assignment to var (it should be var, not val)
val conf = new SparkConf().setMaster("local").setAppName("spark shuffle")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000).map(i => (i%1000, i))
val x = d.reduceByKey(_+_)
val count = 1
if (count == 1) {
d.rightOuterJoin(x).collect()
}

How can i split a string of dataframe schema into each Structs

I want to split a schema of a dataframe into a collection. I am trying this, but the schema is printed out as a string. Is there anyway I can split it into a collection per StructType so that I can manipulate it (like take only array columns from the output)? I am trying to flatten a complex multi level struct + array dataframe.
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql._
val test = sqlContext.read.json(sc.parallelize(Seq("""{"a":1,"b":[2,3],"d":[2,3]}""")))
test.printSchema
val flattened = test.withColumn("b", explode($"d"))
flattened.printSchema
def identifyArrayColumns(dataFrame : DataFrame) = {
val output = for ( d <- dataFrame.collect()) yield
{
d.schema
}
output.toList
}
identifyArrayColumns(test)
Output currently is
identifyArrayColumns: (dataFrame: org.apache.spark.sql.DataFrame)List[org.apache.spark.sql.types.StructType]
res58: List[org.apache.spark.sql.types.StructType] = List(StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true)))
It is one full string, so I cannot filter only the array columns. Suppose if I do a foreach(println). I get only one line
scala> output.foreach(println)
StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true))
What I want is each StructTypes in a single element in a collection
You can simply filter the fields of the DataFrame's schema for fields with type array - no need to inspect the DataFrame's data for this:
def identifyArrayColumns(schema: StructType): List[StructField] = {
schema.fields.filter(_.dataType.typeName == "array").toList
}
NOTE that this is a "shallow" solution that would only return the array fields directly under "root", if you want to also find Arrays within Arrays / maps / structs, you'd need to recursively traverse the shcema and produce this filtered result, something like:
// can be converted into a tail-recursive method by adding another argument to accumulate results
def identifyArrayColumns(schema: StructType): List[StructField] = {
val arrays = schema.fields.filter(_.dataType.typeName == "array").toList
val deeperArrays = schema.fields.flatMap {
case f # StructField(_, s: StructType, _, _) => identifyArrayColumns(s)
case _ => List()
}
arrays ++ deeperArrays
}

How to add to an Immutable map : Scala

I have a ResultSet object returned from Hive using JDBC.
I am trying to store the values in a resultset in a Scala Immutable Map.
How can i add there values to an Immutable map as i am iterating the resultset using while loop
val m : Map[String, String] = null
while ( resultSet.next() ) {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
m += (col -> data) // This Gives Reassignment error
}
I propose :
Iterator.continually{
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col->data
}.takeWhile( _ => resultSet.next()).toMap
Instead of thinking "let's init an empty collection and fill it" which is imho the mutable way to think, this proposition rather think in terms of "let's declare how to build a collection with those elements in it and be done" :-)
You might want to use scala.collection.Iterator[A] so that you can create immutable map out of your java resultSet.
val myMap : Map[String, String] = new Iterator[(String, String)] {
override def hasNext = resultSet.next()
override def next() = {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col -> data
}
}.toMap
Otherwise you have to use mutable scala.collection.mutable.Map.