How to unpack multiple keys in a Spark DataSet

How to unpack multiple keys in a Spark DataSet - scala

I have the following DataSet, with the following structure.
case class Person(age: Int, gender: String, salary: Double)
I want to determine the average salary by gender and age, thus I group the DS by both keys. I've encountered two main problems, one is that both keys are mixed in a single one, but I want to keep them in two different columns, the other is that the aggregated column gets a silly long name and I can't figure out how to rename it (apparently as and alias won't work) all of this using the DS API.
val df = sc.parallelize(List(Person(100000.00, "male", 27),
Person(120000.00, "male", 27),
Person(95000, "male", 26),
Person(89000, "female", 31),
Person(250000, "female", 51),
Person(120000, "female", 51)
)).toDF.as[Person]
df.groupByKey(p => (p.gender, p.age)).agg(typed.avg(_.salary)).show()
+-----------+------------------------------------------------------------------------------------------------+
| key| TypedAverage(line2503618a50834b67a4b132d1b8d2310b12.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Person)|
+-----------+------------------------------------------------------------------------------------------------+
|[female,31]| 89000.0...
|[female,51]| 185000.0...
| [male,27]| 110000.0...
| [male,26]| 95000.0...
+-----------+------------------------------------------------------------------------------------------------+

Aliasing is an untyped action, so you must retype it after. And the only way to unpack the key is to do it after, via a select or something:
df.groupByKey(p => (p.gender, p.age))
.agg(typed.avg[Person](_.salary).as("average_salary").as[Double])
.select($"key._1",$"key._2",$"average_salary").show

The easiest way to achieve both goals is to map() from the aggregation result to the Person instance again:
.map{case ((gender, age), salary) => Person(gender, age, salary)}
The result will look best if slightly re-arrange the order of arguments in the case class'es constructor:
case class Person(gender: String, age: Int, salary: Double)
+------+---+--------+
|gender|age| salary|
+------+---+--------+
|female| 31| 89000.0|
|female| 51|185000.0|
| male| 27|110000.0|
| male| 26| 95000.0|
+------+---+--------+
Full code:
import session.implicits._
val df = session.sparkContext.parallelize(List(
Person("male", 27, 100000),
Person("male", 27, 120000),
Person("male", 26, 95000),
Person("female", 31, 89000),
Person("female", 51, 250000),
Person("female", 51, 120000)
)).toDS
import org.apache.spark.sql.expressions.scalalang.typed
df.groupByKey(p => (p.gender, p.age))
.agg(typed.avg(_.salary))
.map{case ((gender, age), salary) => Person(gender, age, salary)}
.show()

Related

How to select columns that exist in case classes from DataFrame

Given a spark DataFrame with columns "id", "first", "last", "year"
val df=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "Ive", "Fish", 1990),
(4, "John", "Wayne", 1995)
)).toDF("id", "first", "last", "year")
and case class
case class IdAndLastName(
id: Int,
last:String )
I would like to only select columns in case class which are id and last. In other words, I would like to have this output df.select("id","last") by using case class. I am avoiding hardcoding the attributes. Could you please help me how can I achieve this in a compact way.

You can create explictly an encoder for the case class (usually this happens implicitly here). Then you can get the field names from the encoder and use them in the select statement:
val fieldnames = Encoders.product[IdAndLastName].schema.fieldNames
df.select(fieldnames.head, fieldnames.tail:_*).show()
Output:
+---+-----+
| id| last|
+---+-----+
| 1| Doe|
| 2| Fish|
| 4|Wayne|
+---+-----+

import org.apache.spark.sql.Encoders
import org.apache.spark.sql.functions.col
val cols = Encoders.product[IdAndLastName].schema.fieldNames.map(col)
df.select(cols: _*).show()

Convert MapPartitionsRDD to DataFrame and grouping data by 2 keys

I have a dataframe which looks like this:
country | user | count
----------------------
Germany | Sarah| 2
China | Paul | 1
Germany | Alan | 3
Germany | Paul | 1
...
What I am trying to do is to convert this dataframe to another which looks like this:
dimension | value
--------------------------------------------
Country | [Germany -> 4, China -> 1]
--------------------------------------------
User | [Sarah -> 2, Paul -> 2, Alan -> 3]
...
At first I tried to do it by this:
var newDF = Seq.empty[(String, Map[String,Long])].toDF("dimension", "value")
df.collect()
.foreach(row => { Array(0,1)
.map(pos =>
newDF = newDF.union(Seq((df.columns.toSeq(pos).toString, Map(row.mkString(",").split(",")(pos) -> row.mkString(",").split(",")(2).toLong))).toDF())
)
})
val newDF2 = newDF.groupBy("dimension").agg(collect_list("value")).as[(String, Seq[Map[String, Long]])].map {case (id, list) => (id, list.reduce(_ |+| _))}.toDF("dimension", "value")
But the collect() was killing my driver. Therefore, I have tried to do it like this:
class DimItem[T](val dimension: String, val value: String, val metric: T)
val items: RDD[DimItem[Long]] = df.rdd.flatMap(row => {
dims.zipWithIndex.map{case (dim, i) =>
new DimItem(dim, row(i).toString, row(13).asInstanceOf[Long])
}
})
// with the format [ DimItem(Country, Germany, 2), DimItem(User, Sarah, 2)], ...
val itemsGrouped: RDD[((String, String), Iterable[DimItem[Long]])] = items.groupBy(x => (x.dimension, x.value))
val aggregatedItems: RDD[DimItem[Long]] = itemsGrouped.map{case (key, items) => new DimItem(key._1, key._2, items.reduce((a,b) => a.metric + b.metric)}
The idea is to save in an RDD objects like (Country, China, 1), (Country, Germany, 3), (Country, Germany, 1), ... and then group it by the 2 first keys (Country, China), (Country, Germany), ... Once grouped, sum the count they have. Ex: having (Country, Germany, 3), (Country, Germany, 1) will become (Country, Germany, 4).
But once I get here, it tells me that in items.reduce() there is a mismatch: it expects a DimItem[Long] but gets a Long.
Next step will be to group it by the key "dimension" and create the Map[String, Int]()format in the column "value" and convert it to a DF.
I have 2 questions.
First: is this last code correct?
Second: How can I convert this MapPartitionsRDD into a DF?

Here is one solution based on dataframe API:
import org.apache.spark.sql.functions.{lit, map_from_arrays, collect_list}
def transform(df :DataFrame, colName: String) : DataFrame =
df.groupBy(colName)
.agg{sum("count").as("sum")}
.agg{
map_from_arrays(
collect_list(colName),
collect_list("sum")
).as("value")
}.select(lit(colName).as("dimension"), $"value")
val countryDf = transform(df, "country")
val userDf = transform(df, "user")
countryDf.unionByName(userDf).show(false)
// +---------+----------------------------------+
// |dimension|value |
// +---------+----------------------------------+
// |Country |[Germany -> 6, China -> 1] |
// |User |[Sarah -> 2, Alan -> 3, Paul -> 2]|
// +---------+----------------------------------+
Analysis: first we get the sum by country and user grouping by country and user respectively. Next we add one more custom aggregation to the pipeline which collects the previous results into a map. Map will be populated via map_from_arrays function found in Spark 2.4.0. The keys/values of the map we collect them with collect_list. Finally we union the two dataframes to populate the final results.

Sample a different number of random rows for every group in a dataframe in spark scala

The goal is to sample (without replacement) a different number of rows in a dataframe for every group. The number of rows to sample for a specific group is in another dataframe.
Example: idDF is the dataframe to sample from. The groups are denoted by the ID column. The dataframe, planDF specifies the number of rows to sample for each group where "datesToUse" denotes the number of rows, and "ID" denotes the group. "totalDates" is the total number of rows for that group and may or may not be useful.
The final result should have 3 rows sampled from the first group (ID 1), 2 rows sampled from the second group (ID 2) and 1 row sampled from the third group (ID 3).
val idDF = Seq(
(1, "2017-10-03"),
(1, "2017-10-22"),
(1, "2017-11-01"),
(1, "2017-10-02"),
(1, "2017-10-09"),
(1, "2017-12-24"),
(1, "2017-10-20"),
(2, "2017-11-17"),
(2, "2017-11-12"),
(2, "2017-12-02"),
(2, "2017-10-03"),
(3, "2017-12-18"),
(3, "2017-11-21"),
(3, "2017-12-13"),
(3, "2017-10-08"),
(3, "2017-10-16"),
(3, "2017-12-04")
).toDF("ID", "date")
val planDF = Seq(
(1, 3, 7),
(2, 2, 4),
(3, 1, 6)
).toDF("ID", "datesToUse", "totalDates")
this is an example of what a resultant dataframe should look like:
+---+----------+
| ID| date|
+---+----------+
| 1|2017-10-22|
| 1|2017-11-01|
| 1|2017-10-20|
| 2|2017-11-12|
| 2|2017-10-03|
| 3|2017-10-16|
+---+----------+
So far, I tried to use the sample method for DataFrame: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html
Here is an example that would work for an entire data frame.
def sampleDF(DF: DataFrame, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse/totalDates.toFloat.toDouble
DF.sample(false, fraction)
}
I cant figure out how to use something like this for each group. I tried joining the planDF table to the idDF table and using a window partition.
Another idea I had was to somehow make a new column with randomly labeled True / false and then filter on that column.

Another option staying entirely in Dataframes would be to compute probabilities using your planDF, join with idDF, append a column of random numbers and then filter. Helpfully, sql.functions has a rand function.
import org.apache.spark.sql.functions._
import spark.implicits._
val probabilities = planDF.withColumn("prob", $"datesToUse" / $"totalDates")
val dfWithProbs = idDF.join(probabilities, Seq("ID"))
.withColumn("rand", rand())
.where($"rand" < $"prob")
(You'll want to double check that that isn't integer division.)

With the assumption that your planDF is small enough to be collected, you can use Scala's foldLeft to traverse the id list and accumulate the sample Dataframes per id:
import org.apache.spark.sql.{Row, DataFrame}
def sampleByIdDF(DF: DataFrame, id: Int, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse.toDouble / totalDates
DF.where($"id" === id ).sample(false, fraction)
}
val emptyDF = Seq.empty[(Int, String)].toDF("ID", "date")
val planList = planDF.rdd.collect.map{ case Row(x: Int, y: Int, z: Int) => (x, y, z) }
// planList: Array[(Int, Int, Int)] = Array((1,3,7), (2,2,4), (3,1,6))
planList.foldLeft( emptyDF ){
case (accDF: DataFrame, (id: Int, num: Int, total: Int)) =>
accDF union sampleByIdDF(idDF, id, num, total)
}
// res1: org.apache.spark.sql.DataFrame = [ID: int, date: string]
// res1.show
// +---+----------+
// | ID| date|
// +---+----------+
// | 1|2017-10-03|
// | 1|2017-11-01|
// | 1|2017-10-02|
// | 1|2017-12-24|
// | 1|2017-10-20|
// | 2|2017-11-17|
// | 2|2017-11-12|
// | 2|2017-12-02|
// | 3|2017-11-21|
// | 3|2017-12-13|
// +---+----------+
Note that method sample() does not necessarily generate the exact number of samples specified in the method arguments. Here's a relevant SO Q&A.
If your planDF is large, you might have to consider using RDD's aggregate, which has the following signature (skipping the implicit argument):
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
It works somewhat like foldLeft, except that it has one accumulation operator within a partition and an additional one to comine results from different partitions.

Spark SQL - Generate array of arrays from the sql function

I want to create an array of arrays. This is my data table:
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
I want to create an array of integer column "age". For that I use "collect_list":
sqlContext.sql("SELECT collect_list(age) as age from df").show
But now I want to generate an array containing multiple arrays as created above:
sqlContext.sql("SELECT collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
But this does not work , or use the function org.apache.spark.sql.functions.array. Any ideas?

Ok, things can't get more simple. Let's consider the same data you are working on and go step by step from there
// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)
// For SQL usage we need to register the table
df.registerTempTable("df")
sqlContext.sql("select collect_list(age) as age from df").show
// +----------------+
// | age|
// +----------------+
// |[21, 26, 52, 31]|
// +----------------+
sqlContext.sql("select collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show
As the error message says :
org.apache.spark.sql.AnalysisException: No handler for Hive udf class
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Exactly one argument is expected..; line 1 pos 52 [...]
collest_list takes just one argument. Let's check the documentation here.
It actually takes one argument ! But let's go further in the documentation of the functions object. You seem to have noticed that the array function allows you to create a new array column out of a Column or a repeated Column parameter. So let's use that :
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df").show(false)
The array function create indeed a column from the column list create before-hand by collect_list on both age and salary :
// +-------------------------------------------------------------------+
// |arrayInt |
// +-------------------------------------------------------------------+
// |[WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450)]|
// +-------------------------------------------------------------------+
Where do we go from here ?
You have to remember that a Row from a DataFrame is just another collection wrapped by a Row.
The first thing I'll do is work on that collection. So How do we flatten a WrappedArray[WrappedArray[Int]] ?
Scala is kind of magical you just need to use .flatten
import scala.collection.mutable.WrappedArray
val firstRow: mutable.WrappedArray[mutable.WrappedArray[Int]] =
sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df")
.first.get(0).asInstanceOf[WrappedArray[WrappedArray[Int]]]
// res26: scala.collection.mutable.WrappedArray[scala.collection.mutable.WrappedArray[Int]] =
// WrappedArray(WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450))
firstRow.flatten
// res27: scala.collection.mutable.IndexedSeq[Int] = ArrayBuffer(21, 26, 52, 31, 905, 1130, 1890, 1450)
Now let's wrap it in a UDF so we can use it on the DataFrame :
def flatten(array: WrappedArray[WrappedArray[Int]]) = array.flatten
sqlContext.udf.register("flatten", flatten(_: WrappedArray[WrappedArray[Int]]))
Since we registered the UDF, we can now use it inside the sqlContext :
sqlContext.sql("select flatten(array(collect_list(age), collect_list(salary))) as arrayInt from df").show(false)
// +---------------------------------------+
// |arrayInt |
// +---------------------------------------+
// |[21, 26, 52, 31, 905, 1130, 1890, 1450]|
// +---------------------------------------+
I hope this helps !

Let's create the DataFrame the way have created above.
// A case class for our sample table
import org.apache.spark.sql.functions._
case class Testing(name: String, age: Int, salary: Int)
// Create an RDD with some data
val x = sc.parallelize(Array(
Testing(null, 21, 905),
Testing("Noelia", 26, 1130),
Testing("Pilar", 52, 1890),
Testing("Roberto", 31, 1450)
))
// Convert RDD to a DataFrame
val df = spark.createDataFrame(x)
Here we can use array_union function to achieve the desired result. array_unionfunction will return the union of all elements from the input arrays. This function is available since spark 2.4.0
// Scala Ref : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
// Pyspark Ref : https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_union
df.select(collect_list("age").as("age"), collect_list("salary").as("salary"))
.withColumn("new_col", array_union($"age", $"salary")).show(truncate=false)
// Output
+----------------+-----------------------+---------------------------------------+
|age |salary |new_col |
+----------------+-----------------------+---------------------------------------+
|[21, 26, 52, 31]|[905, 1130, 1890, 1450]|[21, 26, 52, 31, 905, 1130, 1890, 1450]|
+----------------+-----------------------+---------------------------------------+
I hope this helps.

reduceBykey Spark maintain order

My input dataset looks like
id1, 10, v1
id2, 9, v2
id2, 34, v3
id1, 6, v4
id1, 12, v5
id2, 2, v6
and I want output
id1; 6,v4 | 10,v1 | 12,v5
id2; 2,v6 | 9,v2 | 34,v3
This is such that
id1: array[num(i),value(i)] where num(i) should be sorted
What I have tried:
Get id and 2nd column as key, sortByKey, but since it's a string,
sorting doesn't happen like a int, but as string
Get 2nd column as key, sortByKey, then get id and key and 2nd
column in value, reduceByKey. But in this case, while doing
reduceByKey; order is not preserved. Even groupByKey is not preventing
the order. Actually this is expected.
Any help will be appreciated.

Since you didn't provide any information about input type I assume it is RDD[(String, Int, String)]:
val rdd = sc.parallelize(
("id1", 10, "v1") :: ("id2", 9, "v2") ::
("id2", 34, "v3") :: ("id1", 6, "v4") ::
("id1", 12, "v5") :: ("id2", 2, "v6") :: Nil)
rdd
.map{case (id, x, y) => (id, (x, y))}
.groupByKey
.mapValues(iter => iter.toList.sortBy(_._1))
.sortByKey() // Optional if you want id1 before id2
Edit:
To get an output you've described in the comments you can replace function passed to mapValues with something like this:
def process(iter: Iterable[(Int, String)]): String = {
iter.toList
.sortBy(_._1)
.map{case (x, y) => s"$x,$y"}
.mkString("|")
}