Transposing DataFrame columns in Spark Scala [duplicate] - scala

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I am finding it hard to transpose columns in DF.
Given below is the base dataframe and the expected output
Student Class Subject Grade
Sam 6th Grade Maths A
Sam 6th Grade Science A
Sam 7th Grade Maths A-
Sam 7th Grade Science A
Rob 6th Grade Maths A
Rob 6th Grade Science A-
Rob 7th Grade Maths A-
Rob 7th Grade Science B
Rob 7th Grade AP A
Expected output:
Student Class Math_Grade Science_Grade AP_Grade
Sam 6th Grade A A
Sam 7th Grade A- A
Rob 6th Grade A A-
Rob 7th Grade A- B A
Please suggest what is the best way to solve this.

You can group the DataFrame by Student, Class and pivot Subject as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("Sam", "6th Grade", "Maths", "A"),
("Sam", "6th Grade", "Science", "A"),
("Sam", "7th Grade", "Maths", "A-"),
("Sam", "7th Grade", "Science", "A"),
("Rob", "6th Grade", "Maths", "A"),
("Rob", "6th Grade", "Science", "A-"),
("Rob", "7th Grade", "Maths", "A-"),
("Rob", "7th Grade", "Science", "B"),
("Rob", "7th Grade", "AP", "A")
).toDF("Student", "Class", "Subject", "Grade")
df.
groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).
orderBy("Student", "Class").
show
// +-------+---------+----+-----+-------+
// |Student| Class| AP|Maths|Science|
// +-------+---------+----+-----+-------+
// | Rob|6th Grade|null| A| A-|
// | Rob|7th Grade| A| A-| B|
// | Sam|6th Grade|null| A| A|
// | Sam|7th Grade|null| A-| A|
// +-------+---------+----+-----+-------+

Simply you can use pivot and group based on columns.
case class StudentRecord(Student: String, `Class`: String, Subject: String, Grade: String)
val rows = Seq(StudentRecord
("Sam", "6th Grade", "Maths", "A"),
StudentRecord
("Sam", "6th Grade", "Science", "A"),
StudentRecord
("Sam", "7th Grade", "Maths", "A-"),
StudentRecord
("Sam", "7th Grade", "Science", "A"),
StudentRecord
("Rob", "6th Grade", "Maths", "A"),
StudentRecord
("Rob", "6th Grade", "Science", "A-"),
StudentRecord
("Rob", "7th Grade", "Maths", "A-"),
StudentRecord
("Rob", "7th Grade", "Science", "B"),
StudentRecord
("Rob", "7th Grade", "AP", "A")
).toDF()
rows.groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).orderBy(desc("Student"), asc("Class")).show()
/**
* +-------+---------+----+-----+-------+
* |Student| Class| AP|Maths|Science|
* +-------+---------+----+-----+-------+
* | Sam|6th Grade|null| A| A|
* | Sam|7th Grade|null| A-| A|
* | Rob|6th Grade|null| A| A-|
* | Rob|7th Grade| A| A-| B|
* +-------+---------+----+-----+-------+
*/

Related

How to Replace Variable with Level in tbl_summary

A reviewer asked that rather than have both genders listed in the table, to just include one. So, Gender would be replaced with Female and the proportions of the gender that was female would be under each Treatment.
library(gtsummary)
d<-tibble::tribble(
~Gender, ~Treatment,
"Male", "A",
"Male", "B",
"Female", "A",
"Male", "C",
"Female", "B",
"Female", "C")
d %>% tbl_summary(by = Treatment)
One way to do this would be to remove the first row of table_body and only keep the second row of table_body this will only keep the information on Female. This matches your table you provided in the comments.
library(gtsummary)
d<-tibble::tribble(
~Gender, ~Treatment,
"Male", "A",
"Male", "B",
"Female", "A",
"Male", "C",
"Female", "B",
"Female", "C")
t1 <- d %>% filter(Gender == "Female") %>% tbl_summary(by = Treatment)
t1$table_body <- t1$table_body[2,]
t1

join datasets with different dimensions - how to aggregate data properly

I am working on a complex logic where I need to redistribute a quantity from one dataset to another dataset.
This questions is a continuation of this question
In the example below I am introducing several new dimensions. After aggregating and distributing all the quantities I am expecting the same total quantity however I have some differences.
See the example below
package playground
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, round, sum}
object sample3 {
val spark = SparkSession
.builder()
.appName("Sample app")
.master("local")
.getOrCreate()
val sc = spark.sparkContext
final case class Owner(a: Long,
b: String,
c: Long,
d: Short,
e: String,
f: String,
o_qtty: Double)
// notice column d is not present in Invoice
final case class Invoice(c: Long,
a: Long,
b: String,
e: String,
f: String,
i_qtty: Double)
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
import spark.implicits._
val ownerData = Seq(
Owner(11, "A", 666, 2017, "x", "y", 50),
Owner(11, "A", 222, 2018, "x", "y", 20),
Owner(33, "C", 444, 2018, "x", "y", 20),
Owner(33, "C", 555, 2018, "x", "y", 120),
Owner(22, "B", 555, 2018, "x", "y", 20),
Owner(99, "D", 888, 2018, "x", "y", 100),
Owner(11, "A", 888, 2018, "x", "y", 100),
Owner(11, "A", 666, 2018, "x", "y", 80),
Owner(33, "C", 666, 2018, "x", "y", 80),
Owner(11, "A", 444, 2018, "x", "y", 50),
)
val invoiceData = Seq(
Invoice(444, 33, "C", "x", "y", 10),
Invoice(999, 22, "B", "x", "y", 200),
Invoice(666, 11, "A", "x", "y", 15),
Invoice(555, 22, "B", "x", "y", 200),
Invoice(888, 11, "A", "x", "y", 12),
)
val owners = spark
.createDataset(ownerData)
.as[Owner]
.cache()
val invoices = spark
.createDataset(invoiceData)
.as[Invoice]
.cache()
val p1 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"))
.selectExpr(
"a",
"d",
"b",
"e",
"f",
"c",
"IF(o_qtty-i_qtty < 0,o_qtty,o_qtty - i_qtty) AS qtty",
"IF(o_qtty-i_qtty < 0,0,i_qtty) AS to_distribute"
)
val p2 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"), "left_outer")
.filter(row => row.anyNull)
.drop(col("i_qtty"))
.withColumnRenamed("o_qtty", "qtty")
val distribute = p1
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("qtty")).as("proportion"))
val result = p2
.join(distribute, Seq("a", "d", "b", "e", "f"))
.join(proportion, Seq("a", "d", "b", "e", "f"))
.withColumn(
"qtty",
round(
((col("to_distribute") / col("proportion")) * col("qtty")) + col(
"qtty"
),
2
)
)
.drop("to_distribute", "proportion")
.union(p1.drop("to_distribute"))
result.show(false)
result.selectExpr("SUM(qtty)").show()
owners.selectExpr("SUM(o_qtty)").show()
/*
+---+----+---+---+---+---+-----+
|a |d |b |e |f |c |qtty |
+---+----+---+---+---+---+-----+
|11 |2018|A |x |y |222|27.71|
|33 |2018|C |x |y |555|126.0|
|33 |2018|C |x |y |666|84.0 |
|11 |2018|A |x |y |444|69.29|
|11 |2017|A |x |y |666|35.0 |
|33 |2018|C |x |y |444|10.0 |
|22 |2018|B |x |y |555|20.0 |
|11 |2018|A |x |y |888|88.0 |
|11 |2018|A |x |y |666|65.0 |
+---+----+---+---+---+---+-----+
+---------+
|sum(qtty)|
+---------+
| 525.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+
*/
}
}
Also, note that the aggregation must not produce any negative quantity.
I show the code where was necessary to do changes.
val distribute = p1
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("qtty")).as("proportion"))
// Here we remove "d" from the join
// If the distribution is null(there is no data in invoices for that owner)
// then we keep the original "qtty"
// column "d" from p2 dataframe was renamed as "year"
val result = p2
.join(distribute, Seq("a","b", "e", "f"),"left_outer")
.join(proportion, Seq("a","b", "e", "f"))
.selectExpr("a","b","e","f","c","IF(ROUND( ((to_distribute/proportion) * qtty) + qtty, 2) IS NULL,qtty,ROUND( ((to_distribute/proportion) * qtty) + qtty, 2)) AS qtty","d AS year")
.union(p1.withColumn("year",col("d")).drop("d","to_distribute"))
.orderBy(col("b"))
****EXPECTED OUTPUT****
+---+---+---+---+---+-----+----+
|a |b |e |f |c |qtty |year|
+---+---+---+---+---+-----+----+
|11 |A |x |y |444|80.0 |2018|
|11 |A |x |y |222|32.0 |2018|
|11 |A |x |y |666|65.0 |2018|
|11 |A |x |y |888|88.0 |2018|
|11 |A |x |y |666|35.0 |2017|
|22 |B |x |y |555|20.0 |2018|
|33 |C |x |y |555|126.0|2018|
|33 |C |x |y |444|10.0 |2018|
|33 |C |x |y |666|84.0 |2018|
|99 |D |x |y |888|100.0|2018|
+---+---+---+---+---+-----+----+
+---------+
|sum(qtty)|
+---------+
| 640.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+

Spark (Scala) filter array of structs without explode

I have a dataframe with a key and a column with an array of structs in a dataframe column. Each row contains a column a looks something like this:
[
{"id" : 1, "someProperty" : "xxx", "someOtherProperty" : "1", "propertyToFilterOn" : 1},
{"id" : 2, "someProperty" : "yyy", "someOtherProperty" : "223", "propertyToFilterOn" : 0},
{"id" : 3, "someProperty" : "zzz", "someOtherProperty" : "345", "propertyToFilterOn" : 1}
]
Now I would like to do two things:
Filter on "propertyToFilterOn" = 1
Apply some logic on other
properties - for example concatenate
So that the result is:
[
{"id" : 1, "newProperty" : "xxx_1"},
{"id" : 3, "newProperty" : "zzz_345"}
]
I know how to do it with explode but explode also requires groupBy on the key when putting it back together. But as this is a streaming Dataframe I would also have to put a watermark on it which I am trying to avoid.
Is there any other way to achieve this without using explode? I am sure there is some Scala magic that can achieve this!
Thanks!
With spark 2.4+ came many higher order functions for arrays. (see https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html)
val dataframe = Seq(
("a", 1, "xxx", "1", 1),
("a", 2, "yyy", "223", 0),
("a", 3, "zzz", "345", 1)
).toDF( "grouping_key", "id" , "someProperty" , "someOtherProperty", "propertyToFilterOn" )
.groupBy("grouping_key")
.agg(collect_list(struct("id" , "someProperty" , "someOtherProperty", "propertyToFilterOn")).as("your_array"))
dataframe.select("your_array").show(false)
+----------------------------------------------------+
|your_array |
+----------------------------------------------------+
|[[1, xxx, 1, 1], [2, yyy, 223, 0], [3, zzz, 345, 1]]|
+----------------------------------------------------+
You can filter elements within an array using the array filter higher order function like this:
val filteredDataframe = dataframe.select(expr("filter(your_array, your_struct -> your_struct.propertyToFilterOn == 1)").as("filtered_arrays"))
filteredDataframe.show(false)
+----------------------------------+
|filtered_arrays |
+----------------------------------+
|[[1, xxx, 1, 1], [3, zzz, 345, 1]]|
+----------------------------------+
for the "other logic" your talking about you should be able to use the transform higher order array function like so:
val tranformedDataframe = filteredDataframe
.select(expr("transform(filtered_arrays, your_struct -> struct(concat(your_struct.someProperty, '_', your_struct.someOtherProperty))"))
but there are issues with returning structs from the transform function as described in this post:
http://mail-archives.apache.org/mod_mbox/spark-user/201811.mbox/%3CCALZs8eBgWqntiPGU8N=ENW2Qvu8XJMhnViKy-225ktW+_c0czA#mail.gmail.com%3E
so you are best using the dataset api for the transform like so:
case class YourStruct(id:String, someProperty: String, someOtherProperty: String)
case class YourArray(filtered_arrays: Seq[YourStruct])
case class YourNewStruct(id:String, newProperty: String)
val transformedDataset = filteredDataframe.as[YourArray].map(_.filtered_arrays.map(ys => YourNewStruct(ys.id, ys.someProperty + "_" + ys.someOtherProperty)))
val transformedDataset.show(false)
+--------------------------+
|value |
+--------------------------+
|[[1, xxx_1], [3, zzz_345]]|
+--------------------------+

Split Set into multiple Sets Scala

I have some Set[String] and a number devider: Int. I need to split the set arbitrary by pieces each of which has size devider. Examples:
1.
Set: "a", "bc", "ds", "fee", "s"
devider: 2
result:
Set1: "a", "bc"
Set2: "ds", "fee"
Set3: "s"
2.
Set: "a", "bc", "ds", "fee", "s", "ff"
devider: 3
result:
Set1: "a", "bc", "ds"
Set2: "fee", "s", "ff"
3.
Set: "a", "bc", "ds"
devider: 4
result:
Set1: "a", "bc", "ds"
What is the idiomatic way to do it in Scala?
You probably want something like:
Set("a", "bc", "ds", "fee", "s").grouped(2).toSet
The problem is that a Set, by definition, has no order, so there's no telling which elements will be grouped together.
Set( "a", "bc", "ds", "fee", "s").grouped(2).toSet
//res0: Set[Set[String]] = Set(Set(s, bc), Set(a, ds), Set(fee))
To get them grouped in a particular fashion you'll need to change the Set to one of the ordered collections, order the elements as required, group them, and transition everything back to Sets.
This is possible only if it is a List like:
val pn=List("a", "bc", "ds", "fee", "s").grouped(2).toSet
println(pn)

Get values from an RDD

I created an RDD wtih the following format using Scala :
Array[(String, (Array[String], Array[String]))]
How can I get the list of the Array[1] from this RDD?
The data for the first data line is:
// Array[(String, (Array[String], Array[String]))]
Array(
(
966515171418,
(
Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0),
Array(4579866236, 4579866226, 2015-07-29 04:16:22, 37, 1, 1, 966515171418, 966515183264, 420500052424347, 0, 3083, 9, 5072, 5084, 2, 1, 0, 0)
)
)
)
Assuming you have something like this (just paste into a spark-shell):
val a = Array(
("966515171418",
(Array("4579848447", "4579848453", "2015-07-29 03:27:28", "44", "1", "1", "966515171418", "966515183263", "420500052424347", "0", "52643", "9", "5067", "5084", "2", "1", "0", "0"),
Array("4579866236", "4579866226", "2015-07-29 04:16:22", "37", "1", "1", "966515171418", "966515183264", "420500052424347", "0", "3083", "9", "5072", "5084", "2", "1", "0", "0")))
)
val rdd = sc.makeRDD(a)
then you get the first array using
scala> rdd.first._2._1
res9: Array[String] = Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0)
which means the first row (which is a Tuple2), then the 2nd element of the tuple (which is again a Tuple2), then the 1st element.
Using pattern matching
scala> rdd.first match { case (_, (array1, _)) => array1 }
res30: Array[String] = Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0)
If you want to get it of all rows, just use map():
scala> rdd.map(_._2._1).collect()
which puts the results of all rows into an array.
Another option is to use pattern matching in map():
scala> rdd.map { case (_, (array1, _)) => array1 }.collect()