join datasets with different dimensions - how to aggregate data properly

join datasets with different dimensions - how to aggregate data properly - scala

I am working on a complex logic where I need to redistribute a quantity from one dataset to another dataset.
This questions is a continuation of this question
In the example below I am introducing several new dimensions. After aggregating and distributing all the quantities I am expecting the same total quantity however I have some differences.
See the example below
package playground
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, round, sum}
object sample3 {
val spark = SparkSession
.builder()
.appName("Sample app")
.master("local")
.getOrCreate()
val sc = spark.sparkContext
final case class Owner(a: Long,
b: String,
c: Long,
d: Short,
e: String,
f: String,
o_qtty: Double)
// notice column d is not present in Invoice
final case class Invoice(c: Long,
a: Long,
b: String,
e: String,
f: String,
i_qtty: Double)
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
import spark.implicits._
val ownerData = Seq(
Owner(11, "A", 666, 2017, "x", "y", 50),
Owner(11, "A", 222, 2018, "x", "y", 20),
Owner(33, "C", 444, 2018, "x", "y", 20),
Owner(33, "C", 555, 2018, "x", "y", 120),
Owner(22, "B", 555, 2018, "x", "y", 20),
Owner(99, "D", 888, 2018, "x", "y", 100),
Owner(11, "A", 888, 2018, "x", "y", 100),
Owner(11, "A", 666, 2018, "x", "y", 80),
Owner(33, "C", 666, 2018, "x", "y", 80),
Owner(11, "A", 444, 2018, "x", "y", 50),
)
val invoiceData = Seq(
Invoice(444, 33, "C", "x", "y", 10),
Invoice(999, 22, "B", "x", "y", 200),
Invoice(666, 11, "A", "x", "y", 15),
Invoice(555, 22, "B", "x", "y", 200),
Invoice(888, 11, "A", "x", "y", 12),
)
val owners = spark
.createDataset(ownerData)
.as[Owner]
.cache()
val invoices = spark
.createDataset(invoiceData)
.as[Invoice]
.cache()
val p1 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"))
.selectExpr(
"a",
"d",
"b",
"e",
"f",
"c",
"IF(o_qtty-i_qtty < 0,o_qtty,o_qtty - i_qtty) AS qtty",
"IF(o_qtty-i_qtty < 0,0,i_qtty) AS to_distribute"
)
val p2 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"), "left_outer")
.filter(row => row.anyNull)
.drop(col("i_qtty"))
.withColumnRenamed("o_qtty", "qtty")
val distribute = p1
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("qtty")).as("proportion"))
val result = p2
.join(distribute, Seq("a", "d", "b", "e", "f"))
.join(proportion, Seq("a", "d", "b", "e", "f"))
.withColumn(
"qtty",
round(
((col("to_distribute") / col("proportion")) * col("qtty")) + col(
"qtty"
),
2
)
)
.drop("to_distribute", "proportion")
.union(p1.drop("to_distribute"))
result.show(false)
result.selectExpr("SUM(qtty)").show()
owners.selectExpr("SUM(o_qtty)").show()
/*
+---+----+---+---+---+---+-----+
|a |d |b |e |f |c |qtty |
+---+----+---+---+---+---+-----+
|11 |2018|A |x |y |222|27.71|
|33 |2018|C |x |y |555|126.0|
|33 |2018|C |x |y |666|84.0 |
|11 |2018|A |x |y |444|69.29|
|11 |2017|A |x |y |666|35.0 |
|33 |2018|C |x |y |444|10.0 |
|22 |2018|B |x |y |555|20.0 |
|11 |2018|A |x |y |888|88.0 |
|11 |2018|A |x |y |666|65.0 |
+---+----+---+---+---+---+-----+
+---------+
|sum(qtty)|
+---------+
| 525.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+
*/
}
}
Also, note that the aggregation must not produce any negative quantity.

I show the code where was necessary to do changes.
val distribute = p1
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("qtty")).as("proportion"))
// Here we remove "d" from the join
// If the distribution is null(there is no data in invoices for that owner)
// then we keep the original "qtty"
// column "d" from p2 dataframe was renamed as "year"
val result = p2
.join(distribute, Seq("a","b", "e", "f"),"left_outer")
.join(proportion, Seq("a","b", "e", "f"))
.selectExpr("a","b","e","f","c","IF(ROUND( ((to_distribute/proportion) * qtty) + qtty, 2) IS NULL,qtty,ROUND( ((to_distribute/proportion) * qtty) + qtty, 2)) AS qtty","d AS year")
.union(p1.withColumn("year",col("d")).drop("d","to_distribute"))
.orderBy(col("b"))
****EXPECTED OUTPUT****
+---+---+---+---+---+-----+----+
|a |b |e |f |c |qtty |year|
+---+---+---+---+---+-----+----+
|11 |A |x |y |444|80.0 |2018|
|11 |A |x |y |222|32.0 |2018|
|11 |A |x |y |666|65.0 |2018|
|11 |A |x |y |888|88.0 |2018|
|11 |A |x |y |666|35.0 |2017|
|22 |B |x |y |555|20.0 |2018|
|33 |C |x |y |555|126.0|2018|
|33 |C |x |y |444|10.0 |2018|
|33 |C |x |y |666|84.0 |2018|
|99 |D |x |y |888|100.0|2018|
+---+---+---+---+---+-----+----+
+---------+
|sum(qtty)|
+---------+
| 640.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+

Related

How to Replace Variable with Level in tbl_summary

A reviewer asked that rather than have both genders listed in the table, to just include one. So, Gender would be replaced with Female and the proportions of the gender that was female would be under each Treatment.
library(gtsummary)
d<-tibble::tribble(
~Gender, ~Treatment,
"Male", "A",
"Male", "B",
"Female", "A",
"Male", "C",
"Female", "B",
"Female", "C")
d %>% tbl_summary(by = Treatment)

One way to do this would be to remove the first row of table_body and only keep the second row of table_body this will only keep the information on Female. This matches your table you provided in the comments.
library(gtsummary)
d<-tibble::tribble(
~Gender, ~Treatment,
"Male", "A",
"Male", "B",
"Female", "A",
"Male", "C",
"Female", "B",
"Female", "C")
t1 <- d %>% filter(Gender == "Female") %>% tbl_summary(by = Treatment)
t1$table_body <- t1$table_body[2,]
t1

Spark (Scala) filter array of structs without explode

I have a dataframe with a key and a column with an array of structs in a dataframe column. Each row contains a column a looks something like this:
[
{"id" : 1, "someProperty" : "xxx", "someOtherProperty" : "1", "propertyToFilterOn" : 1},
{"id" : 2, "someProperty" : "yyy", "someOtherProperty" : "223", "propertyToFilterOn" : 0},
{"id" : 3, "someProperty" : "zzz", "someOtherProperty" : "345", "propertyToFilterOn" : 1}
]
Now I would like to do two things:
Filter on "propertyToFilterOn" = 1
Apply some logic on other
properties - for example concatenate
So that the result is:
[
{"id" : 1, "newProperty" : "xxx_1"},
{"id" : 3, "newProperty" : "zzz_345"}
]
I know how to do it with explode but explode also requires groupBy on the key when putting it back together. But as this is a streaming Dataframe I would also have to put a watermark on it which I am trying to avoid.
Is there any other way to achieve this without using explode? I am sure there is some Scala magic that can achieve this!
Thanks!

With spark 2.4+ came many higher order functions for arrays. (see https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html)
val dataframe = Seq(
("a", 1, "xxx", "1", 1),
("a", 2, "yyy", "223", 0),
("a", 3, "zzz", "345", 1)
).toDF( "grouping_key", "id" , "someProperty" , "someOtherProperty", "propertyToFilterOn" )
.groupBy("grouping_key")
.agg(collect_list(struct("id" , "someProperty" , "someOtherProperty", "propertyToFilterOn")).as("your_array"))
dataframe.select("your_array").show(false)
+----------------------------------------------------+
|your_array |
+----------------------------------------------------+
|[[1, xxx, 1, 1], [2, yyy, 223, 0], [3, zzz, 345, 1]]|
+----------------------------------------------------+
You can filter elements within an array using the array filter higher order function like this:
val filteredDataframe = dataframe.select(expr("filter(your_array, your_struct -> your_struct.propertyToFilterOn == 1)").as("filtered_arrays"))
filteredDataframe.show(false)
+----------------------------------+
|filtered_arrays |
+----------------------------------+
|[[1, xxx, 1, 1], [3, zzz, 345, 1]]|
+----------------------------------+
for the "other logic" your talking about you should be able to use the transform higher order array function like so:
val tranformedDataframe = filteredDataframe
.select(expr("transform(filtered_arrays, your_struct -> struct(concat(your_struct.someProperty, '_', your_struct.someOtherProperty))"))
but there are issues with returning structs from the transform function as described in this post:
http://mail-archives.apache.org/mod_mbox/spark-user/201811.mbox/%3CCALZs8eBgWqntiPGU8N=ENW2Qvu8XJMhnViKy-225ktW+_c0czA#mail.gmail.com%3E
so you are best using the dataset api for the transform like so:
case class YourStruct(id:String, someProperty: String, someOtherProperty: String)
case class YourArray(filtered_arrays: Seq[YourStruct])
case class YourNewStruct(id:String, newProperty: String)
val transformedDataset = filteredDataframe.as[YourArray].map(_.filtered_arrays.map(ys => YourNewStruct(ys.id, ys.someProperty + "_" + ys.someOtherProperty)))
val transformedDataset.show(false)
+--------------------------+
|value |
+--------------------------+
|[[1, xxx_1], [3, zzz_345]]|
+--------------------------+

Transposing DataFrame columns in Spark Scala [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I am finding it hard to transpose columns in DF.
Given below is the base dataframe and the expected output
Student Class Subject Grade
Sam 6th Grade Maths A
Sam 6th Grade Science A
Sam 7th Grade Maths A-
Sam 7th Grade Science A
Rob 6th Grade Maths A
Rob 6th Grade Science A-
Rob 7th Grade Maths A-
Rob 7th Grade Science B
Rob 7th Grade AP A
Expected output:
Student Class Math_Grade Science_Grade AP_Grade
Sam 6th Grade A A
Sam 7th Grade A- A
Rob 6th Grade A A-
Rob 7th Grade A- B A
Please suggest what is the best way to solve this.

You can group the DataFrame by Student, Class and pivot Subject as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("Sam", "6th Grade", "Maths", "A"),
("Sam", "6th Grade", "Science", "A"),
("Sam", "7th Grade", "Maths", "A-"),
("Sam", "7th Grade", "Science", "A"),
("Rob", "6th Grade", "Maths", "A"),
("Rob", "6th Grade", "Science", "A-"),
("Rob", "7th Grade", "Maths", "A-"),
("Rob", "7th Grade", "Science", "B"),
("Rob", "7th Grade", "AP", "A")
).toDF("Student", "Class", "Subject", "Grade")
df.
groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).
orderBy("Student", "Class").
show
// +-------+---------+----+-----+-------+
// |Student| Class| AP|Maths|Science|
// +-------+---------+----+-----+-------+
// | Rob|6th Grade|null| A| A-|
// | Rob|7th Grade| A| A-| B|
// | Sam|6th Grade|null| A| A|
// | Sam|7th Grade|null| A-| A|
// +-------+---------+----+-----+-------+

Simply you can use pivot and group based on columns.
case class StudentRecord(Student: String, `Class`: String, Subject: String, Grade: String)
val rows = Seq(StudentRecord
("Sam", "6th Grade", "Maths", "A"),
StudentRecord
("Sam", "6th Grade", "Science", "A"),
StudentRecord
("Sam", "7th Grade", "Maths", "A-"),
StudentRecord
("Sam", "7th Grade", "Science", "A"),
StudentRecord
("Rob", "6th Grade", "Maths", "A"),
StudentRecord
("Rob", "6th Grade", "Science", "A-"),
StudentRecord
("Rob", "7th Grade", "Maths", "A-"),
StudentRecord
("Rob", "7th Grade", "Science", "B"),
StudentRecord
("Rob", "7th Grade", "AP", "A")
).toDF()
rows.groupBy("Student", "Class").pivot("Subject").agg(first("Grade")).orderBy(desc("Student"), asc("Class")).show()
/**
* +-------+---------+----+-----+-------+
* |Student| Class| AP|Maths|Science|
* +-------+---------+----+-----+-------+
* | Sam|6th Grade|null| A| A|
* | Sam|7th Grade|null| A-| A|
* | Rob|6th Grade|null| A| A-|
* | Rob|7th Grade| A| A-| B|
* +-------+---------+----+-----+-------+
*/

Spark - after a withColumn("newCol", collect_list(...)) select rows with more than one element

I am working with a DataFrame created from this json:
{"id" : "1201", "name" : "satish", "age" : "25"},
{"id" : "1202", "name" : "krishna", "age" : "28"},
{"id" : "1203", "name" : "amith", "age" : "39"},
{"id" : "1204", "name" : "javed", "age" : "23"},
{"id" : "1205", "name" : "mendy", "age" : "25"},
{"id" : "1206", "name" : "rob", "age" : "24"},
{"id" : "1207", "name" : "prudvi", "age" : "23"}
Initially the Dataframe looks like this:
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 25|1205| mendy|
| 24|1206| rob|
| 23|1207| prudvi|
+---+----+-------+
What I need is to group all students with the same age, ordering them depending on their id. This is how I'm approaching this so far:
*Note: I'm pretty sure the are more efficient way than adding a new column using withColumn("newCol", ..) to then use a select("newCol"), but I don't know how to solve it better
val conf = new SparkConf().setAppName("SimpleApp").set("spark.driver.allowMultipleContexts", "true").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.json("students.json")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id"))).select("List")
The output I am getting is this:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([24,1206,rob])]
[WrappedArray([23,1204,javed])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([28,1202,krishna])]
[WrappedArray([39,1203,amith])]
Now, How can I filter the rows which have got more than one element? That is, I want that my final dataframe to be:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
My best approach so far is:
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id")))
val filterd = mergedDF.withColumn("count", count("age").over(Window.partitionBy("age"))).filter($"count" > 1).select("newCol")
But I must be missing something, because the result is not the expected one:
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([25,1201,satish])]
[WrappedArray([25,1201,satish], [25,1205,mendy])]

You can use size() to filter your data:
import org.apache.spark.sql.functions.{col,size}
mergedDF.filter(size(col("newCol"))>1).show(false)
+---+----+------+-----------------------------------+
|age|id |name |newCol |
+---+----+------+-----------------------------------+
|23 |1207|prudvi|[[23,1204,javed], [23,1207,prudvi]]|
|25 |1205|mendy |[[25,1201,satish], [25,1205,mendy]]|
+---+----+------+-----------------------------------+

Spark Dataframe : Pivot with sorting

I am reading the following json file into Dataframe in spark:
{"id" : "a", "country" : "uk", "date" : "2016-01-01"}
{"id" : "b", "country" : "uk", "date" : "2016-01-02"}
{"id" : "c", "country" : "fr", "date" : "2016-02-01"}
{"id" : "d", "country" : "de", "date" : "2016-03-01"}
{"id" : "e", "country" : "tk", "date" : "2016-04-01"}
{"id" : "f", "country" : "be", "date" : "2016-05-01"}
{"id" : "g", "country" : "nl", "date" : "2016-06-01"}
{"id" : "h", "country" : "uk", "date" : "2016-06-01"}
I then apply groupBy on it and pivot it on date, here's the (pseudo) code:
val df = spark.read.json("file.json")
val dfWithFormattedDate = df.withColumn("date", date_format(col("date"), "yyyy-MM"))
dfWithFormattedDate.groupBy("country").pivot("date").agg(countDistinct("id").alias("count")).orderBy("country")
This gives me the Dataframe with country and pivoted dates (months) as columns. I would then like to order the results in descending order of total count. However, I don't have count as one of the columns and I can't apply pivot after applying count() on groupBy as it returns Dataset and not RelationalGroupedDataset. I have tried the following as well:
dfWithFormattedDate.groupBy("country").pivot("date").count()
This does not give me count column either. Is there any way I can gave both count and pivoted dates in resultant Dataset so that I can order by count desc?
Update
Here's the current output:
country|2016-01|2016-02|2016-03| ....
fr | null | 1 | null |
be | null | null | null |
uk | 2 | null | null |
Here's the expected output:
country|count|2016-01|2016-02|2016-03| ....
uk | 3 | 2 | null | null |
fr | 1 | null | 1 | null |
be | 1 | null | null | null |
As you can see, I need the count column in the result and order the rows in descending order of count. Ordering without explicitly having count column is fine as well.

If our starting point is this DataFrame :
import org.apache.spark.sql.functions.{date_format ,col, countDistinct}
val result = df.withColumn("date", date_format(col("date"), "yyyy-MM"))
.groupBy("country").pivot("date").agg(countDistinct("id").alias("count"))
.na.fill(0)
We then can simply calculate the rowsum for all the columns excluding the country column:
import org.apache.spark.sql.functions.desc
val test = result.withColumn("count",
result.columns.filter(_!="country")
.map(c => col(c))
.reduce((x, y) => x + y))
.orderBy(desc("count"))
test.show()
+-------+-------+-------+-------+-------+-------+-------+-----+
|country|2016-01|2016-02|2016-03|2016-04|2016-05|2016-06|count|
+-------+-------+-------+-------+-------+-------+-------+-----+
| uk| 2| 0| 0| 0| 0| 1| 3|
| be| 0| 0| 0| 0| 1| 0| 1|
| de| 0| 0| 1| 0| 0| 0| 1|
| tk| 0| 0| 0| 1| 0| 0| 1|
| nl| 0| 0| 0| 0| 0| 1| 1|
| fr| 0| 1| 0| 0| 0| 0| 1|
+-------+-------+-------+-------+-------+-------+-------+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

join datasets with different dimensions - how to aggregate data properly - scala

Related

How to Replace Variable with Level in tbl_summary

Spark (Scala) filter array of structs without explode

Transposing DataFrame columns in Spark Scala [duplicate]

Spark - after a withColumn("newCol", collect_list(...)) select rows with more than one element

Spark Dataframe : Pivot with sorting

Categories

Resources