Create new Dataframe column based on existing Dataframes in Spark - scala

There are two DF's , I need to populate a new column in DF1 say Flag on below conditions.
DF1
+------+-------------------+
||AMOUNT|Brand |
+------+-------------------+
| 47.88| Parle |
| 40.92| Parle |
| 83.82| Parle |
|106.58| Parle |
| 90.51| Flipkart |
| 11.48| Flipkart |
| 18.47| Flipkart |
| 40.92| Flipkart |
| 30.0| Flipkart |
+------+-------------------+
DF2
+--------------------+-------+----------+
| Brand | P1 | P2 |
+--------------------+-------+----------+
| Parle| 37.00 | 100.15 |
| Flipkart| 10.0 | 30.0 |
+--------------------+-------+----------+
if the amount of say Brand Parle in DF1 is less than P1 value (Amount < P1) in DF2 for the brand "Parle", then flag will be low, if P1 >= amount <= P2 than flag will be mid and if Amount > P2 then high,
Similarly for other merchants too.
the DF1 is having very huge data and DF2 is very small.
expected output
+------+-------------------+----------------+
||AMOUNT|Brand | Flag |
+------+-------------------+----------------+
| 47.88| Parle | mid |
| 40.92| Parle | mid |
| 83.82| Parle | mid |
|106.58| Parle | high |
| 90.51| Flipkart | high |
| 11.48| Flipkart | mid |
| 18.47| Flipkart | mid |
| 40.92| Flipkart | high |
| 30.0| Flipkart | mid |
+------+-------------------+----------------
I know that I can do a join and get the results, but how should i frame to logic in spark.

Simple left join and nested when inbuilt functions should get your desired result as
import org.apache.spark.sql.functions._
df1.join(df2, Seq("Brand"), "left")
.withColumn("Flag", when(col("AMOUNT") < col("P1"), "low").otherwise(
when(col("AMOUNT") >= col("P1") && col("AMOUNT") <= col("P2"), "mid").otherwise(
when(col("AMOUNT") > col("P2"), "high").otherwise("unknown"))))
.select("AMOUNT", "Brand", "Flag")
.show(false)
which should give you
+------+--------+----+
|AMOUNT|Brand |Flag|
+------+--------+----+
|47.88 |Parle |mid |
|40.92 |Parle |mid |
|83.82 |Parle |mid |
|106.58|Parle |high|
|90.51 |Flipkart|high|
|11.48 |Flipkart|mid |
|18.47 |Flipkart|mid |
|40.92 |Flipkart|high|
|30.0 |Flipkart|mid |
+------+--------+----+
I hope the answer is helpful

I think with udf is also doable.
val df3 = df1.join(df2, Seq("Brand"), "left")
import org.apache.spark.sql.functions._
val mapper = udf((amount: Double, p1: Double, p2: Double) => if (amount < p1) "low" else if (amount > p2) "high" else "mid")
df3.withColumn("Flag", mapper(df3("AMOUNT"), df3("P1"), df3("P2")))
.select("AMOUNT", "Brand", "Flag")
.show(false)

Related

Join a DF with another two with conditions - Scala Spark

I am trying to join a DF with another two using a condition. I have the following DF's.
DF1, the DF that I want to join with df_cond1 and df_cond2.
If DF1 InfoNum col is NBC I want to join with df_cond1 else if DF1 InfoNum Column is BBC I want to join with df_cond2 but I don't know how can I do this.
DF1
+-------------+----------+-------------+
| Date | InfoNum | Sport |
+-------------+----------+-------------+
| 31/11/2020 | NBC | football |
| 11/01/2020 | BBC | tennis |
+-------------+----------+-------------+
df_cond1
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Monthly | NBC | DATAquality |
+-------------+---------+-------------+
df_cond2
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Daily | BBC | InfoIndeed |
+-------------+---------+-------------+
final_df
+-------------+----------+-------------+-------------+
| Date | InfoNum | Sport | Description |
+-------------+----------+-------------+-------------+
| 31/11/2020 | NBC | football | DATAquality |
| 11/01/2020 | BBC | tennis | InfoIndeed |
+-------------+----------+-------------+-------------+
I have been searching but didn't find a good solution, can you help me?
Here is how you can join
val df = Seq(
("31/11/2020", "NBC", "football"),
("1/01/2020", "BBC", "tennis")
).toDF("Date", "InfoNum", "Sport")
val df_cond1 = Seq(
("Monthly", "NBC", "DATAquality")
).toDF("Periodicity", "Info", "Description")
val df_cond2 = Seq(
("Daily", "BBC", "InfoIndeed")
).toDF("Periodicity", "Info", "Description")
df.join(df_cond1.union(df_cond2), $"InfoNum" === $"Info")
.drop("Info", "Periodicity")
.show(false)
Output:
+----------+-------+--------+-----------+
|Date |InfoNum|Sport |Description|
+----------+-------+--------+-----------+
|31/11/2020|NBC |football|DATAquality|
|1/01/2020 |BBC |tennis |InfoIndeed |
+----------+-------+--------+-----------+

Iterate over Dataframe & Recursive filters

I have 2 dataframes. "MinNRule" & "SampleData"
MinNRule provides some rule information based on which SampleData needs to be:
Aggregate "Sample Data" on columns defined in MinNRule.MinimumNPopulation and MinNRule.OrderOfOperation
Check if Aggregate.Entity >= MinNRule.MinimumNValue
a. For all Entities that do not meet the MinNRule.MinimumNValue, remove from population
b. For all Entities that meet the MinNRule.MinimumNValue, keep in population
Perform 1 through 2 for next MinNRule.OrderOfOperation using 2.b dataset
MinNRule
| MinimumNGroupName | MinimumNPopulation | MinimumNValue | OrderOfOperation |
|:-----------------:|:------------------:|:-------------:|:----------------:|
| Group1 | People by Facility | 6 | 1 |
| Group1 | People by Project | 4 | 2 |
SampleData
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
| F1 | P2 | 373865 |
| F1 | P2 | 120672 |
| F1 | P2 | 369407 |
| F2 | P4 | 121705 |
| F2 | P4 | 211807 |
| F2 | P4 | 408041 |
| F2 | P4 | 415579 |
Proposed Steps:
Read MinNRule, read rule with OrderOfOperation=1
a. GroupBy Facility, Count on People
b. Aggregate SampleData by 1.a and compare to MinimumNValue=6
| Facility | Count | MinNPass |
|:--------: |:-------: |:--------: |
| F1 | 7 | Y |
| F2 | 4 | N |
Select MinNPass='Y' rows and filter the initial dataframe down to those entities (F2 gets dropped)
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
| F1 | P2 | 373865 |
| F1 | P2 | 120672 |
| F1 | P2 | 369407 |
Read MinNRule, read rule with OrderOfOperation=2
a. GroupBy Project, Count on People
b. Aggregate SampleData by 3.a and compare to MinimumNValue=4
| Project | Count | MinNPass |
|:--------: |:-------: |:--------: |
| P1 | 4 | Y |
| P2 | 3 | N |
Select MinNPass='Y' rows and filter dataframe in 3 down to those entities (P2 gets dropped)
Print Final Result
| Facility | Project | PeopleID |
|:--------: |:-------: |:--------: |
| F1 | P1 | 166152 |
| F1 | P1 | 425906 |
| F1 | P1 | 332127 |
| F1 | P1 | 241630 |
Ideas:
I have been thinking of moving MinNRule to a LocalIterator and loopinng through it and "filtering" SampleData
I am not sure how to pass the result at the end of one loop over to another
Still learning Pyspark, unsure if this is the correct approach.
I am using Azure Databricks
IIUC, since the rules df defines the rules therefore it must be small and can be collected to the driver for performing the operations on the main data.
One approach to get the desired result can be by collecting the rules df and passing it to the reduce function as:
data = MinNRule.orderBy('OrderOfOperation').collect()
from pyspark.sql.functions import *
from functools import reduce
dfnew = reduce(lambda df, rules: df.groupBy(col(rules.MinimumNPopulation.split('by')[1].strip())).\
agg(count(col({'People':'PeopleID'}.get(rules.MinimumNPopulation.split('by')[0].strip()))).alias('count')).\
filter(col('count')>=rules.MinimumNValue).drop('count').join(df,rules.MinimumNPopulation.split('by')[1].strip(),'inner'), data, sampleData)
dfnew.show()
+-------+--------+--------+
|Project|Facility|PeopleID|
+-------+--------+--------+
| P1| F1| 166152|
| P1| F1| 425906|
| P1| F1| 332127|
| P1| F1| 241630|
+-------+--------+--------+
Alternatively you can also loop through the df and get the result the performance remains same in both the cases
import pyspark.sql.functions as f
mapped_cols = {'People':'PeopleID'}
data = MinNRule.orderBy('OrderOfOperation').collect()
for i in data:
cnt, grp = i.MinimumNPopulation.split('by')
cnt = mapped_cols.get(cnt.strip())
grp = grp.strip()
sampleData = sampleData.groupBy(f.col(grp)).agg(f.count(f.col(cnt)).alias('count')).\
filter(f.col('count')>=i.MinimumNValue).drop('count').join(sampleData,grp,'inner')
sampleData.show()
+-------+--------+--------+
|Project|Facility|PeopleID|
+-------+--------+--------+
| P1| F1| 166152|
| P1| F1| 425906|
| P1| F1| 332127|
| P1| F1| 241630|
+-------+--------+--------+
Note: You have to manually parse your rules grammar as it is subjected to change

how to find which date the consecutive column status "Complete" started with in a 7day period

I need to get a date from below input on which there is a consecutive 'complete' status for past 7 days from that given date.
Requirement:
1. go Back 8 days (this is easy)
2. So we are on 20190111 from below data frame, I need to check day by day from 20190111 to 20190104 (7 day period) and get a date on which status has 'complete' for consecutive 7 days. So we should get 20190108
I need this in spark-scala.
input
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
| 9|20190109| pending|
| 10|20190110|complete|
| 11|20190111|complete|
| 12|20190112| pending|
| 13|20190113|complete|
| 14|20190114|complete|
| 15|20190115| pending|
| 16|20190116| pending|
| 17|20190117| pending|
| 18|20190118| pending|
| 19|20190119| pending|
+---+--------+--------+
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
for >= spark 2.4
import org.apache.spark.sql.expressions.Window
val df= Seq((1,"20190101","complete"),(2,"20190102","complete"),
(3,"20190103","complete"),(4,"20190104","complete"), (5,"20190105","complete"),(6,"20190106","complete"),(7,"20190107","complete"),(8,"20190108","complete"),
(9,"20190109", "pending"),(10,"20190110","complete"),(11,"20190111","complete"),(12,"20190112", "pending"),(13,"20190113","complete"),(14,"20190114","complete"),(15,"20190115", "pending") , (16,"20190116", "pending"),(17,"20190117", "pending"),(18,"20190118", "pending"),(19,"20190119", "pending")).toDF("id","date","status")
val df1= df.select($"id", to_date($"date", "yyyyMMdd").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val df_new= df4.where($"previous_7_sum"===8).select($"date").select(explode(sequence(date_sub($"date",7), $"date")).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
+---+--------+--------+
for spark < 2.4
use udf instead of built in array function "sequence"
val df1= df.select($"id", $"date".cast("integer").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val ud1= udf((col1:Int) => {
((col1-7).to(col1 )).toArray})
val df_new= df4.where($"previous_7_sum"===8)
.withColumn("dt_arr", ud1($"date"))
.select(explode($"dt_arr" ).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")

Scala — GroupBy column in specific formatting

DF1 is what I have now, and I want make DF1 looks like DF2.
Desired Output:
DF1 DF2
+---------+-------------------+ +---------+------------------------------+
| ID | Category | | ID | Category |
+---------+-------------------+ +---------+------------------------------+
| 31898 | Transfer | | 31898 | Transfer (e-Transfer) |
| 31898 | e-Transfer | =====> | 32614 | Transfer (e-Transfer + IMT) |
| 32614 | Transfer | =====> | 33987 | Transfer (IMT) |
| 32614 | e-Transfer + IMT | +---------+------------------------------+
| 33987 | Transfer |
| 33987 | IMT |
+---------+-------------------+
Code:
val df = DF1.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
val DF2 = df.withColumn("Category", $"CategorySet"(0) ($"CategorySet"(1)))
The code is not working, how to solve it? And if there is any other better ways to do the same thing, I am open to it. Thank you in advance
You can try this:
val sliceRight = udf((array : Seq[String], from : Int) => " (" + array.takeRight(from).mkString(",") +")")
val df2 = df.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
df2.withColumn("Category", concat($"CategorySet"(0),sliceRight($"CategorySet",lit(1))))
.show(false)
Output:
+-----+----------------------------+---------------------------+
|ID |CategorySet |Category |
+-----+----------------------------+---------------------------+
|33987|[Transfer, IMT] |Transfer (IMT) |
|32614|[Transfer, e-Transfer + IMT]|Transfer (e-Transfer + IMT)|
|31898|[Transfer, e-Transfer] |Transfer (e-Transfer) |
+-----+----------------------------+---------------------------+
answer with slight modification
df.groupBy(“ID”).agg(collect_set(col(“Category”)).as(“Category”)).withColumn(“Category”, concat(col(“Category”)(0),lit(“ (“),col(“Category”)(1), lit(“)”))).show

How filter one big dataframe many times(equal to small df‘s row count) by another small dataframe(row by row) ?

I have two spark dataframe,dfA and dfB.
I want to filter dfA by dfB's each row, which means if dfB have 10000 rows, i need to filter dfA 10000 times with 10000 different filter conditions generated by dfB. Then, after each filter i need to collect the filter result as a column in dfB.
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
so my expected result is
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
My stupid solutions is
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
Maybe you don't want to see my brute force code.it's idea is
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
I've tried my algorithms, it cost almost 50 hours.
Does anybody has a more efficient way ? Very thanks.
Assuming that your DFB is small dataset, I am trying to give the below solution.
Try using a Broadcast Join like below
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
BroadcastJoin is like a Map Side Join. This will materialize the smaller data to all the mappers. This will improve the performance by omitting the required sort-and-shuffle phase during a reduce step.
Some points i would like you to avoid:
Never use collect(). When a collect operation is issued on a RDD, the dataset is copied to the driver.
If your data is too big you might get memory out of bounds exception.
Try using take() or takeSample() instead.
It is obvious that when two dataframes/datasets are involved in calculation then join should be performed. So join is a must step for you. But when should you join is the important question.
I would suggest to aggregate and reduce rows in dataframes as much as possible before joining as it would reduce shuffling.
In your case you can reduce only dfA as you need exact dfB with a column added from dfA meeting the condition
So you can groupBy id and aggregate dfA so that you get one row of each id, then you can perform the join. And then you can use a udf function for your logic of calculation
comments are provided for clarity and explanation
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
which should give you
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+
I hope the answer is helpful