Join a DF with another two with conditions - Scala Spark - scala

I am trying to join a DF with another two using a condition. I have the following DF's.
DF1, the DF that I want to join with df_cond1 and df_cond2.
If DF1 InfoNum col is NBC I want to join with df_cond1 else if DF1 InfoNum Column is BBC I want to join with df_cond2 but I don't know how can I do this.
DF1
+-------------+----------+-------------+
| Date | InfoNum | Sport |
+-------------+----------+-------------+
| 31/11/2020 | NBC | football |
| 11/01/2020 | BBC | tennis |
+-------------+----------+-------------+
df_cond1
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Monthly | NBC | DATAquality |
+-------------+---------+-------------+
df_cond2
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Daily | BBC | InfoIndeed |
+-------------+---------+-------------+
final_df
+-------------+----------+-------------+-------------+
| Date | InfoNum | Sport | Description |
+-------------+----------+-------------+-------------+
| 31/11/2020 | NBC | football | DATAquality |
| 11/01/2020 | BBC | tennis | InfoIndeed |
+-------------+----------+-------------+-------------+
I have been searching but didn't find a good solution, can you help me?

Here is how you can join
val df = Seq(
("31/11/2020", "NBC", "football"),
("1/01/2020", "BBC", "tennis")
).toDF("Date", "InfoNum", "Sport")
val df_cond1 = Seq(
("Monthly", "NBC", "DATAquality")
).toDF("Periodicity", "Info", "Description")
val df_cond2 = Seq(
("Daily", "BBC", "InfoIndeed")
).toDF("Periodicity", "Info", "Description")
df.join(df_cond1.union(df_cond2), $"InfoNum" === $"Info")
.drop("Info", "Periodicity")
.show(false)
Output:
+----------+-------+--------+-----------+
|Date |InfoNum|Sport |Description|
+----------+-------+--------+-----------+
|31/11/2020|NBC |football|DATAquality|
|1/01/2020 |BBC |tennis |InfoIndeed |
+----------+-------+--------+-----------+

Related

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

how to find which date the consecutive column status "Complete" started with in a 7day period

I need to get a date from below input on which there is a consecutive 'complete' status for past 7 days from that given date.
Requirement:
1. go Back 8 days (this is easy)
2. So we are on 20190111 from below data frame, I need to check day by day from 20190111 to 20190104 (7 day period) and get a date on which status has 'complete' for consecutive 7 days. So we should get 20190108
I need this in spark-scala.
input
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
| 9|20190109| pending|
| 10|20190110|complete|
| 11|20190111|complete|
| 12|20190112| pending|
| 13|20190113|complete|
| 14|20190114|complete|
| 15|20190115| pending|
| 16|20190116| pending|
| 17|20190117| pending|
| 18|20190118| pending|
| 19|20190119| pending|
+---+--------+--------+
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
for >= spark 2.4
import org.apache.spark.sql.expressions.Window
val df= Seq((1,"20190101","complete"),(2,"20190102","complete"),
(3,"20190103","complete"),(4,"20190104","complete"), (5,"20190105","complete"),(6,"20190106","complete"),(7,"20190107","complete"),(8,"20190108","complete"),
(9,"20190109", "pending"),(10,"20190110","complete"),(11,"20190111","complete"),(12,"20190112", "pending"),(13,"20190113","complete"),(14,"20190114","complete"),(15,"20190115", "pending") , (16,"20190116", "pending"),(17,"20190117", "pending"),(18,"20190118", "pending"),(19,"20190119", "pending")).toDF("id","date","status")
val df1= df.select($"id", to_date($"date", "yyyyMMdd").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val df_new= df4.where($"previous_7_sum"===8).select($"date").select(explode(sequence(date_sub($"date",7), $"date")).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
+---+--------+--------+
for spark < 2.4
use udf instead of built in array function "sequence"
val df1= df.select($"id", $"date".cast("integer").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val ud1= udf((col1:Int) => {
((col1-7).to(col1 )).toArray})
val df_new= df4.where($"previous_7_sum"===8)
.withColumn("dt_arr", ud1($"date"))
.select(explode($"dt_arr" ).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")

Create dummy variables frame pyspark

I have a spark data frame like:
|---------------------|------------------------------|
| Brand | Model |
|---------------------|------------------------------|
| Hyundai | Elentra,Creta |
|---------------------|------------------------------|
| Hyundai | Creta,Grand i10,Verna |
|---------------------|------------------------------|
| Maruti | Eritga,S-cross,Vitara Brezza|
|---------------------|------------------------------|
| Maruti | Celerio,Eritga,Ciaz |
|---------------------|------------------------------|
I want a data frame like this:
|---------------------|---------|--------|--------------|--------|---------|
| Brand | Model0 | Model1 | Model2 | Model3 | Model4 |
|---------------------|---------|--------|--------------|--------|---------|
| Hyundai | Elentra | Creta | Grand i10 | Verna | null |
|---------------------|---------|--------|--------------|--------|---------|
| Maruti | Ertiga | S-Cross| Vitara Brezza| Celerio| Ciaz |
|---------------------|---------|--------|--------------|--------|---------|
I have used this code :
schema = StructType([
StructField("Brand", StringType()),StructField("Model", StringType())])
tempCSV = spark.read.csv("PATH\\Cars.csv", sep='|', schema=schema)
tempDF = tempCSV.select(
"Brand",
f.split("Model", ",").alias("Model"),
f.posexplode(f.split("Model", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Brand",
f.concat(f.lit("Model"),f.col("pos").cast("string")).alias("name"),
f.expr("Model[pos]").alias("val")
)\
.groupBy("Brand").pivot("name").agg(f.first("val")).toPandas()
But I'm not getting the desired result. Instead of giving the second table result its giving :
|---------------------|---------|--------|--------------|
| Brand | Model0 | Model1 | Model2 |
|---------------------|---------|--------|--------------|
| Hyundai | Elentra | Creta | Grand i10 |
|---------------------|---------|--------|--------------|
| Maruti | Ertiga | S-Cross| Vitara Brezza|
|---------------------|---------|--------|--------------|
Thanks in advance.
This is happening because you are pivoting data on pos which has the repeat value in the same brand group.
You can use the rownumber() and pivot your data to generate the desired result.
Here are the sample code on top of the data you have provided.
df = sqlContext.createDataFrame([('Hyundai',"Elentra,Creta"),("Hyundai","Creta,Grand i10,Verna"),("Maruti","Eritga,S-cross,Vitara Brezza"),("Maruti","Celerio,Eritga,Ciaz")],("Brand","Model"))
tmpDf = df.select("Brand",f.split("Model", ",").alias("Model"),f.posexplode(f.split("Model", ",")).alias("pos", "val"))
tmpDf.createOrReplaceTempView("tbl")
seqDf = sqlContext.sql("select Brand, Model, pos, val, row_number() over(partition by Brand order by pos) as rnk from tbl")
seqDf.groupBy('Brand').pivot('rnk').agg(f.first('val'))
This will generate following result.
+-------+-------+-------+-------+---------+-------------+----+
| Brand| 1| 2| 3| 4| 5| 6|
+-------+-------+-------+-------+---------+-------------+----+
| Maruti| Eritga|Celerio|S-cross| Eritga|Vitara Brezza|Ciaz|
|Hyundai|Elentra| Creta| Creta|Grand i10| Verna|null|
+-------+-------+-------+-------+---------+-------------+----+

Scala — GroupBy column in specific formatting

DF1 is what I have now, and I want make DF1 looks like DF2.
Desired Output:
DF1 DF2
+---------+-------------------+ +---------+------------------------------+
| ID | Category | | ID | Category |
+---------+-------------------+ +---------+------------------------------+
| 31898 | Transfer | | 31898 | Transfer (e-Transfer) |
| 31898 | e-Transfer | =====> | 32614 | Transfer (e-Transfer + IMT) |
| 32614 | Transfer | =====> | 33987 | Transfer (IMT) |
| 32614 | e-Transfer + IMT | +---------+------------------------------+
| 33987 | Transfer |
| 33987 | IMT |
+---------+-------------------+
Code:
val df = DF1.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
val DF2 = df.withColumn("Category", $"CategorySet"(0) ($"CategorySet"(1)))
The code is not working, how to solve it? And if there is any other better ways to do the same thing, I am open to it. Thank you in advance
You can try this:
val sliceRight = udf((array : Seq[String], from : Int) => " (" + array.takeRight(from).mkString(",") +")")
val df2 = df.groupBy("ID").agg(collect_set("Category").as("CategorySet"))
df2.withColumn("Category", concat($"CategorySet"(0),sliceRight($"CategorySet",lit(1))))
.show(false)
Output:
+-----+----------------------------+---------------------------+
|ID |CategorySet |Category |
+-----+----------------------------+---------------------------+
|33987|[Transfer, IMT] |Transfer (IMT) |
|32614|[Transfer, e-Transfer + IMT]|Transfer (e-Transfer + IMT)|
|31898|[Transfer, e-Transfer] |Transfer (e-Transfer) |
+-----+----------------------------+---------------------------+
answer with slight modification
df.groupBy(“ID”).agg(collect_set(col(“Category”)).as(“Category”)).withColumn(“Category”, concat(col(“Category”)(0),lit(“ (“),col(“Category”)(1), lit(“)”))).show

Create new Dataframe column based on existing Dataframes in Spark

There are two DF's , I need to populate a new column in DF1 say Flag on below conditions.
DF1
+------+-------------------+
||AMOUNT|Brand |
+------+-------------------+
| 47.88| Parle |
| 40.92| Parle |
| 83.82| Parle |
|106.58| Parle |
| 90.51| Flipkart |
| 11.48| Flipkart |
| 18.47| Flipkart |
| 40.92| Flipkart |
| 30.0| Flipkart |
+------+-------------------+
DF2
+--------------------+-------+----------+
| Brand | P1 | P2 |
+--------------------+-------+----------+
| Parle| 37.00 | 100.15 |
| Flipkart| 10.0 | 30.0 |
+--------------------+-------+----------+
if the amount of say Brand Parle in DF1 is less than P1 value (Amount < P1) in DF2 for the brand "Parle", then flag will be low, if P1 >= amount <= P2 than flag will be mid and if Amount > P2 then high,
Similarly for other merchants too.
the DF1 is having very huge data and DF2 is very small.
expected output
+------+-------------------+----------------+
||AMOUNT|Brand | Flag |
+------+-------------------+----------------+
| 47.88| Parle | mid |
| 40.92| Parle | mid |
| 83.82| Parle | mid |
|106.58| Parle | high |
| 90.51| Flipkart | high |
| 11.48| Flipkart | mid |
| 18.47| Flipkart | mid |
| 40.92| Flipkart | high |
| 30.0| Flipkart | mid |
+------+-------------------+----------------
I know that I can do a join and get the results, but how should i frame to logic in spark.
Simple left join and nested when inbuilt functions should get your desired result as
import org.apache.spark.sql.functions._
df1.join(df2, Seq("Brand"), "left")
.withColumn("Flag", when(col("AMOUNT") < col("P1"), "low").otherwise(
when(col("AMOUNT") >= col("P1") && col("AMOUNT") <= col("P2"), "mid").otherwise(
when(col("AMOUNT") > col("P2"), "high").otherwise("unknown"))))
.select("AMOUNT", "Brand", "Flag")
.show(false)
which should give you
+------+--------+----+
|AMOUNT|Brand |Flag|
+------+--------+----+
|47.88 |Parle |mid |
|40.92 |Parle |mid |
|83.82 |Parle |mid |
|106.58|Parle |high|
|90.51 |Flipkart|high|
|11.48 |Flipkart|mid |
|18.47 |Flipkart|mid |
|40.92 |Flipkart|high|
|30.0 |Flipkart|mid |
+------+--------+----+
I hope the answer is helpful
I think with udf is also doable.
val df3 = df1.join(df2, Seq("Brand"), "left")
import org.apache.spark.sql.functions._
val mapper = udf((amount: Double, p1: Double, p2: Double) => if (amount < p1) "low" else if (amount > p2) "high" else "mid")
df3.withColumn("Flag", mapper(df3("AMOUNT"), df3("P1"), df3("P2")))
.select("AMOUNT", "Brand", "Flag")
.show(false)