I need to do a check vertical on my dataset in PySpark to flag only row that match some condition.
In detail: I only have to flag only row where there is an "PURCHASE + SELLER" preceded by a "SALE + CUSTOMER" (bold in the example below).
Example:
Input
id
order_type
Initiative
date
1
PURCHASE
Seller
2022-02-11
1
PURCHASE
Seller
2022-02-10
1
PURCHASE
Seller
2022-02-09
1
SALE
Customer
2022-02-08
1
SALE
Customer
2022-02-07
1
SALE
Customer
2022-02-06
1
PURCHASE
Seller
2022-02-05
1
SALE
Customer
2022-02-04
1
PURCHASE
Seller
2022-02-03 (keep attention)
2
PURCHASE
Customer
2022-02-11
Output
id
order_type
Initiative
date
flag
difference (in days)
1
PURCHASE
Seller
2022-02-11
1
3
1
PURCHASE
Seller
2022-02-10
1
2
1
PURCHASE
Seller
2022-02-09
1
1
1
SALE
Customer
2022-02-08
0
1
SALE
Customer
2022-02-07
0
1
SALE
Customer
2022-02-06
0
1
PURCHASE
Seller
2022-02-05
1
1
1
SALE
Customer
2022-02-04
0
1
PURCHASE
Seller
2022-02-03
0 (condition is not satisfied)
2
PURCHASE
Customer
2022-02-11
0
here's my implementation
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
df = spark.createDataFrame(
[
("1", "PURCHASE", "Seller", "2022-02-11"),
("1", "PURCHASE", "Seller", "2022-02-10"),
("1", "PURCHASE", "Seller", "2022-02-09"),
("1", "SALE", "Customer", "2022-02-08"),
("1", "SALE", "Customer", "2022-02-07"),
("1", "SALE", "Customer", "2022-02-06"),
("1", "PURCHASE", "Seller", "2022-02-05"),
("1", "SALE", "Customer", "2022-02-04"),
("1", "PURCHASE", "Seller", "2022-02-03"),
("2", "PURCHASE", "Customer", "2022-02-11"),
],
["id", "order_type", "Initiative", "date"],
)
df = df.withColumn("date", F.col("date").cast(DateType()))
df.show()
sale_df = df.filter((F.lower(F.col("order_type")) == "sale") & (F.lower(F.col("Initiative")) == "customer"))
sale_df.show()
row_window = Window.partitionBy(
"df.id",
"df.order_type",
"df.Initiative",
"df.date",
).orderBy(F.col("s.date").desc())
final_df = (
df.alias("df")
.join(
sale_df.alias("s"),
on=(
(F.col("s.date") < F.col("df.date"))
& (F.lower(F.col("df.order_type")) == "purchase")
& (F.lower(F.col("df.Initiative")) == "seller")
),
how="left",
)
.withColumn("row_num", F.row_number().over(row_window))
.filter(F.col("row_num") == 1)
.withColumn("day_diff", F.datediff(F.col("df.date"),F.col("s.date")))
.withColumn(
"flag",
F.when(
F.col("s.id").isNull(),
F.lit(0),
).otherwise(F.lit(1)),
)
.select("df.*", "flag", "day_diff")
.orderBy(F.col("df.id").asc(),F.col("df.date").desc())
)
final_df.show()
OUTPUTS:
+---+----------+----------+----------+
| id|order_type|Initiative| date|
+---+----------+----------+----------+
| 1| PURCHASE| Seller|2022-02-11|
| 1| PURCHASE| Seller|2022-02-10|
| 1| PURCHASE| Seller|2022-02-09|
| 1| SALE| Customer|2022-02-08|
| 1| SALE| Customer|2022-02-07|
| 1| SALE| Customer|2022-02-06|
| 1| PURCHASE| Seller|2022-02-05|
| 1| SALE| Customer|2022-02-04|
| 1| PURCHASE| Seller|2022-02-03|
| 2| PURCHASE| Customer|2022-02-11|
+---+----------+----------+----------+
+---+----------+----------+----------+
| id|order_type|Initiative| date|
+---+----------+----------+----------+
| 1| SALE| Customer|2022-02-08|
| 1| SALE| Customer|2022-02-07|
| 1| SALE| Customer|2022-02-06|
| 1| SALE| Customer|2022-02-04|
+---+----------+----------+----------+
final output:
+---+----------+----------+----------+----+--------+
| id|order_type|Initiative| date|flag|day_diff|
+---+----------+----------+----------+----+--------+
| 1| PURCHASE| Seller|2022-02-11| 1| 3|
| 1| PURCHASE| Seller|2022-02-10| 1| 2|
| 1| PURCHASE| Seller|2022-02-09| 1| 1|
| 1| SALE| Customer|2022-02-08| 0| null|
| 1| SALE| Customer|2022-02-07| 0| null|
| 1| SALE| Customer|2022-02-06| 0| null|
| 1| PURCHASE| Seller|2022-02-05| 1| 1|
| 1| SALE| Customer|2022-02-04| 0| null|
| 1| PURCHASE| Seller|2022-02-03| 0| null|
| 2| PURCHASE| Customer|2022-02-11| 0| null|
+---+----------+----------+----------+----+--------+
Related
i need to solve the following problem without graphframe please help.
Input Dataframe
|-----------+-----------+--------------|
| ID | prev | next |
|-----------+-----------+--------------|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 2 | null |
| 9 | 9 | null |
|-----------+-----------+--------------|
output dataframe
|-----------+------------|
| bill_id | item_id |
|-----------+------------|
| 1 | [1, 2, 3] |
| 9 | [9] |
|-----------+------------|
This is probably quite inefficient, but it works. It is inspired by how graphframes does connected components. Basically join with itself on the prev column until it doesn't get any lower, then group.
df = sc.parallelize([(1, 1, 2), (2, 1, 3), (3, 2, None), (9, 9, None)]).toDF(['ID', 'prev', 'next'])
df.show()
+---+----+----+
| ID|prev|next|
+---+----+----+
| 1| 1| 2|
| 2| 1| 3|
| 3| 2|null|
| 9| 9|null|
+---+----+----+
converged = False
count = 0
while not converged:
step = df.join(df.selectExpr('ID as prev', 'prev as lower_prev'), 'prev', 'left').cache()
print('step', count)
step.show()
converged = step.where('prev != lower_prev').count() == 0
df = step.selectExpr('ID', 'lower_prev as prev')
print('df', count)
df.show()
count += 1
step 0
+----+---+----+----------+
|prev| ID|next|lower_prev|
+----+---+----+----------+
| 2| 3|null| 1|
| 1| 2| 3| 1|
| 1| 1| 2| 1|
| 9| 9|null| 9|
+----+---+----+----------+
df 0
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
step 1
+----+---+----------+
|prev| ID|lower_prev|
+----+---+----------+
| 1| 3| 1|
| 1| 1| 1|
| 1| 2| 1|
| 9| 9| 9|
+----+---+----------+
df 1
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
df.groupBy('prev').agg(F.collect_set('ID').alias('item_id')).withColumnRenamed('prev', 'bill_id').show()
+-------+---------+
|bill_id| item_id|
+-------+---------+
| 1|[1, 2, 3]|
| 9| [9]|
+-------+---------+
I have a dataframe df
+-----+--------+----------+-----+
|count|currency| date|value|
+-----+--------+----------+-----+
| 3| GBP|2021-01-14| 4|
| 102| USD|2021-01-14| 3|
| 234| EUR|2021-01-14| 5|
| 28| GBP|2021-01-16| 5|
| 48| USD|2021-01-16| 7|
| 68| EUR|2021-01-15| 6|
| 20| GBP|2021-01-15| 1|
| 33| EUR|2021-01-17| 2|
| 106| GBP|2021-01-17| 10|
+-----+--------+----------+-----+
I have a separate dataframe for USD exchange_rate
val exchange_rate = spark.read.format("csv").load("/Users/khan/data/exchange_rate.csv")
exchange_rate.show()
INSERTTIME EXCAHNGERATE CURRENCY
2021-01-14 0.731422 GBP
2021-01-14 0.784125 EUR
2021-01-15 0.701922 GBP
2021-01-15 0.731422 EUR
2021-01-16 0.851422 GBP
2021-01-16 0.721128 EUR
2021-01-17 0.771621 GBP
2021-01-17 0.751426 EUR
I want to convert the GBP and EUR currency to USD in df by looking in the exchange_rate dataframe corresponding to the date
My Idea
import com.currency_converter.CurrencyConverter from http://xavierguihot.com/currency_converter/#com.currency_converter.CurrencyConverter$
is there a simpler way to do it?
You can use a correlated subquery (a fancy way of doing joins):
df.createOrReplaceTempView("df1")
exchange_rate.createOrReplaceTempView("df2")
val result = spark.sql("""
select count, 'USD' as currency, date, value,
value * coalesce(
(select min(df2.EXCAHNGERATE)
from df2
where df1.date = df2.INSERTTIME and df1.currency = df2.CURRENCY),
1 -- use 1 as exchange rate if no exchange rate found
) as converted
from df1
""")
result.show
+-----+--------+----------+-----+---------+
|count|currency| date|value|converted|
+-----+--------+----------+-----+---------+
| 3| USD|2021-01-14| 4| 2.925688|
| 102| USD|2021-01-14| 3| 3.0|
| 234| USD|2021-01-14| 5| 3.920625|
| 28| USD|2021-01-16| 5| 4.25711|
| 48| USD|2021-01-16| 7| 7.0|
| 68| USD|2021-01-15| 6| 4.388532|
| 20| USD|2021-01-15| 1| 0.701922|
| 33| USD|2021-01-17| 2| 1.502852|
| 106| USD|2021-01-17| 10| 7.71621|
+-----+--------+----------+-----+---------+
You can join both DataFrames and operate by row
val dfJoin = df1.join(df2, df1.col("date") === df2.col("INSERTTIME") &&
df1.col("currency") === df2.col("CURRENCY"),"left")#currecy should be changed to currency
dfJoin.withColumn("USD", col("value") * col("EXCAHNGERATE")).show()
I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given
My input:
val df = sc.parallelize(Seq(
("0","car1", "success"),
("0","car1", "success"),
("0","car3", "success"),
("0","car2", "success"),
("1","car1", "success"),
("1","car2", "success"),
("0","car3", "success")
)).toDF("id", "item", "status")
My intermediary group by output looks like this:
val df2 = df.groupBy("id", "item").agg(count("item").alias("occurences"))
+---+----+----------+
| id|item|occurences|
+---+----+----------+
| 0|car3| 2|
| 0|car2| 1|
| 0|car1| 2|
| 1|car2| 1|
| 1|car1| 1|
+---+----+----------+
The output I would like is:
Calculating the sum of occurrences of item skipping the occurrences value of the current id's item.
For example in the output table below, car3 appeared for id "0" 2 times, car 2 appeared 1 time and car 1 appeared 2 times.
So for id "0", the sum of other occurrences for its "car3" item would be value of car2(1) + car1(2) = 3.
For the same id "0", the sum of other occurrences for its "car2" item would be value of car3(2) + car1(2) = 4.
This continues for the rest. Sample output
+---+----+----------+----------------------+
| id|item|occurences| other_occurences_sum |
+---+----+----------+----------------------+
| 0|car3| 2| 3 |<- (car2+car1) for id 0
| 0|car2| 1| 4 |<- (car3+car1) for id 0
| 0|car1| 2| 3 |<- (car3+car2) for id 0
| 1|car2| 1| 1 |<- (car1) for id 1
| 1|car1| 1| 1 |<- (car2) for id 1
+---+----+----------+----------------------+
That's perfect target for a window function.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum
val w = Window.partitionBy("id")
df2.withColumn(
"other_occurences_sum", sum($"occurences").over(w) - $"occurences"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+
where sum($"occurences").over(w) is a sum of all occurrences for the current id. Of course join is also valid:
df2.join(
df2.groupBy("id").agg(sum($"occurences") as "total"), Seq("id")
).select(
$"*", ($"total" - $"occurences") as "other_occurences_sum"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+
I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+