PySpark Column Creation by queuing filtered past rows - pyspark

In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?

If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+

Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()

Related

PySpark Windows Function with Conditional Reset

I have a dataframe like this
| user_id | acivity_date |
| -------- | ------------ |
| 49630701 | 1/1/2019 |
| 49630701 | 1/10/2019 |
| 49630701 | 1/28/2019 |
| 49630701 | 2/5/2019 |
| 49630701 | 3/10/2019 |
| 49630701 | 3/21/2019 |
| 49630701 | 5/25/2019 |
| 49630701 | 5/28/2019 |
| 49630701 | 9/10/2019 |
| 49630701 | 1/1/2020 |
| 49630701 | 1/10/2020 |
| 49630701 | 1/28/2020 |
| 49630701 | 2/10/2020 |
| 49630701 | 3/10/2020 |
What I would need to create is the "Group" column, the logic is For every User we need to retain the Group # until the cumulative date difference is less than 30 days, whenever the cumulative date difference is greater than 30 days then we need to increment the group # as well as reset the cumulative date difference to zero
| user_id | acivity_date | Group |
| -------- | ------------ | ----- |
| 49630701 | 1/1/2019 | 1 |
| 49630701 | 1/10/2019 | 1 |
| 49630701 | 1/28/2019 | 1 |
| 49630701 | 2/5/2019 | 2 | <- Cumulative date diff till here is 35, which is greater than 30, so increment the Group by 1 and reset the cumulative diff to 0
| 49630701 | 3/10/2019 | 3 |
| 49630701 | 3/21/2019 | 3 |
| 49630701 | 5/25/2019 | 4 |
| 49630701 | 5/28/2019 | 4 |
| 49630701 | 9/10/2019 | 5 |
| 49630701 | 1/1/2020 | 6 |
| 49630701 | 1/10/2020 | 6 |
| 49630701 | 1/28/2020 | 6 |
| 49630701 | 2/10/2020 | 7 |
| 49630701 | 3/10/2020 | 7 |
I tried with the below code with the loop, but it is not efficient, it is running for hours. Is there a better way to achieve this? Any help would be really appreciated
df= spark.read.table('excel_file)
df1 = df.select(col("user_id"), col("activity_date")).distinct()
partitionWindow = Window.partitionBy("user_id").orderBy(col("activity_date").asc())
lagTest = lag(col("activity_date"), 1, "0000-00-00 00:00:00").over(partitionWindow)
df1 = df1.select(col("*"), (datediff(col("activity_date"),lagTest)).cast("int").alias("diff_val_with_previous"))
df1 = df1.withColumn('diff_val_with_previous', when(col('diff_val_with_previous').isNull(), lit(0)).otherwise(col('diff_val_with_previous')))
distinctUser = [i['user_id'] for i in df1.select(col("user_id")).distinct().collect()]
rankTest = rank().over(partitionWindow)
df2 = df1.select(col("*"), rankTest.alias("rank"))
interimSessionThreshold = 30
totalSessionTimeThreshold = 30
rowList = []
for x in distinctUser:
tempDf = df2.filter(col("user_id") == x).orderBy(col('activity_date'))
cumulDiff = 0
group = 1
startBatch = True
len_df = tempDf.count()
dp = 0
for i in range(1, len_df+1):
r = tempDf.filter(col("rank") == i)
dp = r.select("diff_val_with_previous").first()[0]
cumulDiff += dp
if ((dp <= interimSessionThreshold) & (cumulDiff <= totalSessionTimeThreshold)):
startBatch=False
rowList.append([r.select("user_id").first()[0], r.select("activity_date").first()[0], group])
else:
group += 1
cumulDiff = 0
startBatch = True
dp = 0
rowList.append([r.select("user_id").first()[0], r.select("activity_date").first()[0], group])
ddf = spark.createDataFrame(rowList, ['user_id', 'activity_date', 'group'])
I can think of two solutions but none of them are matching exactly what you want :
from pyspark.sql import functions as F, Window
df.withColumn(
"idx", F.monotonically_increasing_id()
).withColumn(
"date_as_num", F.unix_timestamp("activity_date")
).withColumn(
"group", F.min("idx").over(Window.partitionBy('user_id').orderBy("date_as_num").rangeBetween(- 60 * 60 * 24 * 30, 0))
).withColumn(
"group", F.dense_rank().over(Window.partitionBy("user_id").orderBy("group"))
).show()
+--------+-------------+----------+-----------+-----+
| user_id|activity_date| idx|date_as_num|group|
+--------+-------------+----------+-----------+-----+
|49630701| 2019-01-01| 0| 1546300800| 1|
|49630701| 2019-01-10| 1| 1547078400| 1|
|49630701| 2019-01-28| 2| 1548633600| 1|
|49630701| 2019-02-05| 3| 1549324800| 2|
|49630701| 2019-03-10| 4| 1552176000| 3|
|49630701| 2019-03-21| 5| 1553126400| 3|
|49630701| 2019-05-25| 6| 1558742400| 4|
|49630701| 2019-05-28|8589934592| 1559001600| 4|
|49630701| 2019-09-10|8589934593| 1568073600| 5|
|49630701| 2020-01-01|8589934594| 1577836800| 6|
|49630701| 2020-01-10|8589934595| 1578614400| 6|
|49630701| 2020-01-28|8589934596| 1580169600| 6|
|49630701| 2020-02-10|8589934597| 1581292800| 7|
|49630701| 2020-03-10|8589934598| 1583798400| 8|
+--------+-------------+----------+-----------+-----+
or
df.withColumn(
"group",
F.datediff(
F.col("activity_date"),
F.lag("activity_date").over(
Window.partitionBy("user_id").orderBy("activity_date")
),
),
).withColumn(
"group", F.sum("group").over(Window.partitionBy("user_id").orderBy("activity_date"))
).withColumn(
"group", F.floor(F.coalesce(F.col("group"), F.lit(0)) / 30)
).withColumn(
"group", F.dense_rank().over(Window.partitionBy("user_id").orderBy("group"))
).show()
+--------+-------------+-----+
| user_id|activity_date|group|
+--------+-------------+-----+
|49630701| 2019-01-01| 1|
|49630701| 2019-01-10| 1|
|49630701| 2019-01-28| 1|
|49630701| 2019-02-05| 2|
|49630701| 2019-03-10| 3|
|49630701| 2019-03-21| 3|
|49630701| 2019-05-25| 4|
|49630701| 2019-05-28| 4|
|49630701| 2019-09-10| 5|
|49630701| 2020-01-01| 6|
|49630701| 2020-01-10| 6|
|49630701| 2020-01-28| 7|
|49630701| 2020-02-10| 7|
|49630701| 2020-03-10| 8|
+--------+-------------+-----+

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

subtract two columns with null in spark dataframe

I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+

how to output multiple (key,value) in spark map function

The format of input data likes below:
+--------------------+-------------+--------------------+
| StudentID| Right | Wrong |
+--------------------+-------------+--------------------+
| studentNo01 | a,b,c | x,y,z |
+--------------------+-------------+--------------------+
| studentNo02 | c,d | v,w |
+--------------------+-------------+--------------------+
And the format of output likes below():
+--------------------+---------+
| key | value|
+--------------------+---------+
| studentNo01,a | 1 |
+--------------------+---------+
| studentNo01,b | 1 |
+--------------------+---------+
| studentNo01,c | 1 |
+--------------------+---------+
| studentNo01,x | 0 |
+--------------------+---------+
| studentNo01,y | 0 |
+--------------------+---------+
| studentNo01,z | 0 |
+--------------------+---------+
| studentNo02,c | 1 |
+--------------------+---------+
| studentNo02,d | 1 |
+--------------------+---------+
| studentNo02,v | 0 |
+--------------------+---------+
| studentNo02,w | 0 |
+--------------------+---------+
The Right means 1 , The Wrong means 0.
I want to process these data using Spark map function or udf, But I don't know how to deal with it . Can you help me, please? Thank you.
Use split and explode twice and do the union
val df = List(
("studentNo01","a,b,c","x,y,z"),
("studentNo02","c,d","v,w")
).toDF("StudenID","Right","Wrong")
+-----------+-----+-----+
| StudenID|Right|Wrong|
+-----------+-----+-----+
|studentNo01|a,b,c|x,y,z|
|studentNo02| c,d| v,w|
+-----------+-----+-----+
val pair = (
df.select('StudenID,explode(split('Right,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(1))
).unionAll(
df.select('StudenID,explode(split('Wrong,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(0))
)
+-------------+-----+
| key|value|
+-------------+-----+
|studentNo01,a| 1|
|studentNo01,b| 1|
|studentNo01,c| 1|
|studentNo02,c| 1|
|studentNo02,d| 1|
|studentNo01,x| 0|
|studentNo01,y| 0|
|studentNo01,z| 0|
|studentNo02,v| 0|
|studentNo02,w| 0|
+-------------+-----+
You can convert to RDD as follows
val rdd = pair.map(r => (r.getString(0),r.getInt(1)))