Design a saturated sum with window functions - pyspark

I have, in a pyspark dataframe, a column with values 1, -1 and 0, representing the events "startup", "shutdown" and "other" of an engine. I want to build a column with the state of the engine, with 1 when the engine is on and 0 when it's off, something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | -1 | 0 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 1 | 1 |
| 7 | -1 | 0 |
+---+-----+-----+
If 1s and -1s are always alternated, this can easily be done with a window function such as
sum('event').over(Window.orderBy('seq'))
It can happen, though, that I have some spurious 1s or -1s, which I want to ignore if I am already in state 1 or 0, respectively. I want thus to be able to do something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | 1 | 1 |
| 4 | -1 | 0 |
| 5 | 0 | 0 |
| 6 | -1 | 0 |
| 7 | 1 | 1 |
+---+-----+-----+
That would require a "saturated" sum function that never goes above 1 or below 0, or some other approach which I am not able to imagine at the moment.
Does somebody have any ideas?

You can achieve the desired result using the last function to fill-forwards the most recent state-change.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = (spark.createDataFrame([
(1, 1),
(2, 0),
(3, -1),
(4, 0)
], ["seq", "event"]))
w = Window.orderBy('seq')
#replace zeros with nulls so they can be ignored easily.
df = df.withColumn('helperCol',F.when(df.event != 0,df.event))
#fill statechanges forward in a new column.
df = df.withColumn('state',F.last(df.helperCol,ignorenulls=True).over(w))
#replace -1 values with 0
df = df.replace(-1,0,['state'])
df.show()
This produces:
+---+-----+---------+-----+
|seq|event|helperCol|state|
+---+-----+---------+-----+
| 1| 1| 1| 1|
| 2| 0| null| 1|
| 3| -1| -1| 0|
| 4| 0| null| 0|
+---+-----+---------+-----+
helperCol need not be added to the dataframe, I have only included it to make the process more readable.

Related

PySpark Column Creation by queuing filtered past rows

In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?
If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+
Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()

How to enrich dataframe by adding columns in specific condition in pyspark?

I have a two different dataframes:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get a dataframe in the following format? So I can get user's taste profile for comparing different users by their similarity score?
+-------+---------+---------+---------+--------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+---------+---------+---------+----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 2 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
First, you need to split your "genre" column.
from pyspark.sql import functions as F
movies = movies.withColumn("genre", F.explode(F.split("genre", '\|')))
# use \ in front of | because split use regex
then you join
user_movie = users.join(movies, on='movie_id')
and you pivot
user_movie.groupBy("user_id").pivot("genre").agg(F.count("*")).fillna(0).show()
+-------+------+---------+---------+--------+-------+------+
|user_id|Action|Adventure|Animation|Children|Fantasy|Sci-Fi|
+-------+------+---------+---------+--------+-------+------+
| 100| 0| 1| 1| 1| 0| 0|
| 101| 1| 2| 0| 1| 1| 1|
+-------+------+---------+---------+--------+-------+------+
FYI : Drama column does not appear because there is no drama "genre" in the movies dataframe. But with your full data, you will have one column for each genre.

How to make a transform of return type DataFrame => DataFrame which will give product of 2 column as value and column1_column2 as name

Input
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 1 | 0 | 1
+-------+-------+----+-------
output
+-------+-------+----+-------+-------+-------+----+-------
| id | a | b | c | a_b | a_c | b_c
+-------+-------+----+-------+-------+-------+----+-------
| 1 | 1 | 0 | 1 | 0 | 1 | 0
+-------+-------+----+-------+-------+-------+----+-------
basically I have a sequence of pair which contains Seq((a,b),(a,c),(b,c))
and thier values will be col(a)*col(b) , col(a)*col(c) col(b)*col(c) for new column
Like I know how to add them in dataFrame but not able to make a transform of return type DataFrame => DataFrame
Is this what you what?
Take a look at the API page. You will save yourself sometime :)
val df = Seq((1, 1, 0, 1))
.toDF("id", "a", "b", "c")
.withColumn("a_b", $"a" * $"b")
.withColumn("a_c", $"a" * $"c")
.withColumn("b_c", $"b" * $"c")
output ============
+---+---+---+---+---+---+---+
| id| a| b| c|a_b|a_c|b_c|
+---+---+---+---+---+---+---+
| 1| 1| 0| 1| 0| 1| 0|
+---+---+---+---+---+---+---+

How do I transform a Spark dataframe so that my values become column names? [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I'm not sure of a good way to phrase the question, but an example will help. Here is the dataframe that I have with the columns: name, type, and count:
+------+------+-------+
| Name | Type | Count |
+------+------+-------+
| a | 0 | 5 |
| a | 1 | 4 |
| a | 5 | 5 |
| a | 4 | 5 |
| a | 2 | 1 |
| b | 0 | 2 |
| b | 1 | 4 |
| b | 3 | 5 |
| b | 4 | 5 |
| b | 2 | 1 |
| c | 0 | 5 |
| c | ... | ... |
+------+------+-------+
I want to get a new dataframe structured like this where the Type column values have become new columns:
+------+---+-----+---+---+---+---+
| Name | 0 | 1 | 2 | 3 | 4 | 5 | <- Number columns are types from input
+------+---+-----+---+---+---+---+
| a | 5 | 4 | 1 | 0 | 5 | 5 |
| b | 2 | 4 | 1 | 5 | 5 | 0 |
| c | 5 | ... | | | | |
+------+---+-----+---+---+---+---+
The columns here are [Name,0,1,2,3,4,5].
Do this by using the pivot function in Spark.
val df2 = df.groupBy("Name").pivot("Type").sum("Count")
Here, if the name and the type is the same for two rows, the count values are simply added together, but other aggregations are possible as well.
Resulting dataframe when using the example data in the question:
+----+---+----+----+----+----+----+
|Name| 0| 1| 2| 3| 4| 5|
+----+---+----+----+----+----+----+
| c| 5|null|null|null|null|null|
| b| 2| 4| 1| 5| 5|null|
| a| 5| 4| 1|null| 5| 5|
+----+---+----+----+----+----+----+

how to output multiple (key,value) in spark map function

The format of input data likes below:
+--------------------+-------------+--------------------+
| StudentID| Right | Wrong |
+--------------------+-------------+--------------------+
| studentNo01 | a,b,c | x,y,z |
+--------------------+-------------+--------------------+
| studentNo02 | c,d | v,w |
+--------------------+-------------+--------------------+
And the format of output likes below():
+--------------------+---------+
| key | value|
+--------------------+---------+
| studentNo01,a | 1 |
+--------------------+---------+
| studentNo01,b | 1 |
+--------------------+---------+
| studentNo01,c | 1 |
+--------------------+---------+
| studentNo01,x | 0 |
+--------------------+---------+
| studentNo01,y | 0 |
+--------------------+---------+
| studentNo01,z | 0 |
+--------------------+---------+
| studentNo02,c | 1 |
+--------------------+---------+
| studentNo02,d | 1 |
+--------------------+---------+
| studentNo02,v | 0 |
+--------------------+---------+
| studentNo02,w | 0 |
+--------------------+---------+
The Right means 1 , The Wrong means 0.
I want to process these data using Spark map function or udf, But I don't know how to deal with it . Can you help me, please? Thank you.
Use split and explode twice and do the union
val df = List(
("studentNo01","a,b,c","x,y,z"),
("studentNo02","c,d","v,w")
).toDF("StudenID","Right","Wrong")
+-----------+-----+-----+
| StudenID|Right|Wrong|
+-----------+-----+-----+
|studentNo01|a,b,c|x,y,z|
|studentNo02| c,d| v,w|
+-----------+-----+-----+
val pair = (
df.select('StudenID,explode(split('Right,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(1))
).unionAll(
df.select('StudenID,explode(split('Wrong,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(0))
)
+-------------+-----+
| key|value|
+-------------+-----+
|studentNo01,a| 1|
|studentNo01,b| 1|
|studentNo01,c| 1|
|studentNo02,c| 1|
|studentNo02,d| 1|
|studentNo01,x| 0|
|studentNo01,y| 0|
|studentNo01,z| 0|
|studentNo02,v| 0|
|studentNo02,w| 0|
+-------------+-----+
You can convert to RDD as follows
val rdd = pair.map(r => (r.getString(0),r.getInt(1)))