How to enrich dataframe by adding columns in specific condition in pyspark? - pyspark

I have a two different dataframes:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get a dataframe in the following format? So I can get user's taste profile for comparing different users by their similarity score?
+-------+---------+---------+---------+--------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+---------+---------+---------+----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 2 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+

First, you need to split your "genre" column.
from pyspark.sql import functions as F
movies = movies.withColumn("genre", F.explode(F.split("genre", '\|')))
# use \ in front of | because split use regex
then you join
user_movie = users.join(movies, on='movie_id')
and you pivot
user_movie.groupBy("user_id").pivot("genre").agg(F.count("*")).fillna(0).show()
+-------+------+---------+---------+--------+-------+------+
|user_id|Action|Adventure|Animation|Children|Fantasy|Sci-Fi|
+-------+------+---------+---------+--------+-------+------+
| 100| 0| 1| 1| 1| 0| 0|
| 101| 1| 2| 0| 1| 1| 1|
+-------+------+---------+---------+--------+-------+------+
FYI : Drama column does not appear because there is no drama "genre" in the movies dataframe. But with your full data, you will have one column for each genre.

Related

PySpark Relate multiple rows

I have a problem where I need to relate rows to each other. I have tried many things but I am now completly stuck. I have tried partitioning, lag, groupbys but nothing works.
The rows below the ID 26 wil relate to the MPAN of 26
ID | MPAN | Value
---------------------
26 | 12345678 | Hello
27 | 99900234 | Bye
30 | 77563820 | Help
33 | 89898937 | Stuck
26 | 54877273 | Need a genius
29 | 54645643 | So close
30 | 22222222 | Thanks
e.g.
ID | MPAN | Value | Relation
----------------------------------------
26 | 12345678 | Hello | NULL
27 | 99900234 | Bye | 12345678
30 | 77563820 | Help | 12345678
33 | 89898937 | Stuck | 12345678
26 | 54877273 | Genius | NULL
29 | 54645643 | So close | 54877273
30 | 22222222 | Thanks | 54877273
This code below only works for previous row and not the LAG for the 26 record
df = spark.read.load('abfss://Files/', format='parquet')
df = df.withColumn("identity", F.monotonically_increasing_id())
win = Window.orderBy("identity")
condition = F.col("Prop_0") != '026'
df = df.withColumn("FlagY", F.when(condition, mpanlookup))
df.show()
As I said in my comment, you need a column to maintain the order. In your example, you used monotonically_increasing_id to create that "ordering" column, but that is absurd because
The function is non-deterministic because its result depends on partition IDs.
Assuming you have a proper "ordering" column :
df.show()
+---+---+--------+-------------+
|idx| ID| MPAN| Value|
+---+---+--------+-------------+
| 1| 26|12345678|Hello |
| 2| 27|99900234|Bye |
| 3| 30|77563820|Help |
| 4| 33|89898937|Stuck |
| 5| 26|54877273|Need a genius|
| 6| 29|54645643|So close |
| 7| 30|22222222|Thanks |
+---+---+--------+-------------+
you can simply do that with last function :
from pyspark.sql import functions as F, Window
df.withColumn(
"Relation",
F.last(F.when(F.col("ID") == 26, F.col("MPAN")), ignorenulls=True).over(
Window.orderBy("idx")
),
).show()
+---+---+--------+-------------+--------+
|idx| ID| MPAN| Value|Relation|
+---+---+--------+-------------+--------+
| 1| 26|12345678|Hello |12345678|
| 2| 27|99900234|Bye |12345678|
| 3| 30|77563820|Help |12345678|
| 4| 33|89898937|Stuck |12345678|
| 5| 26|54877273|Need a genius|54877273|
| 6| 29|54645643|So close |54877273|
| 7| 30|22222222|Thanks |54877273|
+---+---+--------+-------------+--------+

PySpark Column Creation by queuing filtered past rows

In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?
If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+
Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

Removing alphabets from alphanumeric values present in column of dataframe of spark

The two column of dataframe looks like.
SKU | COMPSKU
PT25M | PT10M
PT3H | PT20M
TH | QR12
S18M | JH
spark with scala
How can i remove all alphabets and only numbers retain..
Expected output:
25|10
3|20
0|12
18|0
You could also do it this way.
df.withColumn(
"SKU",
when(regexp_replace(col("SKU"),"[a-zA-Z]","")==="",0
).otherwise(regexp_replace(col("SKU"),"[a-zA-Z]",""))
).withColumn(
"COMPSKU",
when(regexp_replace(col("COMPSKU"),"[a-zA-Z]","")==="", 0
).otherwise(regexp_replace(col("COMPSKU"),"[a-zA-Z]",""))
).show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
| 25 | 10 |
| 3 | 20 |
| 0 | 12 |
| 18 | 0 |
+-----+-------+
*/
Try with regexp_replace function then use case when otherwise statement to replace empty values with 0.
Example:
df.show()
/*
+-----+-------+
| SKU|COMPSKU|
+-----+-------+
|PT25M| PT10M|
| PT3H| PT20M|
| TH| QR12|
| S18M| JH|
+-----+-------+
*/
df.withColumn("SKU",regexp_replace(col("SKU"),"[a-zA-Z]","")).
withColumn("COMPSKU",regexp_replace(col("COMPSKU"),"[a-zA-Z]","")).
withColumn("SKU",when(length(trim(col("SKU")))===0,lit(0)).otherwise(col("SKU"))).
withColumn("COMPSKU",when(length(trim(col("COMPSKU")))===0,lit(0)).otherwise(col("COMPSKU"))).
show()
/*
+---+-------+
|SKU|COMPSKU|
+---+-------+
| 25| 10|
| 3| 20|
| 0| 12|
| 18| 0|
+---+-------+
*/

Spark dataframe: Pivot and Group based on columns

I have input dataframe as below with id, app, and customer
Input dataframe
+--------------------+-----+---------+
| id|app |customer |
+--------------------+-----+---------+
|id1 | fw| WM |
|id1 | fw| CS |
|id2 | fw| CS |
|id1 | fe| WM |
|id3 | bc| TR |
|id3 | bc| WM |
+--------------------+-----+---------+
Expected output
Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe
Expected dataframe
+--------------------+----------+-------+----------+
| id| bc | fe| fw |
+--------------------+----------+-------+----------+
|id1 | 0 | WM| [WM,CS]|
|id2 | 0 | 0| [CS] |
|id3 | [TR,WM] | 0| 0 |
+--------------------+----------+-------+----------+
What have i tried ?
val newDF =
df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()
+--------------------+-----+-------+------+
| id|bc | fe| fw|
+--------------------+-----+-------+------+
|id1 | 0 | WM| WM|
|id2 | 0 | 0| CS|
|id3 | TR | 0| 0|
+--------------------+-----+-------+------+
Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"
Need your suggestion to get the list of customer under each app respectively.
You can use collect_list if you can bear with an empty List at cells where it should be zero:
df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id| bc| fe| fw|
+---+--------+----+--------+
|id3|[TR, WM]| []| []|
|id1| []|[WM]|[CS, WM]|
|id2| []| []| [CS]|
+---+--------+----+--------+
Using CONCAT_WS we can explode array and can remove the square brackets.
df.groupBy("id").pivot("app").agg(concat_ws(",",collect_list("customer")))