Spark dataframe - transform rows with same ID to columns - pyspark

I want to transform below source dataframe (using pyspark):
Key
ID
segment
1
A
m1
2
A
m1
3
B
m1
4
C
m2
1
D
m1
2
E
m1
3
F
m1
4
G
m2
1
J
m1
2
J
m1
3
J
m1
4
K
m2
Into below result dataframe:
ID
key1
key2
A
1
2
B
3
-
C
4
-
D
1
-
F
3
-
G
4
-
J
1
2
J
1
3
J
2
3
K
4
-
In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.
Thank you for your help

Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.
w =Window.partitionBy('ID').orderBy(desc('Key'))
df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
.withColumn('key2', col('key2').isNotNull())# query to create boolean selection
.withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
.orderBy('ID', 'key')#Reorder frame
)
df.show()
+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
| 1| A| m1| 2|
| 2| A| m1| 2|
| 3| B| m1| 1|
| 4| C| m2| 1|
| 1| D| m1| 1|
| 2| E| m1| 1|
| 3| F| m1| 1|
| 4| G| m2| 1|
| 1| J| m1| 2|
| 2| J| m1| 3|
| 3| J| m1| 3|
| 4| K| m2| 1|
+---+---+-------+----+

Related

How to build a rank based on threshold in Spark?

Suppose I have a dataframe:
val df = Seq(
(1,"A"),
(1,"B"),
(1,"C"),
(1,"D"),
(1,"E"),
(1,"F"),
(1,"G"),
(1,"H"),
(2,"I"),
(2,"J"),
(2,"J"),
(2,"J"),
(3,"K"),
).toDF("id", "code")
I need to rank it based on ids and with respect to some threshold. Example:
threshold = 3
id code rank
1 A 1
1 B 1
1 C 1 -- threshold has been reached
1 D 2
1 E 2
1 F 2 -- threshold has been reached
1 G 3
1 H 3
2 I 1
2 J 1
2 J 1 -- threshold has been reached
2 J 2
3 K 1
How can I do it?
I can create a simple rank:
df.withColumn("rank", dense_rank().over(Window.orderBy("id")))
But how to split ranked groups by threshold?
A solution that does not require to move all data into one partition:
//get the largest number of equal ids
val maxGroupSize = df.groupBy("id").count().agg(max("count")).first().getLong(0)
val threshold = 3
var f = maxGroupSize
while( f % threshold>0) f=f+1
df.withColumn("tmp1", 'id* f)
.withColumn("tmp2", dense_rank().over(Window.partitionBy("id").orderBy("code"))-1)
.withColumn("tmp3", 'tmp1+'tmp2)
.withColumn("rank", ('tmp3 / threshold).cast("int"))
Result:
+---+----+----+----+----+----+
| id|code|tmp1|tmp2|tmp3|rank|
+---+----+----+----+----+----+
| 1| A| 9| 0| 9| 3|
| 1| B| 9| 1| 10| 3|
| 1| C| 9| 2| 11| 3|
| 1| D| 9| 3| 12| 4|
| 1| E| 9| 4| 13| 4|
| 1| F| 9| 5| 14| 4|
| 1| G| 9| 6| 15| 5|
| 1| H| 9| 7| 16| 5|
| 2| I| 18| 0| 18| 6|
| 2| J| 18| 1| 19| 6|
| 3| K| 27| 0| 27| 9|
+---+----+----+----+----+----+
The downside of this approach is that the ranks are not consecutive.
It would be possible to fix this with another window
df.withColumn("rank2", dense_rank().over(Window.orderBy("rank")))
but this would again move all data to a single executor.

Change value on duplicated rows using Pyspark, keeping the first record as is

how can I change the column status value on rows that contains duplicate records on specific columns, and keep the first one(with the lower id) as A, for example:
logic:
if the account_id and user_id already exists the status is E, the first record(lower id) is A
if the user_id exists and the account_id is different the status is I, the first record(lower id) is A
input sample:
id
account_id
user_id
1
a
1
2
a
1
3
b
1
4
c
2
5
c
2
6
c
2
7
d
3
8
d
3
9
e
3
output sample
id
account_id
user_id
status
1
a
1
A
2
a
1
E
3
b
1
I
4
c
2
A
5
c
2
E
6
c
2
E
7
d
3
A
8
d
3
E
9
e
3
I
I think I need to group into multiple datasets and join it back, compare and change the values, I think I'm overthinking, help?
Thanks!!
Thank you
Two window functions would help you to determine the duplications and rank them.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
# Distinguishes between "first occurrence" vs "2nd occurrence" and so on
.withColumn('rank', F.rank().over(W.partitionBy('account_id', 'user_id').orderBy('id')))
# Detecting if there is no duplication per pair of 'account_id' and 'user_id'
.withColumn('count', F.count('*').over(W.partitionBy('account_id', 'user_id')))
# building status based on conditions
.withColumn('status', F
.when(F.col('count') == 1, 'I') # if there is only one record, status is 'I'
.when(F.col('rank') == 1, 'A') # if there is more than one record, the first occurrence is 'A'
.otherwise('E') # finally, the other occurrences are 'E'
)
.orderBy('id')
.show()
)
# Output
# +---+----------+-------+----+-----+------+
# | id|account_id|user_id|rank|count|status|
# +---+----------+-------+----+-----+------+
# | 1| a| 1| 1| 2| A|
# | 2| a| 1| 2| 2| E|
# | 3| b| 1| 1| 1| I|
# | 4| c| 2| 1| 3| A|
# | 5| c| 2| 2| 3| E|
# | 6| c| 2| 3| 3| E|
# | 7| d| 3| 1| 2| A|
# | 8| d| 3| 2| 2| E|
# | 9| e| 3| 1| 1| I|
# +---+----------+-------+----+-----+------+

Pyspark Crosstab Pivot Challenge / Problem

I unfortunately could not find a solution for my exact problem. It is related to pivot and crosstab but I could not solve it with these functions.
I have the feeling I am missing an in-between-table, but I somehow cannot come up with a solution.
Problem description:
A table with customers indicating from which category they have bought a product. If the customer bought a product from the category, the category ID will be shown next to his name.
There are 4 categories 1 - 4 and 3 customers A, B, C
+--------+----------+
|customer| category |
+--------+----------+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 4|
| C| 1|
| C| 3|
| C| 4|
+--------+----------+
The table is DISTINCT meaning there is only one combination of custmer and category
What I want is a crosstab by category where I can easily read e.g. how many of those who bought from category 1 also bought from category 4?
Desired results table:
+--------+---+---+---+---+
| | 1 | 2 | 3 | 4 |
+--------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 1|
+--------+---+---+---+---+
Reading examples:
row1 column1 : total number of customers who bought product 1 (A, B, C)
row1 column2 : number of customers who bought product 1 and 2 (A)
row1 column3 : number of customers who bought product 1 and 3 (A, C)
etc.
As you can see the table is mirrored by its diagonal.
Any suggestions how to created the desired table?
Additional challenge:
How to get the results as %?
For the first row the results wold be then: | 100% | 33% | 66% | 66% |
Many thanks in advance!
You can join the input data with itself using customer as join criterium. This returns all combinations of categories that exist for a given customer. After that you can use crosstab to get the result.
df2 = df.withColumnRenamed("category", "cat1").join(df.withColumnRenamed("category", "cat2"), "customer") \
.crosstab("cat1", "cat2") \
.orderBy("cat1_cat2")
df2.show()
Output:
+---------+---+---+---+---+
|cat1_cat2| 1| 2| 3| 4|
+---------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 2|
+---------+---+---+---+---+
To get the relative frequency you can sum over each row and then divide each element by this sum.
df2.withColumn("sum", sum(df2[col] for col in df2.columns if col != "cat1_cat2")) \
.select("cat1_cat2", *(F.round(df2[col]/F.col("sum"),2).alias(col) for col in df2.columns if col != "cat1_cat2")) \
.show()
Output:
+---------+----+----+----+----+
|cat1_cat2| 1| 2| 3| 4|
+---------+----+----+----+----+
| 1|0.38|0.13|0.25|0.25|
| 2|0.33|0.33|0.33| 0.0|
| 3|0.33|0.17|0.33|0.17|
| 4| 0.4| 0.0| 0.2| 0.4|
+---------+----+----+----+----+

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

Adding a Column in DataFrame from another column of same dataFrame Pyspark

I have a Pyspark dataframe df, like following:
+---+----+---+
| id|name| c|
+---+----+---+
| 1| a| 5|
| 2| b| 4|
| 3| c| 2|
| 4| d| 3|
| 5| e| 1|
+---+----+---+
I want to add a column match_name that have value from the name column where id == c
Is it possible to do it with function withColumn()?
Currently i have to create two dataframes and then perform join.
Which is inefficient on large dataset.
Expected Output:
+---+----+---+----------+
| id|name| c|match_name|
+---+----+---+----------+
| 1| a| 5| e|
| 2| b| 4| d|
| 3| c| 2| b|
| 4| d| 3| c|
| 5| e| 1| a|
+---+----+---+----------+
Yes, it is possible, with when:
from pyspark.sql.functions import when, col
condition = col("id") == col("match")
result = df.withColumn("match_name", when(condition, col("name"))
result.show()
id name match match_name
1 a 3 null
2 b 2 b
3 c 5 null
4 d 4 d
5 e 1 null
You may also use otherwise to provide a different value if the condition is not met.