Adding a Column in DataFrame from another column of same dataFrame Pyspark - pyspark

I have a Pyspark dataframe df, like following:
+---+----+---+
| id|name| c|
+---+----+---+
| 1| a| 5|
| 2| b| 4|
| 3| c| 2|
| 4| d| 3|
| 5| e| 1|
+---+----+---+
I want to add a column match_name that have value from the name column where id == c
Is it possible to do it with function withColumn()?
Currently i have to create two dataframes and then perform join.
Which is inefficient on large dataset.
Expected Output:
+---+----+---+----------+
| id|name| c|match_name|
+---+----+---+----------+
| 1| a| 5| e|
| 2| b| 4| d|
| 3| c| 2| b|
| 4| d| 3| c|
| 5| e| 1| a|
+---+----+---+----------+

Yes, it is possible, with when:
from pyspark.sql.functions import when, col
condition = col("id") == col("match")
result = df.withColumn("match_name", when(condition, col("name"))
result.show()
id name match match_name
1 a 3 null
2 b 2 b
3 c 5 null
4 d 4 d
5 e 1 null
You may also use otherwise to provide a different value if the condition is not met.

Related

How to create transition matrix with groupby in pyspark

I have a pyspark dataframe that looks like this
import pandas as pd
so = pd.DataFrame({'id': ['a','a','a','a','b','b','b','b','c','c','c','c'],
'time': [1,2,3,4,1,2,3,4,1,2,3,4],
'group':['A','A','A','A','A','A','A','A','B','B','B','B'],
'value':['S','C','C','C', 'S','C','H', 'H', 'S','C','C','C']})
df_so = spark.createDataFrame(so)
df_so.show()
+---+----+-----+-----+
| id|time|group|value|
+---+----+-----+-----+
| a| 1| A| S|
| a| 2| A| C|
| a| 3| A| C|
| a| 4| A| C|
| b| 1| A| S|
| b| 2| A| C|
| b| 3| A| H|
| b| 4| A| H|
| c| 1| B| S|
| c| 2| B| C|
| c| 3| B| C|
| c| 4| B| C|
+---+----+-----+-----+
I would like to create the "transition matrix" of value by group
The transition matrix indicates what is the probability of e.g. going from value S to value C within each id while time progresses.
Example:
For group A:
We have in total 6 movements
S->C goes 1 time for id==a and 1 time for id==b, so S to C is (1+1)/6
C->S is 0, since within id there is no transition from C to S
C->C is 2/6
C->H is 1/6
H->H is 1/6
Respectively we can do the same for group B
Is there a way to do this in pyspark ?
First I use lag to make the source column (left side of transition) of the transition for each row, then count the frequency group by source & value(target) divided by the total count.
lagw = Window.partitionBy(['group', 'id']).orderBy('time')
frqw = Window.partitionBy(['group', 'source', 'value'])
ttlw = Window.partitionBy('group')
df = (df.withColumn('source', F.lag('value').over(lagw))
.withColumn('transition_p', F.count('source').over(frqw) / F.count('source').over(ttlw)))
df.show()
# +---+----+-----+-----+------+------------+
# | id|time|group|value|source|transition_p|
# +---+----+-----+-----+------+------------+
# | c| 1| B| S| null| 0.0|
# | c| 3| B| C| C| 0.666666666|
# | c| 4| B| C| C| 0.666666666|
# | c| 2| B| C| S| 0.333333333|
# | b| 1| A| S| null| 0.0|
# .....
If I understand what you like at the end,
(df.filter(df.group == 'A')
.groupby('source')
.pivot('value')
.agg(F.first('transition_p'))
).show()
# +------+---------+---------+---------+
# |source| C| H| S|
# +------+---------+---------+---------+
# | null| null| null| 0.0|
# | C|0.3333333|0.1666666| null|
# | S|0.3333333| null| null|
# | H| null|0.1666666| null|
# +------+---------+---------+---------+
The definition of transition matrix T poses that all rows of T should sum to one, which is different from your calculation.
To calculate the transition matrix (as in the definition of wikipedia), first calculate the frequency table. The code should be run after selecting the group subset.
Count the number of transitions from A to B
df = pd.DataFrame({'id': ['a','a','a','a','b','b','b','b'],
'time': [1,2,3,4,1,2,3,4],
'page':['S','C','C','C', 'S','C','H', 'H']})
win1 = Window.partitionBy(["id"]).orderBy("time")
df = df.withColumn("page_next", F.lead("page",1).over(win1))
df = df.where(F.col("page_next").isNotNull())
Find all node permutations and then join the empirical data
nodes = df.select("page").drop_duplicates()
paths = nodes.crossJoin(nodes).toDF("page", "page_next")
data = df.groupby("page", "page_next").agg(F.count(F.col("id")).alias("cnts"))
path_cnts = paths.join(data, on=["page", "page_next"], how="left").fillna(0)
freq_matrix = path_cnts.groupby("page").pivot("page_next").agg(F.first("cnts"))
This should return the frequency matrix where each cell contains numbers of transitions observed from row node A to column node B.
Normalize each row to sum to 1.
node_names = freq_matrix.columns[1:]
row_sum = sum([freq_matrix[node] for node in node_names])
trans_matrix = freq_matrix.select("page", *((freq_matrix[node] / row_sum).alias(node) for node in node_names))
If you want the transition matrix as per your definition.
Simply divide each cell by data.count().
This does not utilize groupby enough, so seems slower.

pyspark: Auto filling in implicit missing values

I have a dataframe
user day amount
a 2 10
a 1 14
a 4 5
b 1 4
You see that, the maximum value of day is 4, and the minimum value is 1. I want to fill 0 for amount column in all missing days of all users, so the above data frame will become.
user day amount
a 2 10
a 1 14
a 4 5
a 3 0
b 1 4
b 2 0
b 3 0
b 4 0
How could I do that in PySpark? Many thanks.
Here is one approach. You can get the min and max values first , then group on user column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:
min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)
missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1)
if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)
#+----+---+---+---+---+
#|user| 1| 2| 4| 3|
#+----+---+---+---+---+
#| b| 4| 0| 0| 0|
#| a| 14| 10| 5| 0|
#+----+---+---+---+---+
#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
.select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()
+----+---+------+
|user|day|amount|
+----+---+------+
| a| 1| 14|
| a| 2| 10|
| a| 4| 5|
| a| 3| 0|
| b| 1| 4|
| b| 2| 0|
| b| 4| 0|
| b| 3| 0|
+----+---+------+
Note, since column day was pivotted , the dtype might have changed so you may have to cast them back to the original dtype
Another way to do this is to use sequence, array functions and explode. (spark2.4+)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy(F.lit(0))
df.withColumn("boundaries", F.sequence(F.min("day").over(w),F.max("day").over(w),F.lit(1)))\
.groupBy("user").agg(F.collect_list("day").alias('day'),F.collect_list("amount").alias('amount')\
,F.first("boundaries").alias("boundaries")).withColumn("boundaries", F.array_except("boundaries","day"))\
.withColumn("day",F.flatten(F.array("day","boundaries"))).drop("boundaries")\
.withColumn("zip", F.explode(F.arrays_zip("day","amount")))\
.select("user","zip.day", F.when(F.col("zip.amount").isNull(),\
F.lit(0)).otherwise(F.col("zip.amount")).alias("amount")).show()
#+----+---+------+
#|user|day|amount|
#+----+---+------+
#| a| 2| 10|
#| a| 1| 14|
#| a| 4| 5|
#| a| 3| 0|
#| b| 1| 4|
#| b| 2| 0|
#| b| 3| 0|
#| b| 4| 0|
#+----+---+------+

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

Determining Number of Joint Sessions Per Product Pair

I have this data-frame:
from pyspark.mllib.linalg.distributed import IndexedRow
rows = sc.parallelize([[1, "A"], [1, 'B'] , [1, "A"], [2, 'A'], [2, 'C'] ,[3,'A'], [3, 'B']])
rows_df = rows.toDF(["session_id", "product"])
rows_df.show()
+----------+-------+
|session_id|product|
+----------+-------+
| 1| A|
| 1| B|
| 1| A|
| 2| A|
| 2| C|
| 3| A|
| 3| B|
+----------+-------+
I want to know how many joint sessions each product pair have together. The same products can be in a session multiple times, but I only want one count per session per product pair.
Sample Output:
+---------+---------+-----------------+
|product_a|product_b|num_join_sessions|
+---------+---------+-----------------+
| A| B| 2|
| A| C| 1|
| B| A| 2|
| B| C| 0|
| C| A| 1|
| C| B| 0|
+---------+---------+-----------------+
I'm lost on how to implement this in pyspark.
Getting the joint session count for pairs that have joint sessions is fairly easy. You can achieve this by joining the DataFrame to itself on session_id and filtering out the rows where the products are the same.
Then you group by the product pairs and count the distinct session_ids.
import pyspark.sql.functions as f
rows_df.alias("l").join(rows_df.alias("r"), on="session_id", how="inner")\
.where("l.product != r.product")\
.groupBy(f.col("l.product").alias("product_a"), f.col("r.product").alias("product_b"))\
.agg(f.countDistinct("session_id").alias("num_join_sessions"))\
.show()
#+---------+---------+-----------------+
#|product_a|product_b|num_join_sessions|
#+---------+---------+-----------------+
#| A| C| 1|
#| C| A| 1|
#| B| A| 2|
#| A| B| 2|
#+---------+---------+-----------------+
(Side note: if want ONLY unique pairs of products, change the != to < in the where function).
The tricky part is that you also want the pairs that don't have joint sessions. This can be done, but it won't be efficient because you will need to get a Cartesian product of every product pairing.
Nevertheless, here is one approach:
Start with the above and RIGHT join in the Cartesian product of the distinct products pairs.
rows_df.alias("l").join(rows_df.alias("r"), on="session_id", how="inner")\
.where("l.product != r.product")\
.groupBy(f.col("l.product").alias("product_a"), f.col("r.product").alias("product_b"))\
.agg(f.countDistinct("session_id").alias("num_join_sessions"))\
.join(
rows_df.selectExpr("product AS product_a").distinct().crossJoin(
rows_df.selectExpr("product AS product_b").distinct()
).where("product_a != product_b").alias("pairs"),
on=["product_a", "product_b"],
how="right"
)\
.fillna(0)\
.sort("product_a", "product_b")\
.show()
#+---------+---------+-----------------+
#|product_a|product_b|num_join_sessions|
#+---------+---------+-----------------+
#| A| B| 2|
#| A| C| 1|
#| B| A| 2|
#| B| C| 0|
#| C| A| 1|
#| C| B| 0|
#+---------+---------+-----------------+
Note: the sort is not needed, but I included it to match the order of the desired output.
I believe this should do it:
import pyspark.sql.functions as F
joint_sessions = rows_df.withColumnRenamed(
'product', 'product_a'
).join(
rows_df.withColumnRenamed('product', 'product_b'),
on='session_id',
how='inner'
).filter(
F.col('product_a') != F.col('product_b')
).groupBy(
'product_a',
'product_b'
).agg(
F.countDistinct('session_id').alias('num_join_sessions')
).select(
'product_a',
'product_b',
'num_join_sessions'
)
joint_sessions.show()

combining lag with row computation in windowing apache spark

assume there is a dataframe as follows:
a| b|
1| 3|
1| 5|
2| 6|
2| 9|
2|14|
I want to produce a final dataframe like this
a| b| c
1| 3| 0
1| 5| -2
2| 6| -6
2| 9| -10
2| 14| -17
The value of c is computed for every row except the first one as a-b+c for the previous row. I tried to use lag as well as rowsBetween, but no success Since "c" value does not exist and it is filled with random variable!!
val w = Window.partitionBy().orderBy($"a", $"b)
df.withColumn("c", lead($"a", 1, 0).over(w) - lead($"b", 1, 0).over(w) + lead($"c", 1, 0).over(w))
You can't reference c while calculating c; What you need is a cumulative sum, which could simply be:
df.withColumn("c", sum(lag($"a" - $"b", 1, 0).over(w)).over(w)).show
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 0|
| 1| 5| -2|
| 2| 6| -6|
| 2| 9|-10|
| 2| 14|-17|
+---+---+---+
But note this is inefficient due to the lack of the partition column.