how to sequentially iterate rows in Pyspark Dataframe - pyspark

I have a Spark DataFrame like this:
+-------+------+-----+---------------+
|Account|nature|value| time|
+-------+------+-----+---------------+
| a| 1| 50|10:05:37:293084|
| a| 1| 50|10:06:46:806510|
| a| 0| 50|11:19:42:951479|
| a| 1| 40|19:14:50:479055|
| a| 0| 50|16:56:17:251624|
| a| 1| 40|16:33:12:133861|
| a| 1| 20|17:33:01:385710|
| b| 0| 30|12:54:49:483725|
| b| 0| 40|19:23:25:845489|
| b| 1| 30|10:58:02:276576|
| b| 1| 40|12:18:27:161290|
| b| 0| 50|12:01:50:698592|
| b| 0| 50|08:45:53:894441|
| b| 0| 40|17:36:55:827330|
| b| 1| 50|17:18:41:728486|
+-------+------+-----+---------------+
I want to compare nature column of one row to other rows with the same Account and value,I should look forward, and add new column named Repeated. The new column get true for both rows, if nature changed, from 1 to 0 or vise versa. For example, the above dataframe should look like this:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| true |
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| true |
| a| 0| 50|16:56:17:251624| true |
| b| 0| 50|08:45:53:894441| true |
| b| 0| 50|12:01:50:698592| false|
| b| 1| 50|17:18:41:728486| true |
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| true |
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| true |
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
My solution is that I have to do group by or window on Account and value columns; then in each group, compare nature of each row to nature of other rows and as a result of comperation, Repeated column become full.
I did this calculation with Spark Window functions. Like this:
windowSpec = Window.partitionBy("Account","value").orderBy("time")
df.withColumn("Repeated", coalesce(f.when(lead(df['nature']).over(windowSpec)!=df['nature'],lit(True)).otherwise(False))).show()
The result was like this which is not the result that I wanted:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| false|
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| false|
| a| 0| 50|16:56:17:251624| false|
| b| 0| 50|08:45:53:894441| false|
| b| 0| 50|12:01:50:698592| true|
| b| 1| 50|17:18:41:728486| false|
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| false|
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| false|
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
UPDATE:
To explain more, if we suppose the first Spark Dataframe is named "df",in the following, I write what exactly want to do in each group of "Account" and "value":
a = df.withColumn('repeated',lit(False))
for i in range(len(group)):
j = i+1
for j in j<=len(group):
if a.loc[i,'nature']!=a.loc[j,'nature'] and a.loc[j,'repeated']==False:
a.loc[i,'repeated'] = True
a.loc[j,'repeated'] = True
Would you please guide me how to do that using Pyspark Window?
Any help is really appreciated.

You actually need to guarantee that the order you see in your dataframe is the actual order. Can you do that? You need a column to sequence that what happened did happen in that order. Inserting new data into a dataframe doesn't guarantee it's order.
A window & Lag will allow you to look at the previous rows value and make the required adjustment.
FYI: I use coalesce here as if it's the first row there is no value for it to compare with. consider using the second parameter to coalesce as you see fit with what should happen with the first value in the account.)
If you need it look at monotonically increasing function. It may help you to create the order by value that is required for us to deterministically look at this data.
from pyspark.sql.functions import lag
from pyspark.sql.functions import lit
from pyspark.sql.functions import coalesce
from pyspark.sql.window import Window
spark.sql("create table nature (Account string,nature int, value int, order int)");
spark.sql("insert into nature values ('a', 1, 50,1), ('a', 1, 40,2),('a',0,50,3),('b',0,30,4),('b',0,40,5),('b',1,30,6),('b',1,40,7)")
windowSpec = Window.partitionBy("Account").orderBy("order")
nature = spark.table("nature");
nature.withColumn("Repeated", coalesce( lead(nature['nature']).over(windowSpec) != nature['nature'], lit(True)) ).show()
|Account|nature|value|order|Repeated|
+-------+------+-----+-----+--------+
| b| 0| 30| 4| false|
| b| 0| 40| 5| true|
| b| 1| 30| 6| false|
| b| 1| 40| 7| true|
| a| 1| 50| 1| false|
| a| 1| 40| 2| true|
| a| 0| 50| 3| true|
+-------+------+-----+-----+--------+
EDIT:
It's not clear from your description if I should look forward or backward. I have changed my code to look forward a row as this is consistent with account 'B' in your output. However it doesn't seem like the logic for Account 'A' is identical to the logic for 'B' in your sample output. (Or I don't understand a subtly of starting on '1' instead of starting on '0'.) If you want to look forward a row use lead, if you want to look back a row use lag.

Problem solved.
Even though this way costs a lot,but it's ok.
def check(part):
df = part
size = len(df)
for i in range(size):
if (df.loc[i,'repeated'] == True):
continue
else:
for j in range((i+1),size):
if (df.loc[i,'nature']!=df.loc[j,'nature']) & (df.loc[j,'repeated']==False):
df.loc[j,'repeated'] = True
df.loc[i,'repeated'] = True
break
return df
df.groupby("Account","value").applyInPandas(check, schema="Account string, nature int,value long,time string,repeated boolean").show()
Update1:
Another solution without any iterations.
def check(df):
df = df.sort_values('verified_time')
df['index'] = df.index
df['IS_REPEATED'] = 0
df1 = df.sort_values(['nature'],ascending=[True]).reset_index(drop=True)
df2 = df.sort_values(['nature'],ascending=[False]).reset_index(drop=True)
df1['IS_REPEATED']=df1['nature']^df2['nature']
df3 = df1.sort_values(['index'],ascending=[True])
df = df3.drop(['index'],axis=1)
return df
df = df.groupby("account", "value").applyInPandas(gf.check2,schema=gf.get_schema('trx'))
UPDATE2:
Solution with Spark window:
def is_repeated_feature(df):
windowPartition = Window.partitionBy("account", "value", 'nature').orderBy('nature')
df_1 = df.withColumn('rank', F.row_number().over(windowPartition))
w = (Window
.partitionBy('account', 'value')
.orderBy('nature')
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df_1 = df_1.withColumn("count_nature", F.count('nature').over(w))
df_1 = df_1.withColumn('sum_nature', F.sum('nature').over(w))
df_1 = df_1.select('*')
df_2 = df_1.withColumn('min_val',
when((df_1.sum_nature > (df_1.count_nature - df_1.sum_nature)),
(df_1.count_nature - df_1.sum_nature)).otherwise(df_1.sum_nature))
df_2 = df_2.withColumn('more_than_one', when(df_2.count_nature > 1, '1').otherwise('0'))
df_2 = df_2.withColumn('is_repeated',
when(((df_2.more_than_one == 1) & (df_2.count_nature > df_2.sum_nature) & (
df_2.rank <= df_2.min_val)), '1')
.otherwise('0'))
return df_2

Related

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

Determining Number of Joint Sessions Per Product Pair

I have this data-frame:
from pyspark.mllib.linalg.distributed import IndexedRow
rows = sc.parallelize([[1, "A"], [1, 'B'] , [1, "A"], [2, 'A'], [2, 'C'] ,[3,'A'], [3, 'B']])
rows_df = rows.toDF(["session_id", "product"])
rows_df.show()
+----------+-------+
|session_id|product|
+----------+-------+
| 1| A|
| 1| B|
| 1| A|
| 2| A|
| 2| C|
| 3| A|
| 3| B|
+----------+-------+
I want to know how many joint sessions each product pair have together. The same products can be in a session multiple times, but I only want one count per session per product pair.
Sample Output:
+---------+---------+-----------------+
|product_a|product_b|num_join_sessions|
+---------+---------+-----------------+
| A| B| 2|
| A| C| 1|
| B| A| 2|
| B| C| 0|
| C| A| 1|
| C| B| 0|
+---------+---------+-----------------+
I'm lost on how to implement this in pyspark.
Getting the joint session count for pairs that have joint sessions is fairly easy. You can achieve this by joining the DataFrame to itself on session_id and filtering out the rows where the products are the same.
Then you group by the product pairs and count the distinct session_ids.
import pyspark.sql.functions as f
rows_df.alias("l").join(rows_df.alias("r"), on="session_id", how="inner")\
.where("l.product != r.product")\
.groupBy(f.col("l.product").alias("product_a"), f.col("r.product").alias("product_b"))\
.agg(f.countDistinct("session_id").alias("num_join_sessions"))\
.show()
#+---------+---------+-----------------+
#|product_a|product_b|num_join_sessions|
#+---------+---------+-----------------+
#| A| C| 1|
#| C| A| 1|
#| B| A| 2|
#| A| B| 2|
#+---------+---------+-----------------+
(Side note: if want ONLY unique pairs of products, change the != to < in the where function).
The tricky part is that you also want the pairs that don't have joint sessions. This can be done, but it won't be efficient because you will need to get a Cartesian product of every product pairing.
Nevertheless, here is one approach:
Start with the above and RIGHT join in the Cartesian product of the distinct products pairs.
rows_df.alias("l").join(rows_df.alias("r"), on="session_id", how="inner")\
.where("l.product != r.product")\
.groupBy(f.col("l.product").alias("product_a"), f.col("r.product").alias("product_b"))\
.agg(f.countDistinct("session_id").alias("num_join_sessions"))\
.join(
rows_df.selectExpr("product AS product_a").distinct().crossJoin(
rows_df.selectExpr("product AS product_b").distinct()
).where("product_a != product_b").alias("pairs"),
on=["product_a", "product_b"],
how="right"
)\
.fillna(0)\
.sort("product_a", "product_b")\
.show()
#+---------+---------+-----------------+
#|product_a|product_b|num_join_sessions|
#+---------+---------+-----------------+
#| A| B| 2|
#| A| C| 1|
#| B| A| 2|
#| B| C| 0|
#| C| A| 1|
#| C| B| 0|
#+---------+---------+-----------------+
Note: the sort is not needed, but I included it to match the order of the desired output.
I believe this should do it:
import pyspark.sql.functions as F
joint_sessions = rows_df.withColumnRenamed(
'product', 'product_a'
).join(
rows_df.withColumnRenamed('product', 'product_b'),
on='session_id',
how='inner'
).filter(
F.col('product_a') != F.col('product_b')
).groupBy(
'product_a',
'product_b'
).agg(
F.countDistinct('session_id').alias('num_join_sessions')
).select(
'product_a',
'product_b',
'num_join_sessions'
)
joint_sessions.show()

Improve the efficiency of Spark SQL in repeated calls to groupBy/count. Pivot the outcome

I have a Spark DataFrame consisting of columns of integers. I want to tabulate each column and pivot the outcome by the column names.
In the following toy example, I start with this DataFrame df
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 1| 1| 1| 0| 2|
| 1| 1| 1| 1| 1|
| 2| 2| 2| 3| 3|
| 0| 0| 0| 0| 1|
| 1| 1| 1| 0| 0|
| 3| 3| 3| 2| 2|
| 0| 1| 1| 1| 0|
+---+---+---+---+---+
Each cell can only contain one of {0, 1, 2, 3}. Now I want to tabulate the counts in each column. Ideally, I would have a column for each label (0, 1, 2, 3), and a row for each column. I do:
val output = df.columns.map(cs => df.select(cs).groupBy(cs).count().orderBy(cs).
withColumnRenamed(cs, "severity").
withColumnRenamed("count", "counts").withColumn("window", lit(cs))
)
I get an Array of DataFrames, one for each row of the df. Each of these dataframes has 4 rows (one for each outcome). Then I do:
val longOutput = output.reduce(_ union _) // flatten the array to produce one dataframe
longOutput.show()
to collapse the Array.
+--------+------+------+
|severity|counts|window|
+--------+------+------+
| 0| 2| a|
| 1| 3| a|
| 2| 1| a|
| 3| 1| a|
| 0| 1| b|
| 1| 4| b|
| 2| 1| b|
| 3| 1| b|
...
And finally, I pivot on the original column names
longOutput.cache()
val results = longOutput.groupBy("window").pivot("severity").agg(first("counts"))
results.show()
+------+---+---+---+---+
|window| 0| 1| 2| 3|
+------+---+---+---+---+
| e| 2| 2| 2| 1|
| d| 3| 2| 1| 1|
| c| 1| 4| 1| 1|
| b| 1| 4| 1| 1|
| a| 2| 3| 1| 1|
+------+---+---+---+---+
However the reduction piece took 8 full seconds on the toy example. It ran for over 2 hours on my actual data which had 1000 columns and 400,000 rows before I terminated it. I am running locally on a machine with 12 cores and 128G of RAM. But clearly, what I'm doing is slow on even a small amount of data, so machine size is not in itself the problem. The column groupby/count took only 7 minutes on the full data set. But then I can't do anything with that Array[DataFrame].
I tried several ways of avoiding union. I tried writing out my array to disk, but that failed due to a memory problem after several hours of effort. I also tried to adjust memory allowances on Zeppelin
So I need a way of doing the tabulation that does not give me an Array of DataFrames, but rather a simple data frame.
The problem with your code is that you trigger one spark job per column and then a big union. In general, it's much faster to try and keep everything within the same one.
In your case, instead of dividing the work, you could explode the dataframe to do everything in one pass like this:
df
.select(array(df.columns.map(c => struct(lit(c) as "name", col(c) as "value") ) : _*) as "a")
.select(explode('a))
.select($"col.name" as "name", $"col.value" as "value")
.groupBy("name")
.pivot("value")
.count()
.show()
This first line is the only one that's a bit tricky. It creates an array of tuples where each column name is mapped to its value. Then we explode it (one line per element of the array) and finally compute a basic pivot.

How to create a sequence of events (column values) per some other column?

I have a Spark data frame as shown below -
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
scala> myDF.show
+-------+-------+---------+-------------+------+
|visitor|channel|timestamp|purchase_flag|amount|
+-------+-------+---------+-------------+------+
| 1| A| 100| 0| 0|
| 1| E| 200| 0| 0|
| 1| | 300| 1| 49|
| 2| A| 200| 0| 0|
| 2| C| 300| 0| 0|
| 2| D| 100| 0| 0|
+-------+-------+---------+-------------+------+
I would like to create Sequence dataframe for every visitor from myDF that traces a visitor's path to purchase ordered by timestamp dimension.
The output dataframe should look like below(-> can be any delimiter) -
+-------+---------------------+
|visitor|channel sequence |
+-------+---------------------+
| 1| A->E->purchase |
| 2| D->A->C->no_purchase|
+-------+---------------------+
To make things clear, visitor 2 has been exposed to channel D, then A and then C; and he does not make a purchase.
Hence the sequence is to be formed as D->A-C->no_purchase.
NOTE: Whenever a purchase happens, channel value goes blank and purchase_flag is set to 1.
I want to do this using a Scala UDF in Spark so that I re-apply the method on other datasets.
Here's how it is done using udf function
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> struct.map(row => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})
myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
.select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
.show(false)
which should give you
+-------+--------------------+
|visitor|channel sequence |
+-------+--------------------+
|1 |A->E->purchase |
|2 |D->A->C->no_purchase|
+-------+--------------------+
You can make it as much generic as you can . this is just a demo on how you should proceed

Adding a Column in DataFrame from another column of same dataFrame Pyspark

I have a Pyspark dataframe df, like following:
+---+----+---+
| id|name| c|
+---+----+---+
| 1| a| 5|
| 2| b| 4|
| 3| c| 2|
| 4| d| 3|
| 5| e| 1|
+---+----+---+
I want to add a column match_name that have value from the name column where id == c
Is it possible to do it with function withColumn()?
Currently i have to create two dataframes and then perform join.
Which is inefficient on large dataset.
Expected Output:
+---+----+---+----------+
| id|name| c|match_name|
+---+----+---+----------+
| 1| a| 5| e|
| 2| b| 4| d|
| 3| c| 2| b|
| 4| d| 3| c|
| 5| e| 1| a|
+---+----+---+----------+
Yes, it is possible, with when:
from pyspark.sql.functions import when, col
condition = col("id") == col("match")
result = df.withColumn("match_name", when(condition, col("name"))
result.show()
id name match match_name
1 a 3 null
2 b 2 b
3 c 5 null
4 d 4 d
5 e 1 null
You may also use otherwise to provide a different value if the condition is not met.