pyspark window function with filter - pyspark

I have the following DataFrame with columns: ["id", "timestamp", "x", "y"]:
+---+----------+---+---+
| id| timestamp| x| y|
+---+----------+---+---+
| 0|1443489380|100| 1|
| 0|1443489390|200| 0|
| 0|1443489400|300| 0|
| 1|1443489410|400| 1|
| 1|1443489550|100| 1|
| 2|1443489560|600| 0|
| 2|1443489570|200| 0|
| 2|1443489580|700| 1|
+---+----------+---+---+
I have defined the following Window:
from pyspark.sql import Window
w = Window.partitionBy("id").orderBy("timestamp")
I would like to extract only the first and last row of data in the window w. How can I accomplish this?

If you want the first and last values on the same row, one way is to use pyspark.sql.functions.first():
from pyspark.sql import Window
from pyspark.sql.functions import first
w1 = Window.partitionBy("id").orderBy("timestamp")
w2 = Window.partitionBy("id").orderBy(f.col("timestamp").desc()) # sort desc
df.select(
"id",
*([first(c).over(w1).alias("first_" + c) for c in df.columns if c != "id"] +
[first(c).over(w2).alias("last_" + c) for c in df.columns if c != "id"])
)\
.distinct()\
.show()
#+---+---------------+-------+-------+--------------+------+------+
#| id|first_timestamp|first_x|first_y|last_timestamp|last_x|last_y|
#+---+---------------+-------+-------+--------------+------+------+
#| 0| 1443489380| 100| 1| 1443489400| 300| 0|
#| 1| 1443489410| 400| 1| 1443489550| 100| 1|
#| 2| 1443489560| 600| 0| 1443489580| 700| 1|
#+---+---------------+-------+-------+--------------+------+------+

Related

how to sequentially iterate rows in Pyspark Dataframe

I have a Spark DataFrame like this:
+-------+------+-----+---------------+
|Account|nature|value| time|
+-------+------+-----+---------------+
| a| 1| 50|10:05:37:293084|
| a| 1| 50|10:06:46:806510|
| a| 0| 50|11:19:42:951479|
| a| 1| 40|19:14:50:479055|
| a| 0| 50|16:56:17:251624|
| a| 1| 40|16:33:12:133861|
| a| 1| 20|17:33:01:385710|
| b| 0| 30|12:54:49:483725|
| b| 0| 40|19:23:25:845489|
| b| 1| 30|10:58:02:276576|
| b| 1| 40|12:18:27:161290|
| b| 0| 50|12:01:50:698592|
| b| 0| 50|08:45:53:894441|
| b| 0| 40|17:36:55:827330|
| b| 1| 50|17:18:41:728486|
+-------+------+-----+---------------+
I want to compare nature column of one row to other rows with the same Account and value,I should look forward, and add new column named Repeated. The new column get true for both rows, if nature changed, from 1 to 0 or vise versa. For example, the above dataframe should look like this:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| true |
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| true |
| a| 0| 50|16:56:17:251624| true |
| b| 0| 50|08:45:53:894441| true |
| b| 0| 50|12:01:50:698592| false|
| b| 1| 50|17:18:41:728486| true |
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| true |
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| true |
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
My solution is that I have to do group by or window on Account and value columns; then in each group, compare nature of each row to nature of other rows and as a result of comperation, Repeated column become full.
I did this calculation with Spark Window functions. Like this:
windowSpec = Window.partitionBy("Account","value").orderBy("time")
df.withColumn("Repeated", coalesce(f.when(lead(df['nature']).over(windowSpec)!=df['nature'],lit(True)).otherwise(False))).show()
The result was like this which is not the result that I wanted:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| false|
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| false|
| a| 0| 50|16:56:17:251624| false|
| b| 0| 50|08:45:53:894441| false|
| b| 0| 50|12:01:50:698592| true|
| b| 1| 50|17:18:41:728486| false|
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| false|
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| false|
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
UPDATE:
To explain more, if we suppose the first Spark Dataframe is named "df",in the following, I write what exactly want to do in each group of "Account" and "value":
a = df.withColumn('repeated',lit(False))
for i in range(len(group)):
j = i+1
for j in j<=len(group):
if a.loc[i,'nature']!=a.loc[j,'nature'] and a.loc[j,'repeated']==False:
a.loc[i,'repeated'] = True
a.loc[j,'repeated'] = True
Would you please guide me how to do that using Pyspark Window?
Any help is really appreciated.
You actually need to guarantee that the order you see in your dataframe is the actual order. Can you do that? You need a column to sequence that what happened did happen in that order. Inserting new data into a dataframe doesn't guarantee it's order.
A window & Lag will allow you to look at the previous rows value and make the required adjustment.
FYI: I use coalesce here as if it's the first row there is no value for it to compare with. consider using the second parameter to coalesce as you see fit with what should happen with the first value in the account.)
If you need it look at monotonically increasing function. It may help you to create the order by value that is required for us to deterministically look at this data.
from pyspark.sql.functions import lag
from pyspark.sql.functions import lit
from pyspark.sql.functions import coalesce
from pyspark.sql.window import Window
spark.sql("create table nature (Account string,nature int, value int, order int)");
spark.sql("insert into nature values ('a', 1, 50,1), ('a', 1, 40,2),('a',0,50,3),('b',0,30,4),('b',0,40,5),('b',1,30,6),('b',1,40,7)")
windowSpec = Window.partitionBy("Account").orderBy("order")
nature = spark.table("nature");
nature.withColumn("Repeated", coalesce( lead(nature['nature']).over(windowSpec) != nature['nature'], lit(True)) ).show()
|Account|nature|value|order|Repeated|
+-------+------+-----+-----+--------+
| b| 0| 30| 4| false|
| b| 0| 40| 5| true|
| b| 1| 30| 6| false|
| b| 1| 40| 7| true|
| a| 1| 50| 1| false|
| a| 1| 40| 2| true|
| a| 0| 50| 3| true|
+-------+------+-----+-----+--------+
EDIT:
It's not clear from your description if I should look forward or backward. I have changed my code to look forward a row as this is consistent with account 'B' in your output. However it doesn't seem like the logic for Account 'A' is identical to the logic for 'B' in your sample output. (Or I don't understand a subtly of starting on '1' instead of starting on '0'.) If you want to look forward a row use lead, if you want to look back a row use lag.
Problem solved.
Even though this way costs a lot,but it's ok.
def check(part):
df = part
size = len(df)
for i in range(size):
if (df.loc[i,'repeated'] == True):
continue
else:
for j in range((i+1),size):
if (df.loc[i,'nature']!=df.loc[j,'nature']) & (df.loc[j,'repeated']==False):
df.loc[j,'repeated'] = True
df.loc[i,'repeated'] = True
break
return df
df.groupby("Account","value").applyInPandas(check, schema="Account string, nature int,value long,time string,repeated boolean").show()
Update1:
Another solution without any iterations.
def check(df):
df = df.sort_values('verified_time')
df['index'] = df.index
df['IS_REPEATED'] = 0
df1 = df.sort_values(['nature'],ascending=[True]).reset_index(drop=True)
df2 = df.sort_values(['nature'],ascending=[False]).reset_index(drop=True)
df1['IS_REPEATED']=df1['nature']^df2['nature']
df3 = df1.sort_values(['index'],ascending=[True])
df = df3.drop(['index'],axis=1)
return df
df = df.groupby("account", "value").applyInPandas(gf.check2,schema=gf.get_schema('trx'))
UPDATE2:
Solution with Spark window:
def is_repeated_feature(df):
windowPartition = Window.partitionBy("account", "value", 'nature').orderBy('nature')
df_1 = df.withColumn('rank', F.row_number().over(windowPartition))
w = (Window
.partitionBy('account', 'value')
.orderBy('nature')
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df_1 = df_1.withColumn("count_nature", F.count('nature').over(w))
df_1 = df_1.withColumn('sum_nature', F.sum('nature').over(w))
df_1 = df_1.select('*')
df_2 = df_1.withColumn('min_val',
when((df_1.sum_nature > (df_1.count_nature - df_1.sum_nature)),
(df_1.count_nature - df_1.sum_nature)).otherwise(df_1.sum_nature))
df_2 = df_2.withColumn('more_than_one', when(df_2.count_nature > 1, '1').otherwise('0'))
df_2 = df_2.withColumn('is_repeated',
when(((df_2.more_than_one == 1) & (df_2.count_nature > df_2.sum_nature) & (
df_2.rank <= df_2.min_val)), '1')
.otherwise('0'))
return df_2

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?
Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+
I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+

Fill the data based on the date ranges in spark

I have sample dataset, I want to fill the dates with 0 based on start date and end date (from 2016-01-01 to 2016-01-08).
id,date,quantity
1,2016-01-03,10
1,2016-01-04,20
1,2016-01-06,30
1,2016-01-07,20
2,2016-01-02,10
2,2016-01-03,10
2,2016-01-04,20
2,2016-01-06,20
2,2016-01-07,20
Based on the solution from below link I was able to implement partial solution.
Filling missing dates in spark dataframe column
Can someone please suggest how to fill the dates from start_date to end_date, even for start_date till end_date.
id,date,quantity
1,2016-01-01,0
1,2016-01-02,0
1,2016-01-03,10
1,2016-01-04,20
1,2016-01-05,0
1,2016-01-06,30
1,2016-01-07,20
1,2016-01-08,0
2,2016-01-01,0
2,2016-01-02,10
2,2016-01-03,10
2,2016-01-04,20
2,2016-01-05,0
2,2016-01-06,20
2,2016-01-07,20
2,2016-01-08,0
From Spark-2.4 use sequence function to generate all dates from 2016-01-01 to 2016-01--08.
Then join to the original dataframe use coalesce to get quantity and id values.
Example:
df1=sql("select explode(sequence(date('2016-01-01'),date('2016-01-08'),INTERVAL 1 DAY)) as date").\
withColumn("quantity",lit(0)).\
withColumn("id",lit(1))
df1.show()
#+----------+--------+---+
#| date|quantity| id|
#+----------+--------+---+
#|2016-01-01| 0| 1|
#|2016-01-02| 0| 1|
#|2016-01-03| 0| 1|
#|2016-01-04| 0| 1|
#|2016-01-05| 0| 1|
#|2016-01-06| 0| 1|
#|2016-01-07| 0| 1|
#|2016-01-08| 0| 1|
#+----------+--------+---+
df.show()
#+---+----------+--------+
#| id| date|quantity|
#+---+----------+--------+
#| 1|2016-01-03| 10|
#| 1|2016-01-04| 20|
#| 1|2016-01-06| 30|
#| 1|2016-01-07| 20|
#+---+----------+--------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
exprs=['date']+[coalesce(col('df.'f'{f}'),col('df1.'f'{f}')).alias(f) for f in df1.columns if f not in ['date']]
df1.\
alias("df1").\
join(df.alias("df"),['date'],'left').\
select(*exprs).\
orderBy("date").\
show()
#+----------+--------+---+
#| date|quantity| id|
#+----------+--------+---+
#|2016-01-01| 0| 1|
#|2016-01-02| 0| 1|
#|2016-01-03| 10| 1|
#|2016-01-04| 20| 1|
#|2016-01-05| 0| 1|
#|2016-01-06| 30| 1|
#|2016-01-07| 20| 1|
#|2016-01-08| 0| 1|
#+----------+--------+---+
Update:
df=spark.createDataFrame([(1,'2016-01-03',10),(1,'2016-01-04',20),(1,'2016-01-06',30),(1,'2016-01-07',20),(2,'2016-01-02',10),(2,'2016-01-03',10),(2,'2016-01-04',20),(2,'2016-01-06',20),(2,'2016-01-07',20)],["id","date","quantity"])
df1=df.selectExpr("id").distinct().selectExpr("id","explode(sequence(date('2016-01-01'),date('2016-01-08'),INTERVAL 1 DAY)) as date").withColumn("quantity",lit(0))
from pyspark.sql.functions import *
from pyspark.sql.types import *
exprs=[coalesce(col('df.'f'{f}'),col('df1.'f'{f}')).alias(f) for f in df1.columns]
df2=df1.alias("df1").join(df.alias("df"),(col("df1.date") == col("df.date"))& (col("df1.id") == col("df.id")),'left').select(*exprs)
df2.orderBy("id","date").show()
#+---+----------+--------+
#| id| date|quantity|
#+---+----------+--------+
#| 1|2016-01-01| 0|
#| 1|2016-01-02| 0|
#| 1|2016-01-03| 10|
#| 1|2016-01-04| 20|
#| 1|2016-01-05| 0|
#| 1|2016-01-06| 30|
#| 1|2016-01-07| 20|
#| 1|2016-01-08| 0|
#| 2|2016-01-01| 0|
#| 2|2016-01-02| 10|
#| 2|2016-01-03| 10|
#| 2|2016-01-04| 20|
#| 2|2016-01-05| 0|
#| 2|2016-01-06| 20|
#| 2|2016-01-07| 20|
#| 2|2016-01-08| 0|
#+---+----------+--------+
If you want to fill concretely the null values as 0, then fillna is also good.
import pyspark.sql.functions as f
from pyspark.sql import Window
df2 = df.select('id').distinct() \
.withColumn('date', f.expr('''explode(sequence(date('2016-01-01'), date('2016-01-08'), INTERVAL 1 days)) as date'''))
df2.join(df, ['id', 'date'], 'left').fillna(0).orderBy('id', 'date').show(20, False)
+---+----------+--------+
|id |date |quantity|
+---+----------+--------+
|1 |2016-01-01|0 |
|1 |2016-01-02|0 |
|1 |2016-01-03|10 |
|1 |2016-01-04|20 |
|1 |2016-01-05|0 |
|1 |2016-01-06|30 |
|1 |2016-01-07|20 |
|1 |2016-01-08|0 |
|2 |2016-01-01|0 |
|2 |2016-01-02|10 |
|2 |2016-01-03|10 |
|2 |2016-01-04|20 |
|2 |2016-01-05|0 |
|2 |2016-01-06|20 |
|2 |2016-01-07|20 |
|2 |2016-01-08|0 |
+---+----------+--------+

how to create a dataframe based on the first appearing date and based on additional columns each id column

i try to create a dataframe with following condition:
I have multiple IDs, multiple columns with defaults (0 or 1) and a startdate column. I would like to get a dataframe with the appearing defaults based on the first startdate (default_date) and each id.
the orginal df looks like this:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2019-01-31|
| 02| 1| 1| 0| 2018-12-31|
| 03| 1| 1| 1| 2018-10-31|
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 0| 2018-07-31|
| 03| 1| 1| 1| 2019-05-31|
this is how i would like to have it:
+----+-----+-----+-----+-----------+
|id |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 1| 2018-07-31|
i tried following code:
val w = Window.partitionBy($"id").orderBy($"date".asc)
val reult = join3.withColumn("rn", row_number.over(w)).where($"def_a" === 1 || $"def_b" === 1 ||$"def_c" === 1).filter($"rn" >= 1).drop("rn")
result.show
I would be grateful for any help
This should work for you. First assign the min date to the original df then join the new df2 with df.
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1,1,0,1,"2019-01-31"),
(2,1,1,0,"2018-12-31"),
(3,1,1,1,"2018-10-31"),
(1,1,0,1,"2018-09-30"),
(2,1,1,0,"2018-08-31"),
(3,1,1,0,"2018-07-31"),
(3,1,1,1,"2019-05-31"))
.toDF("id" ,"def_a" , "def_b", "deb_c", "date")
val w = Window.partitionBy($"id").orderBy($"date".asc)
val df2 = df.withColumn("date", $"date".cast("date"))
.withColumn("min_date", min($"date").over(w))
.select("id", "min_date")
.distinct()
df.join(df2, df("id") === df2("id") && df("date") === df2("min_date"))
.select(df("*"))
.show
And the output should be:
+---+-----+-----+-----+----------+
| id|def_a|def_b|deb_c| date|
+---+-----+-----+-----+----------+
| 1| 1| 0| 1|2018-09-30|
| 2| 1| 1| 0|2018-08-31|
| 3| 1| 1| 0|2018-07-31|
+---+-----+-----+-----+----------+
By the way I believe you had a little mistake on your expected results. It is (3, 1, 1, 0, 2018-07-31) not (3, 1, 1, 1, 2018-07-31)

Different aggregate operations on different columns pyspark

I am trying to apply different aggregation functions to different columns in a pyspark dataframe. Following some suggestions on stackoverflow, I tried this:
the_columns = ["product1","product2"]
the_columns2 = ["customer1","customer2"]
exprs = [mean(col(d)) for d in the_columns1, count(col(c)) for c in the_columns2]
followed by
df.groupby(*group).agg(*exprs)
where "group" is a column not present in either the_columns or the_columns2. This does not work. How to do different aggregation functions on different columns?
You are very close already, instead of put the expressions in a list, add them so you have a flat list of expressions:
exprs = [mean(col(d)) for d in the_columns1] + [count(col(c)) for c in the_columns2]
Here is a demo:
import pyspark.sql.functions as F
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 2| 1|
| 1| 2| 2| 2|
| 2| 3| 3| 3|
| 2| 4| 3| 4|
+---+---+---+---+
cols = ['b']
cols2 = ['c', 'd']
exprs = [F.mean(F.col(x)) for x in cols] + [F.count(F.col(x)) for x in cols2]
df.groupBy('a').agg(*exprs).show()
+---+------+--------+--------+
| a|avg(b)|count(c)|count(d)|
+---+------+--------+--------+
| 1| 1.5| 2| 2|
| 2| 3.5| 2| 2|
+---+------+--------+--------+