Pyspark keep state within tasks - pyspark

This is related to this question: Pyspark dataframe column value dependent on value from another row but this one gets even more complicated.
I have a dataframe:
columns = ['id','seq','manufacturer']
data = [("1",1,"Factory"), ("1",2,"Sub-Factory-1"), ("1",3,"Order"),("1",4,"Sub-Factory-1"),("2",1,"Factory"), ("2",2,"Sub-Factory-1"), ("2",5,"Sub-Factory-1"),("3",1, "Sub-Factory-1"),("3",2,"Order"), ("3",4, "Sub-Factory-1"), ("4", 1,"Factory"), ("4",3, "Sub-Factory-1"),("4",4, "Sub-Factory-1"),("5",1,"Sub-Factory-1"), ("5",2, "Sub-Factory-1"), ("5", 6,"Order"), ("6",2,"Factory"), ("6",3, "Order"), ("6",4,"Sub-Factory-1"), ("6", 6,"Sub-Factory-1"), ("6",7,"Order"), ("7",1,"Sub-Factory-1"), ("7",2,"Factory" ), ("7", 3,"Order"), ("7", 4,"Sub-Factory-1"),("7",5,"Factory"), ("7",8, "Sub-Factory-1"),("7",10,"Sub-Factory-1")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.orderBy('id','seq').show(40)
+---+---+-------------+
| id|seq| manufacturer|
+---+---+-------------+
| 1| 1| Factory|
| 1| 2|Sub-Factory-1|
| 1| 3| Order|
| 1| 4|Sub-Factory-1|
| 2| 1| Factory|
| 2| 2|Sub-Factory-1|
| 2| 5|Sub-Factory-1|
| 3| 1|Sub-Factory-1|
| 3| 2| Order|
| 3| 4|Sub-Factory-1|
| 4| 1| Factory|
| 4| 3|Sub-Factory-1|
| 4| 4|Sub-Factory-1|
| 5| 1|Sub-Factory-1|
| 5| 2|Sub-Factory-1|
| 5| 6| Order|
| 6| 2| Factory|
| 6| 3| Order|
| 6| 4|Sub-Factory-1|
| 6| 6|Sub-Factory-1|
| 6| 7| Order|
| 7| 1|Sub-Factory-1|
| 7| 2| Factory|
| 7| 3| Order|
| 7| 4|Sub-Factory-1|
| 7| 5| Factory|
| 7| 8|Sub-Factory-1|
| 7| 10|Sub-Factory-1|
+---+---+-------------+
What I want to do is to assign hierarchical values to another column(not saying its the best idea) that I can use with the logic from Pyspark dataframe column value dependent on value from another row. So within id group and seq order I want only the first Sub-Factory to attribute to Factory, if there is a Factory within same id and seq order above the Sub-Factory.
So end result should look like:
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
| 6| 7| Order| 0|
| 7| 1|Sub-Factory-1| 0|
| 7| 2| Factory| 1|
| 7| 3| Order| 0|
| 7| 4|Sub-Factory-1| 1|
| 7| 5| Factory| 1|
| 7| 8|Sub-Factory-1| 1|
| 7| 10|Sub-Factory-1| 0|
+---+---+-------------+-------+
The dataset is large so I can't use something like df.collect() and then loop over data because it crashes memory. My first idea was to use an accumulator like:
acc = sc.accumulator(0)
def myFunc(manufacturer):
if manufacturer == 'Factory':
acc.value = 1
return 1
elif manufacturer == 'Sub-Factory-1' and acc.value == 1:
acc.value = 0
return 1
else:
return 0
myFuncUDF = F.udf(myFunc, IntegerType())
df = df.withColumn('test', myFuncUDF(col('manufacturer')))
But it's a bad idea since accumulator cannot be accessed within tasks.
Also Window function solves it if I want to attribute all Sub-Factories from above Factory within same id but now only the first Sub-Factory should get attributed. Any ideas?

from pyspark.sql.window import Window
from pyspark.sql.functions import *
df_mod = df.filter(df.manufacturer == 'Sub-Factory-1')
W = Window.partitionBy("id").orderBy("seq")
df_mod = df_mod.withColumn("rank",rank().over(W))
df_mod = df_mod.filter(col('rank') == 1)
df_mod2 = df.filter(col('manufacturer') == 'Factory')\
.select('id', 'seq', col('manufacturer').alias('Factory_chk_2'))
df_f = df\
.join(df_mod, ['id', 'seq'], 'left')\
.select('id', 'seq', df.manufacturer, 'rank')\
.join(df_mod2, 'id', 'left')\
.select('id', df.seq, df.manufacturer, 'rank', 'Factory_chk_2')\
.withColumn('Factory_chk', when(df.manufacturer=='Factory', 1))\
.withColumn('Factory_chk_2', when(col('Factory_chk_2')=='Factory', 1))\
.withColumn('checker',when(col('Factory_chk_2')=='1', coalesce(col('rank'),col('Factory_chk'))).otherwise(lit(0)))\
.select('id', 'seq', 'manufacturer', 'checker')\
.na.fill(value=0)\
.orderBy('id', 'seq')
df_f.show()
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
+---+---+-------------+-------+
only showing top 20 rows

Related

Window function based on a condition

I have the following DF:
|-----------------------|
|Date | Val | Cond|
|-----------------------|
|2022-01-08 | 2 | 0 |
|2022-01-09 | 4 | 1 |
|2022-01-10 | 6 | 1 |
|2022-01-11 | 8 | 0 |
|2022-01-12 | 2 | 1 |
|2022-01-13 | 5 | 1 |
|2022-01-14 | 7 | 0 |
|2022-01-15 | 9 | 0 |
|-----------------------|
I need to sum the values of two days before where cond = 1 for every date, my expected output is:
|-----------------|
|Date | Sum |
|-----------------|
|2022-01-08 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-09 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-10 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-11 | 10 | (4+6)
|2022-01-12 | 10 | (4+6)
|2022-01-13 | 8 | (2+6)
|2022-01-14 | 7 | (5+2)
|2022-01-15 | 7 | (5+2)
|-----------------|
I've tried to get the output DF using this code:
df = df.where("Cond= 1").withColumn(
"ListView",
f.collect_list("Val").over(windowSpec.rowsBetween(-2, -1))
)
But when I use .where("Cond = 1") I exclude the dates that cond is equal zero.
I found the following answer but didn't help me:
Window.rowsBetween - only consider rows fulfilling a specific condition (e.g. not being null)
How can I achieve my expected output using window functions?
The MVCE:
data_1=[
("2022-01-08",2,0),
("2022-01-09",4,1),
("2022-01-10",6,1),
("2022-01-11",8,0),
("2022-01-12",2,1),
("2022-01-13",5,1),
("2022-01-14",7,0),
("2022-01-15",9,0)
]
schema_1 = StructType([
StructField("Date", DateType(),True),
StructField("Val", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
The following should do the trick (but I'm sure it can be further optimized).
Setup:
data_1=[
("2022-01-08",2,0),
("2022-01-09",4,1),
("2022-01-10",6,1),
("2022-01-11",8,0),
("2022-01-12",2,1),
("2022-01-13",5,1),
("2022-01-14",7,0),
("2022-01-15",9,0),
("2022-01-16",9,0),
("2022-01-17",9,0)
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Val", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
df_1 = df_1.withColumn('Date', to_date("Date", "yyyy-MM-dd"))
+----------+---+----+
| Date|Val|Cond|
+----------+---+----+
|2022-01-08| 2| 0|
|2022-01-09| 4| 1|
|2022-01-10| 6| 1|
|2022-01-11| 8| 0|
|2022-01-12| 2| 1|
|2022-01-13| 5| 1|
|2022-01-14| 7| 0|
|2022-01-15| 9| 0|
|2022-01-16| 9| 0|
|2022-01-17| 9| 0|
+----------+---+----+
Create a new DF only with Cond==1 rows to obtain the sum of two consecutive rows with that condition:
windowSpec = Window.partitionBy("Cond").orderBy("Date")
df_2 = df_1.where(df_1.Cond==1).withColumn(
"Sum",
sum("Val").over(windowSpec.rowsBetween(-1, 0))
).withColumn('date_1', col('date')).drop('date')
+---+----+---+----------+
|Val|Cond|Sum| date_1|
+---+----+---+----------+
| 4| 1| 4|2022-01-09|
| 6| 1| 10|2022-01-10|
| 2| 1| 8|2022-01-12|
| 5| 1| 7|2022-01-13|
+---+----+---+----------+
Do a left join to get the sum into the original data frame, and set the sum to zero for the rows with Cond==0:
df_3 = df_1.join(df_2.select('sum', col('date_1')), df_1.Date == df_2.date_1, "left").drop('date_1').fillna(0)
+----------+---+----+---+
| Date|Val|Cond|sum|
+----------+---+----+---+
|2022-01-08| 2| 0| 0|
|2022-01-09| 4| 1| 4|
|2022-01-10| 6| 1| 10|
|2022-01-11| 8| 0| 0|
|2022-01-12| 2| 1| 8|
|2022-01-13| 5| 1| 7|
|2022-01-14| 7| 0| 0|
|2022-01-15| 9| 0| 0|
|2022-01-16| 9| 0| 0|
|2022-01-17| 9| 0| 0|
+----------+---+----+---+
Do a cumulative sum on the condition column:
df_3=df_3.withColumn('cond_sum', sum('cond').over(Window.orderBy('Date')))
+----------+---+----+---+--------+
| Date|Val|Cond|sum|cond_sum|
+----------+---+----+---+--------+
|2022-01-08| 2| 0| 0| 0|
|2022-01-09| 4| 1| 4| 1|
|2022-01-10| 6| 1| 10| 2|
|2022-01-11| 8| 0| 0| 2|
|2022-01-12| 2| 1| 8| 3|
|2022-01-13| 5| 1| 7| 4|
|2022-01-14| 7| 0| 0| 4|
|2022-01-15| 9| 0| 0| 4|
|2022-01-16| 9| 0| 0| 4|
|2022-01-17| 9| 0| 0| 4|
+----------+---+----+---+--------+
Finally, for each partition where the cond_sum is greater than 1, use the max sum for that partition:
df_3.withColumn('sum', when(df_3.cond_sum > 1, max('sum').over(Window.partitionBy('cond_sum'))).otherwise(0)).show()
+----------+---+----+---+--------+
| Date|Val|Cond|sum|cond_sum|
+----------+---+----+---+--------+
|2022-01-08| 2| 0| 0| 0|
|2022-01-09| 4| 1| 0| 1|
|2022-01-10| 6| 1| 10| 2|
|2022-01-11| 8| 0| 10| 2|
|2022-01-12| 2| 1| 8| 3|
|2022-01-13| 5| 1| 7| 4|
|2022-01-14| 7| 0| 7| 4|
|2022-01-15| 9| 0| 7| 4|
|2022-01-16| 9| 0| 7| 4|
|2022-01-17| 9| 0| 7| 4|
+----------+---+----+---+--------+

Percentile over a specific column

I have the below dataframe .
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
created by
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
Here I have a base column M and is ranked as ID based on Amount.
I am trying to compute the percentile keeping M as a group but for every last three values of Amount.
I am Using the below code to find the percentile for a group. But how can I target the last three values. ?
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
Expected Output
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
This seems to be little bit tricky to me as I am still learning spark.Expecting answers from enthusiasts here.
Adding orderBy("Amount") and rowsBetween(-2,0) to the Window definition gets the required result:
orderBy sorts the rows within each group by Amount
rowsBetween takes only the current row and the two rows before into account when calculating the percentile
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
prints
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+

PySpark: counting rows based on current row value

I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')

Pyspark - Ranking columns keeping ties

I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.
You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+
I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.

Filtering rows based on subsequent row values in spark dataframe [duplicate]

I have a dataframe(spark):
id value
3 0
3 1
3 0
4 1
4 0
4 0
I want to create a new dataframe:
3 0
3 1
4 1
Need to remove all the rows after 1(value) for each id.I tried with window functions in spark dateframe(Scala). But couldn't able to find a solution.Seems to be I am going in a wrong direction.
I am looking for a solution in Scala.Thanks
Output using monotonically_increasing_id
scala> val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data: org.apache.spark.sql.DataFrame = [id: int, value: int]
scala> val minIdx = dataWithIndex.filter($"value" === 1).groupBy($"id").agg(min($"idx")).toDF("r_id", "min_idx")
minIdx: org.apache.spark.sql.DataFrame = [r_id: int, min_idx: bigint]
scala> dataWithIndex.join(minIdx,($"r_id" === $"id") && ($"idx" <= $"min_idx")).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
The solution wont work if we did a sorted transformation in the original dataframe. That time the monotonically_increasing_id() is generated based on original DF rather that sorted DF.I have missed that requirement before.
All suggestions are welcome.
One way is to use monotonically_increasing_id() and a self-join:
val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data.show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 3| 0|
| 4| 1|
| 4| 0|
| 4| 0|
+---+-----+
Now we generate a column named idx with an increasing Long:
val dataWithIndex = data.withColumn("idx", monotonically_increasing_id())
// dataWithIndex.cache()
Now we get the min(idx) for each id where value = 1:
val minIdx = dataWithIndex
.filter($"value" === 1)
.groupBy($"id")
.agg(min($"idx"))
.toDF("r_id", "min_idx")
Now we join the min(idx) back to the original DataFrame:
dataWithIndex.join(
minIdx,
($"r_id" === $"id") && ($"idx" <= $"min_idx")
).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
Note: monotonically_increasing_id() generates its value based on the partition of the row. This value may change each time dataWithIndex is re-evaluated. In my code above, because of lazy evaluation, it's only when I call the final show that monotonically_increasing_id() is evaluated.
If you want to force the value to stay the same, for example so you can use show to evaluate the above step-by-step, uncomment this line above:
// dataWithIndex.cache()
Hi I found the solution using Window and self join.
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
data.show
scala> data.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 3| 0| 1|
| 4| 1| 6|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
val sort_df=data.sort($"sorted")
scala> sort_df.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 1|
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 4|
| 4| 0| 5|
| 4| 1| 6|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
var window=Window.partitionBy("id").orderBy("$sorted")
val sort_idx=sort_df.select($"*",rowNumber.over(window).as("count_index"))
val minIdx=sort_idx.filter($"value"===1).groupBy("id").agg(min("count_index")).toDF("idx","min_idx")
val result_id=sort_idx.join(minIdx,($"id"===$"idx") &&($"count_index" <= $"min_idx"))
result_id.show
+---+-----+------+-----------+---+-------+
| id|value|sorted|count_index|idx|min_idx|
+---+-----+------+-----------+---+-------+
| 1| 0| 7| 1| 1| 2|
| 1| 1| 8| 2| 1| 2|
| 2| 1| 10| 1| 2| 1|
| 3| 0| 1| 1| 3| 3|
| 3| 0| 2| 2| 3| 3|
| 3| 1| 3| 3| 3| 3|
| 4| 0| 4| 1| 4| 3|
| 4| 0| 5| 2| 4| 3|
| 4| 1| 6| 3| 4| 3|
+---+-----+------+-----------+---+-------+
Still looking for a more optimized solutions.Thanks
You can simply use groupBy like this
val df2 = df1.groupBy("id","value").count().select("id","value")
Here your df1 is
id value
3 0
3 1
3 0
4 1
4 0
4 0
And resultant dataframe is df2 which is your expected output like this
id value
3 0
3 1
4 1
4 0
use isin method and filter as below:
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
val idFilter = List(1, 2)
data.filter($"id".isin(idFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
Ex: filter based on val
val valFilter = List(0)
data.filter($"value".isin(valFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 0| 1|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 0| 9|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+