Pyspark - Get first column occurrence of a value in a spark dataframe - pyspark

I have a data-frame as below, I need first, last occurrence of the value 0 and non zero values
Id Col1 Col2 Col3 Col4
1 1 0 0 2
2 0 0 0 0
3 4 2 2 4
4 2 5 9 0
5 0 4 0 0
Expected Result:
Id Col1 Col2 Col3 Col4 First_0 Last_0 First_non_zero Last_non_zero
1 1 0 0 2 2 3 1 4
2 0 0 0 0 1 4 0 0
3 4 2 2 4 0 0 1 4
4 2 5 9 0 4 4 1 3
5 0 4 0 0 1 4 2 2

Here is one way to use pyspark's F.array(), F.greatest() and F.least():
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1,1,0,0,2), (2,0,0,0,0), (3,4,2,2,4), (4,2,5,9,0), (5,0,4,0,0)]
, ['Id','Col1','Col2','Col3','Col4']
)
df.show()
#+---+----+----+----+----+
#| Id|Col1|Col2|Col3|Col4|
#+---+----+----+----+----+
#| 1| 1| 0| 0| 2|
#| 2| 0| 0| 0| 0|
#| 3| 4| 2| 2| 4|
#| 4| 2| 5| 9| 0|
#| 5| 0| 4| 0| 0|
#+---+----+----+----+----+
# column names involved in the calculation
cols = df.columns[1:]
# create an array column `arr_0` with index of elements(having F.col(cols[index])==0) in array cols
# then select the greatest and least value to identify the first_0 and last_0
# fillna with '0' when none of the items is '0'
df.withColumn('arr_0', F.array([ F.when(F.col(cols[i])==0, i+1) for i in range(len(cols))])) \
.withColumn('first_0', F.least(*[F.col('arr_0')[i] for i in range(len(cols))])) \
.withColumn('last_0', F.greatest(*[F.col('arr_0')[i] for i in range(len(cols))])) \
.fillna(0, subset=['first_0', 'last_0']) \
.show()
#+---+----+----+----+----+------------+-------+------+
#| Id|Col1|Col2|Col3|Col4| arr_0|first_0|last_0|
#+---+----+----+----+----+------------+-------+------+
#| 1| 1| 0| 0| 2| [, 2, 3,]| 2| 3|
#| 2| 0| 0| 0| 0|[1, 2, 3, 4]| 1| 4|
#| 3| 4| 2| 2| 4| [,,,]| 0| 0|
#| 4| 2| 5| 9| 0| [,,, 4]| 4| 4|
#| 5| 0| 4| 0| 0| [1,, 3, 4]| 1| 4|
#+---+----+----+----+----+------------+-------+------+
If you are using pyspark 2.4, you can also try F.array_min() and F.array_max():
df.withColumn('arr_0', F.array([ F.when(F.col(cols[i])==0, i+1) for i in range(len(cols)) ])) \
.select('*', F.array_min('arr_0').alias('first_0'), F.array_max('arr_0').alias('last_0')) \
.fillna(0, subset=['first_0', 'last_0']) \
.show()
#+---+----+----+----+----+------------+-------+------+
#| Id|Col1|Col2|Col3|Col4| arr_0|first_0|last_0|
#+---+----+----+----+----+------------+-------+------+
#| 1| 1| 0| 0| 2| [, 2, 3,]| 2| 3|
#| 2| 0| 0| 0| 0|[1, 2, 3, 4]| 1| 4|
#| 3| 4| 2| 2| 4| [,,,]| 0| 0|
#| 4| 2| 5| 9| 0| [,,, 4]| 4| 4|
#| 5| 0| 4| 0| 0| [1,, 3, 4]| 1| 4|
#+---+----+----+----+----+------------+-------+------+
You can do the same to last_non_zero and first_non_zero.

Related

How to build a rank based on threshold in Spark?

Suppose I have a dataframe:
val df = Seq(
(1,"A"),
(1,"B"),
(1,"C"),
(1,"D"),
(1,"E"),
(1,"F"),
(1,"G"),
(1,"H"),
(2,"I"),
(2,"J"),
(2,"J"),
(2,"J"),
(3,"K"),
).toDF("id", "code")
I need to rank it based on ids and with respect to some threshold. Example:
threshold = 3
id code rank
1 A 1
1 B 1
1 C 1 -- threshold has been reached
1 D 2
1 E 2
1 F 2 -- threshold has been reached
1 G 3
1 H 3
2 I 1
2 J 1
2 J 1 -- threshold has been reached
2 J 2
3 K 1
How can I do it?
I can create a simple rank:
df.withColumn("rank", dense_rank().over(Window.orderBy("id")))
But how to split ranked groups by threshold?
A solution that does not require to move all data into one partition:
//get the largest number of equal ids
val maxGroupSize = df.groupBy("id").count().agg(max("count")).first().getLong(0)
val threshold = 3
var f = maxGroupSize
while( f % threshold>0) f=f+1
df.withColumn("tmp1", 'id* f)
.withColumn("tmp2", dense_rank().over(Window.partitionBy("id").orderBy("code"))-1)
.withColumn("tmp3", 'tmp1+'tmp2)
.withColumn("rank", ('tmp3 / threshold).cast("int"))
Result:
+---+----+----+----+----+----+
| id|code|tmp1|tmp2|tmp3|rank|
+---+----+----+----+----+----+
| 1| A| 9| 0| 9| 3|
| 1| B| 9| 1| 10| 3|
| 1| C| 9| 2| 11| 3|
| 1| D| 9| 3| 12| 4|
| 1| E| 9| 4| 13| 4|
| 1| F| 9| 5| 14| 4|
| 1| G| 9| 6| 15| 5|
| 1| H| 9| 7| 16| 5|
| 2| I| 18| 0| 18| 6|
| 2| J| 18| 1| 19| 6|
| 3| K| 27| 0| 27| 9|
+---+----+----+----+----+----+
The downside of this approach is that the ranks are not consecutive.
It would be possible to fix this with another window
df.withColumn("rank2", dense_rank().over(Window.orderBy("rank")))
but this would again move all data to a single executor.

Window function based on a condition

I have the following DF:
|-----------------------|
|Date | Val | Cond|
|-----------------------|
|2022-01-08 | 2 | 0 |
|2022-01-09 | 4 | 1 |
|2022-01-10 | 6 | 1 |
|2022-01-11 | 8 | 0 |
|2022-01-12 | 2 | 1 |
|2022-01-13 | 5 | 1 |
|2022-01-14 | 7 | 0 |
|2022-01-15 | 9 | 0 |
|-----------------------|
I need to sum the values of two days before where cond = 1 for every date, my expected output is:
|-----------------|
|Date | Sum |
|-----------------|
|2022-01-08 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-09 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-10 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-11 | 10 | (4+6)
|2022-01-12 | 10 | (4+6)
|2022-01-13 | 8 | (2+6)
|2022-01-14 | 7 | (5+2)
|2022-01-15 | 7 | (5+2)
|-----------------|
I've tried to get the output DF using this code:
df = df.where("Cond= 1").withColumn(
"ListView",
f.collect_list("Val").over(windowSpec.rowsBetween(-2, -1))
)
But when I use .where("Cond = 1") I exclude the dates that cond is equal zero.
I found the following answer but didn't help me:
Window.rowsBetween - only consider rows fulfilling a specific condition (e.g. not being null)
How can I achieve my expected output using window functions?
The MVCE:
data_1=[
("2022-01-08",2,0),
("2022-01-09",4,1),
("2022-01-10",6,1),
("2022-01-11",8,0),
("2022-01-12",2,1),
("2022-01-13",5,1),
("2022-01-14",7,0),
("2022-01-15",9,0)
]
schema_1 = StructType([
StructField("Date", DateType(),True),
StructField("Val", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
The following should do the trick (but I'm sure it can be further optimized).
Setup:
data_1=[
("2022-01-08",2,0),
("2022-01-09",4,1),
("2022-01-10",6,1),
("2022-01-11",8,0),
("2022-01-12",2,1),
("2022-01-13",5,1),
("2022-01-14",7,0),
("2022-01-15",9,0),
("2022-01-16",9,0),
("2022-01-17",9,0)
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Val", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
df_1 = df_1.withColumn('Date', to_date("Date", "yyyy-MM-dd"))
+----------+---+----+
| Date|Val|Cond|
+----------+---+----+
|2022-01-08| 2| 0|
|2022-01-09| 4| 1|
|2022-01-10| 6| 1|
|2022-01-11| 8| 0|
|2022-01-12| 2| 1|
|2022-01-13| 5| 1|
|2022-01-14| 7| 0|
|2022-01-15| 9| 0|
|2022-01-16| 9| 0|
|2022-01-17| 9| 0|
+----------+---+----+
Create a new DF only with Cond==1 rows to obtain the sum of two consecutive rows with that condition:
windowSpec = Window.partitionBy("Cond").orderBy("Date")
df_2 = df_1.where(df_1.Cond==1).withColumn(
"Sum",
sum("Val").over(windowSpec.rowsBetween(-1, 0))
).withColumn('date_1', col('date')).drop('date')
+---+----+---+----------+
|Val|Cond|Sum| date_1|
+---+----+---+----------+
| 4| 1| 4|2022-01-09|
| 6| 1| 10|2022-01-10|
| 2| 1| 8|2022-01-12|
| 5| 1| 7|2022-01-13|
+---+----+---+----------+
Do a left join to get the sum into the original data frame, and set the sum to zero for the rows with Cond==0:
df_3 = df_1.join(df_2.select('sum', col('date_1')), df_1.Date == df_2.date_1, "left").drop('date_1').fillna(0)
+----------+---+----+---+
| Date|Val|Cond|sum|
+----------+---+----+---+
|2022-01-08| 2| 0| 0|
|2022-01-09| 4| 1| 4|
|2022-01-10| 6| 1| 10|
|2022-01-11| 8| 0| 0|
|2022-01-12| 2| 1| 8|
|2022-01-13| 5| 1| 7|
|2022-01-14| 7| 0| 0|
|2022-01-15| 9| 0| 0|
|2022-01-16| 9| 0| 0|
|2022-01-17| 9| 0| 0|
+----------+---+----+---+
Do a cumulative sum on the condition column:
df_3=df_3.withColumn('cond_sum', sum('cond').over(Window.orderBy('Date')))
+----------+---+----+---+--------+
| Date|Val|Cond|sum|cond_sum|
+----------+---+----+---+--------+
|2022-01-08| 2| 0| 0| 0|
|2022-01-09| 4| 1| 4| 1|
|2022-01-10| 6| 1| 10| 2|
|2022-01-11| 8| 0| 0| 2|
|2022-01-12| 2| 1| 8| 3|
|2022-01-13| 5| 1| 7| 4|
|2022-01-14| 7| 0| 0| 4|
|2022-01-15| 9| 0| 0| 4|
|2022-01-16| 9| 0| 0| 4|
|2022-01-17| 9| 0| 0| 4|
+----------+---+----+---+--------+
Finally, for each partition where the cond_sum is greater than 1, use the max sum for that partition:
df_3.withColumn('sum', when(df_3.cond_sum > 1, max('sum').over(Window.partitionBy('cond_sum'))).otherwise(0)).show()
+----------+---+----+---+--------+
| Date|Val|Cond|sum|cond_sum|
+----------+---+----+---+--------+
|2022-01-08| 2| 0| 0| 0|
|2022-01-09| 4| 1| 0| 1|
|2022-01-10| 6| 1| 10| 2|
|2022-01-11| 8| 0| 10| 2|
|2022-01-12| 2| 1| 8| 3|
|2022-01-13| 5| 1| 7| 4|
|2022-01-14| 7| 0| 7| 4|
|2022-01-15| 9| 0| 7| 4|
|2022-01-16| 9| 0| 7| 4|
|2022-01-17| 9| 0| 7| 4|
+----------+---+----+---+--------+

How to remove the first set of zero-valued columns (or rows) in spark and scala

Hello I am new to spark and I have two data frames such that:
+--------------+-------+-------+-------+-------+-------+-------+-------+
| Region| 3/7/20| 3/8/20| 3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
+--------------+-------+-------+-------+-------+-------+-------+-------+
| Paris| 0| 0| 0| 1| 7| 0| 5|
+--------------+-------+-------+-------+-------+-------+-------+-------+
+----------+-------+
| Period|Reports|
+----------+-------+
|2020/07/20| 0|
|2020/07/21| 0|
|2020/07/22| 0|
|2020/07/23| 8|
|2020/07/24| 0|
|2020/07/25| 1|
+----------+-------+
How to can I drop the first 0-valued consecutive column 3/7/20, 3/8/20, 3/9/20, without deleting the column 3/12/20 ?
Similarly for the second dataframe how to remove the rows 3/12/20, 0 and 2020/07/21, 0 and 2020/07/22, 0 without deleting the row with 2020/07/22, 0
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df=Seq(("0","0","0","1","7","0","5")).toDF("3/7/20","3/8/20","3/9/20","3/10/20","3/11/20","3/12/20","3/13/20")
var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.printSchema()
val df1 = df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap"))
.toDF("Region","Paris")
val windowSpec = Window.partitionBy(lit("A")).orderBy(lit("A"))
df1.withColumn("row_number",row_number.over(windowSpec))
.withColumn("lag", lag("Paris", 1, 0).over(windowSpec))
.withColumn("lead", lead("Paris", 1, 0)
.over(windowSpec)).where(($"lag">0) or ($"Paris"> 0)).show()
/*
+-------+-----+----------+---+----+
| Region|Paris|row_number|lag|lead|
+-------+-----+----------+---+----+
|3/10/20| 1| 4| 0| 7|
|3/11/20| 7| 5| 1| 0|
|3/12/20| 0| 6| 7| 5|
|3/13/20| 5| 7| 0| 0|
+-------+-----+----------+---+----+
*/
val df2=Seq(("2020/07/20","0"),("2020/07/21","0"),("2020/07/22","0"),("2020/07/23","8"),("2020/07/24","0"),("2020/07/25","1")).toDF("Period","Reports")
df2.withColumn("row_number",row_number.over(windowSpec))
.withColumn("lag", lag("Reports", 1, 0).over(windowSpec))
.withColumn("lead", lead("Reports", 1, 0).over(windowSpec))
.where((($"lag">0) or ($"Reports"> 0)) and ($"row_number">1)).show()
/*
+----------+-------+----------+---+----+
| Period|Reports|row_number|lag|lead|
+----------+-------+----------+---+----+
|2020/07/23| 8| 4| 0| 0|
|2020/07/24| 0| 5| 8| 1|
|2020/07/25| 1| 6| 0| 0|
+----------+-------+----------+---+----+
*/

pyspark: Auto filling in implicit missing values

I have a dataframe
user day amount
a 2 10
a 1 14
a 4 5
b 1 4
You see that, the maximum value of day is 4, and the minimum value is 1. I want to fill 0 for amount column in all missing days of all users, so the above data frame will become.
user day amount
a 2 10
a 1 14
a 4 5
a 3 0
b 1 4
b 2 0
b 3 0
b 4 0
How could I do that in PySpark? Many thanks.
Here is one approach. You can get the min and max values first , then group on user column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:
min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)
missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1)
if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)
#+----+---+---+---+---+
#|user| 1| 2| 4| 3|
#+----+---+---+---+---+
#| b| 4| 0| 0| 0|
#| a| 14| 10| 5| 0|
#+----+---+---+---+---+
#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
.select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()
+----+---+------+
|user|day|amount|
+----+---+------+
| a| 1| 14|
| a| 2| 10|
| a| 4| 5|
| a| 3| 0|
| b| 1| 4|
| b| 2| 0|
| b| 4| 0|
| b| 3| 0|
+----+---+------+
Note, since column day was pivotted , the dtype might have changed so you may have to cast them back to the original dtype
Another way to do this is to use sequence, array functions and explode. (spark2.4+)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy(F.lit(0))
df.withColumn("boundaries", F.sequence(F.min("day").over(w),F.max("day").over(w),F.lit(1)))\
.groupBy("user").agg(F.collect_list("day").alias('day'),F.collect_list("amount").alias('amount')\
,F.first("boundaries").alias("boundaries")).withColumn("boundaries", F.array_except("boundaries","day"))\
.withColumn("day",F.flatten(F.array("day","boundaries"))).drop("boundaries")\
.withColumn("zip", F.explode(F.arrays_zip("day","amount")))\
.select("user","zip.day", F.when(F.col("zip.amount").isNull(),\
F.lit(0)).otherwise(F.col("zip.amount")).alias("amount")).show()
#+----+---+------+
#|user|day|amount|
#+----+---+------+
#| a| 2| 10|
#| a| 1| 14|
#| a| 4| 5|
#| a| 3| 0|
#| b| 1| 4|
#| b| 2| 0|
#| b| 3| 0|
#| b| 4| 0|
#+----+---+------+

Filtering rows based on subsequent row values in spark dataframe [duplicate]

I have a dataframe(spark):
id value
3 0
3 1
3 0
4 1
4 0
4 0
I want to create a new dataframe:
3 0
3 1
4 1
Need to remove all the rows after 1(value) for each id.I tried with window functions in spark dateframe(Scala). But couldn't able to find a solution.Seems to be I am going in a wrong direction.
I am looking for a solution in Scala.Thanks
Output using monotonically_increasing_id
scala> val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data: org.apache.spark.sql.DataFrame = [id: int, value: int]
scala> val minIdx = dataWithIndex.filter($"value" === 1).groupBy($"id").agg(min($"idx")).toDF("r_id", "min_idx")
minIdx: org.apache.spark.sql.DataFrame = [r_id: int, min_idx: bigint]
scala> dataWithIndex.join(minIdx,($"r_id" === $"id") && ($"idx" <= $"min_idx")).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
The solution wont work if we did a sorted transformation in the original dataframe. That time the monotonically_increasing_id() is generated based on original DF rather that sorted DF.I have missed that requirement before.
All suggestions are welcome.
One way is to use monotonically_increasing_id() and a self-join:
val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data.show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 3| 0|
| 4| 1|
| 4| 0|
| 4| 0|
+---+-----+
Now we generate a column named idx with an increasing Long:
val dataWithIndex = data.withColumn("idx", monotonically_increasing_id())
// dataWithIndex.cache()
Now we get the min(idx) for each id where value = 1:
val minIdx = dataWithIndex
.filter($"value" === 1)
.groupBy($"id")
.agg(min($"idx"))
.toDF("r_id", "min_idx")
Now we join the min(idx) back to the original DataFrame:
dataWithIndex.join(
minIdx,
($"r_id" === $"id") && ($"idx" <= $"min_idx")
).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
Note: monotonically_increasing_id() generates its value based on the partition of the row. This value may change each time dataWithIndex is re-evaluated. In my code above, because of lazy evaluation, it's only when I call the final show that monotonically_increasing_id() is evaluated.
If you want to force the value to stay the same, for example so you can use show to evaluate the above step-by-step, uncomment this line above:
// dataWithIndex.cache()
Hi I found the solution using Window and self join.
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
data.show
scala> data.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 3| 0| 1|
| 4| 1| 6|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
val sort_df=data.sort($"sorted")
scala> sort_df.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 1|
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 4|
| 4| 0| 5|
| 4| 1| 6|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
var window=Window.partitionBy("id").orderBy("$sorted")
val sort_idx=sort_df.select($"*",rowNumber.over(window).as("count_index"))
val minIdx=sort_idx.filter($"value"===1).groupBy("id").agg(min("count_index")).toDF("idx","min_idx")
val result_id=sort_idx.join(minIdx,($"id"===$"idx") &&($"count_index" <= $"min_idx"))
result_id.show
+---+-----+------+-----------+---+-------+
| id|value|sorted|count_index|idx|min_idx|
+---+-----+------+-----------+---+-------+
| 1| 0| 7| 1| 1| 2|
| 1| 1| 8| 2| 1| 2|
| 2| 1| 10| 1| 2| 1|
| 3| 0| 1| 1| 3| 3|
| 3| 0| 2| 2| 3| 3|
| 3| 1| 3| 3| 3| 3|
| 4| 0| 4| 1| 4| 3|
| 4| 0| 5| 2| 4| 3|
| 4| 1| 6| 3| 4| 3|
+---+-----+------+-----------+---+-------+
Still looking for a more optimized solutions.Thanks
You can simply use groupBy like this
val df2 = df1.groupBy("id","value").count().select("id","value")
Here your df1 is
id value
3 0
3 1
3 0
4 1
4 0
4 0
And resultant dataframe is df2 which is your expected output like this
id value
3 0
3 1
4 1
4 0
use isin method and filter as below:
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
val idFilter = List(1, 2)
data.filter($"id".isin(idFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
Ex: filter based on val
val valFilter = List(0)
data.filter($"value".isin(valFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 0| 1|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 0| 9|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+