How to flag last rows from window using Pyspark

How to flag last rows from window using Pyspark - pyspark

My goal is to create a new column is_end (when is last and the previous p_uuid isNull() then is_end=1 otherwise=0. I don't know how to combine When() and last() functions.
I tried several times to combine with windows but always errors :(
df = spark.createDataFrame([
(1, 110, None, '2019-09-28'),
(2, 110, None, '2019-09-28'),
(3, 110, 'aaa', '2019-09-28'),
(4, 110, None, '2019-09-17'),
(5, 110, None, '2019-09-17'),
(6, 110, 'bbb', '2019-09-17'),
(7, 110, None, '2019-09-01'),
(8, 110, None, '2019-09-01'),
(9, 110, None, '2019-09-01'),
(10, 110, None, '2019-09-01'),
(11, 110, 'ccc', '2019-09-01'),
(12, 110, None, '2019-09-01'),
(13, 110, None, '2019-09-01'),
(14, 110, None, '2019-09-01')
],
['idx', 'u_uuid', 'p_uuid', 'timestamp']
)
df.show()
My dataframe:
+---+------+------+----------+
|idx|u_uuid|p_uuid| timestamp|
+---+------+------+----------+
| 1| 110| null|2019-09-28|
| 2| 110| null|2019-09-28|
| 3| 110| aaa|2019-09-28|
| 4| 110| null|2019-09-17|
| 5| 110| null|2019-09-17|
| 6| 110| bbb|2019-09-17|
| 7| 110| null|2019-09-01|
| 8| 110| null|2019-09-01|
| 9| 110| null|2019-09-01|
| 10| 110| null|2019-09-01|
| 11| 110| ccc|2019-09-01|
| 12| 110| null|2019-09-01|
| 13| 110| null|2019-09-01|
| 14| 110| null|2019-09-01|
+---+------+------+----------+
w = Window.partitionBy("u_uuid").orderBy(col("timestamp"))
df.withColumn("p_uuid", when( lag(F.col("p_uuid").isNull()).over(w), 1).otherwise(0))
What I m looking for:
+---+------+------+----------+------+
|idx|u_uuid|p_uuid| timestamp|is_end|
+---+------+------+----------+------+
| 1| 110| null|2019-09-28| 0|
| 2| 110| null|2019-09-28| 0|
| 3| 110| aaa|2019-09-28| 0|
| 4| 110| null|2019-09-17| 0|
| 5| 110| null|2019-09-17| 0|
| 6| 110| bbb|2019-09-17| 0|
| 7| 110| null|2019-09-01| 0|
| 8| 110| null|2019-09-01| 0|
| 9| 110| null|2019-09-01| 0|
| 10| 110| null|2019-09-01| 0|
| 11| 110| ccc|2019-09-01| 0|
| 12| 110| null|2019-08-29| 1|
| 13| 110| null|2019-08-29| 1|
| 14| 110| null|2019-08-29| 1|

Bellow is pyspark sql associate to your case:
w = (Window
.partitionBy("u_uuid")
.orderBy("timestamp"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn("is_end", F.when(F.last("p_uuid", True).over(w).isNull() & F.col("p_uuid").isNull(), F.lit(1)).otherwise(F.lit(0)))\
.show()

Related

Repeated values in pyspark

I have a dataframe in pyspark where i have three columns
df1 = spark.createDataFrame([
('a', 3, 4.2),
('a', 7, 4.2),
('b', 7, 2.6),
('c', 7, 7.21),
('c', 11, 7.21),
('c', 18, 7.21),
('d', 15, 9.0),
], ['model', 'number', 'price'])
df1.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| b| 7| 2.6|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
| d| 15| 9.0|
+-----+------+-----+
Is there a way in pyspark to display only the values that are repeated in the column 'price'?
like in df2 :
df2 = spark.createDataFrame([
('a', 3, 4.2),
('a', 7, 4.2),
('c', 7, 7.21),
('c', 11, 7.21),
('c', 18, 7.21),
], ['model', 'number', 'price'])
df2.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
+-----+------+-----+
I tried to do this, but didn't work
df = df1.groupBy("model","price").count().filter("count > 1")
df2 = df1.where((df.model == df1.model) & (df.price == df1.price))
df2.show()
it included the values that are not repeated too
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| b| 7| 2.6|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
| d| 15| 9.0|
+-----+------+-----+

You can do so with a window function. We partition by price, take a count and filter count > 1.
from pyspark.sql import Window
from pyspark.sql import functions as f
w = Window().partitionBy('price')
df1.withColumn('_c', f.count('price').over(w)).filter('_c > 1').drop('_c').show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
+-----+------+-----+

How to get value from previous group in spark?

I need to get value of previous group in spark and set it to the current group.
How can I achieve that?
I must order by count instead of TEXT_NUM.
Ordering by TEXT_NUM is not possible because events repeat in time, as count 10 and 11 shows.
I'm trying with the following code:
val spark = SparkSession.builder()
.master("spark://spark-master:7077")
.getOrCreate()
val df = spark
.createDataFrame(
Seq[(Int, String, Int)](
(0, "", 0),
(1, "", 0),
(2, "A", 1),
(3, "A", 1),
(4, "A", 1),
(5, "B", 2),
(6, "B", 2),
(7, "B", 2),
(8, "C", 3),
(9, "C", 3),
(10, "A", 1),
(11, "A", 1)
))
.toDF("count", "TEXT", "TEXT_NUM")
val w1 = Window
.orderBy("count")
.rangeBetween(Window.unboundedPreceding, -1)
df
.withColumn("LAST_VALUE", last("TEXT_NUM").over(w1))
.orderBy("count")
.show()
Result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| 0|
| 2| A| 1| 0|
| 3| A| 1| 1|
| 4| A| 1| 1|
| 5| B| 2| 1|
| 6| B| 2| 2|
| 7| B| 2| 2|
| 8| C| 3| 2|
| 9| C| 3| 3|
| 10| A| 1| 3|
| 11| A| 1| 1|
+-----+----+--------+----------+
Desired result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| null|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| A| 1| 0|
| 5| B| 2| 1|
| 6| B| 2| 1|
| 7| B| 2| 1|
| 8| C| 3| 2|
| 9| C| 3| 2|
| 10| A| 1| 3|
| 11| A| 1| 3|
+-----+----+--------+----------+

Consider using Window function last(columnName, ignoreNulls) to backfill nulls in a column that consists of previous "text_num" at group boundaries, as shown below:
val df = Seq(
(0, "", 0), (1, "", 0),
(2, "A", 1), (3, "A", 1), (4, "A", 1),
(5, "B", 2), (6, "B", 2), (7, "B", 2),
(8, "C", 3), (9, "C", 3),
(10, "A", 1), (11, "A", 1)
).toDF("count", "text", "text_num")
import org.apache.spark.sql.expressions.Window
val w1 = Window.orderBy("count")
val w2 = w1.rowsBetween(Window.unboundedPreceding, 0)
df.
withColumn("prev_num", lag("text_num", 1).over(w1)).
withColumn("last_change", when($"text_num" =!= $"prev_num", $"prev_num")).
withColumn("last_value", last("last_change", ignoreNulls=true).over(w2)).
show
/*
+-----+----+--------+--------+-----------+----------+
|count|text|text_num|prev_num|last_change|last_value|
+-----+----+--------+--------+-----------+----------+
| 0| | 0| null| null| null|
| 1| | 0| 0| null| null|
| 2| A| 1| 0| 0| 0|
| 3| A| 1| 1| null| 0|
| 4| A| 1| 1| null| 0|
| 5| B| 2| 1| 1| 1|
| 6| B| 2| 2| null| 1|
| 7| B| 2| 2| null| 1|
| 8| C| 3| 2| 2| 2|
| 9| C| 3| 3| null| 2|
| 10| A| 1| 3| 3| 3|
| 11| A| 1| 1| null| 3|
+-----+----+--------+--------+-----------+----------+
*/
The intermediary columns are kept in the output for references. Just drop them if they aren't needed.

Set literal value over Window if condition suited Spark Scala

I need to check a condition over a window:
- If the column IND_DEF is 20, then I want to change the value of the column premium for the window to which this register belongs to, and set it to 1.
My initial Dataframe looks like this:
+--------+----+-------+-----+-------+
|policyId|name|premium|state|IND_DEF|
+--------+----+-------+-----+-------+
| 1| BK| null| KT| 40|
| 1| AK| -31| null| 30|
| 1| VZ| null| IL| 20|
| 2| VK| 32| LI| 7|
| 2| CK| 25| YNZ| 10|
| 2| CK| 0| null| 5|
| 2| VK| 30| IL| 25|
+--------+----+-------+-----+-------+
And I want to achieve this:
+--------+----+-------+-----+-------+
|policyId|name|premium|state|IND_DEF|
+--------+----+-------+-----+-------+
| 1| BK| 1| KT| 40|
| 1| AK| 1| null| 30|
| 1| VZ| 1| IL| 20|
| 2| VK| 32| LI| 7|
| 2| CK| 25| YNZ| 10|
| 2| CK| 0| null| 5|
| 2| VK| 30| IL| 25|
+--------+----+-------+-----+-------+
I am trying the following code but does not work...
val df_946 = Seq [(Int, String, Integer, String, Int)]((1,"VZ",null,"IL",20),(1, "AK", -31,null,30),(1,"BK", null,"KT",40),(2,"CK",0,null,5),(2,"CK",25,"YNZ",10),(2,"VK",30,"IL",25),(2,"VK",32,"LI",7)).toDF("policyId", "name", "premium", "state","IND_DEF").orderBy("policyId")
val winSpec = Window.partitionBy("policyId").orderBy("policyId")
val df_947 = df_946.withColumn("premium",when(col("IND_DEF") === 20,lit(1).over(winSpec)).otherwise(col("premium")))

You can generate an array of IND_DEF values via collect_list for each window partition and recreate column premium based on the array_contains condition:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, None, 40),
(1, Some(-31), 30),
(1, None, 20),
(2, Some(32), 7),
(2, Some(30), 10)
).toDF("policyId", "premium", "IND_DEF")
val win = Window.partitionBy($"policyId")
df.
withColumn("indList", collect_list($"IND_DEF").over(win)).
withColumn("premium", when(array_contains($"indList", 20), 1).otherwise($"premium")).
drop($"indList").
show
// +--------+-------+-------+
// |policyId|premium|IND_DEF|
// +--------+-------+-------+
// | 1| 1| 40|
// | 1| 1| 30|
// | 1| 1| 20|
// | 2| 32| 7|
// | 2| 30| 10|
// +--------+-------+-------+

Adding column with sum of all rows above in same grouping

I need to create a 'rolling count' column which takes the previous count and adds the new count for each day and company. I have already organized and sorted the dataframe into groups of ascending dates per company with the corresponding count. I also added a 'ix' column which indexes each grouping, like so:
+--------------------+--------------------+-----+---+
| Normalized_Date| company|count| ix|
+--------------------+--------------------+-----+---+
|09/25/2018 00:00:...|[5c40c8510fb7c017...| 7| 1|
|09/25/2018 00:00:...|[5bdb2b543951bf07...| 9| 1|
|11/28/2017 00:00:...|[593b0d9f3f21f9dd...| 7| 1|
|11/29/2017 00:00:...|[593b0d9f3f21f9dd...| 60| 2|
|01/09/2018 00:00:...|[593b0d9f3f21f9dd...| 1| 3|
|04/27/2018 00:00:...|[593b0d9f3f21f9dd...| 9| 4|
|09/25/2018 00:00:...|[593b0d9f3f21f9dd...| 29| 5|
|11/20/2018 00:00:...|[593b0d9f3f21f9dd...| 42| 6|
|12/11/2018 00:00:...|[593b0d9f3f21f9dd...| 317| 7|
|01/04/2019 00:00:...|[593b0d9f3f21f9dd...| 3| 8|
|02/13/2019 00:00:...|[593b0d9f3f21f9dd...| 15| 9|
|04/01/2019 00:00:...|[593b0d9f3f21f9dd...| 1| 10|
+--------------------+--------------------+-----+---+
The output I need would simply add up all the counts up to that date for each company. Like so:
+--------------------+--------------------+-----+---+------------+
| Normalized_Date| company|count| ix|RollingCount|
+--------------------+--------------------+-----+---+------------+
|09/25/2018 00:00:...|[5c40c8510fb7c017...| 7| 1| 7|
|09/25/2018 00:00:...|[5bdb2b543951bf07...| 9| 1| 9|
|11/28/2017 00:00:...|[593b0d9f3f21f9dd...| 7| 1| 7|
|11/29/2017 00:00:...|[593b0d9f3f21f9dd...| 60| 2| 67|
|01/09/2018 00:00:...|[593b0d9f3f21f9dd...| 1| 3| 68|
|04/27/2018 00:00:...|[593b0d9f3f21f9dd...| 9| 4| 77|
|09/25/2018 00:00:...|[593b0d9f3f21f9dd...| 29| 5| 106|
|11/20/2018 00:00:...|[593b0d9f3f21f9dd...| 42| 6| 148|
|12/11/2018 00:00:...|[593b0d9f3f21f9dd...| 317| 7| 465|
|01/04/2019 00:00:...|[593b0d9f3f21f9dd...| 3| 8| 468|
|02/13/2019 00:00:...|[593b0d9f3f21f9dd...| 15| 9| 483|
|04/01/2019 00:00:...|[593b0d9f3f21f9dd...| 1| 10| 484|
+--------------------+--------------------+-----+---+------------+
I figured the lag function would be of use, and I was able to get each row of rollingcount with ix > 1 to add the count directly above it with the following code:
w = Window.partitionBy('company').orderBy(F.unix_timestamp('Normalized_Dat e','MM/dd/yyyy HH:mm:ss aaa').cast('timestamp'))
refined_DF = solutionDF.withColumn("rn", F.row_number().over(w))
solutionDF = refined_DF.withColumn('RollingCount',F.when(refined_DF['rn'] > 1, refined_DF['count'] + F.lag(refined_DF['count'],count= 1 ).over(w)).otherwise(refined_DF['count']))
which yields the following df:
+--------------------+--------------------+-----+---+------------+
| Normalized_Date| company|count| ix|RollingCount|
+--------------------+--------------------+-----+---+------------+
|09/25/2018 00:00:...|[5c40c8510fb7c017...| 7| 1| 7|
|09/25/2018 00:00:...|[5bdb2b543951bf07...| 9| 1| 9|
|11/28/2017 00:00:...|[593b0d9f3f21f9dd...| 7| 1| 7|
|11/29/2017 00:00:...|[593b0d9f3f21f9dd...| 60| 2| 67|
|01/09/2018 00:00:...|[593b0d9f3f21f9dd...| 1| 3| 61|
|04/27/2018 00:00:...|[593b0d9f3f21f9dd...| 9| 4| 10|
|09/25/2018 00:00:...|[593b0d9f3f21f9dd...| 29| 5| 38|
|11/20/2018 00:00:...|[593b0d9f3f21f9dd...| 42| 6| 71|
|12/11/2018 00:00:...|[593b0d9f3f21f9dd...| 317| 7| 359|
|01/04/2019 00:00:...|[593b0d9f3f21f9dd...| 3| 8| 320|
|02/13/2019 00:00:...|[593b0d9f3f21f9dd...| 15| 9| 18|
|04/01/2019 00:00:...|[593b0d9f3f21f9dd...| 1| 10| 16|
+--------------------+--------------------+-----+---+------------+
I just need it to sum all of the counts ix rows above it. I have tried using a udf to figure out the 'count' input into the lag function, but I keep getting a "'Column' object is not callable" error, plus it doesn't do the sum of all of the rows. I have also tried using a loop but that seems impossible because it will make a new dataframe each time through, plus I would need to join them all afterwards. There must be an easier and simpler way to do this. Perhaps a different function than lag?

The lag returns you a certain single row before your current value, but you need a range to calculate the cummulative sum. Therefore you have to use the window function rangeBetween (rowsBetween). Have a look at the example below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
('09/25/2018', '5c40c8510fb7c017', 7, 1),
('09/25/2018', '5bdb2b543951bf07', 9, 1),
('11/28/2017', '593b0d9f3f21f9dd', 7, 1),
('11/29/2017', '593b0d9f3f21f9dd', 60, 2),
('01/09/2018', '593b0d9f3f21f9dd', 1, 3),
('04/27/2018', '593b0d9f3f21f9dd', 9, 4),
('09/25/2018', '593b0d9f3f21f9dd', 29, 5),
('11/20/2018', '593b0d9f3f21f9dd', 42, 6),
('12/11/2018', '593b0d9f3f21f9dd', 317, 7),
('01/04/2019', '593b0d9f3f21f9dd', 3, 8),
('02/13/2019', '593b0d9f3f21f9dd', 15, 9),
('04/01/2019', '593b0d9f3f21f9dd', 1, 10)
]
columns = ['Normalized_Date', 'company','count', 'ix']
df=spark.createDataFrame(l, columns)
df = df.withColumn('Normalized_Date', F.to_date(df.Normalized_Date, 'MM/dd/yyyy'))
w = Window.partitionBy('company').orderBy('Normalized_Date').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('Rolling_count', F.sum('count').over(w))
df.show()
Output:
+---------------+----------------+-----+---+-------------+
|Normalized_Date| company|count| ix|Rolling_count|
+---------------+----------------+-----+---+-------------+
| 2018-09-25|5c40c8510fb7c017| 7| 1| 7|
| 2018-09-25|5bdb2b543951bf07| 9| 1| 9|
| 2017-11-28|593b0d9f3f21f9dd| 7| 1| 7|
| 2017-11-29|593b0d9f3f21f9dd| 60| 2| 67|
| 2018-01-09|593b0d9f3f21f9dd| 1| 3| 68|
| 2018-04-27|593b0d9f3f21f9dd| 9| 4| 77|
| 2018-09-25|593b0d9f3f21f9dd| 29| 5| 106|
| 2018-11-20|593b0d9f3f21f9dd| 42| 6| 148|
| 2018-12-11|593b0d9f3f21f9dd| 317| 7| 465|
| 2019-01-04|593b0d9f3f21f9dd| 3| 8| 468|
| 2019-02-13|593b0d9f3f21f9dd| 15| 9| 483|
| 2019-04-01|593b0d9f3f21f9dd| 1| 10| 484|
+---------------+----------------+-----+---+-------------+

try this.
You need the sum of all preceding rows to current row in the window frame.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.WindowSpec
import org.apache.spark.sql.functions._
val df = Seq(
("5c40c8510fb7c017", 7, 1),
("5bdb2b543951bf07", 9, 1),
("593b0d9f3f21f9dd", 7, 1),
("593b0d9f3f21f9dd", 60, 2),
("593b0d9f3f21f9dd", 1, 3),
("593b0d9f3f21f9dd", 9, 4),
("593b0d9f3f21f9dd", 29, 5),
("593b0d9f3f21f9dd", 42, 6),
("593b0d9f3f21f9dd", 317, 7),
("593b0d9f3f21f9dd", 3, 8),
("593b0d9f3f21f9dd", 15, 9),
("593b0d9f3f21f9dd", 1, 10)
).toDF("company", "count", "ix")
scala> df.show(false)
+----------------+-----+---+
|company |count|ix |
+----------------+-----+---+
|5c40c8510fb7c017|7 |1 |
|5bdb2b543951bf07|9 |1 |
|593b0d9f3f21f9dd|7 |1 |
|593b0d9f3f21f9dd|60 |2 |
|593b0d9f3f21f9dd|1 |3 |
|593b0d9f3f21f9dd|9 |4 |
|593b0d9f3f21f9dd|29 |5 |
|593b0d9f3f21f9dd|42 |6 |
|593b0d9f3f21f9dd|317 |7 |
|593b0d9f3f21f9dd|3 |8 |
|593b0d9f3f21f9dd|15 |9 |
|593b0d9f3f21f9dd|1 |10 |
+----------------+-----+---+
scala> val overColumns = Window.partitionBy("company").orderBy("ix").rowsBetween(Window.unboundedPreceding, Window.currentRow)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#3ed5e17c
scala> val outputDF = df.withColumn("RollingCount", sum("count").over(overColumns))
outputDF: org.apache.spark.sql.DataFrame = [company: string, count: int ... 2 more fields]
scala> outputDF.show(false)
+----------------+-----+---+------------+
|company |count|ix |RollingCount|
+----------------+-----+---+------------+
|5c40c8510fb7c017|7 |1 |7 |
|5bdb2b543951bf07|9 |1 |9 |
|593b0d9f3f21f9dd|7 |1 |7 |
|593b0d9f3f21f9dd|60 |2 |67 |
|593b0d9f3f21f9dd|1 |3 |68 |
|593b0d9f3f21f9dd|9 |4 |77 |
|593b0d9f3f21f9dd|29 |5 |106 |
|593b0d9f3f21f9dd|42 |6 |148 |
|593b0d9f3f21f9dd|317 |7 |465 |
|593b0d9f3f21f9dd|3 |8 |468 |
|593b0d9f3f21f9dd|15 |9 |483 |
|593b0d9f3f21f9dd|1 |10 |484 |
+----------------+-----+---+------------+

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.

Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()

Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to flag last rows from window using Pyspark - pyspark

Related

Repeated values in pyspark

How to get value from previous group in spark?

Set literal value over Window if condition suited Spark Scala

Adding column with sum of all rows above in same grouping

Apache Spark - Scala API - Aggregate on sequentially increasing key

Categories

Resources