How to sum in pyspark?

How to sum in pyspark? - pyspark

I have a below table and i just want to sum column _10 & _12 column but i am getting error
_2|_10|_12|
+------+---+---+
|yearID| H| 3B|
| 2004| 0| 0|
| 2006| 0| 0|
| 2007| 0| 0|
| 2008| 0| 0|
| 2009| 0| 0|
| 2010| 0| 0|
| 1954|131| 6|
| 1955|189| 9|
| 1956|200| 14|
| 1957|198| 6|
| 1958|196| 4|
| 1959|223| 7|
| 1960|172| 11|
| 1961|197| 10|
| 1962|191| 6|
| 1963|201| 4|
| 1964|187| 2|
| 1965|181| 1|
| 1966|168| 1|
| 1967|184| 3|
| 1968|174| 4|
| 1969|164| 3|
| 1970|154| 1|
| 1971|162| 3|
| 1972|119| 0|
| 1973|118| 1|
| 1974| 91| 0|
| 1975|109| 2|
| 1976| 62| 0|

I am not sure what you mean about sum. If you mean sum whole column's values, you can use agg function. Or if you want to sum like _10 + _12 and create a new column then use withColumn function
>>> data = sc.parallelize([
... ('yearID','H','3B'),
... ('2004','0','0'),
... ('2006','0','0'),
... ('2007','0','0'),
... ('2008','0','0'),
... ('2009','0','0'),
... ('2010','0','0'),
... ('1954','131','6'),
... ('1955','189','9'),
... ('1956','200','14'),
... ('1957','198','6')
... ])
>>>
>>> cols = ['_2','_10','_12']
>>>
>>> df = spark.createDataFrame(data,cols)
18/10/01 04:22:48 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
>>>
>>> df.show()
+------+---+---+
| _2|_10|_12|
+------+---+---+
|yearID| H| 3B|
| 2004| 0| 0|
| 2006| 0| 0|
| 2007| 0| 0|
| 2008| 0| 0|
| 2009| 0| 0|
| 2010| 0| 0|
| 1954|131| 6|
| 1955|189| 9|
| 1956|200| 14|
| 1957|198| 6|
+------+---+---+
>>> df.agg({'_10':'sum','_12':'sum'}).show()
+--------+--------+
|sum(_12)|sum(_10)|
+--------+--------+
| 35.0| 718.0|
+--------+--------+
>>> df.withColumn('new_col', df['_10']+df['_12']).show()
+------+---+---+-------+
| _2|_10|_12|new_col|
+------+---+---+-------+
|yearID| H| 3B| null|
| 2004| 0| 0| 0.0|
| 2006| 0| 0| 0.0|
| 2007| 0| 0| 0.0|
| 2008| 0| 0| 0.0|
| 2009| 0| 0| 0.0|
| 2010| 0| 0| 0.0|
| 1954|131| 6| 137.0|
| 1955|189| 9| 198.0|
| 1956|200| 14| 214.0|
| 1957|198| 6| 204.0|
+------+---+---+-------+

Related

Window function based on a condition

I have the following DF:
|-----------------------|
|Date | Val | Cond|
|-----------------------|
|2022-01-08 | 2 | 0 |
|2022-01-09 | 4 | 1 |
|2022-01-10 | 6 | 1 |
|2022-01-11 | 8 | 0 |
|2022-01-12 | 2 | 1 |
|2022-01-13 | 5 | 1 |
|2022-01-14 | 7 | 0 |
|2022-01-15 | 9 | 0 |
|-----------------------|
I need to sum the values of two days before where cond = 1 for every date, my expected output is:
|-----------------|
|Date | Sum |
|-----------------|
|2022-01-08 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-09 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-10 | 0 | Not sum because doesnt exists two dates with cond = 1 before this date
|2022-01-11 | 10 | (4+6)
|2022-01-12 | 10 | (4+6)
|2022-01-13 | 8 | (2+6)
|2022-01-14 | 7 | (5+2)
|2022-01-15 | 7 | (5+2)
|-----------------|
I've tried to get the output DF using this code:
df = df.where("Cond= 1").withColumn(
"ListView",
f.collect_list("Val").over(windowSpec.rowsBetween(-2, -1))
)
But when I use .where("Cond = 1") I exclude the dates that cond is equal zero.
I found the following answer but didn't help me:
Window.rowsBetween - only consider rows fulfilling a specific condition (e.g. not being null)
How can I achieve my expected output using window functions?
The MVCE:
data_1=[
("2022-01-08",2,0),
("2022-01-09",4,1),
("2022-01-10",6,1),
("2022-01-11",8,0),
("2022-01-12",2,1),
("2022-01-13",5,1),
("2022-01-14",7,0),
("2022-01-15",9,0)
]
schema_1 = StructType([
StructField("Date", DateType(),True),
StructField("Val", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)

The following should do the trick (but I'm sure it can be further optimized).
Setup:
data_1=[
("2022-01-08",2,0),
("2022-01-09",4,1),
("2022-01-10",6,1),
("2022-01-11",8,0),
("2022-01-12",2,1),
("2022-01-13",5,1),
("2022-01-14",7,0),
("2022-01-15",9,0),
("2022-01-16",9,0),
("2022-01-17",9,0)
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Val", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
df_1 = df_1.withColumn('Date', to_date("Date", "yyyy-MM-dd"))
+----------+---+----+
| Date|Val|Cond|
+----------+---+----+
|2022-01-08| 2| 0|
|2022-01-09| 4| 1|
|2022-01-10| 6| 1|
|2022-01-11| 8| 0|
|2022-01-12| 2| 1|
|2022-01-13| 5| 1|
|2022-01-14| 7| 0|
|2022-01-15| 9| 0|
|2022-01-16| 9| 0|
|2022-01-17| 9| 0|
+----------+---+----+
Create a new DF only with Cond==1 rows to obtain the sum of two consecutive rows with that condition:
windowSpec = Window.partitionBy("Cond").orderBy("Date")
df_2 = df_1.where(df_1.Cond==1).withColumn(
"Sum",
sum("Val").over(windowSpec.rowsBetween(-1, 0))
).withColumn('date_1', col('date')).drop('date')
+---+----+---+----------+
|Val|Cond|Sum| date_1|
+---+----+---+----------+
| 4| 1| 4|2022-01-09|
| 6| 1| 10|2022-01-10|
| 2| 1| 8|2022-01-12|
| 5| 1| 7|2022-01-13|
+---+----+---+----------+
Do a left join to get the sum into the original data frame, and set the sum to zero for the rows with Cond==0:
df_3 = df_1.join(df_2.select('sum', col('date_1')), df_1.Date == df_2.date_1, "left").drop('date_1').fillna(0)
+----------+---+----+---+
| Date|Val|Cond|sum|
+----------+---+----+---+
|2022-01-08| 2| 0| 0|
|2022-01-09| 4| 1| 4|
|2022-01-10| 6| 1| 10|
|2022-01-11| 8| 0| 0|
|2022-01-12| 2| 1| 8|
|2022-01-13| 5| 1| 7|
|2022-01-14| 7| 0| 0|
|2022-01-15| 9| 0| 0|
|2022-01-16| 9| 0| 0|
|2022-01-17| 9| 0| 0|
+----------+---+----+---+
Do a cumulative sum on the condition column:
df_3=df_3.withColumn('cond_sum', sum('cond').over(Window.orderBy('Date')))
+----------+---+----+---+--------+
| Date|Val|Cond|sum|cond_sum|
+----------+---+----+---+--------+
|2022-01-08| 2| 0| 0| 0|
|2022-01-09| 4| 1| 4| 1|
|2022-01-10| 6| 1| 10| 2|
|2022-01-11| 8| 0| 0| 2|
|2022-01-12| 2| 1| 8| 3|
|2022-01-13| 5| 1| 7| 4|
|2022-01-14| 7| 0| 0| 4|
|2022-01-15| 9| 0| 0| 4|
|2022-01-16| 9| 0| 0| 4|
|2022-01-17| 9| 0| 0| 4|
+----------+---+----+---+--------+
Finally, for each partition where the cond_sum is greater than 1, use the max sum for that partition:
df_3.withColumn('sum', when(df_3.cond_sum > 1, max('sum').over(Window.partitionBy('cond_sum'))).otherwise(0)).show()
+----------+---+----+---+--------+
| Date|Val|Cond|sum|cond_sum|
+----------+---+----+---+--------+
|2022-01-08| 2| 0| 0| 0|
|2022-01-09| 4| 1| 0| 1|
|2022-01-10| 6| 1| 10| 2|
|2022-01-11| 8| 0| 10| 2|
|2022-01-12| 2| 1| 8| 3|
|2022-01-13| 5| 1| 7| 4|
|2022-01-14| 7| 0| 7| 4|
|2022-01-15| 9| 0| 7| 4|
|2022-01-16| 9| 0| 7| 4|
|2022-01-17| 9| 0| 7| 4|
+----------+---+----+---+--------+

Pyspark keep state within tasks

This is related to this question: Pyspark dataframe column value dependent on value from another row but this one gets even more complicated.
I have a dataframe:
columns = ['id','seq','manufacturer']
data = [("1",1,"Factory"), ("1",2,"Sub-Factory-1"), ("1",3,"Order"),("1",4,"Sub-Factory-1"),("2",1,"Factory"), ("2",2,"Sub-Factory-1"), ("2",5,"Sub-Factory-1"),("3",1, "Sub-Factory-1"),("3",2,"Order"), ("3",4, "Sub-Factory-1"), ("4", 1,"Factory"), ("4",3, "Sub-Factory-1"),("4",4, "Sub-Factory-1"),("5",1,"Sub-Factory-1"), ("5",2, "Sub-Factory-1"), ("5", 6,"Order"), ("6",2,"Factory"), ("6",3, "Order"), ("6",4,"Sub-Factory-1"), ("6", 6,"Sub-Factory-1"), ("6",7,"Order"), ("7",1,"Sub-Factory-1"), ("7",2,"Factory" ), ("7", 3,"Order"), ("7", 4,"Sub-Factory-1"),("7",5,"Factory"), ("7",8, "Sub-Factory-1"),("7",10,"Sub-Factory-1")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.orderBy('id','seq').show(40)
+---+---+-------------+
| id|seq| manufacturer|
+---+---+-------------+
| 1| 1| Factory|
| 1| 2|Sub-Factory-1|
| 1| 3| Order|
| 1| 4|Sub-Factory-1|
| 2| 1| Factory|
| 2| 2|Sub-Factory-1|
| 2| 5|Sub-Factory-1|
| 3| 1|Sub-Factory-1|
| 3| 2| Order|
| 3| 4|Sub-Factory-1|
| 4| 1| Factory|
| 4| 3|Sub-Factory-1|
| 4| 4|Sub-Factory-1|
| 5| 1|Sub-Factory-1|
| 5| 2|Sub-Factory-1|
| 5| 6| Order|
| 6| 2| Factory|
| 6| 3| Order|
| 6| 4|Sub-Factory-1|
| 6| 6|Sub-Factory-1|
| 6| 7| Order|
| 7| 1|Sub-Factory-1|
| 7| 2| Factory|
| 7| 3| Order|
| 7| 4|Sub-Factory-1|
| 7| 5| Factory|
| 7| 8|Sub-Factory-1|
| 7| 10|Sub-Factory-1|
+---+---+-------------+
What I want to do is to assign hierarchical values to another column(not saying its the best idea) that I can use with the logic from Pyspark dataframe column value dependent on value from another row. So within id group and seq order I want only the first Sub-Factory to attribute to Factory, if there is a Factory within same id and seq order above the Sub-Factory.
So end result should look like:
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
| 6| 7| Order| 0|
| 7| 1|Sub-Factory-1| 0|
| 7| 2| Factory| 1|
| 7| 3| Order| 0|
| 7| 4|Sub-Factory-1| 1|
| 7| 5| Factory| 1|
| 7| 8|Sub-Factory-1| 1|
| 7| 10|Sub-Factory-1| 0|
+---+---+-------------+-------+
The dataset is large so I can't use something like df.collect() and then loop over data because it crashes memory. My first idea was to use an accumulator like:
acc = sc.accumulator(0)
def myFunc(manufacturer):
if manufacturer == 'Factory':
acc.value = 1
return 1
elif manufacturer == 'Sub-Factory-1' and acc.value == 1:
acc.value = 0
return 1
else:
return 0
myFuncUDF = F.udf(myFunc, IntegerType())
df = df.withColumn('test', myFuncUDF(col('manufacturer')))
But it's a bad idea since accumulator cannot be accessed within tasks.
Also Window function solves it if I want to attribute all Sub-Factories from above Factory within same id but now only the first Sub-Factory should get attributed. Any ideas?

from pyspark.sql.window import Window
from pyspark.sql.functions import *
df_mod = df.filter(df.manufacturer == 'Sub-Factory-1')
W = Window.partitionBy("id").orderBy("seq")
df_mod = df_mod.withColumn("rank",rank().over(W))
df_mod = df_mod.filter(col('rank') == 1)
df_mod2 = df.filter(col('manufacturer') == 'Factory')\
.select('id', 'seq', col('manufacturer').alias('Factory_chk_2'))
df_f = df\
.join(df_mod, ['id', 'seq'], 'left')\
.select('id', 'seq', df.manufacturer, 'rank')\
.join(df_mod2, 'id', 'left')\
.select('id', df.seq, df.manufacturer, 'rank', 'Factory_chk_2')\
.withColumn('Factory_chk', when(df.manufacturer=='Factory', 1))\
.withColumn('Factory_chk_2', when(col('Factory_chk_2')=='Factory', 1))\
.withColumn('checker',when(col('Factory_chk_2')=='1', coalesce(col('rank'),col('Factory_chk'))).otherwise(lit(0)))\
.select('id', 'seq', 'manufacturer', 'checker')\
.na.fill(value=0)\
.orderBy('id', 'seq')
df_f.show()
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
+---+---+-------------+-------+
only showing top 20 rows

pySpark windows partition sortby instead of order by (exclamation marks)

this is my current dataset
+----------+--------------------+---------+--------+
|session_id| timestamp| item_id|category|
+----------+--------------------+---------+--------+
| 1|2014-04-07 10:51:...|214536502| 0|
| 1|2014-04-07 10:54:...|214536500| 0|
| 1|2014-04-07 10:54:...|214536506| 0|
| 1|2014-04-07 10:57:...|214577561| 0|
| 2|2014-04-07 13:56:...|214662742| 0|
| 2|2014-04-07 13:57:...|214662742| 0|
| 2|2014-04-07 13:58:...|214825110| 0|
| 2|2014-04-07 13:59:...|214757390| 0|
| 2|2014-04-07 14:00:...|214757407| 0|
| 2|2014-04-07 14:02:...|214551617| 0|
| 3|2014-04-02 13:17:...|214716935| 0|
| 3|2014-04-02 13:26:...|214774687| 0|
| 3|2014-04-02 13:30:...|214832672| 0|
| 4|2014-04-07 12:09:...|214836765| 0|
| 4|2014-04-07 12:26:...|214706482| 0|
| 6|2014-04-06 16:58:...|214701242| 0|
| 6|2014-04-06 17:02:...|214826623| 0|
| 7|2014-04-02 06:38:...|214826835| 0|
| 7|2014-04-02 06:39:...|214826715| 0|
| 8|2014-04-06 08:49:...|214838855| 0|
+----------+--------------------+---------+--------+
I want to get the difference between the timestamp of the current row and the timestamp of the previous row.
so I converted the time stamp as follows
data = data.withColumn('time_seconds',data.timestamp.astype('Timestamp').cast("long"))
data.show()
next, I tried the following
my_window = Window.partitionBy().orderBy("session_id")
data = data.withColumn("prev_value", F.lag(data.time_seconds).over(my_window))
data = data.withColumn("diff", F.when(F.isnull(data.time_seconds - data.prev_value), 0)
.otherwise(data.time_seconds - data.prev_value))
data.show()
this is what I got
+----------+-----------+---------+--------+------------+----------+--------+
|session_id| timestamp| item_id|category|time_seconds|prev_value| diff|
+----------+--------------------+---------+--------+------------+----------+
| 1|2014-04-07 |214536502| 0| 1396831869| null| 0|
| 1|2014-04-07 |214536500| 0| 1396832049|1396831869| 180|
| 1|2014-04-07 |214536506| 0| 1396832086|1396832049| 37|
| 1|2014-04-07 |214577561| 0| 1396832220|1396832086| 134|
| 10000001|2014-09-08 |214854230| S| 1410136538|1396832220|13304318|
| 10000001|2014-09-08 |214556216| S| 1410136820|1410136538| 282|
| 10000001|2014-09-08 |214556212| S| 1410136836|1410136820| 16|
| 10000001|2014-09-08 |214854230| S| 1410136872|1410136836| 36|
| 10000001|2014-09-08 |214854125| S| 1410137314|1410136872| 442|
| 10000002|2014-09-08 |214849322| S| 1410167451|1410137314| 30137|
| 10000002|2014-09-08 |214838094| S| 1410167611|1410167451| 160|
| 10000002|2014-09-08 |214714721| S| 1410167694|1410167611| 83|
| 10000002|2014-09-08 |214853711| S| 1410168818|1410167694| 1124|
| 10000003|2014-09-05 |214853090| 3| 1409880735|1410168818| -288083|
| 10000003|2014-09-05 |214851326| 3| 1409880865|1409880735| 130|
| 10000003|2014-09-05 |214853094| 3| 1409881043|1409880865| 178|
| 10000004|2014-09-05 |214853090| 3| 1409886885|1409881043| 5842|
| 10000004|2014-09-05 |214851326| 3| 1409889318|1409886885| 2433|
| 10000004|2014-09-05 |214853090| 3| 1409889388|1409889318| 70|
| 10000004|2014-09-05 |214851326| 3| 1409889428|1409889388| 40|
+----------+--------------------+---------+--------+------------+----------+
only showing top 20 rows
I was hoping that the session Id came out in order of numerical sequence instead of what that gave me...
is there anyway to make the session id come out in numerical order (as in 1,2,3.....) instead of (1,100001......)
thank you so much

Filtering on multiple columns in Spark dataframes

Suppose I have a dataframe in Spark as shown below -
val df = Seq(
(0,0,0,0.0),
(1,0,0,0.1),
(0,1,0,0.11),
(0,0,1,0.12),
(1,1,0,0.24),
(1,0,1,0.27),
(0,1,1,0.30),
(1,1,1,0.40)
).toDF("A","B","C","rate")
Here is how it looks like -
scala> df.show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 0| 0| 0.0|
| 1| 0| 0| 0.1|
| 0| 1| 0|0.11|
| 0| 0| 1|0.12|
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
| 1| 1| 1| 0.4|
+---+---+---+----+
A,B and C are the advertising channels in this case. 0 and 1 represent absence and presence of channels respectively. 2^3 shows 8 combinations in the data-frame.
I want to filter records from this data-frame that shows presence of 2 channels at a time( AB, AC, BC) . Here is how I want my output to be -
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+
I can write 3 statements to get the output by doing -
scala> df.filter($"A" === 1 && $"B" === 1 && $"C" === 0).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
+---+---+---+----+
scala> df.filter($"A" === 1 && $"B" === 0 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 0| 1|0.27|
+---+---+---+----+
scala> df.filter($"A" === 0 && $"B" === 1 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 1| 1| 0.3|
+---+---+---+----+
However, I want to achieve this using either a single statement that does my job or a function that helps me get the output.
I was thinking of using a case statement to match the values. However in general my dataframe might consist of more than 3 channels -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 0| 0| 0.0|
| 0| 0| 0| 1| 0.1|
| 0| 0| 1| 0| 0.1|
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 0| 0.1|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 0| 1| 1| 1| 0.4|
| 1| 0| 0| 0| 0.0|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 0| 1| 1| 0.1|
| 1| 1| 0| 0|0.79|
| 1| 1| 0| 1| 0.1|
| 1| 1| 1| 0| 0.1|
| 1| 1| 1| 1| 0.1|
+---+---+---+---+----+
In this scenario I would want my output as -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 1| 0| 0|0.79|
+---+---+---+---+----+
which shows rates for paired presence of channels => (AB, AC, AD, BC, BD, CD).
Kindly help.

One way could be to sum the columns and then filter only when the result of the sum is 2.
import org.apache.spark.sql.functions._
df.withColumn("res", $"A" + $"B" + $"C").filter($"res" === lit(2)).drop("res").show
The output is:
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+

How to enumerate the rows of a dataframe? Spark Scala

I have a dataframe (renderDF) like this:
+------+---+-------+
| uid|sid|renders|
+------+---+-------+
| david| 0| 0|
|rachel| 1| 0|
|rachel| 3| 0|
|rachel| 2| 0|
| pep| 2| 0|
| pep| 0| 1|
| pep| 1| 1|
|rachel| 0| 1|
| rick| 1| 1|
| ross| 0| 3|
| rick| 0| 3|
+------+---+-------+
I want to use a window function to achieve this result
+------+---+-------+-----------+
| uid|sid|renders|row_number |
+------+---+-------+-----------+
| david| 0| 0| 1 |
|rachel| 1| 0| 2 |
|rachel| 3| 0| 3 |
|rachel| 2| 0| 4 |
| pep| 2| 0| 5 |
| pep| 0| 1| 6 |
| pep| 1| 1| 7 |
|rachel| 0| 1| 8 |
| rick| 1| 1| 9 |
| ross| 0| 3| 10 |
| rick| 0| 3| 11 |
+------+---+-------+-----------+
I try:
val windowRender = Window.partitionBy('sid).orderBy('Renders)
renderDF.withColumn("row_number", row_number() over windowRender)
But it doesn't do what I need.
Is the partition my problem?

try this:
val dfWithRownumber = renderDF.withColumn("row_number", row_number.over(Window.partitionBy(lit(1)).orderBy("renders")))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to sum in pyspark? - pyspark

Related

Window function based on a condition

Pyspark keep state within tasks

pySpark windows partition sortby instead of order by (exclamation marks)

Filtering on multiple columns in Spark dataframes

How to enumerate the rows of a dataframe? Spark Scala

Categories

Resources