PySpark: Incremental Row Counter - pyspark

I am having difficulty implementing this existing answer:
PySpark - get row number for each row in a group
Consider the following:
# create df
df = spark.createDataFrame(sc.parallelize([
[1, 'A', 20220722, 1],
[1, 'A', 20220723, 1],
[1, 'B', 20220724, 2],
[2, 'B', 20220722, 1],
[2, 'C', 20220723, 2],
[2, 'B', 20220724, 3],
]),
['ID', 'State', 'Time', 'Expected'])
# rank
w = Window.partitionBy('State').orderBy('ID', 'Time')
df = df.withColumn('rn', F.row_number().over(w))
df = df.withColumn('rank', F.rank().over(w))
df = df.withColumn('dense', F.dense_rank().over(w))
# view
df.show()
+---+-----+--------+--------+---+----+-----+
| ID|State| Time|Expected| rn|rank|dense|
+---+-----+--------+--------+---+----+-----+
| 1| A|20220722| 1| 1| 1| 1|
| 1| A|20220723| 1| 2| 2| 2|
| 1| B|20220724| 2| 1| 1| 1|
| 2| B|20220722| 1| 2| 2| 2|
| 2| B|20220724| 3| 3| 3| 3|
| 2| C|20220723| 2| 1| 1| 1|
+---+-----+--------+--------+---+----+-----+
How can I get the expected value and also sort the dates correctly such that they are ascending?

you restart your count for each new id value, which means the id field is your partition field, not state.
an approach with sum window function.
data_sdf. \
withColumn('st_notsame',
func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')), func.lit(True)).cast('int')
). \
withColumn('rank',
func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time', 'state').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+-----+--------+--------+----------+----+
# | id|state| time|expected|st_notsame|rank|
# +---+-----+--------+--------+----------+----+
# | 1| A|20220722| 1| 1| 1|
# | 1| A|20220723| 1| 0| 1|
# | 1| B|20220724| 2| 1| 2|
# | 2| B|20220722| 1| 1| 1|
# | 2| C|20220723| 2| 1| 2|
# | 2| B|20220724| 3| 1| 3|
# +---+-----+--------+--------+----------+----+
you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get your desired ranking

Related

Get groups with duplicated values in PySpark

For example, if we have the following dataframe:
df = spark.createDataFrame([['a', 1], ['a', 1],
['b', 1], ['b', 2],
['c', 2], ['c', 2], ['c', 2]],
['col1', 'col2'])
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 1|
| b| 1|
| b| 2|
| c| 2|
| c| 2|
| c| 2|
+----+----+
I want to mark groups based on col1 where values in col2 repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:
window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()
Maybe you have a better solution. My expected output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| 1| 1|
| a| 1| 1|
| b| 1| 0|
| b| 2| 0|
| c| 2| 2|
| c| 2| 2|
| c| 2| 2|
+----+----+----+
As you can see all groups where col3 is equal to zero have only unique values in col2.
According to your needs, you can consider grouping statistics according to col1 and col2.
df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)

How to compare value of one row with all the other rows in PySpark on grouped values

Problem statement
Consider the following data (see code generation at the bottom)
+-----+-----+-------+--------+
|index|group|low_num|high_num|
+-----+-----+-------+--------+
| 0| 1| 1| 1|
| 1| 1| 2| 2|
| 2| 1| 3| 3|
| 3| 2| 1| 3|
+-----+-----+-------+--------+
Then for a given index, I want to count how many times that one indexes high_num is greater than low_num for all low_num in the group.
For instance, consider the second row with index: 1. Index: 1 is in group: 1 and the high_num is 2. high_num on index 1 is greater than the high_num on index 0, equal to low_num, and smaller than the one on index 2. So the high_num of index: 1 is greater than low_num across the group once, so then I want the value in the answer column to say 1.
Dataset with desired output
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+
Dataset generation code
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.getOrCreate()
)
## Example df
## Note the inclusion of "desired" which is the desired output.
df = spark.createDataFrame(
[
(0, 1, 1, 1, 0),
(1, 1, 2, 2, 1),
(2, 1, 3, 3, 2),
(3, 2, 1, 3, 1)
],
schema=["index", "group", "low_num", "high_num", "desired"]
)
Pseudocode that might have solved the problem
A pseusocode might look like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w_spec = Window.partitionBy("group").rowsBetween(
Window.unboundedPreceding, Window.unboundedFollowing)
## F.collect_list_when does not exist
## F.current_col does not exist
## Probably wouldn't work like this anyways
ddf = df.withColumn("Counts",
F.size(F.collect_list_when(
F.current_col("high_number") > F.col("low_number"), 1
).otherwise(None).over(w_spec))
)
You can do a filter on the collect_list, and check its size:
import pyspark.sql.functions as F
df2 = df.withColumn(
'desired',
F.expr('size(filter(collect_list(low_num) over (partition by group), x -> x < high_num))')
)
df2.show()
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+

Spark Dataframe: Group and rank rows on a certain column value

I am trying to rank a column when the "ID" column numbering starts from 1 to max and then resets from 1.
So, the first three rows have a continuous numbering on "ID"; hence these should be grouped with group rank =1. Rows four and five are in another group, group rank = 2.
The rows are sorted by "rownum" column. I am aware of the row_number window function but I don't think I can apply for this use case as there is no constant window. I can only think of looping through each row in the dataframe but not sure how I can update a column when number resets to 1.
val df = Seq(
(1, 1 ),
(2, 2 ),
(3, 3 ),
(4, 1),
(5, 2),
(6, 1),
(7, 1),
(8, 2)
).toDF("rownum", "ID")
df.show()
Expected result is below:
You can do it with 2 window-functions, the first one to flag the state, the second one to calculate a running sum:
df
.withColumn("increase", $"ID" > lag($"ID",1).over(Window.orderBy($"rownum")))
.withColumn("group_rank_of_ID",sum(when($"increase",lit(0)).otherwise(lit(1))).over(Window.orderBy($"rownum")))
.drop($"increase")
.show()
gives:
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
As #Prithvi noted, we can use lead here.
The tricky part is in order to use window function such as lead, we need to at least provide the order.
Consider
val nextID = lag('ID, 1, -1) over Window.orderBy('rownum)
val isNewGroup = 'ID <= nextID cast "integer"
val group_rank_of_ID = sum(isNewGroup) over Window.orderBy('rownum)
/* you can try
df.withColumn("intermediate", nextID).show
// ^^^^^^^-- can be `isNewGroup`, or other vals
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 0|
| 2| 2| 0|
| 3| 3| 0|
| 4| 1| 1|
| 5| 2| 1|
| 6| 1| 2|
| 7| 1| 3|
| 8| 2| 3|
+------+---+----------------+
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID + 1).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
*/

How to automatically drop constant columns in pyspark?

I have a spark dataframe in pyspark and I need to drop all constant columns from my dataframe. Since I don't know which columns are constant I cannot manually unselect the constant columns, i.e. I need an automatic procedure. I am surprised I was not able to find a simple solution on stackoverflow.
Example:
import pandas as pd
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
d = {'col1': [1, 2, 3, 4, 5],
'col2': [1, 2, 3, 4, 5],
'col3': [0, 0, 0, 0, 0],
'col4': [0, 0, 0, 0, 0]}
df_panda = pd.DataFrame(data=d)
df_spark = spark.createDataFrame(df_panda)
df_spark.show()
Output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 1| 0| 0|
| 2| 2| 0| 0|
| 3| 3| 0| 0|
| 4| 4| 0| 0|
| 5| 5| 0| 0|
+----+----+----+----+
Desired output:
+----+----+
|col1|col2|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+----+----+
What is the best way to automatically drop constant columns in pyspark?
Count distinct values in each column first and then drop columns that contain only one distinct value:
import pyspark.sql.functions as f
cnt = df_spark.agg(*(f.countDistinct(c).alias(c) for c in df_spark.columns)).first()
cnt
# Row(col1=5, col2=5, col3=1, col4=1)
df_spark.drop(*[c for c in cnt.asDict() if cnt[c] == 1]).show()
+----+----+
|col1|col2|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+----+----+

Pass Distinct value of one Dataframe into another Dataframe

I want to take distinct value of column from DataFrame A and Pass that into DataFrame B's explode
function to create repeat rows (DataFrameB) for each distinct value.
distinctSet = targetDf.select('utilityId').distinct())
utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
Function
assign_utilityId = psf.udf(
lambda id: [x for x in id],
ArrayType(LongType()))
How to pass distinctSet values to assign_utilityId
Update
+---------+
|utilityId|
+---------+
| 101|
| 101|
| 102|
+---------+
+-----+------+--------+
|index|status|timeSlot|
+-----+------+--------+
| 0| SUN| 0|
| 0| SUN| 1|
I want to take Unique value from Dataframe 1 and create new column in dataFrame 2. Like this
+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
| 0| SUN| 0|101
| 0| SUN| 1|101
| 0| SUN| 0|102
| 0| SUN| 1|102
We don't need a udf for this. I have tried with some input,please check
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+
df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()
>>> df2.join(rdf).show()
+-----+------+--------+---------+
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
| 0| SUN| 0| 101|
| 0| SUN| 0| 102|
| 0| SUN| 1| 101|
| 0| SUN| 1| 102|
+-----+------+--------+---------+