Making different group of sequential null data in pyspark - pyspark

I have data that contain sequential null and I want to make those sequential null data to different group
I have data like below
group_num
days
time
useage
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
2
20200102
1
30
2
20200102
2
null
2
20200102
3
null
2
20200102
4
50
2
20200102
5
null
3
20200105
10
null
3
20200105
11
null
3
20200105
12
5
What I want to do in this data is that make null_group data in usage as the group.
I want to make the same null group if null data is sequential. And also I want to make different null group if null data is not sequential or have different group_num.
group_num
days
time
useage
null_group
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
group1
2
20200102
1
30
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
4
50
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
3
20200105
12
5
Or maybe make new data that only contain null data with different group.
group_num
days
time
useage
null_group
1
20200101
3
null
group1
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
null_group can be change to numeric like below
group_num
days
time
useage
null_group
1
20200101
3
null
1
2
20200102
2
null
2
2
20200102
3
null
2
2
20200102
5
null
3
3
20200105
10
null
4
3
20200105
11
null
4
Can anyone help with this problem? I thought I can do this with pyspark's window function, but it didn't work very well. I think I have to use pyspark because the original data is too large handling as python.

This looks a bit complicated, but the last 2 parts are just for displaying correctly. The main logic goes like this:
calculate "time" difference "dt" between rows (needs to be 1 for the same "null_group")
generate "key" from "usage" and "dt" columns
use the trick to label consecutive rows (originally for pandas https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/)
rename and manipulate labels to get desired result
Full solution:
w = Window.partitionBy('group_num').orderBy('time')
w_cumsum = w.rowsBetween(Window.unboundedPreceding, 0)
# main logic
df_tmp = (
df
.withColumn('dt', F.coalesce(F.col('time') - F.lag('time').over(w), F.lit(1)))
.withColumn('key', F.concat_ws('-', 'usage', 'dt'))
.withColumn('prev_key', F.lag('key').over(w))
.withColumn('diff', F.coalesce((F.col('key') != F.col('prev_key')).cast('int'), F.lit(1)))
.withColumn('cumsum', F.sum('diff').over(w_cumsum))
.withColumn('null_group_key',
F.when(F.isnull('usage'), F.concat_ws('-', 'group_num', 'cumsum')).otherwise(None))
)
# map to generate required group names
df_map = (
df_tmp
.select('null_group_key')
.distinct()
.dropna()
.sort('null_group_key')
.withColumn('null_group', F.concat(F.lit('group'), F.monotonically_increasing_id() + F.lit(1)))
)
# rename and display as needed
(
df_tmp
.join(df_map, 'null_group_key', 'left')
.fillna('', 'null_group')
.select('group_num', 'days', 'time', 'usage', 'null_group')
.sort('group_num', 'time')
.show()
)

Related

Create a range of dates in a pyspark DataFrame

I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

PySpark UDF does not return expected result

I have a Databricks dataframe with multiple columns and a UDF that generates the contents of a new column, based on values from other columns.
A sample of the original dataset is:
interval_group_id control_value pulse_value device_timestamp
2797895314 5 5 2020-09-12 09:08:44
0 5 5 2020-09-12 09:08:45
0 6 5 2020-09-12 09:08:46
0 0 5 2020-09-12 09:08:47
Now I am trying to add a new column, called group_id, based on some logic with the columns above. My UDF code is:
#udf('integer')
def udf_calculate_group_id_new (interval_group_id, prev_interval_group_id, control_val, pulse_val):
if interval_group_id != 0:
return interval_group_id
elif control_val >= pulse_val and prev_interval_group_id != 0:
return prev_interval_group_id
else:
return -1
And the new column being added to my dataframe is done with:
df = df.withColumn('group_id'
, udf_calculate_group_id_new(
df.interval_group_id
, lag(col('interval_group_id')).over(Window.orderBy('device_timestamp'))
, df.control_value
, df.pulse_value)
)
My expected results are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 2797895314
0 5 5 2020-09-12 09:08:45 2797895314
0 6 5 2020-09-12 09:08:46 2797895314
0 0 5 2020-09-12 09:08:47 -1
However, the results of adding the new group_id column are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 null
0 5 5 2020-09-12 09:08:45 null
0 6 5 2020-09-12 09:08:46 -1
0 0 5 2020-09-12 09:08:47 -1
My goal is to propagate the value 2797895314 down the group_id column, based on the conditions mentioned above, but somehow this doesn't happen and the results are populated with null and -1 incorrectly.
Is this a bug with UDF's or is my expectation of the way of working for the UDF incorrect? Or am I just bad at coding?

How to perform a pivot() and write.parquet on each partition of pyspark dataframe?

I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!

Spark - Grouping 2 Dataframe Rows in only 1 row [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have the following dataframe
id col1 col2 col3 col4
1 1 10 100 A
1 1 20 101 B
1 1 30 102 C
2 1 10 80 D
2 1 20 90 E
2 1 30 100 F
2 1 40 104 G
So, I want to return a new dataframe, in which I can have in olnly one row the values for the same (col1, col2), and also create a new column with some oeration over both col3 columns, for example
id(1) col1(1) col2(1) col3(1) col4(1) id(2) col1(2) col2(2) col3(3) col4(4) new_column
1 1 10 100 A 2 1 10 80 D (100-80)*100
1 1 20 101 B 2 1 20 90 E (101-90)*100
1 1 30 102 C 2 1 30 100 F (102-100)*100
- - - - - 2 1 40 104 G -
I tried ordering, grouping by (col1, col2) but the grouping returns a RelationalGroupedDataset that I cannot do anything appart of aggregation functions. SO I will appreciate any help. I'm using Scala 2.11 Thanks!
what about joining the df with itself?
something like:
df.as("left")
.join(df.as("right"), Seq("col1", "col2"), "outer")
.where($"left.id" =!= $"right.id")

KDB: Aggregate across consecutive rows with common label

I would like to sum across consecutive rows that share the same label. Any very simple ways to do this?
Example: I start with this table...
qty flag
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
... and would like to generate...
qty flag
1 OFF
5 ON
11 OFF
4 ON
One method:
q)show t:flip`qty`flag!(1 3 2 2 9 4;`OFF`ON`ON`OFF`OFF`ON)
qty flag
--------
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
q)show result:select sum qty by d:sums differ flag,flag from t
d flag1| qty
----------| ---
1 OFF | 1
2 ON | 5
3 OFF | 11
4 ON | 4
Then to get it in the format you require:
q)`qty`flag#0!result
qty flag
--------
1 OFF
5 ON
11 OFF
4 ON