I have a Databricks dataframe with multiple columns and a UDF that generates the contents of a new column, based on values from other columns.
A sample of the original dataset is:
interval_group_id control_value pulse_value device_timestamp
2797895314 5 5 2020-09-12 09:08:44
0 5 5 2020-09-12 09:08:45
0 6 5 2020-09-12 09:08:46
0 0 5 2020-09-12 09:08:47
Now I am trying to add a new column, called group_id, based on some logic with the columns above. My UDF code is:
#udf('integer')
def udf_calculate_group_id_new (interval_group_id, prev_interval_group_id, control_val, pulse_val):
if interval_group_id != 0:
return interval_group_id
elif control_val >= pulse_val and prev_interval_group_id != 0:
return prev_interval_group_id
else:
return -1
And the new column being added to my dataframe is done with:
df = df.withColumn('group_id'
, udf_calculate_group_id_new(
df.interval_group_id
, lag(col('interval_group_id')).over(Window.orderBy('device_timestamp'))
, df.control_value
, df.pulse_value)
)
My expected results are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 2797895314
0 5 5 2020-09-12 09:08:45 2797895314
0 6 5 2020-09-12 09:08:46 2797895314
0 0 5 2020-09-12 09:08:47 -1
However, the results of adding the new group_id column are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 null
0 5 5 2020-09-12 09:08:45 null
0 6 5 2020-09-12 09:08:46 -1
0 0 5 2020-09-12 09:08:47 -1
My goal is to propagate the value 2797895314 down the group_id column, based on the conditions mentioned above, but somehow this doesn't happen and the results are populated with null and -1 incorrectly.
Is this a bug with UDF's or is my expectation of the way of working for the UDF incorrect? Or am I just bad at coding?
Related
I have data that contain sequential null and I want to make those sequential null data to different group
I have data like below
group_num
days
time
useage
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
2
20200102
1
30
2
20200102
2
null
2
20200102
3
null
2
20200102
4
50
2
20200102
5
null
3
20200105
10
null
3
20200105
11
null
3
20200105
12
5
What I want to do in this data is that make null_group data in usage as the group.
I want to make the same null group if null data is sequential. And also I want to make different null group if null data is not sequential or have different group_num.
group_num
days
time
useage
null_group
1
20200101
1
10
1
20200101
2
10
1
20200101
3
null
group1
2
20200102
1
30
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
4
50
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
3
20200105
12
5
Or maybe make new data that only contain null data with different group.
group_num
days
time
useage
null_group
1
20200101
3
null
group1
2
20200102
2
null
group2
2
20200102
3
null
group2
2
20200102
5
null
group3
3
20200105
10
null
group4
3
20200105
11
null
group4
null_group can be change to numeric like below
group_num
days
time
useage
null_group
1
20200101
3
null
1
2
20200102
2
null
2
2
20200102
3
null
2
2
20200102
5
null
3
3
20200105
10
null
4
3
20200105
11
null
4
Can anyone help with this problem? I thought I can do this with pyspark's window function, but it didn't work very well. I think I have to use pyspark because the original data is too large handling as python.
This looks a bit complicated, but the last 2 parts are just for displaying correctly. The main logic goes like this:
calculate "time" difference "dt" between rows (needs to be 1 for the same "null_group")
generate "key" from "usage" and "dt" columns
use the trick to label consecutive rows (originally for pandas https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/)
rename and manipulate labels to get desired result
Full solution:
w = Window.partitionBy('group_num').orderBy('time')
w_cumsum = w.rowsBetween(Window.unboundedPreceding, 0)
# main logic
df_tmp = (
df
.withColumn('dt', F.coalesce(F.col('time') - F.lag('time').over(w), F.lit(1)))
.withColumn('key', F.concat_ws('-', 'usage', 'dt'))
.withColumn('prev_key', F.lag('key').over(w))
.withColumn('diff', F.coalesce((F.col('key') != F.col('prev_key')).cast('int'), F.lit(1)))
.withColumn('cumsum', F.sum('diff').over(w_cumsum))
.withColumn('null_group_key',
F.when(F.isnull('usage'), F.concat_ws('-', 'group_num', 'cumsum')).otherwise(None))
)
# map to generate required group names
df_map = (
df_tmp
.select('null_group_key')
.distinct()
.dropna()
.sort('null_group_key')
.withColumn('null_group', F.concat(F.lit('group'), F.monotonically_increasing_id() + F.lit(1)))
)
# rename and display as needed
(
df_tmp
.join(df_map, 'null_group_key', 'left')
.fillna('', 'null_group')
.select('group_num', 'days', 'time', 'usage', 'null_group')
.sort('group_num', 'time')
.show()
)
I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.
I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have the following dataframe
id col1 col2 col3 col4
1 1 10 100 A
1 1 20 101 B
1 1 30 102 C
2 1 10 80 D
2 1 20 90 E
2 1 30 100 F
2 1 40 104 G
So, I want to return a new dataframe, in which I can have in olnly one row the values for the same (col1, col2), and also create a new column with some oeration over both col3 columns, for example
id(1) col1(1) col2(1) col3(1) col4(1) id(2) col1(2) col2(2) col3(3) col4(4) new_column
1 1 10 100 A 2 1 10 80 D (100-80)*100
1 1 20 101 B 2 1 20 90 E (101-90)*100
1 1 30 102 C 2 1 30 100 F (102-100)*100
- - - - - 2 1 40 104 G -
I tried ordering, grouping by (col1, col2) but the grouping returns a RelationalGroupedDataset that I cannot do anything appart of aggregation functions. SO I will appreciate any help. I'm using Scala 2.11 Thanks!
what about joining the df with itself?
something like:
df.as("left")
.join(df.as("right"), Seq("col1", "col2"), "outer")
.where($"left.id" =!= $"right.id")
If I have an input as below:
sno name time
1 hello 1
1 hello 2
1 hai 3
1 hai 4
1 hai 5
1 how 6
1 how 7
1 are 8
1 are 9
1 how 10
1 how 11
1 are 12
1 are 13
1 are 14
I want to combine the fields having similar values in name as the below output format:
sno name timestart timeend
1 hello 1 2
1 hai 3 5
1 how 6 7
1 are 8 9
1 how 10 11
1 are 12 14
The input will be sorted according to time and only the records which are having the same name for repeated time intervals must be merged.
I am trying to do using spark but I cannot figure out a way to do this using spark functions since I am new to spark. Any suggestions on the approach will be appreciated.
I tried thinking of writing a user-defined function and applying maps to the data frame but I could not come up with the right logic for the function.
PS: I am trying to do this using scala spark.
One way to do so would be to use a plain SQL query.
Let's say df is your input dataframe.
val viewName = s"dataframe"
df.createOrReplaceTempView(viewName)
spark.sql(query(viewName))
def query(viewName: String): String = s"SELECT sno, name, MAX(time) AS timeend, MIN(time) AS timestart FROM $viewName GROUP BY name"
You can of course use df set. This would be something like:
df.groupBy($"name")
.agg($"sno", $"name", max($"time").as("timeend"), min($"time").as("timestart"))