aggregating with a condition in groupby spark dataframe - scala

I have a dataframe
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
2 13 14 13 14 2 [0.5,1.5] 2.5 3.5
2 13 14 13 14 2 null 3.5 4
2 13 14 13 14 2 null 4 null
so I wanted to apply a condition while using groupby in agg function that if we do groupby col("id") and col("detector") then I want to check the condition that if lag_interval in that group has any non-null value then in aggregation I want two columns one is
min("lag_interval.col1") and other is max("lead_gpsdt")
If the above condition is not met then I want
min("gpsdt"), max("lead_gpsdt")
using this approach I want to get the data with a condition
df.groupBy("detector","id").agg(first("lat-long").alias("start_coordinate"),
last("lat-long").alias("end_coordinate"),struct(min("gpsdt"), max("lead_gpsdt")).as("interval"))
output
id interval start_coordinate end_coordinate
1 [1.5,6] [12,13] [13,14]
1 [6,6.5] [13,14] [13,14]
2 [0.5,4] [13,14] [13,14]
**
for more explanation
**
if we see a part of what groupby("id","detector") does is taking a part out,
we have to see that if in that group of data if one of the value in the col("lag_interval") is not null then we need to use aggregation like this min(lag_interval.col1),max(lead_gpsdt)
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
and if the all value of col("lag_interval") is null in that group of data then we need aggregation output as
min("gpsdt"),max("lead_gpsdt")
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null

The conditional dilemma that you have should be solved by using simple when inbuilt function as suggested below
import org.apache.spark.sql.functions._
df.groupBy("id","detector")
.agg(
struct(
when(isnull(min("lag_interval.col1")), min("gpsdt")).otherwise(min("lag_interval.col1")).as("min"),
max("lead_gpsdt").as(("max"))
).as("interval")
)
which should give you output as
+---+--------+----------+
|id |detector|interval |
+---+--------+----------+
|2 |2 |[0.5, 4.0]|
|1 |2 |[6.0, 6.5]|
|1 |1 |[1.5, 6.0]|
+---+--------+----------+
and I guess you must already have idea how to do first("lat-long").alias("start_coordinate"), last("lat-long").alias("end_coordinate") as you have done.
I hope the answer is helpful

Related

PySpark Identify Duplicates based on Partial Matches from a Column

I am looking for a solution to identify an approach to rank the duplicates based on partial match from a Pyspark Dataframe column and tag the first match id to newly generated column.
Input Dataframe:
id
name
class
1
Roger Fernandes
12
2
Kevin Kingsely
11
3
Fernandes Roger
13
4
Jack Sparrow
14
5
Roger thinker
16
6
Ro seman
17
Output Dataframe:
id
name
class
duplicate
id
1
Roger Ferna
12
yes
1
2
Kevin Kingsely
11
no
None
3
Ferna Roger
13
yes
1
4
Jack Sparrow
14
no
None
5
Roger think
16
yes
1
6
Ro seman
17
no
None
Note: We can consider the weightage of partial match as >50%.
Tried:
partition_dataframe = Window.partitionBy(["name"]).orderBy(order_by_column)
data_frame = data_frame.withColumn("Duplicated_Rank", rank().over(partition_dataframe))
Output Dataframe:
id
name
class
duplicate
id
1
Roger Ferna
12
yes
1
2
Kevin Kingsely
11
no
None
3
Ferna Roger
13
yes
1
4
Jack Sparrow
14
no
None
5
Roger think
16
yes
1
6
Ro seman
17
no
None

Create a range of dates in a pyspark DataFrame

I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

Postgres- select all rows matching the first 10 distinct ids

have a happy new year!
I'm looking to keep all rows in my table for the first 10 distinct IDS, not just the first 10 rows order by id.
I don't know how to though. Your input will be of great help!
SELECT * FROM test_id;
id
----
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
(18 rows)
WITH ranked_ids AS (
select *, rank() over(order by id) AS rank from test_id
)
select * from ranked_ids WHERE rank <= 10;
id | rank
----+------
1 | 1
3 | 2
5 | 3
7 | 4
9 | 5
11 | 6
13 | 7
15 | 8
17 | 9
19 | 10
(10 rows)

reshape dataframe from column to rows in scala

I want to reshape a dataframe in Spark using scala . I found most of the example uses groupBy andpivot. In my case i dont want to use groupBy. This is how my dataframe looks like
tagid timestamp value
1 1 2016-12-01 05:30:00 5
2 1 2017-12-01 05:31:00 6
3 1 2017-11-01 05:32:00 4
4 1 2017-11-01 05:33:00 5
5 2 2016-12-01 05:30:00 100
6 2 2017-12-01 05:31:00 111
7 2 2017-11-01 05:32:00 109
8 2 2016-12-01 05:34:00 95
And i want my dataframe to look like this,
timestamp 1 2
1 2016-12-01 05:30:00 5 100
2 2017-12-01 05:31:00 6 111
3 2017-11-01 05:32:00 4 109
4 2017-11-01 05:33:00 5 NA
5 2016-12-01 05:34:00 NA 95
i used pivot without groupBy and it throws error.
df.pivot("tagid")
error: value pivot is not a member of org.apache.spark.sql.DataFrame.
How do i convert this? Thank you.
Doing the following should solve your issue.
df.groupBy("timestamp").pivot("tagId").agg(first($"value"))
you should have final dataframe as
+-------------------+----+----+
|timestamp |1 |2 |
+-------------------+----+----+
|2017-11-01 05:33:00|5 |null|
|2017-11-01 05:32:00|4 |109 |
|2017-12-01 05:31:00|6 |111 |
|2016-12-01 05:30:00|5 |100 |
|2016-12-01 05:34:00|null|95 |
+-------------------+----+----+
for more information you can checkout databricks blog

Replace DataFrame rows with most recent data based on key

I have a dataframe that looks like this:
user_id val date
1 10 2015-02-01
1 11 2015-01-01
2 12 2015-03-01
2 13 2015-02-01
3 14 2015-03-01
3 15 2015-04-01
I need to run a function that calculates (let's say) the sum of vals chronologically by the dates. If a user has a more recent date, use that date, but if not, keep the older date.
For example. If I run the function with the date 2015-03-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 14 2015-03-01
Giving me a sum of 36.
If I run the function with the date 2015-04-15, then the table will be:
user_id val date
1 10 2015-02-01
2 12 2015-03-01
3 15 2015-04-01
(User 3's row was replaced with a more recent date).
I know this is fairly esoteric, but thought I could bounce this off all of you as I have been trying to think of a simple way of doing this..
try this:
In [36]: df.loc[df.date <= '2015-03-15']
Out[36]:
user_id val date
0 1 10 2015-02-01
1 1 11 2015-01-01
2 2 12 2015-03-01
3 2 13 2015-02-01
4 3 14 2015-03-01
In [39]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').agg({'date':'last', 'val':'last'}).reset_index()
Out[39]:
user_id date val
0 1 2015-02-01 10
1 2 2015-03-01 12
2 3 2015-03-01 14
or:
In [40]: df.loc[df.date <= '2015-03-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[40]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 14 2015-03-01
In [41]: df.loc[df.date <= '2015-04-15'].sort_values('date').groupby('user_id').last().reset_index()
Out[41]:
user_id val date
0 1 10 2015-02-01
1 2 12 2015-03-01
2 3 15 2015-04-01