I have the below dataframe in spark where I need to detect the key change ( on column rec) and create a new column called groupId. For example the first row and second row belong to one group until again the same set of record (D) is encountered and 1st row and 2nd row belong to the same groupId.
rec amount date
D 250 20220522
C 110 20220522
D 120 20220522
C 100 20220522
C 50 20220522
D 50 20220522
D 50 20220522
D 50 20220522
EXPECTED OUTPUT
rec amount date groupId
D 250 20220522 1
C 110 20220522 1
D 120 20220522 2
C 100 20220522 2
C 50 20220522 2
D 50 20220522 3
D 50 20220522 4
D 50 20220522 5
I tried many ways but couldn't achieve the desired output , I am not sure what I am doing incorrectly here , below is what I have tried
WindowSpec window = Window.orderBy("date");
Dataset<Row> dataset4 = data
.withColumn("nextRow", functions.lead("rec", 1).over(window))
.withColumn("prevRow", functions.lag("rec", 1).over(window))
.withColumn("groupId",
functions.when(functions.col("nextRow")
.equalTo(functions.col("prevRow")),
functions.dense_rank().over(window)
));
Can someone please help me what I am doing incorrectly here ?
Window function does not work quite work like that; here is a workaround that might not be the best one;
First, keep track of what the starting value is:
val different = if (df.rdd.collect()(0)(0) == "C") 1 else 0
We set a value of 0 to C and a value of 1 to D:
.withColumn("other", when(col("rec").equalTo("C"), 0).otherwise(1))
Then, we create a unique id (because we do not have a combination of rows that indicate a unique row):
.withColumn("id", expr("row_number() over (order by date)"))
Finally, we do a cumulative count:
.withColumn("group_id",
sum("other").over(Window.orderBy("id").partitionBy("date")) + different
)
I partitioned by date here, you can remove that but the performance might degrade seriously. Finally, we drop id, final result:
+---+------+--------+-----+--------+
|rec|amount|date |other|group_id|
+---+------+--------+-----+--------+
|D |250 |20220522|1 |1 |
|C |110 |20220522|0 |1 |
|D |120 |20220522|1 |2 |
|C |100 |20220522|0 |2 |
|C |50 |20220522|0 |2 |
|D |50 |20220522|1 |3 |
|D |50 |20220522|1 |4 |
|D |50 |20220522|1 |5 |
+---+------+--------+-----+--------+
Good luck!
Related
I have a dataframe looking like this (just some example values):
| id | timestamp | mode | trip | journey | value |
1 2021-09-12 23:59:19.717000 walking 1 1 1.21
1 2021-09-12 23:59:38.617000 walking 1 1 1.36
1 2021-09-12 23:59:38.617000 driving 2 1 1.65
2 2021-09-11 23:52:09.315000 walking 4 6 1.04
I want to create new columns which I fill with the previous and next mode. Something like this:
| id | timestamp | mode | trip | journey | value | prev | next
1 2021-09-12 23:59:19.717000 walking 1 1 1.21 bus driving
1 2021-09-12 23:59:38.617000 walking 1 1 1.36 bus driving
1 2021-09-12 23:59:38.617000 driving 2 1 1.65 walking walking
2 2021-09-11 23:52:09.315000 walking 4 6 1.0 walking driving
I have tried to partition by id, trip, journey and mode and ordered by timestamp. Then I tried to use lag() and lead() but I am not sure these work on other partitions. I came across the Window.unboundedPreceding and Window.unboundedFollowing, however I am not sure I completely understand how these work. In my mind I think that if I partition the data as explained above I will always just need the last value of mode from the previous partition and to fill the next I could reorder the partition from ascending to descending on the timestamp and then do the same to fill the next column. However, I am unsure how I get the last value of the previous partition.
I have tried this:
w = Window.partitionBy("id", "journey", "trip").orderBy(col("timestamp").asc())
w_prev = w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("prev", first("mode").over(w_prev))
Code examples and explainations using pyspark will be very appreciated!
So, based on what I could understand you could do something like this,
Create a partition based on ID and their journey, within each journey there are multiple trips, so order by trip and lastly the timestamp, and then simply use the lead and lag to get the output!
w = Window().partitionBy('id', 'journey').orderBy('trip', 'timestamp')
df.withColumn('prev', F.lag('mode', 1).over(w)) \
.withColumn('next', F.lead('mode', 1).over(w)) \
.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:19.717000|walking|1 |1 |1.21 |null |walking|
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |walking|driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
EDIT:
Okay as OP asked, you can do this to achieve it,
# Used for taking the latest record from same id, trip, journey
w = Window().partitionBy('id', 'trip', 'journey').orderBy(F.col('timestamp').desc())
# Used to calculate prev and next mode
w1 = Window().partitionBy('id', 'journey').orderBy('trip')
# First take only the latest rows for a particular combination of id, trip, journey
# Second, use the filtered rows to get prev and next modes
df2 = df.withColumn('rn', F.row_number().over(w)) \
.filter(F.col('rn') == 1) \
.withColumn('prev', F.lag('mode', 1).over(w1)) \
.withColumn('next', F.lead('mode', 1).over(w1)) \
.drop('rn')
df2.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |null |driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
# Finally, join the calculated DF with the original DF to get prev and next mode
final_df = df.alias('a').join(df2.alias('b'), ['id', 'trip', 'journey'], how='left') \
.select('a.*', 'b.prev', 'b.next')
final_df.show(truncate=False)
Output:
+---+----+-------+--------------------------+-------+-----+-------+-------+
|id |trip|journey|timestamp |mode |value|prev |next |
+---+----+-------+--------------------------+-------+-----+-------+-------+
|1 |1 |1 |2021-09-12 23:59:19.717000|walking|1.21 |null |driving|
|1 |1 |1 |2021-09-12 23:59:38.617000|walking|1.36 |null |driving|
|1 |2 |1 |2021-09-12 23:59:38.617000|driving|1.65 |walking|null |
|2 |4 |6 |2021-09-11 23:52:09.315000|walking|1.04 |null |null |
+---+----+-------+--------------------------+-------+-----+-------+-------+
How to find & populate the 3rd highest Amount & populate the same 3rd highest Amount into Cut_of_3 (new col) and repeat the same into that related ID, if there is no 3rd highest Amount for that ID, need to populate the 100 into that Related ID. Pls find sample dataset & Expecting result. Thanks in Advance.!
Sample Dataset:-
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
2 New 02/05/20 30
2 Removed 03/05/20 20
3 New 09/05/20 50
3 Assigned 09/05/20 20
3 In-Progress 10/05/20 30
3 Closed 10/05/20 10
4 New 10/05/20 20
4 Assigned 10/05/20 30
Expecting Result:-
ID Status Date Amount Cut_of_3
1 New 01/05/20 20 20
1 Assigned 02/05/20 30 20
1 In-Progress 02/05/20 50 20
2 New 02/05/20 30 100
2 Removed 03/05/20 20 100
3 New 09/05/20 50 35
3 Assigned 09/05/20 35 35
3 In-Progress 10/05/20 40 35
3 Closed 10/05/20 10 35
4 New 10/05/20 20 100
4 Assigned 10/05/20 30 100
Here is how you can achieve with use of Window functions
val window = Window.partitionBy("ID").orderBy("ID")
// collect as list and sort descending and get the third value
df.withColumn("Cut_of_3", sort_array(collect_list($"Amount").over(window), false)(2))
// if if there is no third value it returns null and replace null with 100
.na.fill(100, Seq("Cut_of_3"))
.sort("ID")
.show(false)
Output:
+---+-----------+--------+------+--------+
|ID |Status |Date |Amount|Cut_of_3|
+---+-----------+--------+------+--------+
|1 |New |01/05/20|20 |20 |
|1 |Assigned |02/05/20|30 |20 |
|1 |In-Progress|02/05/20|50 |20 |
|2 |New |02/05/20|30 |100 |
|2 |Removed |03/05/20|20 |100 |
|3 |New |09/05/20|50 |20 |
|3 |Assigned |09/05/20|20 |20 |
|3 |In-Progress|10/05/20|30 |20 |
|3 |Closed |10/05/20|10 |20 |
|4 |New |10/05/20|20 |100 |
|4 |Assigned |10/05/20|30 |100 |
+---+-----------+--------+------+--------+
I'm trying to create a comparison matrix using a Spark dataframe, and am starting by creating a single column dataframe with one row per value:
val df = List(1, 2, 3, 4, 5).toDF
From here, what I need to do is create a new column for each row, and insert (for now), a random number in each space, like this:
Item 1 2 3 4 5
------ --- --- --- --- ---
1 0 7 3 6 2
2 1 0 4 3 1
3 8 6 0 4 4
4 8 8 1 0 9
5 9 5 3 6 0
Any assistance would be appreciated!
Considering to transpose the input DataFrame called df using .pivot() function like the following:
val output = df.groupBy("item").pivot("item").agg((rand()*100).cast(DataTypes.IntegerType))
This will generate a new DataFrame with a random Integer value corrisponding to the row value (null otherwise).
+----+----+----+----+----+----+
|item|1 |2 |3 |4 |5 |
+----+----+----+----+----+----+
|1 |9 |null|null|null|null|
|3 |null|null|2 |null|null|
|5 |null|null|null|null|6 |
|4 |null|null|null|26 |null|
|2 |null|33 |null|null|null|
+----+----+----+----+----+----+
If you don't want the null values you can consider to apply an UDF later.
As title said, I want to reject rows, so I will not create duplicates.
And first step is not to join on values that have more rows in second table.
Here is an example if needed:
Table a:
aa |bb |
---|----|
1 |111 |
2 |222 |
Table h:
hh |kk |
---|----|
1 |111 |
2 |111 |
3 |222 |
Using Normal Left join:
SELECT
*
FROM a
LEFT JOIN h
ON a.bb = h.kk
;
I get:
aa |bb |hh |kk |
---|----|---|----|
1 |111 |1 |111 |
1 |111 |2 |111 |
2 |222 |3 |222 |
I want to get rid of first two rows, where aa = 1.
...
And second step would be for another query, probably with some case, where is table a I will filter out only those rows which have in table b more than 2 rows.
Therefore I want to create table c, where i will have:
aa |bb |
---|----|
1 |111 |
Can someone help me please?
Thank you.
To get only the 1:1 joins
SELECT a.aa,h.hh,h.kk FROM a
LEFT JOIN h ON a.bb = h.kk
GROUP BY bb HAVING COUNT(kk)=1
To get only the 1:n joins
SELECT a.aa,h.hh,h.kk FROM a
LEFT JOIN h ON a.bb = h.kk
GROUP BY bb HAVING COUNT(kk)>1
I have two Data Frames:
DF1:
ID | Col1 | Col2
1 a aa
2 b bb
3 c cc
DF2:
ID | Col1 | Col2
1 ab aa
2 b bba
4 d dd
How can I join these two DFs and the result should be:
Result:
1 ab aa
2 b bba
3 c cc
4 d dd
My code is:
val df = DF1.join(DF2, Seq("ID"), "outer")
.select($"ID",
when(DF1("Col1").isNull, lit(0)).otherwise(DF1("Col1")).as("Col1"),
when(DF1("Col2").isNull, lit(0)).otherwise(DF2("Col2")).as("Col2"))
.orderBy("ID")
And it works, but I don't want to specify each column, because I have large files.
So, is there any way to update the dataframe (and to add some recors if in the second DF are new one) without specifying each column?
A simple leftanti join of df1 with df2 and merging of the result into df2 should get your desired output as
df2.union(df1.join(df2, Seq("ID"), "leftanti")).orderBy("ID").show(false)
which should give you
+---+----+----+
|ID |Col1|Col2|
+---+----+----+
|1 |ab |aa |
|2 |b |bba |
|3 |c |cc |
|4 |d |dd |
+---+----+----+
The solution doesn't match the logic you have in your code but generates the expected result