I want to reshape a dataframe in Spark using scala . I found most of the example uses groupBy andpivot. In my case i dont want to use groupBy. This is how my dataframe looks like
tagid timestamp value
1 1 2016-12-01 05:30:00 5
2 1 2017-12-01 05:31:00 6
3 1 2017-11-01 05:32:00 4
4 1 2017-11-01 05:33:00 5
5 2 2016-12-01 05:30:00 100
6 2 2017-12-01 05:31:00 111
7 2 2017-11-01 05:32:00 109
8 2 2016-12-01 05:34:00 95
And i want my dataframe to look like this,
timestamp 1 2
1 2016-12-01 05:30:00 5 100
2 2017-12-01 05:31:00 6 111
3 2017-11-01 05:32:00 4 109
4 2017-11-01 05:33:00 5 NA
5 2016-12-01 05:34:00 NA 95
i used pivot without groupBy and it throws error.
df.pivot("tagid")
error: value pivot is not a member of org.apache.spark.sql.DataFrame.
How do i convert this? Thank you.
Doing the following should solve your issue.
df.groupBy("timestamp").pivot("tagId").agg(first($"value"))
you should have final dataframe as
+-------------------+----+----+
|timestamp |1 |2 |
+-------------------+----+----+
|2017-11-01 05:33:00|5 |null|
|2017-11-01 05:32:00|4 |109 |
|2017-12-01 05:31:00|6 |111 |
|2016-12-01 05:30:00|5 |100 |
|2016-12-01 05:34:00|null|95 |
+-------------------+----+----+
for more information you can checkout databricks blog
Related
Let's consider a dataset:
name
age
Max
33
Adam
32
Zim
41
Muller
62
Now, if we run this query on dataset x:
x.as("a").join(x.as("b")).where(
$"b.age" - $"a.age" <= 10 and
$"a.age" > $"b.age").show()
name
age
name
age
Max
33
Zim
41
Adam
32
Max
33
Adam
32
Zim
41
That is my desired result.
Now, conceptually if I have a very big dataset, I might want to use bucketing to reduce search space.
So, doing bucketing with:
val buck_x = x.withColumn("buc_age", floor($"age"/ 10))
which gives me:
name
age
buck_age
Max
33
3
Adam
32
3
Zim
41
4
Muller
62
6
After explode, I get the following result:
val exp_x = buck_x.withColumn("buc_age", explode(array($"buc_age" -1, $"buc_age", $"buc_age" + 1)))
name
age
buck_age
Max
33
2
Max
33
3
Max
33
4
Adam
32
2
Adam
32
3
Adam
32
4
Zim
41
3
Zim
41
4
Zim
41
5
Muller
62
5
Muller
62
6
Muller
62
7
Now, after final query,
exp_x.as("a").join(exp_x.as("b")).where(
$"a.buc_age" === $"b.buc_age" and
$"b.age" - $"a.age" <= 10 and
$"b.age" > $"a.age").show()
I get the following result.
name
age
buc_age
name
age
buc_age
Max
33
3
Zim
41
3
Max
33
4
Zim
41
4
Adam
32
2
Max
33
2
Adam
32
3
Zim
41
3
Adam
32
3
Max
33
3
Adam
32
4
Zim
41
4
Adam
32
4
Max
33
4
Clearly, it's not the same as my expectation, I am getting more rows than expected. How to solve this while using bucket?
Drop your bucketing columns and then select distinct rows, essentially undoing the duplication caused by explode:
exp_x.select(res1.columns.map(c => col(c).as(c + "_a")) : _*).join(exp_x.select(res1.columns.map(c => col(c).as(c + "_b")) : _*)).where(
$"buc_age_a" === $"buc_age_b" and
$"age_b" - $"age_a" <= 10 and
$"age_b" > $"age_a").
drop("buc_age_a", "buc_age_b").
distinct.
show
+------+-----+------+-----+
|name_a|age_a|name_b|age_b|
+------+-----+------+-----+
| Adam| 32| Zim| 41|
| Adam| 32| Max| 33|
| Max| 33| Zim| 41|
+------+-----+------+-----+
There is really no need for an explode.
Instead, this approach unions two inner self joins. The two joins find cases where:
A and B are in the same bucket, and B is older
B is one bucket more, but no more than 10 years older
This should perform better than using the explode, since fewer comparisons are performed (because the sets being joined here are one third of the exploded size).
val namesDF = Seq(("Max", 33), ("Adam", 32), ("Zim", 41), ("Muller", 62)).toDF("name", "age")
val buck_x = namesDF.withColumn("buc_age", floor($"age" / 10))
// same bucket where b is still older
val same = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" === $"b.buc_age" && $"b.age" > $"a.age"), "inner")
// different buckets -- b is one bucket higher but still no more than 10 ages different
val diff = buck_x.as("a").join(buck_x.as("b"), ($"a.buc_age" + 1 === $"b.buc_age" && $"b.age" <= $"a.age" + 10), "inner")
val result = same.union(diff)
The result (you can do a drop to remove excess columns like in Charlie's answer):
result.show(false)
+----+---+-------+----+---+-------+
|name|age|buc_age|name|age|buc_age|
+----+---+-------+----+---+-------+
|Adam|32 |3 |Max |33 |3 |
|Max |33 |3 |Zim |41 |4 |
|Adam|32 |3 |Zim |41 |4 |
+----+---+-------+----+---+-------+
Given a list of ids/keys and a set of corresponding values for a constant column:
q)ikeys: 1 2 3 5;
q)ivals: 100 100.5 101.5 99.5;
What is the fastest way to update the `toupd column in the following table such that the rows that match the given ikeys are updated to the new values in ivals:i.e.
q) show tab;
ikeys | `toupd `noupd
------|--------------
1 | 0.5 1
2 | 100.5 2
3 | 500.5 4
4 | 400.5 8
5 | 400.5 16
6 | 600.5 32
7 | 700.5 64
is updated to:
q) show restab;
ikeys | `toupd `noupd
------|--------------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
furthermore, is there a canonical method with which one could update multiple columns in this manner.
thanks
A dot amend is another approach which more easily generalises to more than one column. It can also take advantage of amend-in-place which would be the most memory efficient approach as it doesn't create a duplicate copy of the table in memory (assumes global).
ikeys:1 2 3 5
ivals:100 100.5 101.5 99.5
tab:([ikeys:1+til 7]toupd:.5 100.5 500.5 400.5 400.5 600.5 700.5;noupd:1 2 4 8 16 32 64)
q).[tab;(([]ikeys);`toupd);:;ivals]
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
/amend in place
.[`tab;(([]ikeys);`toupd);:;ivals]
/generalise to two columns
q).[tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
ikeys| toupd noupd
-----| -----------
1 | 100 1000
2 | 100.5 2000
3 | 101.5 3000
4 | 400.5 8
5 | 99.5 4000
6 | 600.5 32
7 | 700.5 64
/you could amend in place here too
.[`tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
Here are two different ways of doing it.
tab lj ([ikeys] toupd: ivals)
or
m: ikeys
update toupd: ivals from tab where ikeys in m
I'm sure there are plenty more ways. If you want to find out which is fastest for your purpose (and your data), try using q)\t:1000 yourCodeHere for large tables and see which suits you best.
As for which is the canonical way for multiple columns, I imagine it would be the update, but it's a matter of personal preference, just do whatever is fastest.
A dictionary is also a common method of updating values given a mapping. Indexing the dictionary with the ikeys column gives the new values and then we fill in nulls with the old toupd column values.
q)show d:ikeys!ivals
1| 100
2| 100.5
3| 101.5
5| 99.5
q)update toupd:toupd^d ikeys from tab
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
It also worth noting that the update condition with the where clause is not guaranteed to work in all cases, e.g. if you have more mapping values than appear in your ikeys column.
q)m:ikeys:1 2 3 5 7 11
q)ivals:100 100.5 101.5 99.5 100 100
q)update toupd: ivals from tab where ikeys in m
'length
How to find & populate the 3rd highest Amount & populate the same 3rd highest Amount into Cut_of_3 (new col) and repeat the same into that related ID, if there is no 3rd highest Amount for that ID, need to populate the 100 into that Related ID. Pls find sample dataset & Expecting result. Thanks in Advance.!
Sample Dataset:-
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
2 New 02/05/20 30
2 Removed 03/05/20 20
3 New 09/05/20 50
3 Assigned 09/05/20 20
3 In-Progress 10/05/20 30
3 Closed 10/05/20 10
4 New 10/05/20 20
4 Assigned 10/05/20 30
Expecting Result:-
ID Status Date Amount Cut_of_3
1 New 01/05/20 20 20
1 Assigned 02/05/20 30 20
1 In-Progress 02/05/20 50 20
2 New 02/05/20 30 100
2 Removed 03/05/20 20 100
3 New 09/05/20 50 35
3 Assigned 09/05/20 35 35
3 In-Progress 10/05/20 40 35
3 Closed 10/05/20 10 35
4 New 10/05/20 20 100
4 Assigned 10/05/20 30 100
Here is how you can achieve with use of Window functions
val window = Window.partitionBy("ID").orderBy("ID")
// collect as list and sort descending and get the third value
df.withColumn("Cut_of_3", sort_array(collect_list($"Amount").over(window), false)(2))
// if if there is no third value it returns null and replace null with 100
.na.fill(100, Seq("Cut_of_3"))
.sort("ID")
.show(false)
Output:
+---+-----------+--------+------+--------+
|ID |Status |Date |Amount|Cut_of_3|
+---+-----------+--------+------+--------+
|1 |New |01/05/20|20 |20 |
|1 |Assigned |02/05/20|30 |20 |
|1 |In-Progress|02/05/20|50 |20 |
|2 |New |02/05/20|30 |100 |
|2 |Removed |03/05/20|20 |100 |
|3 |New |09/05/20|50 |20 |
|3 |Assigned |09/05/20|20 |20 |
|3 |In-Progress|10/05/20|30 |20 |
|3 |Closed |10/05/20|10 |20 |
|4 |New |10/05/20|20 |100 |
|4 |Assigned |10/05/20|30 |100 |
+---+-----------+--------+------+--------+
I have a dataframe
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
2 13 14 13 14 2 [0.5,1.5] 2.5 3.5
2 13 14 13 14 2 null 3.5 4
2 13 14 13 14 2 null 4 null
so I wanted to apply a condition while using groupby in agg function that if we do groupby col("id") and col("detector") then I want to check the condition that if lag_interval in that group has any non-null value then in aggregation I want two columns one is
min("lag_interval.col1") and other is max("lead_gpsdt")
If the above condition is not met then I want
min("gpsdt"), max("lead_gpsdt")
using this approach I want to get the data with a condition
df.groupBy("detector","id").agg(first("lat-long").alias("start_coordinate"),
last("lat-long").alias("end_coordinate"),struct(min("gpsdt"), max("lead_gpsdt")).as("interval"))
output
id interval start_coordinate end_coordinate
1 [1.5,6] [12,13] [13,14]
1 [6,6.5] [13,14] [13,14]
2 [0.5,4] [13,14] [13,14]
**
for more explanation
**
if we see a part of what groupby("id","detector") does is taking a part out,
we have to see that if in that group of data if one of the value in the col("lag_interval") is not null then we need to use aggregation like this min(lag_interval.col1),max(lead_gpsdt)
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
and if the all value of col("lag_interval") is null in that group of data then we need aggregation output as
min("gpsdt"),max("lead_gpsdt")
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
The conditional dilemma that you have should be solved by using simple when inbuilt function as suggested below
import org.apache.spark.sql.functions._
df.groupBy("id","detector")
.agg(
struct(
when(isnull(min("lag_interval.col1")), min("gpsdt")).otherwise(min("lag_interval.col1")).as("min"),
max("lead_gpsdt").as(("max"))
).as("interval")
)
which should give you output as
+---+--------+----------+
|id |detector|interval |
+---+--------+----------+
|2 |2 |[0.5, 4.0]|
|1 |2 |[6.0, 6.5]|
|1 |1 |[1.5, 6.0]|
+---+--------+----------+
and I guess you must already have idea how to do first("lat-long").alias("start_coordinate"), last("lat-long").alias("end_coordinate") as you have done.
I hope the answer is helpful
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 6 years ago.
I have dataframe with column by name c1, c2, c3, c4. I want to group it on a column and use agg function on other column eg min/max/agg.. etc and get the corresponding other column value based on result of agg function
Example :
c1 c2 c3 c4
1 23 1 1
1 45 2 2
1 91 3 3
1 90 4 4
1 71 5 5
1 42 6 6
1 72 7 7
1 44 8 8
1 55 9 9
1 21 0 0
Should result:
c1 c2 c3 c4
1 91 3 3
let dataframe be df
df.groupBy($"c1").agg(max($"c2"), ??, ??)
can someone please help what should go inplace of ??
i know solution of this problem using RDD. Wanted to explore if this can be solve in easier way using Dataframe/Dataset api
You can do this in two steps:
calculate the aggregated data frame;
join the data frame back with the original data frame and filter based on the condition;
so:
val maxDF = df.groupBy("c1").agg(max($"c2").as("maxc2"))
// maxDF: org.apache.spark.sql.DataFrame = [c1: int, maxc2: int]
df.join(maxDF, Seq("c1")).where($"c2" === $"maxc2").drop($"maxc2").show
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 1| 91| 3| 3|
+---+---+---+---+