How to get the name of the group with maximum value of parameter? [duplicate] - scala

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a DataFrame df like this one:
df =
name group influence
A 1 2
B 1 3
C 1 0
A 2 5
D 2 1
For each distinct value of group, I want to extract the value of name that has the maximum value of influence.
The expected result is this one:
group max_name max_influence
1 B 3
2 A 5
I know how to get max value but I don't know how to getmax_name.
df.groupBy("group").agg(max("influence").as("max_influence")

There is good alternative to groupBy with structs - window functions, which sometimes are really faster.
For your examle I would try the following:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
.filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
| A| 2| 5| 5|
| B| 1| 3| 3|
+----+-----+---------+-------------+
Now all you need is to drop useless columns and rename remaining ones.
Hope, it'll help.

Related

Sum values of specific rows if fields are the same

Hi Im trying to sum values of one column if 'ID' matches for all in a dataframe
For example
ID
Gender
value
1
Male
5
1
Male
6
2
Female
3
3
Female
0
3
Female
9
4
Male
10
How do I get the following table
ID
Gender
value
1
Male
11
2
Female
3
3
Female
9
4
Male
10
In the example above, ID with Value 1 is now showed just once and its value has been summed up (same for ID with value 3).
Thanks
Im new to Pyspark and still learning. I've tried count(), select and groupby() but nothing has resulted in what Im trying to do.
try this:
df = (
df
.withColumn('value', f.sum(f.col('value')).over(Window.partitionBy(f.col('ID'))))
)
Link to documentation about Window operation https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html
You can use a simple groupBy, with the sum function:
from pyspark.sql import functions as F
(
df
.groupby("ID", 'Gender') # sum rows with same ID and Gender
# .groupby("ID") # use this line instead if you want to sum rows with the same ID, even if they have different Gender
.agg(F.sum('value').alias('value'))
)
The result is:
+---+------+-----+
| ID|Gender|value|
+---+------+-----+
| 1| Male| 11|
| 2|Female| 3|
| 3|Female| 9|
| 4| Male| 10|
+---+------+-----+

Spark aggregation with window functions

I have a spark df which I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:
A
B
C
Snap
1
2
3
2019-12-29
1
2
4
2019-12-31
where the primary key is formed by fields A and B. I need to create a new field to indicate which register is active (the last snap for each set of rows with the same PK). So I need something like this:
A
B
C
Snap
activity
1
2
3
2019-12-29
false
1
2
4
2019-12-31
true
I have done this by creating an auxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I donĀ“t know how I can implement it.
Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the activity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:
A
B
C
Snap
activity
end
1
2
3
2019-12-29
false
2019-12-30
1
2
4
2019-12-31
true
You can check row_number ordered by Snap in descending order. The 1st row is the last active snap:
df.selectExpr(
'*',
'row_number() over (partition by A, B order by Snap desc) = 1 as activity'
).show()
+---+---+---+----------+--------+
| A| B| C| Snap|activity|
+---+---+---+----------+--------+
| 1| 2| 4|2019-12-31| true|
| 1| 2| 3|2019-12-29| false|
+---+---+---+----------+--------+
Edit: to get the end date for each group, use max window function on Snap:
import pyspark.sql.functions as f
df.withColumn(
'activity',
f.expr('row_number() over (partition by A, B order by Snap desc) = 1')
).withColumn(
"end",
f.expr('case when activity then null else max(date_add(to_date(Snap), -1)) over (partition by A, B) end')
).show()
+---+---+---+----------+--------+----------+
| A| B| C| Snap|activity| end|
+---+---+---+----------+--------+----------+
| 1| 2| 4|2019-12-31| true| null|
| 1| 2| 3|2019-12-29| false|2019-12-30|
+---+---+---+----------+--------+----------+

Pyspark : How to split pipe-separated column into multiple rows? [duplicate]

This question already has answers here:
How to split pipe-separated column into multiple rows?
(2 answers)
Closed 2 years ago.
I have a dataframe that contains the following:
movieId / movieName / genre
1 example1 action|thriller|romance
2 example2 fantastic|action
I would like to obtain a second dataframe (from the first one), that contains the following:
movieId / movieName / genre
1 example1 action
1 example1 thriller
1 example1 romance
2 example2 fantastic
2 example2 action
How can we do it using pyspark?
Use split function will return an array then explode function on array.
Example:
df.show(10,False)
#+-------+---------+-----------------------+
#|movieid|moviename|genre |
#+-------+---------+-----------------------+
#|1 |example1 |action|thriller|romance|
#+-------+---------+-----------------------+
from pyspark.sql.functions import *
df.withColumnRenamed("genre","genre1").\
withColumn("genre",explode(split(col("genre1"),'\\|'))).\
drop("genre1").\
show()
#+-------+---------+--------+
#|movieid|moviename| genre|
#+-------+---------+--------+
#| 1| example1| action|
#| 1| example1|thriller|
#| 1| example1| romance|
#+-------+---------+--------+

Spark and Scala, add new column with value from another dataframe by mapping common key [duplicate]

This question already has answers here:
How to join two DataFrames and change column for missing values?
(3 answers)
How to do left outer join in spark sql?
(3 answers)
Closed 5 years ago.
I have 2 dataframes.
df1 =
dep-code rank
abc 1
bcd 2
df2 =
some cols... dep-code
abc
bcd
abc
I want to add new column to df2 as rank with df1.dep-code = df2.dep-code
result -
some cols... dep-code rank
abc 1
bcd 2
abc 1
That's a simple join:
df2.join(df1, "dep-code")
With the following inpouts:
df1 with the join and the desired column:
+--------+----+
|dep-code|rank|
+--------+----+
| abc| 1|
| bcd| 2|
+--------+----+
df2 with the join column plus an extra one (aColumn):
+----------+--------+
| aColumn|dep-code|
+----------+--------+
| some| abc|
| someother| bcd|
|yetAnother| abc|
+----------+--------+
You'll retrieve the output below:
+--------+----------+----+
|dep-code| aColumn|rank|
+--------+----------+----+
| abc| some| 1|
| abc|yetAnother| 1|
| bcd| someother| 2|
+--------+----------+----+

Get Unique records in Spark [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
Corey beat me to it, but here's the Scala version:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result