Sum values of specific rows if fields are the same - pyspark

Hi Im trying to sum values of one column if 'ID' matches for all in a dataframe
For example
ID
Gender
value
1
Male
5
1
Male
6
2
Female
3
3
Female
0
3
Female
9
4
Male
10
How do I get the following table
ID
Gender
value
1
Male
11
2
Female
3
3
Female
9
4
Male
10
In the example above, ID with Value 1 is now showed just once and its value has been summed up (same for ID with value 3).
Thanks
Im new to Pyspark and still learning. I've tried count(), select and groupby() but nothing has resulted in what Im trying to do.

try this:
df = (
df
.withColumn('value', f.sum(f.col('value')).over(Window.partitionBy(f.col('ID'))))
)
Link to documentation about Window operation https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html

You can use a simple groupBy, with the sum function:
from pyspark.sql import functions as F
(
df
.groupby("ID", 'Gender') # sum rows with same ID and Gender
# .groupby("ID") # use this line instead if you want to sum rows with the same ID, even if they have different Gender
.agg(F.sum('value').alias('value'))
)
The result is:
+---+------+-----+
| ID|Gender|value|
+---+------+-----+
| 1| Male| 11|
| 2|Female| 3|
| 3|Female| 9|
| 4| Male| 10|
+---+------+-----+

Related

how to join two dataframes and substract two columns from the dataframe

I have two dataframes which look like below
I am trying to find the diff between two amount based on ID
Dataframe 1:
ID I Amt
1 null 200
null 2 200
3 null 600
dataframe 2
ID I Amt
2 null 300
3 null 400
Output
Df
ID Amt(df2-df1)
2 100
3 -200
Query doesnt work:
Substraction doesnt work
df = df1.join(df2, df1["coalesce(ID, I)"] == df2["coalesce(ID, I)"], 'inner').select
((df1["amt)"]) – (df2["amt”])), df1["coalesce(ID, I)"].show())
I would do a couple of things differently. To make it easier to know what column is in what dataframe, I would rename them. I would also do the coalesce outside of the join itself.
val joined = df1.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF1_AMT")).join(
df2.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF2_AMT")),"joinKey")
Then you can easily perform your calculation:
joined.withColumn("DIFF",$"DF2_AMT" - $"DF1_AMT").show
+-------+-------+-------+------+
|joinKey|DF1_AMT|DF2_AMT| DIFF|
+-------+-------+-------+------+
| 2| 200| 300| 100.0|
| 3| 600| 400|-200.0|
+-------+-------+-------+------+

Apache spark aggregation: aggregate column based on another column value

I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+

how to take value for same answered more than once and need to create each value one column

I have data like below, want to take data for same id from one column and put each answer in different new columns respectively
actual
ID Brandid
1 234
1 122
1 134
2 122
3 234
3 122
Excpected
ID BRANDID_1 BRANDID_2 BRANDID_3
1 234 122 134
2 122 - -
3 234 122 -
You can use pivot after a groupBy, but first you can create a column with the future column name using row_number to get monotically number per ID over a Window. Here is one way:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# create the window on ID and as you need orderBy after,
# you can use a constant to keep the original order do F.lit(1)
w = Window.partitionBy('ID').orderBy(F.lit(1))
# create the column with future columns name to pivot on
pv_df = (df.withColumn('pv', F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')))
# groupby the ID and pivot on the created column
.groupBy('ID').pivot('pv')
# in aggregation, you need a function so we use first
.agg(F.first('Brandid')))
and you get
pv_df.show()
+---+---------+---------+---------+
| ID|Brandid_1|Brandid_2|Brandid_3|
+---+---------+---------+---------+
| 1| 234| 122| 134|
| 3| 234| 122| null|
| 2| 122| null| null|
+---+---------+---------+---------+
EDIT: to get the column in order as OP requested, you can use lpad, first define the length for number you want:
nb_pad = 3
and replace in the above method F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')) by
F.concat(F.lit('Brandid_'), F.lpad(F.row_number().over(w).cast('string'), nb_pad, "0"))
and if you don't know how many "0" you need to add (here it was number of length of 3 overall), then you can get this value by
nb_val = len(str(sdf.groupBy('ID').count().select(F.max('count')).collect()[0][0]))

How to get the name of the group with maximum value of parameter? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a DataFrame df like this one:
df =
name group influence
A 1 2
B 1 3
C 1 0
A 2 5
D 2 1
For each distinct value of group, I want to extract the value of name that has the maximum value of influence.
The expected result is this one:
group max_name max_influence
1 B 3
2 A 5
I know how to get max value but I don't know how to getmax_name.
df.groupBy("group").agg(max("influence").as("max_influence")
There is good alternative to groupBy with structs - window functions, which sometimes are really faster.
For your examle I would try the following:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
.filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
| A| 2| 5| 5|
| B| 1| 3| 3|
+----+-----+---------+-------------+
Now all you need is to drop useless columns and rename remaining ones.
Hope, it'll help.

Splitting Dataframe into two DataFrame

I have a dateframe which have unique as well as repeated records on the basis of number. Now i want to split the dataframe into two dataframe. In first dataframe i need to copy only unique rows and in second dataframe i want all repeated rows. For example
id name number
1 Shan 101
2 Shan 101
3 John 102
4 Michel 103
The two splitted dataframe should be like
Unique
id name number
3 John 102
4 Michel 103
Repeated
id name number
1 Shan 101
2 Shan 101
The solution you tried could probably get you there.
Your data looks like this
val df = sc.parallelize(Array(
(1, "Shan", 101),
(2, "Shan", 101),
(3, "John", 102),
(4, "Michel", 103)
)).toDF("id","name","number")
Then you yourself suggest grouping and counting. If you do it like this
val repeatedNames = df.groupBy("name").count.where(col("count")>1).withColumnRenamed("name","repeated").drop("count")
then you could actually get all the way by doing something like this afterwards:
val repeated = df.join(repeatedNames, repeatedNames("repeated")===df("name")).drop("repeated")
val distinct = df.except(repeated)
repeated show
+---+----+------+
| id|name|number|
+---+----+------+
| 1|Shan| 101|
| 2|Shan| 101|
+---+----+------+
distinct show
+---+------+------+
| id| name|number|
+---+------+------+
| 4|Michel| 103|
| 3| John| 102|
+---+------+------+
Hope it helps.