Hi Im trying to sum values of one column if 'ID' matches for all in a dataframe
For example
ID
Gender
value
1
Male
5
1
Male
6
2
Female
3
3
Female
0
3
Female
9
4
Male
10
How do I get the following table
ID
Gender
value
1
Male
11
2
Female
3
3
Female
9
4
Male
10
In the example above, ID with Value 1 is now showed just once and its value has been summed up (same for ID with value 3).
Thanks
Im new to Pyspark and still learning. I've tried count(), select and groupby() but nothing has resulted in what Im trying to do.
try this:
df = (
df
.withColumn('value', f.sum(f.col('value')).over(Window.partitionBy(f.col('ID'))))
)
Link to documentation about Window operation https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html
You can use a simple groupBy, with the sum function:
from pyspark.sql import functions as F
(
df
.groupby("ID", 'Gender') # sum rows with same ID and Gender
# .groupby("ID") # use this line instead if you want to sum rows with the same ID, even if they have different Gender
.agg(F.sum('value').alias('value'))
)
The result is:
+---+------+-----+
| ID|Gender|value|
+---+------+-----+
| 1| Male| 11|
| 2|Female| 3|
| 3|Female| 9|
| 4| Male| 10|
+---+------+-----+
This question already has answers here:
How to join two DataFrames and change column for missing values?
(3 answers)
How to do left outer join in spark sql?
(3 answers)
Closed 5 years ago.
I have 2 dataframes.
df1 =
dep-code rank
abc 1
bcd 2
df2 =
some cols... dep-code
abc
bcd
abc
I want to add new column to df2 as rank with df1.dep-code = df2.dep-code
result -
some cols... dep-code rank
abc 1
bcd 2
abc 1
That's a simple join:
df2.join(df1, "dep-code")
With the following inpouts:
df1 with the join and the desired column:
+--------+----+
|dep-code|rank|
+--------+----+
| abc| 1|
| bcd| 2|
+--------+----+
df2 with the join column plus an extra one (aColumn):
+----------+--------+
| aColumn|dep-code|
+----------+--------+
| some| abc|
| someother| bcd|
|yetAnother| abc|
+----------+--------+
You'll retrieve the output below:
+--------+----------+----+
|dep-code| aColumn|rank|
+--------+----------+----+
| abc| some| 1|
| abc|yetAnother| 1|
| bcd| someother| 2|
+--------+----------+----+
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a DataFrame df like this one:
df =
name group influence
A 1 2
B 1 3
C 1 0
A 2 5
D 2 1
For each distinct value of group, I want to extract the value of name that has the maximum value of influence.
The expected result is this one:
group max_name max_influence
1 B 3
2 A 5
I know how to get max value but I don't know how to getmax_name.
df.groupBy("group").agg(max("influence").as("max_influence")
There is good alternative to groupBy with structs - window functions, which sometimes are really faster.
For your examle I would try the following:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
.filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
| A| 2| 5| 5|
| B| 1| 3| 3|
+----+-----+---------+-------------+
Now all you need is to drop useless columns and rename remaining ones.
Hope, it'll help.
I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype):
id Value
1 103
2 1504
3 1
I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below:
id Value
1 0103
2 1504
3 0001
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
You can use lpad from functions module,
from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| 0103|
| 2| 1504|
| 3| 0001|
+---+-----+
Using PySpark lpad function in conjunction with withColumn:
import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0'))
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
Corey beat me to it, but here's the Scala version:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result