I have a 2 different columns Gender and Country
Gender : Men, Women.
Country : America, India, Australia
I need to find out what percentage of men belongs to which country : India, America, Australia and also what percentage of women belongs : to which country : India, America, Australia in pyspark
Thanks in advance.
Here is one way to do it
from pyspark.sql import functions as F
df = spark.createDataFrame(
[
("female", "America"),
("female", "India"),
("female", "Australia"),
("male", "America"),
("male", "America"),
("female", "Australia"),
("male", "Australia"),
("male", "India"),
("male", "India"),
],
["Gender", "Country"],
)
df.show()
percent_df = (
df.groupBy("Gender", "Country")
.agg(count("Gender").alias("gender_country_count"))
.withColumn(
"total_gender_count",
F.sum("gender_country_count").over(Window.partitionBy("Gender")),
)
.withColumn(
"gender_percent",
((F.col("gender_country_count") / F.col("total_gender_count")) *
100),
)
# .drop("gender_country_count")
# .drop("total_gender_count")
)
percent_df.show()
Related
I have cars table with data
country
car
price
Germany
Mercedes
30000
Germany
BMW
20000
Germany
Opel
15000
Japan
Honda
20000
Japan
Toyota
15000
I need get country, car and price from table, with highest price for each country
country
car
price
Germany
Mercedes
30000
Japan
Honda
20000
I saw similar question but solution there is in SQL, i want DSL format of that for PySpark dataframes (link in case for that: Get records based on column max value)
You need row_number and filter to achieve your result like below
df = spark.createDataFrame(
[
("Germany","Mercedes", 30000),
("Germany","BMW", 20000),
("Germany","Opel", 15000),
("Japan","Honda",20000),
("Japan","Toyota",15000)],
("country","car", "price"))
from pyspark.sql.window import *
from pyspark.sql.functions import row_number, desc
df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("country").orderBy(desc("price"))))
df2 = df1.filter(df1.row_num == 1).drop('row_num')
My base dataframe looks like this:
HeroNamesDF
id gen name surname supername
1 1 Clarc Kent BATMAN
2 1 Bruce Smith BATMAN
3 2 Clark Kent SUPERMAN
And then I have another one with the corrections: CorrectionsDF
id gen attribute value
1 1 supername SUPERMAN
1 1 name Clark
2 1 surname Wayne
My aproach to the problem was to do this
CorrectionsDF.select(id, gen).distinct().collect().map(r => {
val id = r(0)
val gen = r(1)
val corrections = CorrectionsDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
val candidates = HeroNamesDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
candidates.columns.map(column => {
val change = corrections.where(col("attribute") === lit(column)).select("id", "gen", "value")
candidates.select("id", "gen", column)
.join(change, Seq("id", "gen"), "full")
.withColumn(column, when(col("value").isNotNull, col("value")).otherwise(col(column)))
.drop("value")
}).reduce((df1, df2) => df1.join(df2, Seq("id", "gen")) )
}
Expected output:
id gen name surname supername
1 1 Clark Kent SUPERMAN
2 1 Bruce Wayne BATMAN
3 2 Clark Kent SUPERMAN
And I would like to get rid of the .collect() but I can't make it work.
If I understood correctly the example, one inner join combined with a group by should be sufficient in your case. With the group by we will generate a map, using
collect_list and map_from_arrays, which will contain the aggregated data for every id/gen pair i.e {"name" : "Clarc", "surname" : "Kent", "superaname" : "BATMAN"}:
import org.apache.spark.sql.functions.{collect_list, map_from_arrays, coalesce}
val hdf = (load hero df)
val cdf = (load corrections df)
hdf.join(cdf, Seq("id", "gen"), "left")
.groupBy(hdf("id"), hdf("gen"))
.agg(
map_from_arrays(
collect_list("attribute"), // the keys
collect_list("value") // the values
).as("m"),
first("firstname").as("firstname"),
first("lastname").as("surname"),
first("supername").as("supername")
)
.select(
$"id",
$"gen",
coalesce($"m".getItem("name"), $"firstname").as("firstname"),
coalesce($"m".getItem("surname"), $"surname").as("surname"),
coalesce($"m".getItem("supername"), $"supername").as("supername")
)
I'm looking for a way to do this without a UDF, I am wondering if its possible. Lets say I have a DF as follows:
Buyer_name Buyer_state CoBuyer_name CoBuyers_state Price Date
Bob CA Joe CA 20 010119
Stacy IL Jamie IL 50 020419
... about 3 millions more rows...
And I want to turn it to:
Buyer_name Buyer_state Price Date
Bob CA 20 010119
Joe CA 20 010119
Stacy IL 50 020419
Jamie IL 50 020419
...
Edit: I could also,
Create two dataframes, removing "Buyer" columns from one, and "Cobuyer" columns from the other.
Rename dataframe with "Cobuyer" columns as "Buyer" columns.
Concatenate both dataframes.
You can group struct(Buyer_name, Buyer_state) and struct(CoBuyer_name, CoBuyer_state) into an Array which is then expanded using explode, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
df.
withColumn("Buyers", array(
struct($"Buyer_name".as("_1"), $"Buyer_state".as("_2")),
struct($"CoBuyer_name".as("_1"), $"CoBuyer_state".as("_2"))
)).
withColumn("Buyer", explode($"Buyers")).
select(
$"Buyer._1".as("Buyer_name"), $"Buyer._2".as("Buyer_state"), $"Price", $"Date"
).show
// +----------+-----------+-----+------+
// |Buyer_name|Buyer_state|Price| Date|
// +----------+-----------+-----+------+
// | Bob| CA| 20|010119|
// | Joe| CA| 20|010119|
// | Stacy| IL| 50|020419|
// | Jamie| IL| 50|020419|
// +----------+-----------+-----+------+
This sounds like an unpivot operation to me which can be accomplished with the union function in Scala:
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
val df_new = df.select("Buyer_name", "Buyer_state", "Price", "Date").union(df.select("CoBuyer_name", "CoBuyer_state", "Price", "Date"))
df_new.show
Thanks to Leo for providing the dataframe definition which I've re-used.
I have a dataframe DF with the following structure :
ID, DateTime, Latitude, Longitude, otherArgs
I want to group my data by ID and time window, and keep information about the location (For example the mean of the grouped latitude and the mean of the grouped longitude)
I successfully got a new dataframe with data grouped by id and time using :
DF.groupBy($"ID",window($"DateTime","2 minutes")).agg(max($"ID"))
But I lose my location data doing that.
What I am looking for is something that would look like this for example:
DF.groupBy($"ID",window($"DateTime","2 minutes"),mean("latitude"),mean("longitude")).agg(max($"ID"))
Returning only one row for each ID and time window.
EDIT :
Sample input :
DF : ID, DateTime, Latitude, Longitude, otherArgs
0 , 2018-01-07T04:04:00 , 25.000, 55.000, OtherThings
0 , 2018-01-07T04:05:00 , 26.000, 56.000, OtherThings
1 , 2018-01-07T04:04:00 , 26.000, 50.000, OtherThings
1 , 2018-01-07T04:05:00 , 27.000, 51.000, OtherThings
Sample output :
DF : ID, window(DateTime), Latitude, Longitude
0 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 25.5, 55.5
1 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 26.5, 50.5
Here is what you can do, you need to use mean with the aggregation.
val df = Seq(
(0, "2018-01-07T04:04:00", 25.000, 55.000, "OtherThings"),
(0, "2018-01-07T04:05:00", 26.000, 56.000, "OtherThings"),
(1, "2018-01-07T04:04:00", 26.000, 50.000, "OtherThings"),
(1, "2018-01-07T04:05:00", 27.000, 51.000, "OtherThings")
).toDF("ID", "DateTime", "Latitude", "Longitude", "otherArgs")
//convert Sting to DateType for DateTime
.withColumn("DateTime", $"DateTime".cast(DateType))
df.groupBy($"id", window($"DateTime", "2 minutes"))
.agg(
mean("Latitude").as("lat"),
mean("Longitude").as("long")
)
.show(false)
Output:
+---+---------------------------------------------+----+----+
|id |window |lat |long|
+---+---------------------------------------------+----+----+
|1 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|26.5|50.5|
|0 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|25.5|55.5|
+---+---------------------------------------------+----+----+
You should use the .agg() method for the aggregating
Perhaps this is what you mean?
DF
.groupBy(
'ID,
window('DateTime, "2 minutes")
)
.agg(
mean("latitude").as("latitudeMean"),
mean("longitude").as("longitudeMean")
)
aggregrated_table = df_input.groupBy('city', 'income_bracket') \
.agg(
count('suburb').alias('suburb'),
sum('population').alias('population'),
sum('gross_income').alias('gross_income'),
sum('no_households').alias('no_households'))
Would like to group by city and income bracket but within each city certain suburbs have different income brackets. How do I group by the most frequently occurring income bracket per city?
for example:
city1 suburb1 income_bracket_10
city1 suburb1 income_bracket_10
city1 suburb2 income_bracket_10
city1 suburb3 income_bracket_11
city1 suburb4 income_bracket_10
Would be grouped by income_bracket_10
Using a window function before aggregating might do the trick:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
"count",
psf.count("*").over(w)
).withColumn(
"rn",
psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
psf.count('suburb').alias('suburb'),
psf.sum('population').alias('population'),
psf.sum('gross_income').alias('gross_income'),
psf.sum('no_households').alias('no_households'))
you can also use a window function after aggregating since you're keeping a count of (city, income_bracket) occurrences.
You don't necessarily need Window functions:
aggregrated_table = (
df_input.groupby("city", "suburb","income_bracket")
.count()
.withColumn("count_income", F.array("count", "income_bracket"))
.groupby("city", "suburb")
.agg(F.max("count_income").getItem(1).alias("most_common_income_bracket"))
)
I think this does what you require. I don't really know if it performs better than the window based solution.
For pyspark version >=3.4 you can use the mode function directly to get the most frequent element per group:
from pyspark.sql import functions as f
df = spark.createDataFrame([
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
... schema=("course", "year", "earnings"))
>>> df.groupby("course").agg(f.mode("year")).show()
+------+----------+
|course|mode(year)|
+------+----------+
| Java| 2012|
|dotNET| 2012|
+------+----------+
https://github.com/apache/spark/blob/7f1b6fe02bdb2c68d5fb3129684ca0ed2ae5b534/python/pyspark/sql/functions.py#L379
The solution by mfcabrera gave wrong results when F.max was used on F.array column as the values in ArrayType are treated as String and integer max didnt work as expected.
The below solution worked.
w = Window.partitionBy('city', "suburb").orderBy(f.desc("count"))
aggregrated_table = (
input_df.groupby("city", "suburb","income_bracket")
.count()
.withColumn("max_income", f.row_number().over(w2))
.filter(f.col("max_income") == 1).drop("max_income")
)
aggregrated_table.display()