Get records based on column max value - in PySpark

Get records based on column max value - in PySpark - pyspark

I have cars table with data
country
car
price
Germany
Mercedes
30000
Germany
BMW
20000
Germany
Opel
15000
Japan
Honda
20000
Japan
Toyota
15000
I need get country, car and price from table, with highest price for each country
country
car
price
Germany
Mercedes
30000
Japan
Honda
20000
I saw similar question but solution there is in SQL, i want DSL format of that for PySpark dataframes (link in case for that: Get records based on column max value)

You need row_number and filter to achieve your result like below
df = spark.createDataFrame(
[
("Germany","Mercedes", 30000),
("Germany","BMW", 20000),
("Germany","Opel", 15000),
("Japan","Honda",20000),
("Japan","Toyota",15000)],
("country","car", "price"))
from pyspark.sql.window import *
from pyspark.sql.functions import row_number, desc
df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("country").orderBy(desc("price"))))
df2 = df1.filter(df1.row_num == 1).drop('row_num')

Related

How to execute spark SQL using withColumn for streaming dataframe?

There is a scenario in which SCHOOL_GROUP column from Streaming data needs to be updated based on one mapping table (static dataframe).
Matching logic needs to be applied on AREA and SCHOOL_GROUP of streaming DF (teachersInfoDf) with SPLIT_CRITERIA and SCHOOL_CODE column to fetch SCHOOL from static DF(mappingDf).
teachersInfoDf (Streaming Data):
FNAME
LNAME
DOB
GENDER
SALARY
SCHOOL_GROUP
AREA
Williams
Kylie
1996
M
2000
ABCD
CNTRL-1
Maria
Brown
1992
F
2000
ABCD
CNTRL-5
John
Snow
1997
M
5000
XYZA
MM-RLBH1
Merry
Ely
1993
F
1000
PQRS
J-20
Michael
Rose
1998
M
1000
XYZA
DAY-20
Andrew
Simen
1990
M
1000
STUV
LVL-20
John
Dear
1997
M
5000
PQRS
P-RLBH1
mappingDf (Mapping Table data-Static):
SCHOOL_CODE
SPLIT_CRITERIA
SCHOOL
ABCD
(AREA LIKE 'CNTRL-%')
GROUP-1
XYZA
(AREA IN ('KK-DSK','DAY-20','MM-RLBH1','KM-RED1','NN-RLBH2'))
MULTI
PQRS
(AREA LIKE 'P-%' OR AREA LIKE 'V-%' OR AREA LIKE 'J-%')
WEST
STUV
(AREA NOT IN ('SS_ENDO2','SS_GRTGED','SS_GRTMMU','PC_ENDO1','PC_ENDO2','GRTENDO','GRTENDO1')
CORE
Required Dataframe:
FNAME
LNAME
DOB
GENDER
SALARY
SCHOOL_GROUP
AREA
Williams
Kylie
2006
M
2000
GROUP-1
CNTRL-1
Maria
Brown
2002
F
2000
GROUP-1
CNTRL-5
John
Snow
2007
M
5000
MULTI
MM-RLBH1
Merry
Ely
2003
F
1000
WEST
J-20
Michael
Rose
2002
M
1000
MULTI
DAY-20
Andrew
Simen
2008
M
1000
CORE
LVL-20
John
Dear
2007
M
5000
WEST
P-RLBH1
Using Spark SQL how can I achieve that?
(I know in streaming we can't show data like this. Streaming DF examples are for reference only.)
(For now, I created static DF to apply the logic.)
I am using below way but getting error:
def deriveSchoolOnArea: UserDefinedFunction = udf((area: String, SPLIT_CRITERIA: String, SCHOOL: String) => {
if (area == null || SPLIT_CRITERIA == null || SCHOOL == null) {
return null
}
val splitCriteria = SPLIT_CRITERIA.replace("AREA", area)
val query = """select """" + SCHOOL + """" AS SCHOOL from dual where """ + splitCriteria
print(query)
val dualDf = spark.sparkContext.parallelize(Seq("dual")).toDF()
dualDf.createOrReplaceGlobalTempView("dual")
print("View Created")
val finalHosDf = spark.sql(query)
print("Query Executed")
var finalSchool = ""
if (finalHosDf.isEmpty){
return null
}
else{
finalSchool = finalHosDf.select(col("SCHOOL")).first.getString(0)
}
print(finalSchool)
finalSchool
})
val dfJoin = teachersInfoDf.join(mappingDf,mappingDf("SCHOOL_CODE") === teachersInfoDf("SCHOOL_GROUP"), "left")
val dfJoin2 = dfJoin.withColumn("SCHOOL_GROUP", coalesce(deriveSchoolOnArea(col("area"), col("SPLIT_CRITERIA"), col("SCHOOL")), col("SCHOOL_GROUP")))
dfJoin2.show(false)
But Getting below error:
dfJoin2.show(false)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2459)

Filtering inactivated rows in Spark using Scala

I am very new to Spark and Scala programming and I have a problem that I hope some smart people can help me to solve.
I have a table named users with 4 columns: status, user_id, name, date
Rows are:
status user_id name date
active 1 Peter 2020-01-01
active 2 John 2020-01-01
active 3 Alex 2020-01-01
inactive 1 Peter 2020-02-01
inactive 2 John 2020-01-01
I need to select only active users. Two users were inactivated. Only one was inactivated for the same date.
What I aim is to filter rows with inactive status(this I can) and to filter inactivated users when inactivation row matches columns with active row. Peter was inactivated for different date and he is not filtered. Desired result would be:
1 Peter 2020-01-01
3 Alex 2020-01-01
rows with inactive status filtered. John is inactivated, so his row is filtered too.
The closest I come is to filter users that has inactive status:
val users = spark.table("db.users")
.filter(col("status").not Equal("Inactive"))
.select("user_id", "name", "date")
Any ideas or suggestions how to solve this?
Thanks!

Check the inactive first with group by for each user and date, and join this result into the original df.
val df2 = df.groupBy('user_id, 'date).agg(max('status).as("status"))
.filter("status = 'inactive'")
.withColumnRenamed("status", "inactive")
df.join(df2, Seq("user_id", "date"), "left")
.filter('inactive.isNull)
.select(df.columns.head, df.columns.tail: _*)
.show()
+------+-------+-----+----------+
|status|user_id| name| date|
+------+-------+-----+----------+
|active| 1|Peter|2020-01-01|
|active| 3| Alex|2020-01-01|
+------+-------+-----+----------+

cannot select columns in a table as one of the column name is limit

The above code is resulting in issues as it has a column name as keyword
limit. If I remove the column 'limit' from the select list, the script is
working fine.
Table A has following contents
\**** Table A *******\\\\
There are two tables A , B Table A as follows
ID Day Name Description limit
1 2016-09-01 Sam Retail 100
2 2016-01-28 Chris Retail 200
3 2016-02-06 ChrisTY Retail 50
4 2016-02-26 Christa Retail 10
3 2016-12-06 ChrisTu Retail 200
4 2016-12-31 Christi Retail 500
Table B has following contents
\\\**** Table B *****\\\\\\\
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
The above code is resulting in issues as it has a column name as keyword
limit. If I remove the column 'limit' from the select list, the script is
working fine.
\\\**** Tried Code *****\\\\\\\
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql(
"select * From A where day ='{0}'".format(i[0])
)
Join = ABC2.join(
Tab2,
(
ABC2.ID == Tab2.ID
)
)\
.select(
Tab2.skey,
ABC2.Day,
ABC2.Name,
ABC2.limit,)
withColumn('newcol1, lit('')),
withColumn('newcol2, lit('A'))
ABC2 .show()
ABC=spark.sql(
"select distinct day from A where day= '2016-01-01' "
)
\\\**** Expected Result *****\\\\\\\
How can we amend the code so that the limit is also selected

It worked this wasy. not sure functional reason but is successful, Renaming
the limit as alias before and there after getting it back
\\**** Tried Code *****\\\\\\\
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql( "select Day,Name,Description,limit as liu From A where day
='{0}'".format(i[0]) )
Join = ABC2.join( Tab2, ( ABC2.ID == Tab2.ID ) )\
.selectexpr( "skey as skey",
"Day as Day",
"Name as Day",
"liu as limit",)
withColumn('newcol1, lit('')),
withColumn('newcol2, lit('A'))
ABC2 .show()

Dataframes Join in Scala with multiple columns is not same with few columns might be null

I have 2 dataframes as below.
Goal is to find a new row from df2 where the same column values are not exist in dataframe 1.
I have tried to join the two dataframes with id as join condition and checked other column values are not equal as below.
But it does not work.
Could someone please assist?
df1: This dataframe is like a master table
id amt city date
abc 100 City1 9/26/2018
abc 100 City1 9/25/2018
def 200 City2 9/26/2018
ghi 300 City3 9/26/2018
df2: Dataframe 2 which is new dataset comes everyday.
id amt city date
abc 100 City1 9/27/2018
def null City2 9/26/2018
ghi 300 City3 9/26/2018
Result: Come up with a result dataframe as below:
id amt city date
abc 100 City1 9/27/2018
def null City2 9/26/2018
Code I tried:
val writeDF = df1.join(df2, df1.col("id") === df2.col("id")).
where(df1.col("amt") =!= df2.col("amt")).where(df1.col("city") =!=
df2.col("city")).where(df1.col("date") =!= df2.col("date")).select($"df2.*")

DataFrame method df1.except(df2) will return all of the rows in df1 that are not present in df2.
Source: Spark 2.2.0 Docs

Except method can be used as mentioned in scala docs.
dataFrame1.except(dataFrame2)
will return another dataframe containing rows of dataFrame1 but not dataFrame2

you need to use except method to achieve this.
df2.except(df1).show

pyspark: aggregate on the most frequent value in a column

aggregrated_table = df_input.groupBy('city', 'income_bracket') \
.agg(
count('suburb').alias('suburb'),
sum('population').alias('population'),
sum('gross_income').alias('gross_income'),
sum('no_households').alias('no_households'))
Would like to group by city and income bracket but within each city certain suburbs have different income brackets. How do I group by the most frequently occurring income bracket per city?
for example:
city1 suburb1 income_bracket_10
city1 suburb1 income_bracket_10
city1 suburb2 income_bracket_10
city1 suburb3 income_bracket_11
city1 suburb4 income_bracket_10
Would be grouped by income_bracket_10

Using a window function before aggregating might do the trick:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
"count",
psf.count("*").over(w)
).withColumn(
"rn",
psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
psf.count('suburb').alias('suburb'),
psf.sum('population').alias('population'),
psf.sum('gross_income').alias('gross_income'),
psf.sum('no_households').alias('no_households'))
you can also use a window function after aggregating since you're keeping a count of (city, income_bracket) occurrences.

You don't necessarily need Window functions:
aggregrated_table = (
df_input.groupby("city", "suburb","income_bracket")
.count()
.withColumn("count_income", F.array("count", "income_bracket"))
.groupby("city", "suburb")
.agg(F.max("count_income").getItem(1).alias("most_common_income_bracket"))
)
I think this does what you require. I don't really know if it performs better than the window based solution.

For pyspark version >=3.4 you can use the mode function directly to get the most frequent element per group:
from pyspark.sql import functions as f
df = spark.createDataFrame([
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
... schema=("course", "year", "earnings"))
>>> df.groupby("course").agg(f.mode("year")).show()
+------+----------+
|course|mode(year)|
+------+----------+
| Java| 2012|
|dotNET| 2012|
+------+----------+
https://github.com/apache/spark/blob/7f1b6fe02bdb2c68d5fb3129684ca0ed2ae5b534/python/pyspark/sql/functions.py#L379

The solution by mfcabrera gave wrong results when F.max was used on F.array column as the values in ArrayType are treated as String and integer max didnt work as expected.
The below solution worked.
w = Window.partitionBy('city', "suburb").orderBy(f.desc("count"))
aggregrated_table = (
input_df.groupby("city", "suburb","income_bracket")
.count()
.withColumn("max_income", f.row_number().over(w2))
.filter(f.col("max_income") == 1).drop("max_income")
)
aggregrated_table.display()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Get records based on column max value - in PySpark - pyspark

Related

How to execute spark SQL using withColumn for streaming dataframe?

Filtering inactivated rows in Spark using Scala

cannot select columns in a table as one of the column name is limit

Dataframes Join in Scala with multiple columns is not same with few columns might be null

pyspark: aggregate on the most frequent value in a column

Categories

Resources