How to execute spark SQL using withColumn for streaming dataframe? - scala

There is a scenario in which SCHOOL_GROUP column from Streaming data needs to be updated based on one mapping table (static dataframe).
Matching logic needs to be applied on AREA and SCHOOL_GROUP of streaming DF (teachersInfoDf) with SPLIT_CRITERIA and SCHOOL_CODE column to fetch SCHOOL from static DF(mappingDf).
teachersInfoDf (Streaming Data):
FNAME
LNAME
DOB
GENDER
SALARY
SCHOOL_GROUP
AREA
Williams
Kylie
1996
M
2000
ABCD
CNTRL-1
Maria
Brown
1992
F
2000
ABCD
CNTRL-5
John
Snow
1997
M
5000
XYZA
MM-RLBH1
Merry
Ely
1993
F
1000
PQRS
J-20
Michael
Rose
1998
M
1000
XYZA
DAY-20
Andrew
Simen
1990
M
1000
STUV
LVL-20
John
Dear
1997
M
5000
PQRS
P-RLBH1
mappingDf (Mapping Table data-Static):
SCHOOL_CODE
SPLIT_CRITERIA
SCHOOL
ABCD
(AREA LIKE 'CNTRL-%')
GROUP-1
XYZA
(AREA IN ('KK-DSK','DAY-20','MM-RLBH1','KM-RED1','NN-RLBH2'))
MULTI
PQRS
(AREA LIKE 'P-%' OR AREA LIKE 'V-%' OR AREA LIKE 'J-%')
WEST
STUV
(AREA NOT IN ('SS_ENDO2','SS_GRTGED','SS_GRTMMU','PC_ENDO1','PC_ENDO2','GRTENDO','GRTENDO1')
CORE
Required Dataframe:
FNAME
LNAME
DOB
GENDER
SALARY
SCHOOL_GROUP
AREA
Williams
Kylie
2006
M
2000
GROUP-1
CNTRL-1
Maria
Brown
2002
F
2000
GROUP-1
CNTRL-5
John
Snow
2007
M
5000
MULTI
MM-RLBH1
Merry
Ely
2003
F
1000
WEST
J-20
Michael
Rose
2002
M
1000
MULTI
DAY-20
Andrew
Simen
2008
M
1000
CORE
LVL-20
John
Dear
2007
M
5000
WEST
P-RLBH1
Using Spark SQL how can I achieve that?
(I know in streaming we can't show data like this. Streaming DF examples are for reference only.)
(For now, I created static DF to apply the logic.)
I am using below way but getting error:
def deriveSchoolOnArea: UserDefinedFunction = udf((area: String, SPLIT_CRITERIA: String, SCHOOL: String) => {
if (area == null || SPLIT_CRITERIA == null || SCHOOL == null) {
return null
}
val splitCriteria = SPLIT_CRITERIA.replace("AREA", area)
val query = """select """" + SCHOOL + """" AS SCHOOL from dual where """ + splitCriteria
print(query)
val dualDf = spark.sparkContext.parallelize(Seq("dual")).toDF()
dualDf.createOrReplaceGlobalTempView("dual")
print("View Created")
val finalHosDf = spark.sql(query)
print("Query Executed")
var finalSchool = ""
if (finalHosDf.isEmpty){
return null
}
else{
finalSchool = finalHosDf.select(col("SCHOOL")).first.getString(0)
}
print(finalSchool)
finalSchool
})
val dfJoin = teachersInfoDf.join(mappingDf,mappingDf("SCHOOL_CODE") === teachersInfoDf("SCHOOL_GROUP"), "left")
val dfJoin2 = dfJoin.withColumn("SCHOOL_GROUP", coalesce(deriveSchoolOnArea(col("area"), col("SPLIT_CRITERIA"), col("SCHOOL")), col("SCHOOL_GROUP")))
dfJoin2.show(false)
But Getting below error:
dfJoin2.show(false)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2459)

Related

Get records based on column max value - in PySpark

I have cars table with data
country
car
price
Germany
Mercedes
30000
Germany
BMW
20000
Germany
Opel
15000
Japan
Honda
20000
Japan
Toyota
15000
I need get country, car and price from table, with highest price for each country
country
car
price
Germany
Mercedes
30000
Japan
Honda
20000
I saw similar question but solution there is in SQL, i want DSL format of that for PySpark dataframes (link in case for that: Get records based on column max value)
You need row_number and filter to achieve your result like below
df = spark.createDataFrame(
[
("Germany","Mercedes", 30000),
("Germany","BMW", 20000),
("Germany","Opel", 15000),
("Japan","Honda",20000),
("Japan","Toyota",15000)],
("country","car", "price"))
from pyspark.sql.window import *
from pyspark.sql.functions import row_number, desc
df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("country").orderBy(desc("price"))))
df2 = df1.filter(df1.row_num == 1).drop('row_num')

How to apply changes stored on a dataframe to another dataframe?

My base dataframe looks like this:
HeroNamesDF
id gen name surname supername
1 1 Clarc Kent BATMAN
2 1 Bruce Smith BATMAN
3 2 Clark Kent SUPERMAN
And then I have another one with the corrections: CorrectionsDF
id gen attribute value
1 1 supername SUPERMAN
1 1 name Clark
2 1 surname Wayne
My aproach to the problem was to do this
CorrectionsDF.select(id, gen).distinct().collect().map(r => {
val id = r(0)
val gen = r(1)
val corrections = CorrectionsDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
val candidates = HeroNamesDF.filter(col("id") === lit(id) and col("gen") === lit(gen))
candidates.columns.map(column => {
val change = corrections.where(col("attribute") === lit(column)).select("id", "gen", "value")
candidates.select("id", "gen", column)
.join(change, Seq("id", "gen"), "full")
.withColumn(column, when(col("value").isNotNull, col("value")).otherwise(col(column)))
.drop("value")
}).reduce((df1, df2) => df1.join(df2, Seq("id", "gen")) )
}
Expected output:
id gen name surname supername
1 1 Clark Kent SUPERMAN
2 1 Bruce Wayne BATMAN
3 2 Clark Kent SUPERMAN
And I would like to get rid of the .collect() but I can't make it work.
If I understood correctly the example, one inner join combined with a group by should be sufficient in your case. With the group by we will generate a map, using
collect_list and map_from_arrays, which will contain the aggregated data for every id/gen pair i.e {"name" : "Clarc", "surname" : "Kent", "superaname" : "BATMAN"}:
import org.apache.spark.sql.functions.{collect_list, map_from_arrays, coalesce}
val hdf = (load hero df)
val cdf = (load corrections df)
hdf.join(cdf, Seq("id", "gen"), "left")
.groupBy(hdf("id"), hdf("gen"))
.agg(
map_from_arrays(
collect_list("attribute"), // the keys
collect_list("value") // the values
).as("m"),
first("firstname").as("firstname"),
first("lastname").as("surname"),
first("supername").as("supername")
)
.select(
$"id",
$"gen",
coalesce($"m".getItem("name"), $"firstname").as("firstname"),
coalesce($"m".getItem("surname"), $"surname").as("surname"),
coalesce($"m".getItem("supername"), $"supername").as("supername")
)

Dataframes Join in Scala with multiple columns is not same with few columns might be null

I have 2 dataframes as below.
Goal is to find a new row from df2 where the same column values are not exist in dataframe 1.
I have tried to join the two dataframes with id as join condition and checked other column values are not equal as below.
But it does not work.
Could someone please assist?
df1: This dataframe is like a master table
id amt city date
abc 100 City1 9/26/2018
abc 100 City1 9/25/2018
def 200 City2 9/26/2018
ghi 300 City3 9/26/2018
df2: Dataframe 2 which is new dataset comes everyday.
id amt city date
abc 100 City1 9/27/2018
def null City2 9/26/2018
ghi 300 City3 9/26/2018
Result: Come up with a result dataframe as below:
id amt city date
abc 100 City1 9/27/2018
def null City2 9/26/2018
Code I tried:
val writeDF = df1.join(df2, df1.col("id") === df2.col("id")).
where(df1.col("amt") =!= df2.col("amt")).where(df1.col("city") =!=
df2.col("city")).where(df1.col("date") =!= df2.col("date")).select($"df2.*")
DataFrame method df1.except(df2) will return all of the rows in df1 that are not present in df2.
Source: Spark 2.2.0 Docs
Except method can be used as mentioned in scala docs.
dataFrame1.except(dataFrame2)
will return another dataframe containing rows of dataFrame1 but not dataFrame2
you need to use except method to achieve this.
df2.except(df1).show

spark Group By data-frame columns without aggregation [duplicate]

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.
Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

Find diff between two data frame based on primary key in spark scala

I have two data frame in spark.
I am doing df1.except(df2) two find if any columns has changed between two data frame .
df1 is like here
|001000900|aaaaa BELLOWS CORPORATION||N|
|001000905|ddddd DEPARTMENT OF LABOR AND EMPLOYMENT SECURITY|BUREAU OF COMPLIANCE|N|
|001001049|gggg RAVIOLI MFG CO INC|SPINELLI BKY RAVIOLI PASTRY SP|N|
|001001130|dddd ANGELES UNIFIED SCHOOL DISTRICT|TRANSPORTATION BRANCH|N|
|001001143|ffff MUSIC PARTIES, INC||N|
|001001155|BOSTON BRASS AND IRON CO||N|
|001001171|HANCOCK MARINE, INC.||N|
|001001184|TRILLION CORPORATION||N|
|001001192|HAWAII STATE CHIROPRACTIC ASSOCIATION INC||N|
|001001379|THE FRUIT SQUARE PEOPLE INC|L & M BAKERY|N|
|001001416|J & S MARKET||N|
df2 is like below
|001000145|PARADISE TAN||N|
|001000306|SHRUT & ASCH LEATHER COMPANY, INC.||N|
|001000355|HARRISON SPECIALTY CO., INC.||N|
|001000363|LOUIS M. GERSON CO., INC.||N|
|001000467|SAVE THE SEA TURTLES INTERNATIONAL|ADOPT THE BEACH HI|N|
|001000504|DIRIGO SPICE CORPORATION|CUNNINGHAM SPICE|N|
|001000744|FREEDMAN THREAD COMPANY|COLONIAL THREAD CO|N|
|001000756|AFFORDABLE AIR CONDITIONING|P R ENTERPRISE|N|
|001000900|CLIFLEX BELLOWS CORPORATION||N|
|001000905|FLORIDA DEPARTMENT OF LABOR AND EMPLOYMENT SECURITY|BUREAU OF COMPLIANCE|N|
|001001049|SPINELLI RAVIOLI MFG CO INC|SPINELLI BKY RAVIOLI PASTRY SP|N|
|001001130|LOS ANGELES UNIFIED SCHOOL DISTRICT|TRANSPORTATION BRANCH|N|
|001001143|TOSCO MUSIC PARTIES, INC||N|
|001001155|BOSTON BRASS AND IRON CO||N|
But what I want is that I have to find the diff between two data frame based on one column .Some thing like below
I want my output like below
|dunsnumber|filler1| businessname| tradestylename|registeredaddressindicator|
+----------+-------+--------------------+--------------------+--------------------------+
| 001001130| |dddd ANGELES UNIF...|TRANSPORTATION BR...| N|
| 001000900| |aaaaa BELLOWS COR...| | N|
| 001000905| |ddddd DEPARTMENT ...|BUREAU OF COMPLIANCE| N|
| 001001143| |ffff MUSIC PARTIE...| | N|
| 001001049| |gggg RAVIOLI MFG ...|SPINELLI BKY RAVI...| N|
+----------+-------+--------------------+--------------------+
Here is my code
import org.apache.spark.sql.functions._
val textRdd1 = sc.textFile("/home/cloudera/TRF/PCFP/INCR")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split("\\|", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("/home/cloudera/TRF/PCFP/MAIN")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split("\\|", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
val diffAnyColumnDF = df1.except(df2).where(df1.col("dunsnumber") ===
df2.col("dunsnumber")).show()
So if my primary key 'dunsnumber' matches then only find if any columns has changed or not for that primary key or not .
I hope my question clear.
Dataframe doesn't have substract method. You can use an alternate approach though.
Convert data to RDD, use subtract method, get back to your dataframe.
Hi so this has worked for me ..
val diffAnyColumnDF = df1.except(df2)
val addDF= diffAnyColumnDF.join(df2, Seq("dunsnumber")).show()