Apache Spark merge two identical DataFrames summing all rows and columns

Apache Spark merge two identical DataFrames summing all rows and columns - scala

I have two dataframes with identical column names but different number of rows, each of them identified by an ID and Date, as follows:
First table (the one with all the ID's available):
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
70
Second table (a smaller version including only some ID's):
ID
Date
Amount A
2
2021-09-01
50
2
2021-09-02
30
What I would like to have is a single table with the following output:
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
120
2
2021-09-02
30
Thanks in advance.

Approach 1: Using a Join
You may join both tables and sum on similar rows.
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
first_df.ID,
first_df.Date,
first_df.AmountA + second_df.AmountA as AmountA
FROM
first_df
LEFT JOIN
second_df ON first_df.ID = second_df.ID AND
first_df.Date = second_df.Date
Using Scala api
val outputDf = firstDf.alias("first_df")
.join(
secondDf.alias("second_df"),
Seq("ID","Date"),
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
Using pyspark api
outputDf = (
firstDf.alias("first_df")
.join(
second_df.alias("second_df"),
["ID","Date"],
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
)
Approach 2: Using a Union then aggregate by sum
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
ID,
Date,
SUM(AmountA) as AmountA
FROM (
SELECT ID, Date, AmountA FROM first_df UNION ALL
SELECT ID, Date, AmountA FROM second_df
) t
GROUP BY
ID,
Date
Using Scala api
val outputDf = firstDf.select("ID","Date","AmountA")
.union(secondDf.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
sum("AmountA").alias("AmountA")
)
Using Pyspark api
from pyspark.sql import functions as F
val outputDf = (
firstDf.select("ID","Date","AmountA")
.union(second_df.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
F.sum("AmountA").alias("AmountA")
)
)
Let me know if this works for you.

Related

Spark data frame rows between 1 preceding and 1 preceding

Here is the piece of teradata query, trying to rewrite in spark scala dataframe
Input Query
SELECT Min(activitydate)
over (
ORDER BY cust_id, site_group, activitydate ROWS BETWEEN 1 preceding AND 1 preceding) AS pread
FROM table_name
What is the equivalent of 1 preceding AND 1 preceding in spark scala ?
This is what I tried
val window = Window.orderBy('cust_id, 'site_group, 'activitydate)
val df2 = df1.withColumn("pread", lag('ActivityDate, 1) over (window ))

SparkSQL Select with multiple columns, then join?

I'm unfamiliar with sparksql, but want to select multiple columns in this query then join the 2 frames. The primary key column is ID from df.
val count1 = df.select(size($"col1").as("col1Name"))
val count2 = df.select(size($"col2").as("col2Name"))
So ultimately I want a table with ID, count1 and count2. How can I achieve this?

I believe what you are trying to do is count 2 columns from df. You can do this using below
df.registerTempTable("temp_table")
//Below Is an example how you can use SparkSql
val newdf = spark.sql("select id,count(col1) as count1,count(col2) as count2 from temp_table group by id")
//You can use this dataframe further for operations
newdf.show(false)

Adding clolumns in between a table while joining

Need to add new columns with constant values while joining two tables
using pyspark. Using lit isn't solving the issue in Pyspark.
\\\**** Table A *******\\\\\\\
There are two tables A , B Table A as follows
ID Day Name Description
1 2016-09-01 Sam Retail
2 2016-01-28 Chris Retail
3 2016-02-06 ChrisTY Retail
4 2016-02-26 Christa Retail
3 2016-12-06 ChrisTu Retail
4 2016-12-31 Christi Retail
\\\**** Table B *****\\\\\\\
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql(
"select * From A where day ='{0}'".format(i[0])
)
Join = ABC2.join(
Tab2,
(
ABC2.ID == Tab2.ID
)
).select(
Tab2.skey,
ABC2.Day,
ABC2.Name,
ABC2.withColumn('newcol1, lit('')),
ABC2.withColumn('newcol2, lit('A')),
ABC2.Description
)
Join.select(
"skey",
"Day",
"Name",
"newcol1",
"newcol2",
"Description"
).write.mode("append").format("parquet").insertinto("Table")
ABC=spark.sql(
"select distinct day from A where day= '2016-01-01' "
)
The above code is resulting in issues even after defining the new columns
and constant values with lit, also newcol1 needs to take null value and newcol2
as A
New Table should be loaded with the following columns in the same order as
presented and also with new columns with constant values

Rewrite your Join DF as:
Join = ABC2.join(Tab2, (ABC2.ID == Tab2.ID))\
.select(Tab2.skey,ABC2.Day,ABC2.Name,)\
.withColumn('newcol1', lit(""))\
.withColumn('newcol2', lit("A"))

you can Join.select in the order you like, so your code will look like:
Join = ABC2.join(Tab2, (ABC2.ID == Tab2.ID))\ .select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)\ .withColumn('newcol1', lit(""))\ .withColumn('newcol2', lit("A"))
Join.select(
"skey",
"Day",
"Name",
"newcol1",
"newcol2",
"Description"
).write.mode("append").format("parquet").insertinto("Table")

Calling .hql file direclty from spark

I was trying to run hql files like below , but getting error noviablealtexception
val QUERY = fromFile(s"$SQLDIR/select_cust_info.hql").getLines.mkString
sqlContext.sql(s"$QUERY").show()
Can you please help , how to run it ?
as requested the select_cust_info.hql would be like this
set hive.execution.engine=mr;
--new records
insert into cust_info_stage
select row_number () over () + ${hiveconf:maxid} as row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
--upd record
insert into cust_info_stage
select row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
inner join cust_info_stage on
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
!quit
above hql is just a sample, I want to call such hqls from sqlContext.
Now next level that I will check is , if the .hqls have hiveconf defined within, how to pass those variables in sqlContext.

You can try below code to run hql file in pyspark v2+
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sc =SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
with open("/home/hadoop/test/abc.hql") as fr:
query = fr.read()
print(query)
results = sqlCtx.sql(query)

Scala spark is better than pyspark?

I was translating pyspark to Scala spark thinking Scala spark would work well.
But Scala spark is taking more time than pyspark.
Can anyone please find issues with these two queries when its being executed in Scala spark.
Query1 : sqlContext.sql(SELECT a.pair, a.bi_count,
a.uni_count, unigram_table.uni_count as uni_count_2,
(log(a.bi_count) -log(a.uni_count) - log(unigram_table.uni_count))
as score FROM ( SELECT * FROM bigram_table JOIN unigram_table
ON bigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child = unigram_table.token WHERE a.bi_count >
4000 ORDER BY score DESC limit 400000 )
Execution time in pyspark - 3 min
Execution time in Scala spark - 3 min
Query2: sqlContext.sql( SELECT
pair, tri_count, (log(tri_count) - log(count1) -log(count2)
-log(unigram_table.uni_count)) as score FROM( SELECT pair, tri_count, count1, child1, child2, unigram_table.uni_count
as count2 FROM ( SELECT
pair, child1, child2, tri_count, unigram_table.uni_count
as count1 FROM trigram_table JOIN unigram_table ON
trigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child1 = unigram_table.token ) as b JOIN
unigram_table ON b.child2 = unigram_table.token WHERE tri_count >
3000 ORDER BY score DES)
Execution time in pyspark - 3 min
Execution time in Scala spark - 12 min

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache Spark merge two identical DataFrames summing all rows and columns - scala

Related

Spark data frame rows between 1 preceding and 1 preceding

SparkSQL Select with multiple columns, then join?

Adding clolumns in between a table while joining

Calling .hql file direclty from spark

Scala spark is better than pyspark?

Categories

Resources