Adding clolumns in between a table while joining - pyspark

Need to add new columns with constant values while joining two tables
using pyspark. Using lit isn't solving the issue in Pyspark.
\\\**** Table A *******\\\\\\\
There are two tables A , B Table A as follows
ID Day Name Description
1 2016-09-01 Sam Retail
2 2016-01-28 Chris Retail
3 2016-02-06 ChrisTY Retail
4 2016-02-26 Christa Retail
3 2016-12-06 ChrisTu Retail
4 2016-12-31 Christi Retail
\\\**** Table B *****\\\\\\\
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql(
"select * From A where day ='{0}'".format(i[0])
)
Join = ABC2.join(
Tab2,
(
ABC2.ID == Tab2.ID
)
).select(
Tab2.skey,
ABC2.Day,
ABC2.Name,
ABC2.withColumn('newcol1, lit('')),
ABC2.withColumn('newcol2, lit('A')),
ABC2.Description
)
Join.select(
"skey",
"Day",
"Name",
"newcol1",
"newcol2",
"Description"
).write.mode("append").format("parquet").insertinto("Table")
ABC=spark.sql(
"select distinct day from A where day= '2016-01-01' "
)
The above code is resulting in issues even after defining the new columns
and constant values with lit, also newcol1 needs to take null value and newcol2
as A
New Table should be loaded with the following columns in the same order as
presented and also with new columns with constant values

Rewrite your Join DF as:
Join = ABC2.join(Tab2, (ABC2.ID == Tab2.ID))\
.select(Tab2.skey,ABC2.Day,ABC2.Name,)\
.withColumn('newcol1', lit(""))\
.withColumn('newcol2', lit("A"))

you can Join.select in the order you like, so your code will look like:
Join = ABC2.join(Tab2, (ABC2.ID == Tab2.ID))\ .select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)\ .withColumn('newcol1', lit(""))\ .withColumn('newcol2', lit("A"))
Join.select(
"skey",
"Day",
"Name",
"newcol1",
"newcol2",
"Description"
).write.mode("append").format("parquet").insertinto("Table")

Related

Apache Spark merge two identical DataFrames summing all rows and columns

I have two dataframes with identical column names but different number of rows, each of them identified by an ID and Date, as follows:
First table (the one with all the ID's available):
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
70
Second table (a smaller version including only some ID's):
ID
Date
Amount A
2
2021-09-01
50
2
2021-09-02
30
What I would like to have is a single table with the following output:
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
120
2
2021-09-02
30
Thanks in advance.
Approach 1: Using a Join
You may join both tables and sum on similar rows.
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
first_df.ID,
first_df.Date,
first_df.AmountA + second_df.AmountA as AmountA
FROM
first_df
LEFT JOIN
second_df ON first_df.ID = second_df.ID AND
first_df.Date = second_df.Date
Using Scala api
val outputDf = firstDf.alias("first_df")
.join(
secondDf.alias("second_df"),
Seq("ID","Date"),
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
Using pyspark api
outputDf = (
firstDf.alias("first_df")
.join(
second_df.alias("second_df"),
["ID","Date"],
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
)
Approach 2: Using a Union then aggregate by sum
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
ID,
Date,
SUM(AmountA) as AmountA
FROM (
SELECT ID, Date, AmountA FROM first_df UNION ALL
SELECT ID, Date, AmountA FROM second_df
) t
GROUP BY
ID,
Date
Using Scala api
val outputDf = firstDf.select("ID","Date","AmountA")
.union(secondDf.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
sum("AmountA").alias("AmountA")
)
Using Pyspark api
from pyspark.sql import functions as F
val outputDf = (
firstDf.select("ID","Date","AmountA")
.union(second_df.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
F.sum("AmountA").alias("AmountA")
)
)
Let me know if this works for you.

Spark scala window count max

I have following df:-
result
state
clubName
win
XYZ
club1
win
XYZ
club2
win
XYZ
club1
win
PQR
club3
I need state wise max wining clubName
val byState =Window.partitionBy("state").orderBy('state)
I tried creating a window but does not helps..
Expected Result :-
Some like this in sql
select temp.res
(select count(result) as res
from table
group by clubName) temp
group by state
e.g
state
max_count_of_wins
clubName
XYZ
2
club1
You can get the win count for each club, then assign a rank for each club ordered by wins, and filter those rows with rank = 1.
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"wins",
count(when(col("result") === "win", 1))
.over(Window.partitionBy("state","clubName"))
).withColumn(
"rn",
row_number().over(Window.partitionBy("state").orderBy(desc("wins")))
).filter("rn = 1").selectExpr("state", "wins as max_count_of_wins", "clubName")
df2.show
+-----+-----------------+--------+
|state|max_count_of_wins|clubName|
+-----+-----------------+--------+
| PQR| 1| club3|
| XYZ| 2| club1|
+-----+-----------------+--------+
You can also use a SQL-dialect with SparkSQL (find doc here):
df.sql("""
SELECT tt.name, tt.state, MAX(tt.nWins) as max_count_of_wins
FROM (
SELECT t1.clubName as name, t1.state as state, COUNT(1) as nWins
FROM Table1 t1
WHERE t1.result = 'win'
GROUP BY state, name
) as tt
GROUP BY tt.state;
""")
where the table in the dataframe would be named Table1 and your dataframe df.
p.s. if you want to try it yourself use the initialization
CREATE TABLE Table1
(`result` varchar(3), `state` varchar(3), `clubName` varchar(5))
;
INSERT INTO Table1
(`result`, `state`, `clubName`)
VALUES
('win', 'XYZ', 'club1'),
('win', 'XYZ', 'club2'),
('win', 'XYZ', 'club1'),
('win', 'PQR', 'club3')
;
on http://sqlfiddle.com.

Calling .hql file direclty from spark

I was trying to run hql files like below , but getting error noviablealtexception
val QUERY = fromFile(s"$SQLDIR/select_cust_info.hql").getLines.mkString
sqlContext.sql(s"$QUERY").show()
Can you please help , how to run it ?
as requested the select_cust_info.hql would be like this
set hive.execution.engine=mr;
--new records
insert into cust_info_stage
select row_number () over () + ${hiveconf:maxid} as row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
--upd record
insert into cust_info_stage
select row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
inner join cust_info_stage on
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
!quit
above hql is just a sample, I want to call such hqls from sqlContext.
Now next level that I will check is , if the .hqls have hiveconf defined within, how to pass those variables in sqlContext.
You can try below code to run hql file in pyspark v2+
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sc =SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
with open("/home/hadoop/test/abc.hql") as fr:
query = fr.read()
print(query)
results = sqlCtx.sql(query)

Deduplication problems in Pyspark

I have one dataframe with many rows of id, date and other information. It contains 2,856,134 records. A count distinct of ID results in 1,552,184 records.
Using this:
DF2 = sorted(DF.groupBy(DF.id).max('date').alias('date').collect())
Gives me the max date per ID, and results in 1,552,184 records, which matches the above. So far so good.
I try to join DF2 back to DF where id = id and max_date = date:
df3 = DF2.join(DF,(DF2.id==DF.id)&(DF2.Max_date==DF.date),"left")
This results in 2,358,316 records - which is different than the original amount.
I changed the code to:
df3 = DF2.join(DF,(DF2.id==DF.id)&(DF2.Max_date==DF.date),"left").dropDuplicates()
This results in 1,552,508 records (which is odd, since it should return 1,552,184 from the de-duplicated DF2 above.
Any idea what's happening here? I presume it's something to do with my join function.
Thanks!
its because your table 2 has duplicate entries for example:
Table1 Table2
_______ _________
1 2
2 2
3 5
4 6
SELECT Table1.Id, Table2.Id FROM Table1 LEFT OUTER JOIN Table2 ON Table1.Id=Table2.Id
Results:
1,null
2,2
2,2
3,null
4,null
I hope this will help you in solving your problem

Scala spark is better than pyspark?

I was translating pyspark to Scala spark thinking Scala spark would work well.
But Scala spark is taking more time than pyspark.
Can anyone please find issues with these two queries when its being executed in Scala spark.
Query1 : sqlContext.sql(SELECT a.pair, a.bi_count,
a.uni_count, unigram_table.uni_count as uni_count_2,
(log(a.bi_count) -log(a.uni_count) - log(unigram_table.uni_count))
as score FROM ( SELECT * FROM bigram_table JOIN unigram_table
ON bigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child = unigram_table.token WHERE a.bi_count >
4000 ORDER BY score DESC limit 400000 )
Execution time in pyspark - 3 min
Execution time in Scala spark - 3 min
Query2: sqlContext.sql( SELECT
pair, tri_count, (log(tri_count) - log(count1) -log(count2)
-log(unigram_table.uni_count)) as score FROM( SELECT pair, tri_count, count1, child1, child2, unigram_table.uni_count
as count2 FROM ( SELECT
pair, child1, child2, tri_count, unigram_table.uni_count
as count1 FROM trigram_table JOIN unigram_table ON
trigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child1 = unigram_table.token ) as b JOIN
unigram_table ON b.child2 = unigram_table.token WHERE tri_count >
3000 ORDER BY score DES)
Execution time in pyspark - 3 min
Execution time in Scala spark - 12 min