I was translating pyspark to Scala spark thinking Scala spark would work well.
But Scala spark is taking more time than pyspark.
Can anyone please find issues with these two queries when its being executed in Scala spark.
Query1 : sqlContext.sql(SELECT a.pair, a.bi_count,
a.uni_count, unigram_table.uni_count as uni_count_2,
(log(a.bi_count) -log(a.uni_count) - log(unigram_table.uni_count))
as score FROM ( SELECT * FROM bigram_table JOIN unigram_table
ON bigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child = unigram_table.token WHERE a.bi_count >
4000 ORDER BY score DESC limit 400000 )
Execution time in pyspark - 3 min
Execution time in Scala spark - 3 min
Query2: sqlContext.sql( SELECT
pair, tri_count, (log(tri_count) - log(count1) -log(count2)
-log(unigram_table.uni_count)) as score FROM( SELECT pair, tri_count, count1, child1, child2, unigram_table.uni_count
as count2 FROM ( SELECT
pair, child1, child2, tri_count, unigram_table.uni_count
as count1 FROM trigram_table JOIN unigram_table ON
trigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child1 = unigram_table.token ) as b JOIN
unigram_table ON b.child2 = unigram_table.token WHERE tri_count >
3000 ORDER BY score DES)
Execution time in pyspark - 3 min
Execution time in Scala spark - 12 min
Related
Here is the piece of teradata query, trying to rewrite in spark scala dataframe
Input Query
SELECT Min(activitydate)
over (
ORDER BY cust_id, site_group, activitydate ROWS BETWEEN 1 preceding AND 1 preceding) AS pread
FROM table_name
What is the equivalent of 1 preceding AND 1 preceding in spark scala ?
This is what I tried
val window = Window.orderBy('cust_id, 'site_group, 'activitydate)
val df2 = df1.withColumn("pread", lag('ActivityDate, 1) over (window ))
I have two dataframes with identical column names but different number of rows, each of them identified by an ID and Date, as follows:
First table (the one with all the ID's available):
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
70
Second table (a smaller version including only some ID's):
ID
Date
Amount A
2
2021-09-01
50
2
2021-09-02
30
What I would like to have is a single table with the following output:
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
120
2
2021-09-02
30
Thanks in advance.
Approach 1: Using a Join
You may join both tables and sum on similar rows.
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
first_df.ID,
first_df.Date,
first_df.AmountA + second_df.AmountA as AmountA
FROM
first_df
LEFT JOIN
second_df ON first_df.ID = second_df.ID AND
first_df.Date = second_df.Date
Using Scala api
val outputDf = firstDf.alias("first_df")
.join(
secondDf.alias("second_df"),
Seq("ID","Date"),
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
Using pyspark api
outputDf = (
firstDf.alias("first_df")
.join(
second_df.alias("second_df"),
["ID","Date"],
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
)
Approach 2: Using a Union then aggregate by sum
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
ID,
Date,
SUM(AmountA) as AmountA
FROM (
SELECT ID, Date, AmountA FROM first_df UNION ALL
SELECT ID, Date, AmountA FROM second_df
) t
GROUP BY
ID,
Date
Using Scala api
val outputDf = firstDf.select("ID","Date","AmountA")
.union(secondDf.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
sum("AmountA").alias("AmountA")
)
Using Pyspark api
from pyspark.sql import functions as F
val outputDf = (
firstDf.select("ID","Date","AmountA")
.union(second_df.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
F.sum("AmountA").alias("AmountA")
)
)
Let me know if this works for you.
I was trying to run hql files like below , but getting error noviablealtexception
val QUERY = fromFile(s"$SQLDIR/select_cust_info.hql").getLines.mkString
sqlContext.sql(s"$QUERY").show()
Can you please help , how to run it ?
as requested the select_cust_info.hql would be like this
set hive.execution.engine=mr;
--new records
insert into cust_info_stage
select row_number () over () + ${hiveconf:maxid} as row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
--upd record
insert into cust_info_stage
select row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
inner join cust_info_stage on
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
!quit
above hql is just a sample, I want to call such hqls from sqlContext.
Now next level that I will check is , if the .hqls have hiveconf defined within, how to pass those variables in sqlContext.
You can try below code to run hql file in pyspark v2+
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sc =SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
with open("/home/hadoop/test/abc.hql") as fr:
query = fr.read()
print(query)
results = sqlCtx.sql(query)
I want to filter records from target table whose date is greater then min(date) of source table (with common id in both table)
val cm_record_rdd=hiveContext.sql("select t1.* from target t1 left outer join source t2 on t1.id=t2.id")
val min_date_rdd=hiveContext.sql("select min(date) as min_date from source");
val src_rdd = hiveContext.sql("select * from source");
how can I filter records of cm_record with target.date >= source.min_date?
I tried below steps:
src_rdd.filter(cm_record_rdd("start_dt") >= min(src_rdd("date")))
src_rdd.filter(cm_record_rdd("start_dt") >= min_date_rdd("min_date"))
Nothing worked
Solution:
val min_date=hiveContext.sql("select min(date) as min_date from source").collect.head.get(0)
src_rdd.filter(delta_count("start_dt") >= min_date)
Given the following data:
sequence | amount
1 100000
1 20000
2 10000
2 10000
I'd like to write a sql query that gives me the sum of the current sequence, plus the sum of the previous sequence. Like so:
sequence | current | previous
1 120000 0
2 20000 120000
I know the solution likely involves windowing functions but I'm not too sure how to implement it without subqueries.
SQL Fiddle
select
seq,
amount,
lag(amount::int, 1, 0) over(order by seq) as previous
from (
select seq, sum(amount) as amount
from sa
group by seq
) s
order by seq
If your sequence is "sequencial" without holes you can simply do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount) from mytable t2 WHERE t2.sequence = t1.sequence - 1)
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence
Otherwise, instead of t2.sequence = t1.sequence - 1 you could do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount)
from mytable t2
WHERE t2.sequence = (SELECT MAX(t3.sequence)
FROM mytable t3
WHERE t3.sequence < t1.sequence))
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence;
You can see both approaches in this fiddle