Calling .hql file direclty from spark

Calling .hql file direclty from spark - scala

I was trying to run hql files like below , but getting error noviablealtexception
val QUERY = fromFile(s"$SQLDIR/select_cust_info.hql").getLines.mkString
sqlContext.sql(s"$QUERY").show()
Can you please help , how to run it ?
as requested the select_cust_info.hql would be like this
set hive.execution.engine=mr;
--new records
insert into cust_info_stage
select row_number () over () + ${hiveconf:maxid} as row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
--upd record
insert into cust_info_stage
select row_id , name, age, sex, country , upd_date, create_date
from ${hiveconf:table} r
inner join cust_info_stage on
left join cust_dim d on id=uid
where not exists ( select 1 from cust_info c where c.id=r.id);
!quit
above hql is just a sample, I want to call such hqls from sqlContext.
Now next level that I will check is , if the .hqls have hiveconf defined within, how to pass those variables in sqlContext.

You can try below code to run hql file in pyspark v2+
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sc =SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
with open("/home/hadoop/test/abc.hql") as fr:
query = fr.read()
print(query)
results = sqlCtx.sql(query)

Related

Convert pyspark code to snowflake to create row level policy in snowflake

I am trying to convert pyspark code to snowflake to create row level policy in snowflake.
I am new to snowflake and not sure how to add split and case statement in snowflake row level policy.
Pyspark code
df=df.withColumn('first_part',upper(split(col('id'),'#').getItem(0)))\
.withColumn('last_part',split(col('id'),'#').getItem(1))
df.createOrReplaceTempView("df_table")
df1.createOrReplaceTempView("df1_table")
joined_df=spark.sql("select c.*, case when first_part == 'emp' then '1' else p.flag end as flag, p.agreement_date from df_table c left join df1_table p on c.last_part = p.empid")
Snowflake part
create or replace row access policy policy.policy_row_hi as (col1 varchar) returns boolean ->
exists
(select 1
from schema.table1 t1
inner join schema.table2 t2 on (t2.oid = t1.oid)
where t2.empid = col1
and t1.flag = '1'
);

Apache Spark merge two identical DataFrames summing all rows and columns

I have two dataframes with identical column names but different number of rows, each of them identified by an ID and Date, as follows:
First table (the one with all the ID's available):
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
70
Second table (a smaller version including only some ID's):
ID
Date
Amount A
2
2021-09-01
50
2
2021-09-02
30
What I would like to have is a single table with the following output:
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
120
2
2021-09-02
30
Thanks in advance.

Approach 1: Using a Join
You may join both tables and sum on similar rows.
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
first_df.ID,
first_df.Date,
first_df.AmountA + second_df.AmountA as AmountA
FROM
first_df
LEFT JOIN
second_df ON first_df.ID = second_df.ID AND
first_df.Date = second_df.Date
Using Scala api
val outputDf = firstDf.alias("first_df")
.join(
secondDf.alias("second_df"),
Seq("ID","Date"),
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
Using pyspark api
outputDf = (
firstDf.alias("first_df")
.join(
second_df.alias("second_df"),
["ID","Date"],
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
)
Approach 2: Using a Union then aggregate by sum
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
ID,
Date,
SUM(AmountA) as AmountA
FROM (
SELECT ID, Date, AmountA FROM first_df UNION ALL
SELECT ID, Date, AmountA FROM second_df
) t
GROUP BY
ID,
Date
Using Scala api
val outputDf = firstDf.select("ID","Date","AmountA")
.union(secondDf.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
sum("AmountA").alias("AmountA")
)
Using Pyspark api
from pyspark.sql import functions as F
val outputDf = (
firstDf.select("ID","Date","AmountA")
.union(second_df.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
F.sum("AmountA").alias("AmountA")
)
)
Let me know if this works for you.

converting sql to dataframe api

How can the below sql be converted in spark? I attempted to do the below but saw this error -
Error evaluating method : '$eq$eq$eq': Method threw
'java.lang.RuntimeException' exception.
I am also not sure how to represent where sp1.cart_id = sp.cart_id in spark query
select distinct
o.order_id
, 'PENDING'
from shopping sp
inner join order o
on o.cart_id = sp.cart_id
where o.order_date = (select max(sp1.order_date)
from shopping sp1
where sp1.cart_id = sp.cart_id)
SHOPPING_DF
.select(
"ORDER_ID",
"PENDING")
.join(ORDER_DF, Seq("CART_ID"), "inner")
.filter(col(ORDER_DATE) === SHOPPING_DF.groupBy("CART_ID").agg(max("ORDER_DATE")))```

If this query is rewritten as a simple join on a table shopping that uses the window function max to determine the order date for each cart_id, this could easily be rewritten as sql as
SELECT DISTINCT
o.order_id,
'PENDING'
FROM
order o
INNER JOIN (
SELECT
cart_id,
MAX(order_date) OVER (
PARTITION BY cart_id
) as order_date
FROM
shopping
) sp ON sp.cart_id = o.cart_id AND
sp.order_date = o.order_date
This may be run on your spark session to achieve the results.
Converting this to the spark api could be written as
ORDER_DF.alias('o')
.join(
SHOPPING_DF.selectExpr(
"cart_id",
"MAX(order_date) OVER (PARTITION BY cart_id) as order_date"
).alias("sp"),
Seq("cart_id","order_date"),
"inner"
)
.selectExpr(
"o.order_id",
"'PENDING' as PENDING"
).distinct()
Let me know if this works for you.

Loop through the list which has queries to be executed and appended to dataframe

I need to loop through each element in the list and run this query against the database and append the result in to the same dataframe (df). Could you please let me know how to achieve this.
PS : I am using spark scala for this.
List((select * from table1 where a=10 ) as rules,
(select * from table1 where b=10) as rules,
(select * from table1 where c=10 ) as rules)
Thank you.

As you load data from the same table table1, you can simply use multiple conditions with or in where clause :
val df = spark.sql("select * from table1 where a=10 or b=10 or c=10")
If the queries are on different tables you can load into list of dataframes then union:
val queries = List(
"select * from table1 where a=10",
"select * from table1 where b=10",
"select * from table1 where c=10"
)
val df = queries.map(spark.sql).reduce(_ unionAll _)

Adding clolumns in between a table while joining

Need to add new columns with constant values while joining two tables
using pyspark. Using lit isn't solving the issue in Pyspark.
\\\**** Table A *******\\\\\\\
There are two tables A , B Table A as follows
ID Day Name Description
1 2016-09-01 Sam Retail
2 2016-01-28 Chris Retail
3 2016-02-06 ChrisTY Retail
4 2016-02-26 Christa Retail
3 2016-12-06 ChrisTu Retail
4 2016-12-31 Christi Retail
\\\**** Table B *****\\\\\\\
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark import HiveContext
hiveContext= HiveContext(sc)
ABC2 = spark.sql(
"select * From A where day ='{0}'".format(i[0])
)
Join = ABC2.join(
Tab2,
(
ABC2.ID == Tab2.ID
)
).select(
Tab2.skey,
ABC2.Day,
ABC2.Name,
ABC2.withColumn('newcol1, lit('')),
ABC2.withColumn('newcol2, lit('A')),
ABC2.Description
)
Join.select(
"skey",
"Day",
"Name",
"newcol1",
"newcol2",
"Description"
).write.mode("append").format("parquet").insertinto("Table")
ABC=spark.sql(
"select distinct day from A where day= '2016-01-01' "
)
The above code is resulting in issues even after defining the new columns
and constant values with lit, also newcol1 needs to take null value and newcol2
as A
New Table should be loaded with the following columns in the same order as
presented and also with new columns with constant values

Rewrite your Join DF as:
Join = ABC2.join(Tab2, (ABC2.ID == Tab2.ID))\
.select(Tab2.skey,ABC2.Day,ABC2.Name,)\
.withColumn('newcol1', lit(""))\
.withColumn('newcol2', lit("A"))

you can Join.select in the order you like, so your code will look like:
Join = ABC2.join(Tab2, (ABC2.ID == Tab2.ID))\ .select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)\ .withColumn('newcol1', lit(""))\ .withColumn('newcol2', lit("A"))
Join.select(
"skey",
"Day",
"Name",
"newcol1",
"newcol2",
"Description"
).write.mode("append").format("parquet").insertinto("Table")