Execute SQL stored in dataframe using pyspark - pyspark

I have a list of SQL's stored in a hive table column, I have to get one sql at a time from the hive table and execute the SQL, I'm getting the SQL as a dataframe, but can anyone tell me how to execute the SQL stored as a dataframe?
The column parameter_value contains the SQL
extrac_sql = spark.sql(""" select parameter_value
from schema_name.table_params
where project_name = 'some_projectname'
and sub_project_name = 'some_sub_project'
and parameter_name = 'extract_sql' """)
now the extract_sql contains the sql, how to execute it?

You can do it as follows:
sqls = spark.sql(""" select parameter_value
from schema_name.table_params
where project_name = 'some_projectname'
and sub_project_name = 'some_sub_project'
and parameter_name = 'extract_sql' """).collect()
for sql in sqls:
spark.sql(sql[0]).show()

Related

How to Convert T-SQL IF statement to Databricks PySpark

I have the following code in T-SQL
IF NOT EXISTS ( SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'airports' AND COLUMN_NAME = 'airport_region') SELECT * FROM airports;
I would like to convert the above T-SQL to Pyspark.
I have the following dataframe
df = df1.createOrReplaceTempView('airports')
My attempt at converting the above is as follows:
sql("""IF NOT EXISTS(SELECT * FROM airports where table = airports and COLUMN = 'airport_region') select * from airports""")
The above gives me a ParseException: error.
Any thoughts?
I have reproduced the above and if you want to do it in Pyspark you can use the above query and inside if, execute the SQL script.
df.createOrReplaceTempView("sample1")
if('name1' not in df.columns):
spark.sql("select * from sample1").show()
If you want to do it in SQL query you can try like below.
First get the columns names as dataframe and save as a temporary view and using that view, if the your column name not exists in that select the required table.
column_names=spark.sql("show columns in sample1");
column_names.createOrReplaceTempView("tempcols")
spark.sql("select * from sample1 where not exists (select * from tempcols where col_name='name1')").show()
If column exists:
Try this:
if('airport_region' not in df1.columns):
<do stuff>

Check the Metastore for the Table availability in Spark

I want to check if the table is available in a specific hive database before using the below select query.
How to get the information from the metastore.
sparkSession.sql("select * from db_name.table_name")
you can run below command before run your operation on the table
sparkSession.sql("use databaseName");
val df = sparkSession.sql("show tables like 'tableName'")
if(df.head(1).isEmpty == false){
//write the code
}
Try this
#This will give the list of tables in the database as a dataframe.
tables = sparkSession.sql('SHOW TABLES IN '+db_name)
#You can collect them to bring in a list of row items
tables_list = tables.select('tableName').collect()
# convert them into a python array
tables_list_arr=[]
for table_name in tables:
tables_list_arr.append(str(table_name['tableName']))
# Then execute your command
if(table_name in tables_list_arr):
sparkSession.sql("select * from db_name.table_name")

Upsert in RDBMS(Mysql) table using Pyspark DataFrames and JDBC

I'm trying to perform merge(upsert) operation on MySql using Pyspark DataFrames and JDBC connection.
Followed the below article to do same which is in scala. ( https://medium.com/#thomaspt748/how-to-upsert-data-into-relational-database-using-apache-spark-part-2-45a9d49d0f43 ).
But I need to perform upsert operations using pyspark got stuck at iterating through Pyspark Dataframe to call upsert function like below. Need to pass the source dataframe as input to read as parameters and perform sql upsert.
(In simple terms performing the sql upsert using pyspark dataframe)
def upsertToDelta(id, name, price, purchase_date):
try:
connection = mysql.connector.connect(host='localhost',
database='Electronics',
user='pynative',
password='pynative##29')
cursor = connection.cursor()
mySql_insert_query = "MERGE INTO targetTable USING
VALUES (%s, %s, %s, %s) as INSROW((Id, Name, Price, Purchase_date)
WHEN NOT MATCHED THEN INSERT VALUES (INSROW.Id,INSROW.Price,INSROW.Purchase,INSROW.Purchase_date)
WHEN MATCHED THEN UPDATE SET set Name=INSROW.Name"
recordTuple = (id, name, price, purchase_date)
cursor.execute(mySql_insert_query, recordTuple)
connection.commit()
print("Record inserted successfully into test table")
except mysql.connector.Error as error:
print("Failed to insert into MySQL table {}".format(error))
**
dataFrame.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.start()
**
Any help on this is highly appreciated.

DB2 Update statement not working using JDBC

I have a few rows stored in a source table (as defined as $schema.$sourceTable in the UPDATE query below). This table has 3 columns: TABLE_NAME, PERMISSION_TAG_COL, PT_DEPLOYED
I have an update statement stored in a string like:
var update_PT_Deploy = s"UPDATE $schema.$sourceTable SET PT_DEPLOYED = 'Y' WHERE TABLE_NAME = '$tableName';"
My source table does have rows with TABLE_NAME as $tableName (parameter) as I inserted rows into this table using another function of my program. The default value of PT_DEPLOYED when I inserted the rows was specified as NULL.
I'm trying to execute update using JDBC in the following manner:
println(update_PT_Deploy)
val preparedStatement: PreparedStatement = connection.prepareStatement(update_PT_Deploy)
val row = preparedStatement.execute()
println(row)
println("row updated in table successfully")
preparedStatement.close()
The above piece of code does not throw any exception, but when I query my table in a tool like DBeaver, the NULL value of PT_DEPLOYED does not get updated to Y.
If I execute the same query as mentioned in update_PT_Deploy inside DBeaver, the query works and the table updates. I am sure I am following the correct steps..

How to delete data from Hive external table for Non-Partition column?

I have created an external table in Hive partitioned by client and month.
The requirement asks to delete the data for ID=201 from that table but it's not partitioned by the ID column.
I have tried to do with Insert Overwrite but it's not working.
We are using Spark 2.2.0.
How can I solve this problem?
val sqlDF = spark.sql("select * from db.table")
val newSqlDF1 = sqlDF.filter(!col("ID").isin("201") && col("month").isin("062016"))
val columns = newSqlDF1.schema.fieldNames.mkString(",")
newSqlDF1.createOrReplaceTempView("myTempTable") --34
spark.sql(s"INSERT OVERWRITE TABLE db.table PARTITION(client, month) select ${columns} from myTempTable")