How to create a temporary table in snowflake based on pyspark dataframe - pyspark

I can read the snowflake table in pyspark dataframe using sqlContext
sql = f"""select * from table1""";
df = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.option("query", sql)
.load()
How do I create a temporary table in snowflake (using pyspark code) and insert values from this pyspark dataframe (df)?

just save as usual, with snowflake format
snowflake_options = {
...
'sfDatabase': 'dbabc',
'dbtable': 'tablexyz',
...
}
(df
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.save()
)

I don't believe this can be done. At least not the way you want.
You can, technically, create a temporary table; but persisting it is something that I have had a great deal of difficulty finding how to do (i.e. I haven't). If you run the following:
spark.sparkContext._jvm.net.snowflake.spark.snowflake.Utils.runQuery(snowflake_options, 'create temporary table tmp_table (id int, value text)')
you'll notice that it successfully returns a java object indicating the temp table was created successfully; but once you try and run any further statements on it, you'll get nasty errors that mean it no longer exists. Somehow we mere mortals would need to find a way to access and persist the Snowflake session through the jvm api. That being said, I also think that would run contrary to the Spark paradigm.
If you really need the special-case performance boost of running transformations on Snowflake instead of bringing it all into Spark, just keep everything in Snowflake to begin with by either
Using CTEs in the query, or
Using the runQuery api described above to create "temporary" permanent/transient tables and designing Snowflake queries that insert directly to those and then clean them up (DROP them) when you are done.

Related

How to execute a update query in spark sql temp tables

I am trying the below code but it is throwing some random error that I am unable to understand:
df.registerTempTable("Temp_table")
spark.sql("Update Temp_table set column_a='1'")
Currently spark sql does not support UPDATE statments. The workaround is to use create a delta lake / iceberg table using your spark dataframe and execute you sql query directly on this table.
For iceberg implementation refer to :
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html

Synapse spark select column with space

I am trying to read synapse table, which has spaces in column names.
Read table is working until I am selecting columns without spaces or special characters:
%%spark
val df = spark.read.synapsesql("<Pool>.<schema>.<table>").select("TYPE", "Year").limit(100)
df.show()
OUTPUT:
+------+----+
| TYPE|Year|
+------+----+
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
|BOUGHT|LAST|
When I start selecting columns with spaces I am getting errors. I have tried many variants:
.select(col("""`Country Code`"""))
.select(col("`Country Code`"))
.select(col("""[Country Code]"""))
.select(col("Country Code"))
.select($"`Country Code`")
.select("`Country Code`")
.select("""`Country Code`""")
will return this error:
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'Country'.
If I ommit ` in select for example:
.select("[Country Code]")
ERROR: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name '[Country Code]'.
With back-tick spark in synapse just take only first word as column.
Any experience?
The select on its own will work, adding show (or any other action like count) will not. There does seem to be an issue with the Synapse synapsesql API. The Invalid column name 'country' error is coming from the SQL engine because it seems like there is no way to pass square brackets back to it. Also parquet files do not support spaces in column names so it's probably connected.
The workaround is to simply not use spaces in column names. Fix up the tables in a previous Synapse pipeline step if required. I'll have a look into it but may be no other answer.
If you want to rename existing columns in the database you can use sp_rename, eg
EXEC sp_rename 'dbo.countries.country Type', 'countryType', 'COLUMN';
This code has been tested on a Synapse dedicated SQL pool.
That particular API (sysnapsesql.read) cannot handle views unfortunately. You would have to materialise it eg using a CTAS in a prior Synapse Pipeline step. The API useful for simple patterns (get table -> process -> put back) but is pretty limited. You can't even manage table distribution (hash, round_robin, replicate) or indexing (clustered columnstore, clustered index, heap) or partitioning but you never know they might add to it one day. I'll be keeping on eye out during the next MS conference anyway.
I have created the function to run query using JDBC. Thanks this I was able to read from view. I have added saplme code how to get password from KeyVault, using TokenLibrary.
def spark_query(db, query):
jdbc_hostname = "<synapse_db>.sql.azuresynapse.net"
user = "<spark_db_client>"
password = "<strong_password>"
# password_from_kv = TokenLibrary.getSecret("<Linked_Key_Vault_Service_Name>", "<Key_Vault_Key_Name>", "<Key_Vault_Name>")
return spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{jdbc_hostname }:1433;databaseName={db};user={user};password={password}") \
.option("query", query) \
.load()
Then I have created VIEW with column names without spaces:
CREATE VIEW v_my_table
AS
SELECT [Country code] as country_code from my_table
Granted access to <spark_db_client>:
GRANT SELECT ON v_my_table to <spark_db_client>
After whole preparation I was able to read table from VIEW and save to spark database:
query = """
SELECT country_code FROM dbo.v_my_table
"""
df = spark_query(db="<my_database>", query=query)
spark.sql("CREATE DATABASE IF NOT EXISTS spark_poc")
df.write.mode("overwrite").saveAsTable("spark_poc.my_table")
df.registerTempTable("my_table")
This are <placeholder_variables>

Using loop to create spark SQL queries

I am trying to create some spark SQL queries for different tables which i have collected as a list. I want to create SQL queries for all the tables present in the hive database.The hive context has been initialized Following is my approach.
tables= spark.sql("select tables in survey_db")
# registering dataframe as temp view with 2 columns - tableName and db name
tables.createOrReplaceTempView("table_list")
# collecting my table names in a list
table_array= spark.sql("select collect_list(tableName) from table_list").collect()[0][0]
# array values(list)
table_array= [u'survey',u'market',u'customer']
I want to create spark SQL queries for the table names stored in table_array. for example:
for i in table_array:
spark.sql("select * from survey_db.'i'")
I cant use shell scripting as i have to write a pyspark script for this. Please advice if spark.sql queries can be created using loop/map . Thanks everyone.
You can achieve the same as follows:
sql_list = [f"select * from survey_db.{table}" for table in table_array]
for sql in sql_list:
df = spark.sql(sql)
df.show()

How to get Create Statement of Table in some other database in Spark using JDBC

Problem statement:
I have a Impala database where multiple tables are present
I am creating Spark JDBC connection to Impala and loading these tables into spark dataframe for my validations like this which works fine:
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","tablename")
.load()
Now the next step and my actual problem is I need to find the create statement which was used to create the tables in Impala itself
Since I cannot run command like below as it gives error, is there anyway I can fetch the show create statement for tables present in Impala.
val df = spark.read.format("jdbc")
.option("url","url")
.option("dbtable","show create table tablename")
.load()
Perhaps you can use Spark SQL "natively" to execute something like
val createstmt = spark.sql("show create table <tablename>")
The resulting dataframe will have a single column (type string) which contains a complete CREATE TABLE statement.
But, if you still choose to go JDBC route there is always an option to use the good old JDBC interface. Scala understands everything written in Java, after all...
import java.sql.*
Connection conn = DriverManager.getConnection("url")
Statement stmt = conn.createStatement()
ResultSet rs = stmt.executeQuery("show create table <tablename>")
...etc...

Read from a hive table and write back to it using spark sql

I am reading a Hive table using Spark SQL and assigning it to a scala val
val x = sqlContext.sql("select * from some_table")
Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table.
Finally I am trying to insert overwrite the y dataframe to the same hive table some_table
y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table")
Then I am getting the error
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from
I tried creating an insert sql statement and firing it using sqlContext.sql() but it too gave me the same error.
Is there any way I can bypass this error? I need to insert the records back to the same table.
Hi I tried doing as suggested , but still getting the same error .
val x = sqlContext.sql("select * from incremental.test2")
val y = x.limit(5)
y.registerTempTable("temp_table")
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("incremental.test2")
scala> dy.write.mode("overwrite").insertInto("incremental.test2")
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;
Actually you can also use checkpointing to achieve this. Since it breaks data lineage, Spark is not able to detect that you are reading and overwriting in the same table:
sqlContext.sparkContext.setCheckpointDir(checkpointDir)
val ds = sqlContext.sql("select * from some_table").checkpoint()
ds.write.mode("overwrite").saveAsTable("some_table")
You should first save your DataFrame y in a temporary table
y.write.mode("overwrite").saveAsTable("temp_table")
Then you can overwrite rows in your target table
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("some_table")
You should first save your DataFrame y like a parquet file:
y.write.parquet("temp_table")
After you load this like:
val parquetFile = sqlContext.read.parquet("temp_table")
And finish you insert your data in your table
parquetFile.write.insertInto("some_table")
In context to Spark 2.2
This error means that our process is reading from same table and writing to same table.
Normally, this should work as process writes to directory .hiveStaging...
This error occurs in case of saveAsTable method, as it overwrites entire table instead of individual partitions.
This error should not occur with insertInto method, as it overwrites partitions not the table.
A reason why this happening is because Hive table has following Spark TBLProperties in its definition. This problem will solve for insertInto method if you remove following Spark TBLProperties -
'spark.sql.partitionProvider' 'spark.sql.sources.provider'
'spark.sql.sources.schema.numPartCols
'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0'
'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2'
'spark.sql.sources.schema.partCol.0'
'spark.sql.sources.schema.partCol.1'
https://querydb.blogspot.com/2019/07/read-from-hive-table-and-write-back-to.html
when we upgraded our HDP to 2.6.3 The Spark was updated from 2.2 to 2.3 which resulted in below error -
Caused by: org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
at org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:906)
This error occurs for job where-in we are reading and writing to same path. Like Jobs with SCD Logic
Solution -
Set --conf "spark.sql.hive.convertMetastoreOrc=false"
or, update the job such that it writes data to a temporary table. Then reads from temporary table and insert it into final table.
https://querydb.blogspot.com/2020/09/orgapachesparksqlanalysisexception.html
Read the data from hive table in spark:
val hconfig = new org.apache.hadoop.conf.Configuration()
org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(hconfig , "dbname", "tablename")
val inputFormat = (new HCatInputFormat).asInstanceOf[InputFormat[WritableComparable[_],HCatRecord]].getClass
val data = sc.newAPIHadoopRDD(hconfig,inputFormat,classOf[WritableComparable[_]],classOf[HCatRecord])
You'll also get the Error: "Cannot overwrite a path that is also being read from" in a case where your are doing this:
You are "insert overwrite" to a hive TABLE "A" from a VIEW "V" (that executes your logic)
And that VIEW also references the same TABLE "A". I found this the hard way as the VIEW is deeply nested code that was querying "A" as well. Bummer.
It is like cutting the very branch on which you are sitting :-(
What you need to keep in mind before doing below is that the hive table in which you are overwriting should be have been created by hive DDL not by
spark(df.write.saveAsTable("<table_name>"))
if the above is not true this wont work.
I tested this in spark 2.3.0
val tableReadDf=spark.sql("select * from <dbName>.<tableName>")
val updatedDf=tableReadDf.<transformation> //any update/delete/addition
updatedDf.createOrReplaceTempView("myUpdatedTable")
spark.sql("""with tempView as(select * from myUpdatedTable) insert overwrite table
<dbName>.<tableName> <partition><partition_columns> select * from tempView""")
This is good solution for me:
Extract RDD and schema from DataFrame.
Create new clone DataFame.
Overwrite table.
private def overWrite(df: DataFrame): Unit = {
val schema = df.schema
val rdd = df.rdd
val dfForSave = spark.createDataFrame(rdd, schema)
dfForSave.write
.mode(SaveMode.Overwrite)
.insertInto(s"${tableSource.schema}.${tableSource.table}")}