Spark to Synapse "truncate" not working as expected - pyspark

I have a simple requirement to write a dataframe from spark (databricks) to a synapse dedicated pool table and keep refreshing (truncating) it on daily basis without dropping it.
Documentation suggests to use truncate with overwrite mode but that doesn't seem to be working as expected for me. As i continue to see table creation date getting updated
I am using
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", synapse_jdbc) \
.option("tempDir", tempDir) \
.option("useAzureMSI", "true") \
.option("dbTable", table_name) \
.mode("overwrite") \
.option("truncate","true") \
.save()
But there doesn't seem to be any difference whether i use truncate or not. Creation date/time of the table in synapse gets updated everytime i execute the above from databricks. Can anyone please help with this, what am i missing?
I already have a workaround which works but seems more like a workaround
.option("preActions", "truncate table "+table_name) \
.mode("append") \

I tried to reproduce your scenario in my environment and the truncate is not working for me with the synapse connector.
While researching this issue I found that not all Options are supported to synapse connector In the Official Microsoft document the provided the list of supported options like dbTable, query, user, password, url, encrypt=true, jdbcDriver, tempDir, tempCompression, forwardSparkAzureStorageCredentials, useAzureMSI , enableServicePrincipalAuth, etc.
Truncate ate option is supported to the jdbc format nots synapse connector.
When I change the format form com.databricks.spark.sqldw to jdbc it's working fine now.
My Code:
df.write.format("jdbc")
.option("url",synapse_jdbc)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", table_name)
.option("tempDir", tempdir)
.option("truncate","true")
.mode("overwrite")
.save()
First execution:
Second execution:
conclusion: For both the time when code is executed table creation time is same means overwrite is not dropping table it is truncating table

Related

Azure databricks - Do we have postgres connector for spark

Azure databricks - Do we have postgres connector for spark
Also, how to upsert/update record in postgres using spark databricks.
I am using Spark 3.1.1
When trying to write using mode=overwrite, it truncates the table but recird is not getting inserted
I am new to this. Please help.
You don't need a separate connector for PostgreSQL - it works via standard JDBC connector, and PostgreSQL JDBC driver should be included into databricks runtime - check release notes for your specific runtime. So you just need to form a correct JDBC URL as described in documentation (Spark documentation also has examples of URL for PostgreSQL).
Something like this:
df.write \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()
Regarding the UPSERT, it's not so simple, not only for PostgreSQL, but also for other databases:
either you do the full join, and select entries not existing only in your dataset, and the rest is taking from the database, and then overwrite - but this is very expensive because you're reading a full database, and writing it back
or you're doing left join with database (you need to read it again) going down to RDD with .foreachPartition/.foreach, and forming a series of INSERT/UPDATE operations depending on if data exist in database, or not - it's doable, but you need more experience.
Specifically for PosgtreSQL you can convert this (foreach) into INSERT ... ON CONFLICT so you won't need to read full database (see their wiki for more information about this operation)
Another approach - write your data into temporary table, and then via JDBC issue the MERGE command to incorporate your changes into the table. This is more "lightweight" method from my point of view.

Is there any similar function with MySQL trigger in IBM Netezza?

As you can see on the title, I want to know the similar function with MySQL's trigger function. What actually I want to do is importing data from IBM Netezza Databases using sqoop incremental mode. Below is the sqoop scripts what I'm going to use.
sqoop job --create dhjob01 -- import --connect jdbc:netezza://10.100.3.236:5480/TEST \
--username admin --password password \
--table testm \
--incremental lastmodified \
--check-column 'modifiedtime' --last-value '1995-07-18' \
--target-dir /user/dhlee/nz_sqoop_test \
-m 1
As the official Sqoop documentation says, I can gather data from RDBs with incremental mode by making a sqoop import job and execute it recursively.
Anyway the point is, I need a function like MySQL trigger so that I can update the modified date whenever tables in Netezza are updated. And if you have any great idea that I can gather the data incrementally, please tell me. Thank you.
Unfortunately there isn't anything similar to triggers available. I would recommend modifying the relevant UPDATE commands to include setting a column to CURRENT_TIMESTAMP
In Netezza you have something even better:
- Deleted records is still possible to see http://dwgeek.com/netezza-recover-deleted-rows.html/
- the INSERT- and DELETE-TXID are a rising number (and visible on all records as described above)
- updates are really a delete plus an insert
Can you follow me?
enter image description here
This is the screen shot that I've got after I inserted and deleted some rows.

Hive table created with Spark not visible from HUE / Hive GUI

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.
I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

How to change a mongo-spark connection configuration from a databricks python notebook

I succeeded at connecting to mongodb from spark, using the mongo-spark connector from a databricks notebook in python.
Right now I am configuring the mongodb uri in an environment variable, but it is not flexible, since I want to change the connection parameter right in my notebook.
I read in the connector documentation that it is possible to override any values set in the SparkConf.
How can I override the values from python?
You don't need to set anything in the SparkConf beforehand*.
You can pass any configuration options to the DataFrame Reader or Writer eg:
df = sqlContext.read \
.option("uri", "mongodb://example.com/db.coll) \
.format("com.mongodb.spark.sql.DefaultSource") \
.load()
* This was added in 0.2

Exception after Setting property 'spark.sql.hive.metastore.jars' in 'spark-defaults.conf'

Given below is the version of Spark & Hive I have installed in my system
Spark : spark-1.4.0-bin-hadoop2.6
Hive : apache-hive-1.0.0-bin
I have configured the Hive installation to use MySQL as Metastore. The goal is to access the MySQL Metastore & execute HiveQL queries inside spark-shell(using HiveContext)
So far I am able to execute the HiveQL queries by accessing the Derby Metastore(As described here, believe Spark-1.4 comes bundled with Hive 0.13.1 which in turn uses the internal Derby database as Metastore)
Then I tried to point spark-shell to my external Metastore(MySQL in this case) by setting the property(as suggested here) given below in $SPARK_HOME/conf/spark-defaults.conf,
spark.sql.hive.metastore.jars /home/mountain/hv/lib:/home/mountain/hp/lib
I have also copied $HIVE_HOME/conf/hive-site.xml into $SPARK_HOME/conf. But I am getting the following exception when I start the spark-shell
mountain#mountain:~/del$ spark-shell
Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/session/SessionState when creating Hive client
using classpath: file:/home/mountain/hv/lib/, file:/home/mountain/hp/lib/
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars.
Am I missing something (or) not setting the property spark.sql.hive.metastore.jars correctly?
Note: In Linux Mint verified.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
Corrupted version of hive-site.xml will cause this... please copy the correct hive-site.xml