Azure databricks - Do we have postgres connector for spark - postgresql

Azure databricks - Do we have postgres connector for spark
Also, how to upsert/update record in postgres using spark databricks.
I am using Spark 3.1.1
When trying to write using mode=overwrite, it truncates the table but recird is not getting inserted
I am new to this. Please help.

You don't need a separate connector for PostgreSQL - it works via standard JDBC connector, and PostgreSQL JDBC driver should be included into databricks runtime - check release notes for your specific runtime. So you just need to form a correct JDBC URL as described in documentation (Spark documentation also has examples of URL for PostgreSQL).
Something like this:
df.write \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()
Regarding the UPSERT, it's not so simple, not only for PostgreSQL, but also for other databases:
either you do the full join, and select entries not existing only in your dataset, and the rest is taking from the database, and then overwrite - but this is very expensive because you're reading a full database, and writing it back
or you're doing left join with database (you need to read it again) going down to RDD with .foreachPartition/.foreach, and forming a series of INSERT/UPDATE operations depending on if data exist in database, or not - it's doable, but you need more experience.
Specifically for PosgtreSQL you can convert this (foreach) into INSERT ... ON CONFLICT so you won't need to read full database (see their wiki for more information about this operation)
Another approach - write your data into temporary table, and then via JDBC issue the MERGE command to incorporate your changes into the table. This is more "lightweight" method from my point of view.

Related

Spark to Synapse "truncate" not working as expected

I have a simple requirement to write a dataframe from spark (databricks) to a synapse dedicated pool table and keep refreshing (truncating) it on daily basis without dropping it.
Documentation suggests to use truncate with overwrite mode but that doesn't seem to be working as expected for me. As i continue to see table creation date getting updated
I am using
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", synapse_jdbc) \
.option("tempDir", tempDir) \
.option("useAzureMSI", "true") \
.option("dbTable", table_name) \
.mode("overwrite") \
.option("truncate","true") \
.save()
But there doesn't seem to be any difference whether i use truncate or not. Creation date/time of the table in synapse gets updated everytime i execute the above from databricks. Can anyone please help with this, what am i missing?
I already have a workaround which works but seems more like a workaround
.option("preActions", "truncate table "+table_name) \
.mode("append") \
I tried to reproduce your scenario in my environment and the truncate is not working for me with the synapse connector.
While researching this issue I found that not all Options are supported to synapse connector In the Official Microsoft document the provided the list of supported options like dbTable, query, user, password, url, encrypt=true, jdbcDriver, tempDir, tempCompression, forwardSparkAzureStorageCredentials, useAzureMSI , enableServicePrincipalAuth, etc.
Truncate ate option is supported to the jdbc format nots synapse connector.
When I change the format form com.databricks.spark.sqldw to jdbc it's working fine now.
My Code:
df.write.format("jdbc")
.option("url",synapse_jdbc)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", table_name)
.option("tempDir", tempdir)
.option("truncate","true")
.mode("overwrite")
.save()
First execution:
Second execution:
conclusion: For both the time when code is executed table creation time is same means overwrite is not dropping table it is truncating table

Delete redshift table from within databricks using pyspark

I tried to connect to a redshift system table called stv_sessions and I can read the data into a dataframe.
This stv_sessions table is a redshift system table which has the process id's of all the queries that are currently running.
To delete a query from running we can do this.
select pg_terminate_backend(pid)
While this works for me if I directly connect to redshift (using aginity), it gives me insuffecient previlege issues when trying to run from databricks.
Simply put I dont know how to run the query from databricks notebook.
I have tried this so far,
kill_query = "select pg_terminate_backend('12345')"
some_random_df_i_created.write.format("com.databricks.spark.redshift").option("url",redshift_url).option("dbtable","stv_sessions").option("tempdir", temp_dir_loc).option("forward_spark_s3_credentials", True).options("preactions", kill_query).mode("append").save()
Please let me know if the methodology i follow is correct.
Thank you
Databricks purposely does not preinclude this driver. You need to Download and install the offical Redshift JDBC driver for databricks. : download the official Amazon Redshift JDBC driver, upload it to Databricks, and attach the library to your cluster.(recommend using v1.2.12 or lower with Databricks clusters). Then, use JDBC URLs of the form
val jdbcUsername = "REPLACE_WITH_YOUR_USER"
val jdbcPassword = "REPLACE_WITH_YOUR_PASSWORD"
val jdbcHostname = "REPLACE_WITH_YOUR_REDSHIFT_HOST"
val jdbcPort = 5439
val jdbcDatabase = "REPLACE_WITH_DATABASE"
val jdbcUrl = s"jdbc:redshift://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}"
jdbcUsername: String = REPLACE_WITH_YOUR_USER
jdbcPassword: String = REPLACE_WITH_YOUR_PASSWORD
jdbcHostname: String = REPLACE_WITH_YOUR_REDSHIFT_HOST
jdbcPort: Int = 5439
jdbcDatabase: String = REPLACE_WITH_DATABASE
jdbcUrl: String = jdbc:redshift://REPLACE_WITH_YOUR_REDSHIFT_HOST:5439/REPLACE_WITH_DATABASE?user=REPLACE_WITH_YOUR_USER&password=REPLACE_WITH_YOUR_PASSWORD
Then try putting jdbcUrl in place of your redshift_url.
That may be the only reason you are getting privilege issues.
Link1:https://docs.databricks.com/_static/notebooks/redshift.html
Link2:https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#installation
Another reason could be the redshift-databricks connector only uses SSL(encryption in flight) and it is possible that IAM roles may have been set on your redshift cluster to only allow some users to delete tables.
Apologies if none of this helps your case.

How to speed up spark df.write jdbc to postgres database?

I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write:
df.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
I tried increasing the batchsize but that didn't help, as completing this task still took ~4hours. I've also included some snapshots below from aws emr showing more details about how the job ran. The task to save the dataframe to the postgres table was only assigned to one executor (which I found strange), would speeding this up involve dividing this task between executors?
Also, I have read spark's performance tuning docs but increasing the batchsize, and queryTimeout have not seemed to improve performance. (I tried calling df.cache() in my script before df.write, but runtime for the script was still 4hrs)
Additionally, my aws emr hardware setup and spark-submit are:
Master Node (1): m4.xlarge
Core Nodes (2): m5.xlarge
spark-submit --deploy-mode client --executor-cores 4 --num-executors 4 ...
Spark is a distributed data processing engine, so when you are processing your data or saving it on file system it uses all its executors to perform the task.
Spark JDBC is slow because when you establish a JDBC connection, one of the executors establishes link to the target database hence resulting in slow speeds and failure.
To overcome this problem and speed up data writes to the database you need to use one of the following approaches:
Approach 1:
In this approach you need to use postgres COPY command utility in order to speed up the write operation. This requires you to have psycopg2 library on your EMR cluster.
The documentation for COPY utility is here
If you want to know the benchmark differences and why copy is faster visit here!
Postgres also suggests using COPY command for bulk inserts. Now how to bulk insert a spark dataframe.
Now to implement faster writes, first save your spark dataframe to EMR file system in csv format and also repartition your output so that no file contains more than 100k rows.
#Repartition your dataframe dynamically based on number of rows in df
df.repartition(10).write.option("maxRecordsPerFile", 100000).mode("overwrite").csv("path/to/save/data)
Now read the files using python and execute copy command for each file.
import psycopg2
#iterate over your files here and generate file object you can also get files list using os module
file = open('path/to/save/data/part-00000_0.csv')
file1 = open('path/to/save/data/part-00000_1.csv')
#define a function
def execute_copy(fileName):
con = psycopg2.connect(database=dbname,user=user,password=password,host=host,port=port)
cursor = con.cursor()
cursor.copy_from(fileName, 'table_name', sep=",")
con.commit()
con.close()
To gain additional speed boost, since you are using EMR cluster you can leverage python multiprocessing to copy more than one file at once.
from multiprocessing import Pool, cpu_count
with Pool(cpu_count()) as p:
print(p.map(execute_copy, [file,file1]))
This is the approach recommended as spark JDBC can't be tuned to gain higher write speeds due to connection constraints.
Approach 2:
Since you are already using an AWS EMR cluster you can always leverage the hadoop capabilities to perform your table writes faster.
So here we will be using sqoop export to export our data from emrfs to the postgres db.
#If you are using s3 as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir s3://mybucket/myinputfiles/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
#If you are using EMRFS as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir /path/to/save/data/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
Why sqoop?
Because sqoop opens multiple connections with the database based on the number of mapper specified. So if you specify -m as 8 then 8 concurrent connection streams will be there and those will write data to the postgres.
Also, for more information on using sqoop go through this AWS Blog, SQOOP Considerations and SQOOP Documentation.
If you can hack around your way with code then Approach 1 will definitely give you the performance boost you seek and if you are comfortable with hadoop components like SQOOP then go with second approach.
Hope it helps!
Spark side tuning => Perform repartition on Datafarme so that there would multiple executor writing to DB in parallel
df
.repartition(10) // No. of concurrent connection Spark to PostgreSQL
.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
Postgresql side tuning =>
There will need to bump up below parameters on PostgreSQL respectively.
max_connections determines the maximum number of concurrent
connections to the database server. The default is typically 100
connections.
shared_buffers configuration parameter determines how much
memory is dedicated to PostgreSQL to use for caching data.
To solve the performance issue, you generally need to resolve the below 2 bottlenecks:
Make sure the spark job is writing the data in parallel to DB -
To resolve this make sure you have a partitioned dataframe. Use "df.repartition(n)" to partiton the dataframe so that each partition is written in DB parallely.
Note - Large number of executors will also lead to slow inserts. So start with 5 partitions and increase the number of partitions by 5 till you get optimal performance.
Make sure the DB has enough compute, memory and storage required for ingesting bulk data.
By repartitioning the dataframe you can achieve a better write performance is a known answer. But there is an optimal way of repartitioning your dataframe.
Since you are running this process on an EMR cluster , First get to know about the instance type and the number of cores that are running on each of your slave instances. According to that specify your number of partitions on a dataframe.
In your case you are using m5.xlarge(2 slaves) which will have 4 vCPUs each which means 4 threads per instance. So 8 partitions will give you an optimal result when you are dealing with huge data.
Note : Number of partitions should be increased or decreased based on your data size.
Note : Batch size is also something you should consider in your writes. Bigger the batch size better the performance

Is there any similar function with MySQL trigger in IBM Netezza?

As you can see on the title, I want to know the similar function with MySQL's trigger function. What actually I want to do is importing data from IBM Netezza Databases using sqoop incremental mode. Below is the sqoop scripts what I'm going to use.
sqoop job --create dhjob01 -- import --connect jdbc:netezza://10.100.3.236:5480/TEST \
--username admin --password password \
--table testm \
--incremental lastmodified \
--check-column 'modifiedtime' --last-value '1995-07-18' \
--target-dir /user/dhlee/nz_sqoop_test \
-m 1
As the official Sqoop documentation says, I can gather data from RDBs with incremental mode by making a sqoop import job and execute it recursively.
Anyway the point is, I need a function like MySQL trigger so that I can update the modified date whenever tables in Netezza are updated. And if you have any great idea that I can gather the data incrementally, please tell me. Thank you.
Unfortunately there isn't anything similar to triggers available. I would recommend modifying the relevant UPDATE commands to include setting a column to CURRENT_TIMESTAMP
In Netezza you have something even better:
- Deleted records is still possible to see http://dwgeek.com/netezza-recover-deleted-rows.html/
- the INSERT- and DELETE-TXID are a rising number (and visible on all records as described above)
- updates are really a delete plus an insert
Can you follow me?
enter image description here
This is the screen shot that I've got after I inserted and deleted some rows.

Spark SQL build for hive?

I have downloaded spark release - 1.3.1 and package type is Pre-build for Hadoop 2.6 and later
now i want to run below scala code using spark shell so i followed this steps
1. bin/spark-shell
2. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Now the problem is if i verity it on hue browser like
select * from src;
then i get
table not found exception
that means table not created how do i configure hive with spark shell to make this successful. i want to use SparkSQL also i need to read and write data from hive.
i randomly heard that we need to copy hive-site.xml file somewhere in spark directory
can someone please explain me with the steps - SparkSQL and Hive configuration
Thanks
Tushar
Indeed, the hive-site.xml direction is correct. Take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables .
Also it sounds like you wish to create a hive table from spark, for that look at "Saving to Persistent Tables" in the same document as above.