Can Pyspark Use JDBC to Pass Alter Table - postgresql

I would like to pass an alter table command to my PostgreSQL database after I load data from a Databricks notebook using pyspark. I know that I can pass a query using spark.read.jdbc but in this case I would like to add a unique constraint once the data has loaded. The purpose is to speed up the data load process into the db by reducing the time to create the unique index.

Spark is a framework for data processing therefore its API mostly developed for read and write operations with data sources. In your case, you have some DDL statements to execute and Spark isn't supposed to perform such operations.
It will better option, to keep DDL operation separate after data processing in spark sql. Here you can add one more PostgreSQL job to perform such operations.

I was experiencing this exact problem in Redshift. After reviewing the doc on JDBC connections, it looks like you can do something like this:
%sql
ALTER TABLE <jdbcTable> {SOME ACTIONS}
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
dbtable "<jdbcDatabase>.atable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)

Related

Incrementally loading into a Synapse table using Spark

I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.

create a database in pyspark using Python API's only

I'm unable to locate any API to create a database in pyspark. All I can find is SQL based approach.
catalog doesn't mention a python method to create a database.
Is this even possible? I am trying to avoid using SQL.
I guess it might be a suboptimal solution, but you can call a CREATE DATABASE statement using SparkSession's sql method to create a database, like this:
spark.sql("CREATE DATABASE IF EXISTS test_db")
It's not pure PySpark API, but this way you don't have to switch context to SQL completely, to create a database :)

How to perform upsert operation in PostgreSQL on Cloud Data Fusion

I want to do operation upsert before write on PostgreSQL on Cloud Data Fusion, I can easily write with the sink plugin but I can't find how can I do the update if the value already exist, thanks.
As pointed out by #ShipraSarkar that there is no option in the Postgres sink plugin for an upsert operation. Thus, INSERT ON CONFLICT statement in PostgreSQL can be used for the same. As you rightly pointed out, first one can put the data in a staging table using the sink plugin and then run a query using Postgres executor to merge (i.e. upsert) it with the target table.

Is it right use-case of KSql

I am using KStreams where I need to de-duplicate the data. Source ingests duplicated data due to many reasons i.e data itself duplicate, re-partitioning.
Currently using Redis for this use-case where data is stored something as below
id#object list-of-applications-processed-this-id-and-this-object
As KSQL is implemented on top of RocksDB which is also a Key-Value database, can I use KSql for this use case?
At the time of successful processing, I would add an entry to KSQL. At the time of reception, I will have to check the existence of the id in KSQL.
Is it correct use case as per KSql design in the event processing world?
If you want to use to use ksqlDB as a cache, you can create a TABLE using the topic as data source. Note that a CREATE TABLE statement by itself, does only declare a schema (it does not pull in any data into ksqlDB yet).
CREATE TABLE inputTable <schemaDefinition> WITH(kafka_topic='...');
Check out the docs for more details: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/
To pull in the data, you can create a second table via:
CREATE TABLE cache AS SELECT * FROM inputTable;
This will run a query in the background, that read the input data and puts the result into the ksqlDB server. Because the query is a simple SELECT * it effectively pulls in all data from the topic. You can now issue "pull queries" (i.e, lookups) against the result to use TABLE cache as desired: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/
Future work:
We are currently working on adding "source tables" (cf https://github.com/confluentinc/ksql/pull/7474) that will make this setup simpler. If you declare a source table, you can do the same with a single statement instead of two:
CREATE SOURCE TABLE cache <schemaDefinition> WITH(kafka_topic='...');

PySpark save to Redshift table with "Overwirte" mode results in dropping table?

Using PySpark in AWS Glue to load data from S3 files to Redshift table, in code used mode("Overwirte") got error stated that "can't drop table because other object depend on the table", turned out there is view created on top of that table, seams the "Overwrite" mode actually drop and re-create redshift table then load data, is there any option that only "truncate" table not dropping it?
AWS Glue uses databricks spark redshift connector (it's not documented anywhere but I verified that empirically). Spark redshift connector's documentation mentions:
Overwriting an existing table: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.
Here there is a related discussion inline to your question, where they have used truncate instead of overwrite, also its a combination of lambda & glue. Please refer here for detailed discussions and code samples. Hope this helps.
regards