Batch Insert from Dataframe to DB ignoring failed row in Pyspark

Batch Insert from Dataframe to DB ignoring failed row in Pyspark - postgresql

I am trying to insert spark DF to Postgres using JDBC write. The postgres table has a unique constraint on one of the columns, when the df to be inserted violates the constraint entire batch is rejected and spark session closes giving an error duplicate key value violates unique constraint which is correct as the data is duplicate (already exists in the database)
org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:148
What is needed that the data rows which do not violate the constraint be inserted and the failed row be ignored, without failing the entire batch.
The code used is:
mode = "Append"
url = "jdbc:postgresql://IP/DB name"
properties = {"user": "username", "password": "password"}
DF.write
.option("numPartitions",partitions_for_parallelism)
.option("batchsize",batch_size)
.jdbc(url=url, table="table name", mode=mode, properties=properties)
How can I do this?

Unfortunately, there is no out of the box solution by Spark. There is a number of possible solutions I see:
Implement business logic of conflict resolution in PostgreSQL database as part of the forEachPartition function. For example, catch the exception of the constraint violation then report to the log.
Drop the constraint on PostgreSQL database, use autogenerated PK means enable to store duplicated rows in the database. Deduplication logic may be further implemented as a part of each SQL query or running deduplication on a daily/hourly basis. You can see example here.
In case there is no other system or process writing to PostgreSQL table except your Spark job it is possible to do filter using the join operation to remove all existing rows from Spark Dataframe before spark.write something like this
I hope my ideas would be helpful.

That is not possible if you have a unique constraint on the target. There is no UPSert mode currently with these techniques. You need to design around this aspect.

Related

pyspark - insert generated primary key in dataframe

I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.

The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()

ADF mapping data flow only inserting, never updating

I have an ADF data flow that will only insert. It never updates rows.
Below is a screenshot of the flow, and the Alter Row task that sets the insert/Update policies.
data flow
alter row task
There is a source table and a destination table.
There is a source table for new data.
A lookup is done against the key of the destination table.
Two columns are then generated, a hash of the source data & hash of the destination data.
In the alter row task, the policy's are as follows:
Insert: if the lookup found no matching id.
Update: if lookup found a matching id and the checksums do not match (i.e. user exists but data is different between the source and existing record).
Otherwise it should do nothing.
The Sink allows insert and updates:
Even so, on first run it inserts all records but on second run it inserts all the records again, even if they exist.
I think I am misunderstanding the process and so appreciate any expertise or advise.

Thank you Joel Cochran for your valuable inputs, repro’d the scenario, and posting it as an answer to help other community members.
If you are using the upsert method in the sink, add alter row transformation with upsert if and write the expression for the upsert condition.
If you are using insert and update as your update method in the sink then in alter row transformation use both inserts if and update if conditions to insert and update data accordingly into the sink based on alter row conditions.

How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's?

I'm trying to run the upsert/delete some of the values in DB2 database source table, which is a existing table on DB2. Is it possible using Pyspark/Spark SQL/Dataframes.

There is no direct way for update/delete in relational database using Pyspark job, but there are workarounds.
(1) You can create a identical empty table (secondary table) in relational database and insert data into secondary table using pyspark job, and write a DML trigger that would perform desired DML operation on your primary table.
(2) You can create a dataframe (eg. a) in spark that would be copy of your existing relational table and merge existing table dataframe with current dataframe(eg. b) and create a new dataframe(eg. c) that would be having latest changes. Now truncate the relational database table and reload with spark latest changes dataframe(c).
These is just a workaround and not a optimal solution for huge amount of data.

How to insert rows into cassandra if they don't exist using spark- cassandra driver?

I want to write to cassandra from a data frame and I want to exclude the rows if a particular row is already existing (i.e Primary key- though upserts happen I don't want to change the other columns) using spark-cassandra connector. Is there a way we can do that?
Thanks.!

You can use the ifNotExists WriteConf option which was introduced in this pr.
It works like so:
val writeConf = WriteConf(ifNotExists = true)
rdd.saveToCassandra(keyspaceName, tableName, writeConf = writeConf)

You can do
sparkConf.set("spark.cassandra.output.ifNotExists", "true")
With this config
if partition key and clustering column are same as row which exists in cassandra: write will be ignored
else write will be performed
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html#reference_ds_gp2_1jp_xj__if-not-exists
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters

Srinu, this all boils down to "read before write" no matter whether you are using Spark or not.
But there is IF NOT EXISTS clause:
If the column exists, it is updated. The row is created if none
exists. Use IF NOT EXISTS to perform the insertion only if the row
does not already exist. Using IF NOT EXISTS incurs a performance hit
associated with using Paxos internally. For information about Paxos,
see Cassandra 2.1 documentation or Cassandra 2.0 documentation.
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html

Apache Spark - Error persisting Dataframe to MemSQL database using JDBC driver

I'm currently facing an issue while trying to save an Apache Spark DataFrame loaded from an Apache Spark temp table to a distributed MemSQL database.
The trick is that I cannot use MemSQLContext connector for the moment. So I'm using JDBC driver.
Here is my code:
//store suppliers data from temp table into a dataframe
val suppliers = sqlContext.read.table("tmp_SUPPLIER")
//append data to the target table
suppliers.write.mode(SaveMode.Append).jdbc(url_memsql, "R_SUPPLIER", prop_memsql)
Here is the error message (occuring during the suppliers.write statement):
java.sql.SQLException: Distributed tables must either have a PRIMARY or SHARD key.
Note:
R_SUPPLIER table has exactly the same fields and datatypes than the temp table and has a primary key set.
FYI, here are some clues:
R_SUPPLIER script:
`CREATE TABLE R_SUPPLIER
(
SUP_ID INT NOT NULL PRIMARY KEY,
SUP_CAGE_CODE CHAR(5) NULL,
SUP_INTERNAL_SAP_CODE CHAR(5) NULL,
SUP_NAME VARCHAR(255) NULL,
SHARD KEY(SUP_ID)
);`
The suppliers.write statement has worked once, but data was then loaded in the DataFrame with a sqlContext.read.jdbc command and not sqlContext.sql (data was stored in a distant database and not in Apache Spark local temp table).
Did anyone face the same issue, please?

Are you getting that error when you run the create table, or when you run the suppliers.write code? That is an error that you should only get when creating a table. Therefore if you are hitting it when running suppliers.write, your code is probably trying to create and write to a new table, not the one you created before.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Batch Insert from Dataframe to DB ignoring failed row in Pyspark - postgresql

That is not possible if you have a unique constraint on the target. There is no UPSert mode currently with these techniques. You need to design around this aspect.

Related

pyspark - insert generated primary key in dataframe

ADF mapping data flow only inserting, never updating

How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's?

How to insert rows into cassandra if they don't exist using spark- cassandra driver?

Apache Spark - Error persisting Dataframe to MemSQL database using JDBC driver

Categories

Resources