How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's? - pyspark

I'm trying to run the upsert/delete some of the values in DB2 database source table, which is a existing table on DB2. Is it possible using Pyspark/Spark SQL/Dataframes.

There is no direct way for update/delete in relational database using Pyspark job, but there are workarounds.
(1) You can create a identical empty table (secondary table) in relational database and insert data into secondary table using pyspark job, and write a DML trigger that would perform desired DML operation on your primary table.
(2) You can create a dataframe (eg. a) in spark that would be copy of your existing relational table and merge existing table dataframe with current dataframe(eg. b) and create a new dataframe(eg. c) that would be having latest changes. Now truncate the relational database table and reload with spark latest changes dataframe(c).
These is just a workaround and not a optimal solution for huge amount of data.

Related

Insert into Memory Optimized Table from non optimized

I have two database.
Primary have a DDL triggers so i can't create memory optimized tables there. So i created secondary database and create there table with memory optimized on. Now, in procedure on primary database i need insert copy data from other table to this optimized.
For example:
INSERT INTO InMemory.dbo.DestTable_InMem SELECT * FROM #T;
And i have:
A user transaction that accesses memory optimized tables or natively compiled modules cannot access more than one user database or databases model and msdb, and it cannot write to master.
Did exists some workarounds from it?
I cannot move my procedure to second database.
There is no other way than using a native procedure to INSERT, UPDATE or DELETE in an in-memory table.
See: A Guide to Query Processing for Memory-Optimized Tables
To move from one DB to the other, the source table must exists locally

SCD2 Implementation in Redshift using AWS GLue Pyspark

I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work.
Current Requirement:
Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
If primary key match is not found then insert the new record.
Implementation:
I have implemented it in Glue using pyspark with the following steps:
Created dataframes which will cover three scenarios:
If a match is found update the existing record's end date to current date.
Insert the new record to Redshift table where PPK match is found
Insert the new record to Redshift table where PPK match is not found
Finally, Union all these three data frames into one and write this to redshift table.
With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.
Is there a way to delete the old record with high date value using pyspark? Please advise.
We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell .
[for example : S3 bucket - > Staging table -> Target Table]
In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .

postgres - move one part of partitioned table to new db as independent table

I have postgresql DB with partitioned table - can somebody help me understood how convert one part of this partitioned table to regular table in new DB?
When i search on google found information about partitioned :/ How proper ask google of operation which I want to perform?

Clearing records in HBase table

We are creating a Disaster Recovery System for HBase tables. Because of the restrictions we are not able to use the fancy methods to maintain the replica of the table. We are using Export/Import statements to get the data into HDFS and using that to create tables in the DR Servers.
While Importing the data into HBase table, we are using truncate command to clear the table and load the data fresh into the table. But the truncate statement is taking a long time to delete the rows. Is there are any other effective statements to clear the entire table?
(truncate takes 33 min for ~2500000 records)
disable -> drop -> create table again, maybe ? I don't know if drop takes too long.

Talend, Postgres and sequences

I have a JPA application and associated Postgres schema that was designed with a single sequence table (not a sequence). I am trying to populate several tables using Talend tPostgresqlOutput. These tables have keys that are sequenced by the JPA application. I am at a loss to work out how to read a sequence number from the table, update it and then use the sequence number to key a record on an insert with Talend. I can work it through with a sequence, but this is a table.