Need to Understand the CDC flow of Debezium Postgres Source Connector - postgresql

Folks,
I am trying to understand the CDC process of Debezium Postgres source connector as i suspect data loss. Here is my scenario, I have a connector name "deb-connector" that pulls data from about 10 tables via the replication slot "deb-1" and plugin "wal2json", everything works fine and all good. Now i wanted to add 2 more tables to the sync up, so i created a new connector named "deb-connector-updt" with additional 2 tables hop on to the same rep slot "deb-1". It took some time initially (as it normally does) for the snapshot to finish and running. Here is my question,
As far as the snapshot process is concerned, my understanding is that its gonna take a new snapshot for the second connector (pls correct me if i am wrong).?
If it does, the initial process would be READ events for all records in the db and all records (2.5K recs) should have flown to kafka no matter what, right? But i don't see them, instead i see only the cdc records (23 recs) flowing.
This can only happen if it doesn't take new snapshot and basically start from where it left off (that makes sense only if the snapshot is tied to the slot). (or) Should i create a new slot whenever i want to add/remove tables list from the config (that doesn't make sense at all)?
I would highly appreciate if the experts can answer this! Thank you!

Related

Audit data changes with Debezium

I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.

Best way to unload delta data from DB2 transactional tables?

Sorry if this has been asked before. I am hoping to win some time this way :)
What would be the best way to unload delta data from a DB2 source database that has been optimized for OLTP? E.g. by analyzing the redo files, as with Oracle Logminer?
Background: we want near-realtime ETL, and a full table unload every 5 minutes is not feasible.
this is more about the actual technology behind accessing DB2 than about determining the deltas to load into the (Teradata) target.
Ie, we want to unload all records since last unload timestamp.
many many thanks!
Check out IBM InfoSphere Data Replication.
Briefly:
There are 3 replication solutions: CDC, SQL & Q replication.
All 3 solutions read Db2 transaction logs using the same db2ReadLog API, which anyone may use for custom implementation. All other things like staging & transformation of the data changes got from logs, transportation and target application of data are different for each method.

Bidirectional Replication Design: best way to script and execute unmatched row on Source DB to multiple subscriber DBs, sequentially or concurrently?

Thank you for help or suggestion offered.
I am trying to build my own multi-master replication on Postgresql 10 in Windows, for a situation which cannot use any of the current 3rd party tools for PG multimaster replication, which can also involve another DB platform in a subscriber group (Sybase ADS). I have the following logic to create bidirectional replication, partially inspired by Bucardo's logic, between 1 publisher and 2 subscribers:
When INSERT, UPDATE, or DELETE is made on Source table, Source table Trigger adds row to created meta table on Source DB that will act as a replication transaction to be performed on the 2 subscriber DBs which subcribe to it.
A NOTIFY signal will be sent to a service, or script written in Python or some scripting language will monitor for changes in the metatable or trigger execution and be able to do a table compare or script the statement to run on each subscriber database.
***I believe that triggers on the subscribers will need to be paused to keep them from pushing their received statements to their subscribers, i.e. if node A and node B both subscribe to each other's table A, then an update to node A's table A should replicate to node B's table A without then replicating back to table A in a bidirectional "ping-pong storm".
There will be a final compare between tables and the transaction will be closed. Re-enable triggers on subscribers if they were paused/disabled when pushing transactions from step 2 addendum.
This will hopefully be able to be done bidirectionally, in order of timestamp, in FIFO order, unless I can figure out a to create child processes to run the synchronizations concurrently.
For this, I am trying to figure out the best way to setup the service logic---essentially Step 2 above, which has apparently been done using a daemon in Linux, but I have to work in Windows, making it run as, or resembling, a service/agent---or come up with a reasonably easy and efficient design to send the source DBs statements to the subscribers DBs.
Does anyone see that this plan is faulty or may not work?
Disclaimer: I don't know anything about Postgresql but have done plenty of custom replication.
The main problem with bidirectional replication is merge issues.
If the same key is used in both systems with different attributes, which one gets to push their change? If you nominate a master it's easier. Then the slave just gets overwritten every time.
How much latency can you handle? It's much easier to take the 'notify' part out and just have a five minute windows task scheduler job that inspects log tables and pushes data around.
In other words, this kind of pattern:
Change occurs in a table. A database trigger on that table notes the change and writes the PK of the table to a change log table. A ReplicationBatch column in the log table is set to NULL by default
A windows scheduled task inspects all change log tables to find all changes that happened since the last run and 'reserves' these records by setting their replication state to a replication batch number
i.e. you run a UPDATE LogTable Set ReplicationBatch=BatchNumber WHERE ReplicationState IS NULL
All records that have been marked are replicated
you run a SELECT * FROM LogTable WHERE ReplicationState=RepID to get the records to be processed
When complete, the reserved records are marked as complete so the next time around only subsequent changes are replicated. This completion flag might be in the log table or it might be in a ReplicaionBatch number table
The main point is that you need to reserve records for replication, so that as you are replicating them out, additional log records can be added in from the source without messing up the batch
Then periodically you clear out the log tables.

IBM DB2 and IBM IMS Change Data Capture Capabilities

I'd like to understand if the CDC enabled IBM IMS segments and IBM DB2 table sources would be able to provide both the before and after snapshot change values (like the Oracle .OLD and .NEW values in trigger) so that both could be used for further processing.
Note:
We are supposed to retrieve these values through Informatica PowerExchange and process and push to targets.
As of now, we need to know would we be able to retrieve both before snapshot and after snapshot values from IBM DB2 and IBM IMS (.OLD and .NEW as in Oracle triggers - not an exact similar example, but mentioned just as an example to understand)
Any help is much appreciated, Thanks.
I don't believe CDC captures before data in its change messages that it compiles from the DBMS log data. It's main purpose is to issue the minimum number of commands needed to replicate the data from one database to another. You'll want to take a snapshot of your replica database prior to processing the change messages if you want to preserve the state of data such that you can query it.
Alternatively for Db2, it's probably easier to work with the temporal tables feature added in Db2 10 as that allows you to define what changes should drive a snapshot. You can then access the temporal data using a temporal SQL query.
SELECT … FROM…period specification
Example trigger with old and new referencing...
CREATE TRIGGER danny117
NO CASCADE BEFORE Update ON mylib.myfile
REFERENCING NEW AS N old as O
FOR EACH ROW
-- don't let the claim change and force upper case
--just do something automatically on update blah...
BEGIN ATOMIC
SET N.claim = ucase(O.claim);
END
w.r.t PowerExchange 9.1.0 & 9.6:
Before snapshot data can't be processed via the powerexchange for DB2 database. Recently I worked on a migration project and I thought like the Oracle CDC which uses SCN numbers there should be something for db2 to start the logger from any desired point. But to my surprise Inforamtica global support confirmed that before snapshot data can't be captured by PowerExchange.
They talk about materialize and de-materialize targets which was out of my knowledge at that time, later I found out they meant to export and import of history data.
Even if you have table with CDC enanbled, you can't capture the data before snapshot from PWX.
DB2 reads capture data from the DB2-logs which has a marking for the operation like U/I/D that's enough for PowerExchange to progress.

Using Kafka for Data Integration with Updates & Deletes

So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.
At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.
I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.
We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.
This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?
Any suggestions/help is much appreciated. Thank you!
There are 3 approaches you can take:
Full dump load
Incremental dump load
Binlog replication
Full dump load
Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.
Incremental dump load
Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of
SELECT *
FROM my_table
WHERE last_update > #{last_import}
This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.
Binlog replication
Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.
This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.
Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.
You can read more about these different replication approaches in this blog post.
Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).