Debezium for Postgress and WAL flush - apache-kafka

I have Debezium in a container, capturing all changes of PostgeSQL database records. But i am unable to understand couple of things regarding how Debezium works. If Debezium starts for the first time it takes a snapshot of the database and then starting streaming based on WAL file accordantly to the Debezium documentation.:
PostgreSQL normally purges write-ahead log (WAL) segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the PostgreSQL connector first connects to a particular PostgreSQL database, it starts by performing a consistent snapshot of each of the database schemas. After the connector completes the snapshot, it continues streaming changes from the exact point at which the snapshot was made. This way, the connector starts with a consistent view of all of the data, and does not omit any changes that were made while the snapshot was being taken.
The connector is tolerant of failures. As the connector reads changes and produces events, it records the WAL position for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart the connector continues reading the WAL where it last left off. This includes snapshots. If the connector stops during a snapshot, the connector begins a new snapshot when it restarts.
But there is gap here, or maybe not.
When the snapshot of the database is complete and then it is streaming from WAL file, if the connector goes down and until it goes up the WAL is purged/flushed, how Debezium ensuring data integrity?

Since Debezium is using a replication slot on PG, Debezium constantly sends back to PG the information up to which LSN it has consumed so that PG can flush the WAL files. Now, this information is stored in the table pg_replication_slots. When Debezium starts up again, it reads the restart_lsn from that table and requests changes that happened before that value. This is how Debezium is ensuring data integrity.
Not that if for some reason, that LSN is not available in the WAL files, there is no way to get it back, meaning data loss has happened.

Related

Can I initiate an ad-hoc Debezium snapshot without a signaling table?

I am running a Debezium connector to PostgreSQL. The snapshot.mode I use is initial, since I don't want to resnapshot just because the connector has been restarted. However, during development I want to restart the process, as the messages expire from Kafka before they have been read.
If I delete and recreate the connector via Kafka Connect REST API, this doesn't do anything, as the information in the offset/status/config topics is preserved. I have to delete and recreate them when restarting the whole connect cluster to trigger another snapshot.
Am I missing a more convenient way of doing this?
You will also need a new name for the connector as well as a new database.server.name name in the connector config, which stores all the offset information. It should almost be like deploying a connector for the first time again.

Purpose of Debezium's Postgres connector offsets?

I need some explanation on how Postgres' pg_logical plugin and Debeziums Postgres connector work together in the context of offsets on Debezium side.
As far as I understand when reading records from the WAL pg_logical will remove them from the replication slot. So as soon as Debezium reads a record it is gone from the slot forever. Still there is a way to "remember" the last succesffuly read record on Debezium side. This is implemented via an offset storage.
I mean it seems to make sense for the case that Postgres failed and after restart the slot contains already-consumed records and the Debezium connector needs to skip some elements, but besides that?
I thought that recovering from a failure of Debezium including loosing already read WAL will include using the externally backed-up the offsets somehow.But how? The replication slot will already have discarded the records that have been consumed before the connector failure...
Thanks a lot for enlightening me!

Debezium SQL Server Connector Kafka Initial Snapshot

according to the Debezium SQL Server Connector documentation, initial snapshot only fires on connector first run.
However if I delete connector and create new one but with the same name, initial snapshot is not working also.
Is this by design or known Issue?
Any help appreciated
Kafka Connect stores details about connectors such as their snapshot status and ingest progress even after they've been deleted. If you recreate it with the same name it will assume it's the same connector and thus will try to continue from where the previous connector got to.
If you want a connector to start from scratch (i.e. run snapshot etc) then you need to give the connector a new name. (Technically, you could also go into Kafka Connect and muck about with the internal data to remove the data for the connector of the same name, but that's probably a bad idea)
Give your connector a new database.server.name value or create a new topic. The reason why the snapshot doesn't fire again is because the current offset value for your topic and consumer has already passed the snapshot count index.

Apache Flink - duplicate message processing during job deployments, with ActiveMQ as source

Given,
I have a Flink job that reads from ActiveMQ source & writes to a mysql database - keyed on an identifier. I have enabled checkpoints for this job every one second. I point the checkpoints to a Minio instance, I verified the checkpoints are working with the jobid. I deploy this job is an Openshift (Kubernetes underneath) - I can scale up/down this job as & when required.
Problem
When the job is deployed (rolling) or the job went down due to a bug/error, and if there were any unconsumed messages in ActiveMQ or unacknowledged messages in Flink (but written to the database), when the job recovers (or new job is deployed) the job process already processed messages, resulting in duplicate records inserted in the database.
Question
Shouldn't the checkpoints help the job recover from where it left?
Should I take the checkpoint before I (rolling) deploy new job?
What happens if the job quit with error or cluster failure?
As the jobid keeps changing on every deployment, how does the recovery happens?
Edit As I cannot expect idempotency from the database, to avoid duplicates saved into the database (Exactly-Once), can I write database specific (upsert) query to update if the given record is present & insert if not?
JDBC currently only supports at least once, meaning you get duplicate messages upon recovery. There is currently a draft to add support for exactly once, which would probably be released with 1.11.
Shouldn't the checkpoints help the job recover from where it left?
Yes, but the time between last successful checkpoints and recovery could produce the observed duplicates. I gave a more detailed answer on a somewhat related topic.
Should I take the checkpoint before I (rolling) deploy new job?
Absolutely. You should actually use cancel with savepoint. That is the only reliable way to change the topology. Additionally, cancel with savepoints avoids any duplicates in the data as it gracefully shuts down the job.
What happens if the job quit with error or cluster failure?
It should automatically restart (depending on your restart settings). It would use the latest checkpoint for recovery. That would most certainly result in duplicates.
As the jobid keeps changing on every deployment, how does the recovery happens?
You usually point explicitly to the same checkpoint directory (on S3?).
As I cannot expect idempotency from the database, is upsert the only way to achieve Exactly-Once processing?
Currently, I do not see a way around it. It should change with 1.11.

Do I need to archive postgres WAL records if I am already streaming them to a standby server?

I have a postgres master node which is streaming WAL records to a standby slave node. The slave database runs in read only mode and has a copy of all data on the master node. It can be switched to master by creating a recovery.conf file in /tmp.
On the master node I am also archiving WAL records. I am just wondering if this is necessary if they are already streamed to the slave node? The archived WAL records are 27GB at this point. The disk will fill eventually.
A standby server is no backup; it only protects you from hardware failure on the primary.
Just imagine that somebody by mistakes deletes data or drops a table, then you won't be able to recover from this problem without a backup.
Create a job that regularly cleans up archived WALs if they exceed a certain age.
Once you have a full backup, then you can purge the preceding WAL files associated.
The idea is to preserve the WAL Files for PITR in case if your server crashes.
If your Primary server crashes, then you can certainly use your hot-standby and make it primary, but at this time you have to build another server (as a hot-standby). Typically you don't want to build it using streaming replication.
You will be using full backup+wal backups to build a server and then proceed further instead of relying on streaming replication.