Purpose of Debezium's Postgres connector offsets?

Purpose of Debezium's Postgres connector offsets? - postgresql

I need some explanation on how Postgres' pg_logical plugin and Debeziums Postgres connector work together in the context of offsets on Debezium side.
As far as I understand when reading records from the WAL pg_logical will remove them from the replication slot. So as soon as Debezium reads a record it is gone from the slot forever. Still there is a way to "remember" the last succesffuly read record on Debezium side. This is implemented via an offset storage.
I mean it seems to make sense for the case that Postgres failed and after restart the slot contains already-consumed records and the Debezium connector needs to skip some elements, but besides that?
I thought that recovering from a failure of Debezium including loosing already read WAL will include using the externally backed-up the offsets somehow.But how? The replication slot will already have discarded the records that have been consumed before the connector failure...
Thanks a lot for enlightening me!

Related

Kafka connect - completely removing a connector

my question is split to two. I've read Kafka Connect - Delete Connector with configs?. I'd like to completely remove a connector, with offsets and all, so I can recreate it with the same name later. Is this possible? To my understanding, a tombstone message will kill this connector indefinitely.
The second part is - is there a way to have the kafka-connect container automatically delete all connectors he created when bringing it down?
Thanks

There is no such command to completely cleanup connector state. For sink connectors, you can use kafka-consumer-groups to reset it's offsets. For source connectors, it's not as straightforward, as you'll need to manually produce data into the Connect-managed offsets topic.
The config and status topics also persist historical data, but shouldn't prevent you from recreating the connector with the same name/details.
The Connect containers published by Confluent and Debezium always uses Distributed mode. You'll need to override the entrypoint of the container to use standalone mode to not persist the connector metadata in Kafka topics (this won't be fault tolerant, but it'll be fine for testing)

Debezium for Postgress and WAL flush

I have Debezium in a container, capturing all changes of PostgeSQL database records. But i am unable to understand couple of things regarding how Debezium works. If Debezium starts for the first time it takes a snapshot of the database and then starting streaming based on WAL file accordantly to the Debezium documentation.:
PostgreSQL normally purges write-ahead log (WAL) segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the PostgreSQL connector first connects to a particular PostgreSQL database, it starts by performing a consistent snapshot of each of the database schemas. After the connector completes the snapshot, it continues streaming changes from the exact point at which the snapshot was made. This way, the connector starts with a consistent view of all of the data, and does not omit any changes that were made while the snapshot was being taken.
The connector is tolerant of failures. As the connector reads changes and produces events, it records the WAL position for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart the connector continues reading the WAL where it last left off. This includes snapshots. If the connector stops during a snapshot, the connector begins a new snapshot when it restarts.
But there is gap here, or maybe not.
When the snapshot of the database is complete and then it is streaming from WAL file, if the connector goes down and until it goes up the WAL is purged/flushed, how Debezium ensuring data integrity?

Since Debezium is using a replication slot on PG, Debezium constantly sends back to PG the information up to which LSN it has consumed so that PG can flush the WAL files. Now, this information is stored in the table pg_replication_slots. When Debezium starts up again, it reads the restart_lsn from that table and requests changes that happened before that value. This is how Debezium is ensuring data integrity.
Not that if for some reason, that LSN is not available in the WAL files, there is no way to get it back, meaning data loss has happened.

Can I initiate an ad-hoc Debezium snapshot without a signaling table?

I am running a Debezium connector to PostgreSQL. The snapshot.mode I use is initial, since I don't want to resnapshot just because the connector has been restarted. However, during development I want to restart the process, as the messages expire from Kafka before they have been read.
If I delete and recreate the connector via Kafka Connect REST API, this doesn't do anything, as the information in the offset/status/config topics is preserved. I have to delete and recreate them when restarting the whole connect cluster to trigger another snapshot.
Am I missing a more convenient way of doing this?

You will also need a new name for the connector as well as a new database.server.name name in the connector config, which stores all the offset information. It should almost be like deploying a connector for the first time again.

Coucbase to Kafka source connector

I have used kafka source connector to get the documents from Couchbase to kafka. These documents are then replicated to Mongo DB.
Couchbase --> Source Connector --> Kafka --> Sink Connector ---> Mongo
If the source connector is down then how to again synch all the documents to Kafka?
Is there any get and touch functionality that can agian event out all the changes made during the down period to the kafka topic?

If you're asking about processing the document changes that occurred while the source connector was down, then you don't need to do anything. Kafka Connect stores the state (offsets) of the source connector and will restore the StreamTask state and continue from where it left off. The Couchbase source connector supports this, as we can see in the code here, which is then used here to initialize the DCP stream with the saved offsets.
If you're asking how to reset the connector and re-stream the entire bucket from the beginning, that's actually not as easy. As far as I know, there is no built-in way in Kafka to reset a connector's offsets - there is a KIP under review related to that: KIP-199 Barring official support, the best way I know of resetting the connector state is either change the config to use a different topic for saving the offsets, which is hacky and leaves the old offsets as a potential problem, or actually edit the saved offsets as described here. I would never advocate doing either of those on a production system, so use your own judgement.

Postgres streaming using JDBC Kafka Connect

I am trying to stream changes in my Postgres database using the Kafka Connect JDBC Connector. I am running into issues upon startup as the database is quite big and the query dies every time as rows change in between.
What is the best practice for starting off the JDBC Connector on really huge tables?

Assuming you can't pause the workload on the database that you're streaming the contents in from to allow the initialisation to complete, I would look at Debezium.
In fact, depending on your use case, I would look at Debezium regardless :) It lets you do true CDC against Postgres (and MySQL and MongoDB), and is a Kafka Connect plugin just like the JDBC Connector is so you retain all the benefits of that.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse