Can I initiate an ad-hoc Debezium snapshot without a signaling table? - apache-kafka

I am running a Debezium connector to PostgreSQL. The snapshot.mode I use is initial, since I don't want to resnapshot just because the connector has been restarted. However, during development I want to restart the process, as the messages expire from Kafka before they have been read.
If I delete and recreate the connector via Kafka Connect REST API, this doesn't do anything, as the information in the offset/status/config topics is preserved. I have to delete and recreate them when restarting the whole connect cluster to trigger another snapshot.
Am I missing a more convenient way of doing this?

You will also need a new name for the connector as well as a new database.server.name name in the connector config, which stores all the offset information. It should almost be like deploying a connector for the first time again.

Related

Kafka connect - completely removing a connector

my question is split to two. I've read Kafka Connect - Delete Connector with configs?. I'd like to completely remove a connector, with offsets and all, so I can recreate it with the same name later. Is this possible? To my understanding, a tombstone message will kill this connector indefinitely.
The second part is - is there a way to have the kafka-connect container automatically delete all connectors he created when bringing it down?
Thanks
There is no such command to completely cleanup connector state. For sink connectors, you can use kafka-consumer-groups to reset it's offsets. For source connectors, it's not as straightforward, as you'll need to manually produce data into the Connect-managed offsets topic.
The config and status topics also persist historical data, but shouldn't prevent you from recreating the connector with the same name/details.
The Connect containers published by Confluent and Debezium always uses Distributed mode. You'll need to override the entrypoint of the container to use standalone mode to not persist the connector metadata in Kafka topics (this won't be fault tolerant, but it'll be fine for testing)

How to ad-hoc snapshots on Debezium

I have Debezium in a container, capturing all changes of PostgeSQL database records.
How to delete all kafka topics which are created already and initiate ad-hoc snapshot from the beginning for all tables configured?
You can use kafka-topics --delete, just like any other topic. The Debezium ones typically match your database schema/table name. You'll also need to find the internal offsets topic created by Kafka Connect framework.
For Docker, though, if you restart Kafka and Zookeeper and they don't have volumes attached, then they'll lose everything, which would be easier for ad-hoc development.
Also, you don't need Zookeeper anymore, as of Kafka 3.3.1

Debezium: Check if snapshot is complete for postgres initial_only

I'm using Debezium postgres connector v1.4.2.Final.
I'm using snapshot.mode=initial_only, where I only want to get the table(s) snapshot and not stream the incremental changes. Once the snapshot is completed, I want to stop/kill the connector. How can I find out if the snapshotting is complete and that it's safe to kill the connector?
I'm using this to be able to add new tables to an existing connector. For doing that I'm trying this:
kill the original connector (snapshot.mode=initial)
start a new connector with snapshot.mode=initial_only for new tables
stop the new connector once snapshotting is complete
Start original connector after adding new tables to table.whitelist
please check JMX metrics. Verify if this one https://debezium.io/documentation/reference/1.5/connectors/postgresql.html#connectors-snaps-metric-snapshotcompleted_postgresql would suite to your needs.

Running Source Connector on Demand and Not Based on poll.interval.ms

I have a table that is updated once / twice a day, but I want the data to be pushed to Kafka immediately after the table is updated. Is it possible to avoid running the connector every poll.interval.ms, but rather to run it only after the table is updated (sync on demand or trigger the sync in some other way after the table update)
I apologize if this question is stupid... Can sink connector be running on one Kafka cluster, but pull messages from another Kafka cluster and insert them into Postgres. I'm not talking about replicating messages from Cluster A to Cluster B and then inserting messages from Cluster B to Postgres. I'm talking about Connector running on Cluster B but pulling messages from Cluster A and writing them to Postgres.
Thanks!
If you use log-based change data capture (Debezium, etc) then you capture changes as soon as they are there, without needing to re-query the database. If you use query-based CDC then you do have to query the database on a polling interval. For query-based vs log-based CDC see this blog or talk.
One option would be to use the Kafka Connect REST API to control the connector - but you're kind of going against the streaming paradigm here and will start to find awkward edges in doing this. For example, when do you decide to pause the connector? How do you determine that it's ingested all the changes? etc.
Using log-based CDC is low-impact on the source system and commonly the route that people go.
Kafka Connect does not run on your Kafka cluster. Kafka Connect runs as its own cluster. Physically, it can be co-located for purposes of dev/sandbox environment (this ref arch is useful for production). See also this talk "Running Kafka Connect".
So in your example, "Cluster B" is actually a Kafka Connect cluster - and it would be configured to read from Kafka cluster "A", and that is fine.

Debezium SQL Server Connector Kafka Initial Snapshot

according to the Debezium SQL Server Connector documentation, initial snapshot only fires on connector first run.
However if I delete connector and create new one but with the same name, initial snapshot is not working also.
Is this by design or known Issue?
Any help appreciated
Kafka Connect stores details about connectors such as their snapshot status and ingest progress even after they've been deleted. If you recreate it with the same name it will assume it's the same connector and thus will try to continue from where the previous connector got to.
If you want a connector to start from scratch (i.e. run snapshot etc) then you need to give the connector a new name. (Technically, you could also go into Kafka Connect and muck about with the internal data to remove the data for the connector of the same name, but that's probably a bad idea)
Give your connector a new database.server.name value or create a new topic. The reason why the snapshot doesn't fire again is because the current offset value for your topic and consumer has already passed the snapshot count index.