Running Source Connector on Demand and Not Based on poll.interval.ms - apache-kafka

I have a table that is updated once / twice a day, but I want the data to be pushed to Kafka immediately after the table is updated. Is it possible to avoid running the connector every poll.interval.ms, but rather to run it only after the table is updated (sync on demand or trigger the sync in some other way after the table update)
I apologize if this question is stupid... Can sink connector be running on one Kafka cluster, but pull messages from another Kafka cluster and insert them into Postgres. I'm not talking about replicating messages from Cluster A to Cluster B and then inserting messages from Cluster B to Postgres. I'm talking about Connector running on Cluster B but pulling messages from Cluster A and writing them to Postgres.
Thanks!

If you use log-based change data capture (Debezium, etc) then you capture changes as soon as they are there, without needing to re-query the database. If you use query-based CDC then you do have to query the database on a polling interval. For query-based vs log-based CDC see this blog or talk.
One option would be to use the Kafka Connect REST API to control the connector - but you're kind of going against the streaming paradigm here and will start to find awkward edges in doing this. For example, when do you decide to pause the connector? How do you determine that it's ingested all the changes? etc.
Using log-based CDC is low-impact on the source system and commonly the route that people go.
Kafka Connect does not run on your Kafka cluster. Kafka Connect runs as its own cluster. Physically, it can be co-located for purposes of dev/sandbox environment (this ref arch is useful for production). See also this talk "Running Kafka Connect".
So in your example, "Cluster B" is actually a Kafka Connect cluster - and it would be configured to read from Kafka cluster "A", and that is fine.

Related

Can I initiate an ad-hoc Debezium snapshot without a signaling table?

I am running a Debezium connector to PostgreSQL. The snapshot.mode I use is initial, since I don't want to resnapshot just because the connector has been restarted. However, during development I want to restart the process, as the messages expire from Kafka before they have been read.
If I delete and recreate the connector via Kafka Connect REST API, this doesn't do anything, as the information in the offset/status/config topics is preserved. I have to delete and recreate them when restarting the whole connect cluster to trigger another snapshot.
Am I missing a more convenient way of doing this?
You will also need a new name for the connector as well as a new database.server.name name in the connector config, which stores all the offset information. It should almost be like deploying a connector for the first time again.

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

Kafka sink from multiple independent brokers

I want to aggregate changes from multiple databases into one so I thought to run a Debezium connector and a Kafka server/broker next to each database, and use a Kafka sink connector to consume from all those Kafkas to write into one database.
The question is, can I use a single instance of Kafka sink connector to consume at the same time, from multiple Kafka brokers which are independent (not a cluster).
Running a Kafka broker next to each database sounds very complicated. And a single Kafka connect worker that connects to different Kafka broker clusters does not seem to be supported, as far as I can see.
If you go down this path, it may make more sense to use something like Kafka MirrorMaker to copy your local topics to a single main Kafka cluster, and then use a Kafka Connect Sink to read all the copied topics from one worker and write to a central DB.
Ultimately, running a Broker next to each source database is pretty complicated. From what you described, it sounds like you have some connectivity between your different databases, but it is limited and possibly prone to disconnects. Have you considered alternative designs:
DB Replication: Use your DB vendor's native async replication to just copy the data to a single target DB. The remote region is always read-only, replication should not slow down your source DB (depends on the DB, of-course). And async DB replication can usually handle some network disconnections and latency.
Local Debezium: Run a process with Debezium next to each DB, and save all events to a file. Copy the files to some central server or to a cloud storage service like S3. Finally, import these files into a central DB. This would basically skip Kafka completely.
You can point the Connect property files at whatever bootstrap.servers you want
The property itself is required to be part of a single "cluster" (even if a single broker), which would be determined by the broker zookeeper.connect property

What is the relationship between connectors and tasks in Kafka Connect?

We've been using Kafka Connect for a while on a project, currently entirely using only the Confluent Kafka Connect JDBC connector. I'm struggling to understand the role of 'tasks' in Kafka Connect, and specifically with this connector. I understand 'connectors'; they encompass a bunch of configuration about a particular source/sink and the topics they connect from/to. I understand that there's a 1:Many relationship between connectors and tasks, and the general principle that tasks are used to parallelize work. However, how can we understand when a connector will/might create multiple tasks?
In the source connector case, we are using the JDBC connector to pick up source data by timestamp and/or a primary key, and so this seems in its very nature sequential. Indeed, all of our source connectors only ever seem to have one task. What would ever trigger Kafka Connect to create more than one connector? Currently we are running Kafka Connect in distributed mode, but only with one worker; if we had multiple workers, might we get multiple tasks per connector, or are the two not related?
In the sink connector case, we are explicitly configuring each of our sink connectors with tasks.max=1, and so unsurprisingly we only ever see one task for each connector there too. If we removed that configuration, presumably we could/would get more than one task. Would this mean the messages on our input topic might be consumed out of sequence? In which case, how is data consistency for changes assured?
Also, from time to time, we have seen situations where a single connector and task will both enter the FAILED state (because of input connectivity issues). Restarting the task will remove it from this state, and restart the flow of data, but the connector remains in FAILED state. How can this be - isn't the connector's state just the aggregate of all its child tasks?
A task is a thread that performs the actual sourcing or sinking of data.
The number of tasks per connector is determined by the implementation of the connector. Take a Debezium source connector to MySQL as an example, since one MySQL instance writes to exactly one binlog file at a time and a file has to be read sequentially, one connector generates exactly one task.
Whereas for sink connectors, the number of tasks should be equal to the number of partitions of the topic.
The task distribution among workers is determined by task rebalance which is a very similar process to Kafka consumer group rebalance.

Kafka design questions - Kafka Connect vs. own consumer/producer

I need to understand when to use Kafka connect vs. own consumer/producer written by developer. We are getting Confluent Platform. Also to achieve fault tolerant design do we have to run the consumer/producer code ( jar file) from all the brokers ?
Kafka connect is typically used to connect external sources to Kafka i.e. to produce/consume to/from external sources from/to Kafka.
Anything that you can do with connector can be done through
Producer+Consumer
Readily available Connectors only ease connecting external sources to Kafka without requiring the developer to write the low-level code.
Some points to remember..
If the source and sink are both the same Kafka cluster, Connector doesn't make sense
If you are doing changed-data-capture (CDC) from a database and push them to Kafka, you can use a Database source connector.
Resource constraints: Kafka connect is a separate process. So double check what you can trade-off between resources and ease of development.
If you are writing your own connector, it is well and good, unless someone has not already written it. If you are using third-party connectors, you need to check how well they are maintained and/or if support is available.
do we have to run the consumer/producer code ( jar file) from all the brokers ?
Don't run client code on the brokers. Let all memory and disk access be reserved for the broker process.
when to use Kafka connect vs. own consumer/produce
In my experience, these factors should be taken into consideration
You're planning on deploying and monitoring Kafka Connect anyway, and have the available resources to do so. Again, these don't run on the broker machines
You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted
You aren't able to integrate your own producer/consumer code into your existing applications or simply would rather have a simpler produce/consume loop
Having structured data not tied to the a particular binary format is preferred
Writing your own or using a community connector is well tested and configurable for your use cases
Connect has limited options for fault tolerance compared to the raw producer/consumer APIs, with the drawbacks of more code, depending on other libraries being used
Note: Confluent Platform is still the same Apache Kafka
Kafka Connect:
Kafka Connect is an open-source platform which basically contains two types: Sink and Source. The Kafka Connect is used to fetch/put data from/to a database to/from Kafka. The Kafka connect helps to use various other systems with Kafka. It also helps in tracking the changes (as mentioned in one of the answers Changed Data Capture (CDC) ) from DB's to Kafka. The system maintains the offset, in order to read/write data from that particular offset to Kafka or any other database.
For more details, you can refer to https://docs.confluent.io/current/connect/index.html
The Producer/Consumer:
The Producer and Consumer are just an end system, which use the Kafka to produce and consume topics to/from Kafka. They are used where we want to broadcast the data to various consumers in a consumer group. This kind of system also maintains the lag and offsets of data for the consumer groups.
No, you don't need to run any producer/consumer while running Kafka connect. In case you want to check there is no data loss you can run the consumer while running Source Connectors. In case, of Sink Connectors, the already produced data can be verified in your database, by running their particular select queries.