Distributed Official Mongodb Kafka Source Connector with Multiple tasks Not working

Distributed Official Mongodb Kafka Source Connector with Multiple tasks Not working - mongodb

I am running Apache Kafka on my Windows machine with two Kafka-Connect-Workers(Port 8083, 8084) and one topic with three partitions(replication of one).
My issue is that I am able to see the fail-over to other Kafka-Connect worker whenever I shutdown one of them, but load balancing is not happening because the number of tasks is always ONE.
I am using Official MongoDB-Kafka-Connector as Source(ChangeStream) with tasks.max=6.
I tried updating MongoDB with multiple threads so that it could push more data into Kafka-Connect and may perhaps make Kafka-Connect create more tasks. Even under higher volume of data, tasks count remain one.
How I confirmed only one task is running? That's through the api "http://localhost:8083/connectors/mongodb-connector/status" :
Response:
{
"name":"mongodb-connector",
"connector": {
"state":"RUNNING",
"worker_id":"xx.xx.xx.xx:8083"
}
"tasks": [
{
"id": 0,
"state": "RUNNING"
"worker_id": "xx.xx.xx.xx:8083"
}
],
"type": "source"
}
Am I missing something here? Why more tasks are not created?

It seems this is the behavior of Official MongoDB Kafka Source Connector. This is the answer I got on another forum from Ross Lawley(MongoDB developer):
Prior to 1.2.0 only a single task was supported by the sink connector.
The Source connector still only supports a single task, this is because it uses a single Change Stream cursor. This is enough to watch and publish changes cluster wide, database wide or down to a single collection.
I raised this ticket: https://jira.mongodb.org/browse/KAFKA-121
Got following response:
The source connector will only ever produce a single task.
This is by design as the source connector is backed by a change stream. Change streams internally use the same data as used by replication engine and as such should be able to scale as the database does.
There are no plans to allow multiple cursors, however, should you feel that this is not meeting your requirements, then you can configure multiple connectors and each would have its own change stream cursor.

Related

Kafka Connect - connectors stop after no data for some time

I am running Kafka Connect in distributed mode on Kubernetes with 3 sink connectors, Kafka -> S3.
When data flows into Kafka and at least one of the connectors has data to read, everything works fine.
But on periods when there is no data to read, for a few hours for example, and none of the connectors needs to read any data, all the connectors stop (the /connectors endpoint on the Rest API shows an empty list). So when new data comes in eventually - it is not read unless manually starting the connectors.
Is this common behavior or am I missing something? I can add additional information about the setup if needed.

Based on comments, your config.storage.topic was not created with cleanup.policy=compact, therefore Kafka is deleting your configs for idle configurations, not idle connector tasks. When the configs are deleted from the topic, then the REST API removes the /connector response information.
Refer documentation on appropriate configurations for the internal Connect topics
https://kafka.apache.org/documentation/#connect

What is the relationship between connectors and tasks in Kafka Connect?

We've been using Kafka Connect for a while on a project, currently entirely using only the Confluent Kafka Connect JDBC connector. I'm struggling to understand the role of 'tasks' in Kafka Connect, and specifically with this connector. I understand 'connectors'; they encompass a bunch of configuration about a particular source/sink and the topics they connect from/to. I understand that there's a 1:Many relationship between connectors and tasks, and the general principle that tasks are used to parallelize work. However, how can we understand when a connector will/might create multiple tasks?
In the source connector case, we are using the JDBC connector to pick up source data by timestamp and/or a primary key, and so this seems in its very nature sequential. Indeed, all of our source connectors only ever seem to have one task. What would ever trigger Kafka Connect to create more than one connector? Currently we are running Kafka Connect in distributed mode, but only with one worker; if we had multiple workers, might we get multiple tasks per connector, or are the two not related?
In the sink connector case, we are explicitly configuring each of our sink connectors with tasks.max=1, and so unsurprisingly we only ever see one task for each connector there too. If we removed that configuration, presumably we could/would get more than one task. Would this mean the messages on our input topic might be consumed out of sequence? In which case, how is data consistency for changes assured?
Also, from time to time, we have seen situations where a single connector and task will both enter the FAILED state (because of input connectivity issues). Restarting the task will remove it from this state, and restart the flow of data, but the connector remains in FAILED state. How can this be - isn't the connector's state just the aggregate of all its child tasks?

A task is a thread that performs the actual sourcing or sinking of data.
The number of tasks per connector is determined by the implementation of the connector. Take a Debezium source connector to MySQL as an example, since one MySQL instance writes to exactly one binlog file at a time and a file has to be read sequentially, one connector generates exactly one task.
Whereas for sink connectors, the number of tasks should be equal to the number of partitions of the topic.
The task distribution among workers is determined by task rebalance which is a very similar process to Kafka consumer group rebalance.

Running Source Connector on Demand and Not Based on poll.interval.ms

I have a table that is updated once / twice a day, but I want the data to be pushed to Kafka immediately after the table is updated. Is it possible to avoid running the connector every poll.interval.ms, but rather to run it only after the table is updated (sync on demand or trigger the sync in some other way after the table update)
I apologize if this question is stupid... Can sink connector be running on one Kafka cluster, but pull messages from another Kafka cluster and insert them into Postgres. I'm not talking about replicating messages from Cluster A to Cluster B and then inserting messages from Cluster B to Postgres. I'm talking about Connector running on Cluster B but pulling messages from Cluster A and writing them to Postgres.
Thanks!

If you use log-based change data capture (Debezium, etc) then you capture changes as soon as they are there, without needing to re-query the database. If you use query-based CDC then you do have to query the database on a polling interval. For query-based vs log-based CDC see this blog or talk.
One option would be to use the Kafka Connect REST API to control the connector - but you're kind of going against the streaming paradigm here and will start to find awkward edges in doing this. For example, when do you decide to pause the connector? How do you determine that it's ingested all the changes? etc.
Using log-based CDC is low-impact on the source system and commonly the route that people go.
Kafka Connect does not run on your Kafka cluster. Kafka Connect runs as its own cluster. Physically, it can be co-located for purposes of dev/sandbox environment (this ref arch is useful for production). See also this talk "Running Kafka Connect".
So in your example, "Cluster B" is actually a Kafka Connect cluster - and it would be configured to read from Kafka cluster "A", and that is fine.

Kafka Connector - distributed - load balancing tasks

I am running development environment for Confluent Kafka, Community edition on Windows, version 3.0.1-2.11.
I am trying to achieve load balancing of tasks between 2 instances of connector. I am running Kafka Zookepper, Server, REST services and 2 instance of Connect distributed on the same machine.
Only difference between properties file for connectors is rest port since they are running on the same machine.
I don't create topics for connector offsets, config, status. Should I?
I have custom code for sink connector.
When I create worker for my sink connector I do this by executing POST request
POST http://localhost:8083/connectors
toward any of the running connectors. Checking is there loaded worker is done at URL
GET http://localhost:8083/connectors
My sink connector has System.out.println() lines in code with which I can follow output of my code in the console log.
When my worker is running I can see that only one instance of connector is executing code. If I terminate one connector another instance will take over the worker and execution will resume. However this is not what I want.
My goal is that both connector instances are running worker code so that they can share the load between them.
I've tried to got over some open source connectors to see is there specifics in writing code of connectors but with no success.
I've made some different attempts to tackle this problem but with no success.
I could rewrite my business code to come around this but I'm pretty sure I'm missing on something not obvious for me.
Recently I commented on Robin Moffatt's answer of this question.

From the sounds of it your custom code is not correctly spawning the number of tasks that you are expecting.
Make sure that you've set tasks.max >1 in your config
Make sure that your connector is correctly creating the appropriate number of tasks to taskConfigs
References:
https://opencredo.com/blogs/kafka-connect-source-connectors-a-detailed-guide-to-connecting-to-what-you-love/
https://docs.confluent.io/current/connect/devguide.html
https://enfuse.io/a-diy-guide-to-kafka-connectors/

Kafka sink connector: No tasks assigned, even after restart

I am using Confluent 3.2 in a set of Docker containers, one of which is running a kafka-connect worker.
For reasons yet unclear to me, two of my four connectors - to be specific, hpgraphsl's MongoDB sink connector - stopped working. I was able to identify the main problem: The connectors did not have any tasks assigned, as could be seen by calling GET /connectors/{my_connector}/status. The other two connectors (of the same type) were not affected and were happily producing output.
I tried three different methods to get my connectors running again via the REST API:
Pausing and resuming the connectors
Restarting the connectors
Deleting and the creating the connector under the same name, using the same config
None of the methods worked. I finally got my connectors working again by:
Deleting and creating the connector under a different name, say my_connector_v2 instead of my_connector
What is going on here? Why am I not able to restart my existing connector and get it to start an actual task? Is there any stale data on the kafka-connect worker or in some kafka-connect-related topic on the Kafka brokers that needs to be cleaned?
I have filed an issue on the specific connector's github repo, but I feel like this might actually be general bug related to the intrinsics of kafka-connect. Any ideas?

I have faced this issue. If the resources are less for a SinkTask or SourceTask to start, this can happen.
Memory allocated to the worker may be less some time. By default workers are allocated 250MB. Please increase this. Below is an example to allocate 2GB memory for the worker running in distributed mode.
KAFKA_HEAP_OPTS="-Xmx2G" sh $KAFKA_SERVICE_HOME/connect-distributed $KAFKA_CONFIG_HOME/connect-avro-distributed.properties

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse