Kafka sink connector: No tasks assigned, even after restart - mongodb

I am using Confluent 3.2 in a set of Docker containers, one of which is running a kafka-connect worker.
For reasons yet unclear to me, two of my four connectors - to be specific, hpgraphsl's MongoDB sink connector - stopped working. I was able to identify the main problem: The connectors did not have any tasks assigned, as could be seen by calling GET /connectors/{my_connector}/status. The other two connectors (of the same type) were not affected and were happily producing output.
I tried three different methods to get my connectors running again via the REST API:
Pausing and resuming the connectors
Restarting the connectors
Deleting and the creating the connector under the same name, using the same config
None of the methods worked. I finally got my connectors working again by:
Deleting and creating the connector under a different name, say my_connector_v2 instead of my_connector
What is going on here? Why am I not able to restart my existing connector and get it to start an actual task? Is there any stale data on the kafka-connect worker or in some kafka-connect-related topic on the Kafka brokers that needs to be cleaned?
I have filed an issue on the specific connector's github repo, but I feel like this might actually be general bug related to the intrinsics of kafka-connect. Any ideas?

I have faced this issue. If the resources are less for a SinkTask or SourceTask to start, this can happen.
Memory allocated to the worker may be less some time. By default workers are allocated 250MB. Please increase this. Below is an example to allocate 2GB memory for the worker running in distributed mode.
KAFKA_HEAP_OPTS="-Xmx2G" sh $KAFKA_SERVICE_HOME/connect-distributed $KAFKA_CONFIG_HOME/connect-avro-distributed.properties

Related

Kafka connect-distributed mode fault tolerance not working

I have created kafka connect cluster with 3 EC2 machines and started 3 connectors ( debezium-postgres source) on each machine reading a different set of tables from postgres source. In one of the machines, I started the s3 sink connector as well. So the changed data from postgres is being moved to kafka broker via source connectors (3) and S3 sink connector consumes these messages and pushes them to S3 bucket.
The cluster is working fine and so are the connectors too. When I pause any of the connectors running on one of EC2 machine, I was expecting that its task should be taken by another connector (postgres-debezium) running on another machine. But that's not happening.
I installed kafdrop as well to monitor the brokers. I see 3 internal topics connect-offsets, connect-status and connect-configs are getting populated with necessary offsets, configs, and status too ( when I pause, status paus message appears).
But somehow connectors are not taking the task when I paused.
Let me know in what scenario connector takes the task of other failed one? Is pause is the right way? or we should produce some error on one of the connectors then it takes.
Please guide.
Sounds like it's working as expected.
Pausing has nothing to do with the fault tolerance settings and it'll completely stop the tasks. There's nothing to rebalance until unpaused.
The fault tolerance settings for dead letter queue, skip+log, or halt are for when there are actual runtime exception in the connector that you cannot control through the API. For example, a database or S3 network / authentication exception, or serialization error in the Kafka client

How to convince Kafka brokers to assign workers to my topic?

We're got a (Confluent) managed kafka installation with 5 brokers and 2 connector hosts. I have two topics which never get any consumers assigned to them, despite repeated starting and stopping of the connectors which are supposed to listen for them. This configuration was running until recently (and no, nothing has changed - we've done an audit to confirm).
What, if anything, can I do to force assignment of consumers to these topics?
This problem occurred for two reasons: (1) I had mistakenly installed new versions of the PostgreSQL and SQLServer JDBC connectors {which conflicted with the already-installed versions}, and (2) not understanding that source connectors do not have Consumer Groups assigned to them.
Getting past the "well duh, of course sources don't have consumers", another vital piece of information (thankyou to the support team at Confluent for this) is that you can see where your connector is up to by reading the private or hidden topic connect-offsets. Once you have that, you can then check your actual DB query and see what it is returning then (if necessary) reset your connector's offset.

Kafka 2.0 - Multiple Kerberos Principals in KafkaConnect Connectors

We are currently using HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0. Problem is with running multiple connectors with different configuration(Kerberos principals) on same KafkaConnect Cluster.
As part of this Kafka version, all connectors are supposed to use same consumer/producer properties which have been set in worker configuration with consumer.* or producer.* prefix. But as I stated, we have multiple users (apps) running their own connectors and we can't use a single Kerberos principal to allow read on all topics.
So just wanted to check with experts if there is any way this security limitation can be over come. The option I can think of is - run a different Kafka-Connect cluster for each Kafka User (different principals) but what implications it could have if we run many KafkaConnect Clusters on same nodes ? Will it cause any impacts in term of resources (Java heap etc.) or this is the only way (standard procedure) to handle this.
PS: In later releases (2.3+) this problem is fixed via KAFKA-8265 and these settings can be overwritten but even if we try upgrading to latest HDF we will only get Kafka 2.1 which will not solve this issue.
Thanks for your help !!
I think upgrading is your best option to get the linked feature. As I commented, you can go get latest kafka versions on your own... Hortonworks/Cloudera doesn't offer support for Connect anyway. They'd rather you use Spark/Flink/NiFi (I think Storm is no longer around?)
what implications it could have if we run many KafkaConnect Clusters on same nodes ? Will it cause any impacts in term of resources (Java heap etc.)
Heap is the main one (for batching, sink connectors). Network and CPU load could also come into account, depending on rate of messages.
As long as the advertised ports for each cluster process aren't colliding, you should be able to use the same group ids and internal topics, though

Kafka Connector - distributed - load balancing tasks

I am running development environment for Confluent Kafka, Community edition on Windows, version 3.0.1-2.11.
I am trying to achieve load balancing of tasks between 2 instances of connector. I am running Kafka Zookepper, Server, REST services and 2 instance of Connect distributed on the same machine.
Only difference between properties file for connectors is rest port since they are running on the same machine.
I don't create topics for connector offsets, config, status. Should I?
I have custom code for sink connector.
When I create worker for my sink connector I do this by executing POST request
POST http://localhost:8083/connectors
toward any of the running connectors. Checking is there loaded worker is done at URL
GET http://localhost:8083/connectors
My sink connector has System.out.println() lines in code with which I can follow output of my code in the console log.
When my worker is running I can see that only one instance of connector is executing code. If I terminate one connector another instance will take over the worker and execution will resume. However this is not what I want.
My goal is that both connector instances are running worker code so that they can share the load between them.
I've tried to got over some open source connectors to see is there specifics in writing code of connectors but with no success.
I've made some different attempts to tackle this problem but with no success.
I could rewrite my business code to come around this but I'm pretty sure I'm missing on something not obvious for me.
Recently I commented on Robin Moffatt's answer of this question.
From the sounds of it your custom code is not correctly spawning the number of tasks that you are expecting.
Make sure that you've set tasks.max >1 in your config
Make sure that your connector is correctly creating the appropriate number of tasks to taskConfigs
References:
https://opencredo.com/blogs/kafka-connect-source-connectors-a-detailed-guide-to-connecting-to-what-you-love/
https://docs.confluent.io/current/connect/devguide.html
https://enfuse.io/a-diy-guide-to-kafka-connectors/

Kafka partitions and offsets disappeared

My Kafka clients are running in GCP App Engine Flex environment with auto scale enabled (GCP keeps the instance count to at least two and it has been mostly 2 due to low CPU usages). The consumer groups running in that 2 VMs have been consuming messages from various topics in 20 partitions for several months and recently I noticed that partitions in older topics shrank to just 1 (!) and offsets for that consumer group was reset to 0. topic-[partition] directories were also gone from the kafka-logs directory. Strangely, recently created topic partitions are intact. I have 3 different environments (all in GCP) and this happened to all three. We didn't see any lost messages or data problem but want to understand what had happened to avoid this happening again.
The kafka broker and zookeeper are running in the same and single GCP compute engine instance (I know it's not the best practice and have plan to improve) and I suspect it has something to do with machine restart and that wipes out some information. However, I verified that data files are written under /opt/bitnami/(kafka|bitnami) directory and not /tmp which can be removed by machine restarts.
Spring Kafka 1.1.3
kafka client 0.10.1.1
single node kafka broker 0.10.1.0
single node zookeeper 3.4.9
Any insights on this will be appreciated!
Bitnami developer here. I could reproduce the issue and track it down to an init script that was clearing the content of the tmp/kafka-logs/ folder.
We released a new revision of the kafka installers, virtual machines and cloud images fixing the issue. The revision that includes the fix is 1.0.0-2.