Kafka Connect assigns same task to multiple workers - apache-kafka

I'm using Kafka Connect in distributed mode. A strange behavior I observed multiple times now is that, after some time (can be hours, can be days), what appears to be a balancing error happens: same tasks get assigned to multiple workers. As a result, they run concurrently and, depending on the nature of the connector, fail or produce "unpredictable" outputs.
The simplest configuration I was able to use to reproduce the behavior is: two Kafka Connect workers, two connectors, each connector with one task only. Kafka Connect is deployed into Kubernetes. Kafka itself is in Confluent Cloud. Both Kafka Connect and Kafka are of the same version (5.3.1).
Relevant messages from the log:
Worker A:
[2019-10-30 12:44:23,925] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Successfully joined group with generation 488 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:469)
[2019-10-30 12:44:23,926] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Joined group at generation 488 and got assignment: Assignment{error=0, leader='connect-1-d5c19893-b33c-4f07-85fb-db9736795759', leaderUrl='http://10.16.0.15:8083/', offset=250, connectorIds=[some-hdfs-sink, some-mqtt-source], taskIds=[some-hdfs-sink-0, some-mqtt-source-0], revokedConnectorIds=[], revokedTaskIds=[], delay=0} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1397)
Worker B:
[2019-10-30 12:44:23,930] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Successfully joined group with generation 488 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:469)
[2019-10-30 12:44:23,936] INFO [Worker clientId=connect-1, groupId=some-kafka-connect-cluster] Joined group at generation 488 and got assignment: Assignment{error=0, leader='connect-1-d5c19893-b33c-4f07-85fb-db9736795759', leaderUrl='http://10.16.0.15:8083/', offset=250, connectorIds=[some-mqtt-source], taskIds=[some-mqtt-source-0], revokedConnectorIds=[], revokedTaskIds=[], delay=0} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1397)
In above log extracts you can observe that same task (some-mqtt-source-0) is assigned to two workers. After this message, I can also see log messages by the task instances on both workers.
This behavior doesn't depend on the connector (I observed it with other tasks as well). It also doesn't happen immediately after the workers are started, but after some time only.
My question is what can be the cause of this behavior?
EDIT 1:
I've tried running 3 workers, instead of two, thinking that it might be a distributed consensus issue. It appears not to be, and having 3 workers doesn't fix the issue.
EDIT 2:
I've noticed that just before a worker A is assigned a task that originally ran on worker B, that worker (B) observes an error joining a group. For example, if tasks get "duplicated" in generation N, worker B would not have a "Successfully joined group with generation N" message in logs. More so, between generation N-1 and N+1, worker B typically logs errors like Attempt to heartbeat failed for since member id and Group coordinator bx-xxx-xxxxx.europe-west1.gcp.confluent.cloud:9092 (id: 1234567890 rack: null) is unavailable or invalid. Worker B typically joins the generation N+1 shortly after the generation N (sometimes in as few as just about 3 seconds later). It is now clear what triggers the behavior. However:
although I understand that there may be temporary issues like these and they are probably normal in a general case, why doesn't rebalancing fix the issue after all servers successfully join the next generation? Although more rebalnce follow - it doesn't correctly distribute the tasks, and keeps the "duplicates" forever (until I restart workers).
it appears that in some periods, rebalance happens almost once per several hours, and in other periods it happens every 5 minutes (precisely up to seconds); what could be the reason? what is the normal?
what could be the reason for "Group coordinator is unavailable or invalid" errors, given that I use a Confluent Cloud, and are there any configuration parameters that can be tweaked in Kafka Connect in order to make it more resilient with regards to this error? I know there are session.timeout.ms and heartbeat.interval.ms, but the documentation is so minimalist it is not even clear what is the practical impact of changing these parameters to smaller or bigger values.
EDIT 3:
I observed that the issue is not critical for sink tasks: although same sink tasks get assigned to multiple workers, corresponding consumers are assigned to different partitions as they normally should, and everything works almost as it should - I simply got more tasks than I originally asked for. However, in case of source tasks, the behavior is breaking - tasks run concurrently and compete for resources on the source side.
EDIT 4:
Meanwhile, I downgraded Kafka Connect to version 2.2 (Confluent Platform 5.2.3) - a pre-"Incremental Cooperative Rebalancing" version. It works fine for last 2 days. So, I assume the behavior is related to the new rebalancing mechanism.

As mentioned in the comments, Jira Kafka-9184 was made to address this problem, and it has been resolved.
The fix is available in versions 2.3.2 and above.
As such the answer is now: upgrading to a recent version should prevent this problem from occurring.

Related

Delete the kafka connect topics without stopping the process

I was running a Kafka connect worker in distributed mode. (it's a test cluster), I wanted to reset the default connect-* topics,so without stopping the worker I removed, then After the worker restart, I'm getting this error.
ERROR [Worker clientId=connect-1, groupId=debezium-cluster1] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:324)
org.apache.kafka.common.config.ConfigException:
Topic 'connect-offsets' supplied via the 'offset.storage.topic' property is required to have 'cleanup.policy=compact' to guarantee consistency and durability of source connector offsets,
but found the topic currently has 'cleanup.policy=delete'.
Continuing would likely result in eventually losing source connector offsets and problems restarting this Connect cluster in the future.
Change the 'offset.storage.topic' property in the Connect worker configurations to use a topic with 'cleanup.policy=compact'.
Deleting the internal topics while the workers are still running sounds risky. The workers have internal state, which now no longer matches the state in the Kafka brokers.
A safer approach would be to shut down the workers (or at-least shut down all the connectors), delete the topics, and restart the workers/connectors.
It looks like the topics got auto-created, perhaps by the workers when you deleted them mid-flight.
You could manually apply the configuration change to the topic as suggested, or you could also specify a new set of topics for the worker to use (connect01- for example) and let the workers recreate them correctly.

Duplicate message consumption in Kafka due to auto-downscaling/deletion of pods

Background
We have a simple producer/consumer style application with Kafka as the message broker and Consumer Processes running as Kubernetes pods. We have defined two topics namely the in-topic and the out-topic. A set of consumer pods that belong to the same consumer group read messages from the in-topic, perform some work and finally write out the same message (key) to the out-topic once the work is complete.
Issue Description
We noticed that there are duplicate messages being written out to the out-topic by the consumers that are running in the Kubernetes pods. To rephrase, two different consumers are consuming the same messages from the in-topic twice and thus publishing the same message twice to the out-topic as well. We analyzed the issue and can safely conclude that this issue only occurs when pods are auto-downscaled/deleted by Kubernetes.
In fact, an interesting observation we have is that if any message is read by two different consumers from the in-topic (and thus published twice in the out-topic), the given message is always the last message consumed by one of the pods that was downscaled. In other words, if a message is consumed twice, the root cause is always the downscaling of a pod.
We can conclude that a pod is getting downscaled after a consumer writes the message to the out-topic but before Kafka can commit the offset to the in-topic.
Consumer configuration
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true");
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "3600000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG"org.apache.kafka.common.serialization.StringDeserializer")
Zookeeper/broker logs :
[2021-04-07 02:42:22,708] INFO [GroupCoordinator 0]: Preparing to rebalance group PortfolioEnrichmentGroup14 in state PreparingRebalance with old generation 1 (__consumer_offsets-17) (reason: removing member PortfolioEnrichmentConsumer13-9aa71765-2518-
493f-a312-6c1633225015 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2021-04-07 02:42:23,331] INFO [GroupCoordinator 0]: Stabilized group PortfolioEnrichmentGroup14 generation 2 (__consumer_offsets-17) (kafka.coordinator.group.GroupCoordinator)
[2021-04-07 02:42:23,335] INFO [GroupCoordinator 0]: Assignment received from leader for group PortfolioEnrichmentGroup14 for generation 2 (kafka.coordinator.group.GroupCoordinator)
What we tried
Looking at the logs, it was clear that rebalancing takes place because of the heartbeat expiration. We added the following configuration parameters to increase the heartbeat and also increase the session time out :
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "10000")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "900000");
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, "512");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "1");
However, this did not solve the issue. Looking at the broker logs, we can confirm that the issue is due to the downscaling of pods.
Question : What could be causing this behavior where a message is consumed twice when a pod gets downscaled?
Note : I already understand the root cause of the issue; however, considering that a consumer is a long lived process running in an infinite loop, how and why is Kubernetes downscaling/killing a pod before the consumer commits the offset? How do I tell Kubernetes not to remove a running pod from a consumer group until all Kafka commits are completed?
"What could be causing this behavior where a message is consumed twice when a pod gets downscaled?"
You have provided the answer already yourself: "[...] that a pod is getting downscaled after a consumer writes the message to the out-topic but before Kafka can commit the offset to the in-topic."
As the message was processed but not committed, another pod is re-processing the same message again after the downscaling happens. Remember that adding or removing a consumer from a consumer group always initiates a Rebalancing. You have now first-hand experience why this should generally be avoided as much as feasible. Depending on the Kafka version a rebalance will cause every single consumer of the consumer group to stop consuming until the rebalancing is done.
To solve your issue, I see two options:
Only remove running pods out of the Consumer Group when they are idle
Reduce the consumer configuration auto.commit.interval.ms to 1 as this defaults to 5 seconds. This will only work if you set enable.auto.commit to true.
If you want your consumer to commit message/s before exiting you would need to handle exit signal to your consumer. A lot of languages do support this. Have a look at this thread on how to do this in java - How to finish kafka consumer safety?(Is there meaning to call thread#join inside shutdownHook ? ).
That being said, please note that there is no 100% guarantee to achieving exactly once. Your process can be killed forcefully by OS before even given time to run any exit clean up (kill -9 <process_id>.

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Kafka consumer group keep moving to PreparingRebalance state and stops consuming

We have a Kafka Streams consumer group that keep moving to PreparingRebalance state and stop consuming.
The pattern is as follows:
Consumer group is running and stable for around 20 minutes
New consumers (members) start to appear in the group state without any clear reason, these new members only originate from a small number of VMs (not the same VMs each time), and they keep joining
Group state changes to PreparingRebalance
All consumers stop consuming, showing these logs:
"Group coordinator ... is unavailable or invalid, will attempt rediscovery"
The consumer on VMs that generated extra members show these logs:
Offset commit failed on partition X at offset Y: The coordinator is not aware of this member.
Failed to commit stream task X since it got migrated to another thread already. Closing it as zombie before triggering a new rebalance.
Detected task Z that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Will try to rejoin the consumer group.
We kill all consumer processes on all VMs, the group moves to Empty with 0 members, we start the processes and we're back to step 1
Kafka version is 1.1.0, streams version is 2.0.0
We took thread dumps from the misbehaving consumers, and didn't see more consumer threads than configured.
We tried restarting kafka brokers, cleaning zookeeper cache.
We suspect that the issue has to do with missing heartbeats, but the default heartbeat is 3 seconds and message handling times are no where near that.
Anyone encountered a similar behaviour?

Kafka 0.10.0.1 partition reassignment after broker failure

I'm testing kafka's partition reassignment as a precursor to launching a production system. I have several topics with 9 partitions each and a replication factor of 3. I've killed one of the brokers to simulate a failure condition and verified that some topics became under replicated (verification done via a fork of yahoo's kafka manager modified to allow adding a version 0.10.0.1 cluster).
I then started a new broker with a different id. I would now like to distribute partitions to this new broker. I attempted to use kafka manager's reassign partitions functionality however that did not work (possibly due to an improperly modified fork).
I saw that kafka comes with a bin/kafka-reassign-partitions.sh script but the docs say that I have to manually write out the partition reassignments for each topic in json format. Is there a way to handle this without manually deciding on which brokers partitions must go?
Hmm what a coincidence that I was doing exactly the same thing today. I don't have an answer you're probably going to like but I achieved what I wanted in the end.
Ultimately, what I did was executed the kafka-reassign-partitions command with what the same tool proposed for a reassignment. But whatever it generated I just replaced the new broker id with the old failed broker id. For some reason the generated json moved everything around.
This will fail (or rather never complete) because the old broker has passed on. I then had to delete the reassignment operation in zookeeper (znode: admin/reassign_partitions or something).
Then I restarted kafka on the new broker and it magically picked up as leader of the partition that was looking for a new replacement leader.
I'll let you know if everything is still working tomorrow and if I still have a job ;-)