Delete the kafka connect topics without stopping the process - apache-kafka

I was running a Kafka connect worker in distributed mode. (it's a test cluster), I wanted to reset the default connect-* topics,so without stopping the worker I removed, then After the worker restart, I'm getting this error.
ERROR [Worker clientId=connect-1, groupId=debezium-cluster1] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:324)
org.apache.kafka.common.config.ConfigException:
Topic 'connect-offsets' supplied via the 'offset.storage.topic' property is required to have 'cleanup.policy=compact' to guarantee consistency and durability of source connector offsets,
but found the topic currently has 'cleanup.policy=delete'.
Continuing would likely result in eventually losing source connector offsets and problems restarting this Connect cluster in the future.
Change the 'offset.storage.topic' property in the Connect worker configurations to use a topic with 'cleanup.policy=compact'.

Deleting the internal topics while the workers are still running sounds risky. The workers have internal state, which now no longer matches the state in the Kafka brokers.
A safer approach would be to shut down the workers (or at-least shut down all the connectors), delete the topics, and restart the workers/connectors.

It looks like the topics got auto-created, perhaps by the workers when you deleted them mid-flight.
You could manually apply the configuration change to the topic as suggested, or you could also specify a new set of topics for the worker to use (connect01- for example) and let the workers recreate them correctly.

Related

Duplicate messages when using kafka mirrormaker at the time of problems on the source cluster

We have a remote kafka cluster that belongs to an external service, with which we pull data using a mirrormaker to our internal kafka cluster.
The following situation has occurred - on the side of the external service, one of the cluster brokers has fallen due to technical reasons.
The following appeared in the mirrormaker logs:
...
ERROR [Consumer clientId=XXX-1, groupId=YYY] Offset commit failed on partition PARTITION_NAME at offset 123456: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
WARN Failed to commit offsets because the consumer group has rebalanced and assigned partitions to another instance. If you see this regularly, it could indicate that you need to either increase the consumer's session.timeout.ms or reduce the number of records handled on each iteration with max.poll.records (kafka.tools.MirrorMaker$)
...
Next, consumers reconnected to alive nodes in the cluster and continued to read messages.
The problem is that due to the fall of the broker on the side of the external kafka, the messages could be read, but could not be committed. For this reason, after the rebalancing, the messages were read again and duplicates appeared in our internal cluster.
Are there any ways that would help in this situation to avoid duplicates in the internal cluster? (except for those indicated in the log warning.)
Maybe there are some consumer configuration parameters that would help to solve problems with duplicates.

Kafka Streams Apps Threads fail transaction and are fenced and restarted after Kafka broker restart

We are noticing Streams Apps threads fail transactions during rolling restarts of our Kafka Brokers. The transaction failure causes stream thread fencing which in turn causes a restart of the thread and re-balancing. The re-balancing causes some delay in processing. Our goal is to make broker restarts as smooth as possible and prevent processing delays as much as possible.
For our rolling Broker restarts we use the controlled.shutdown=true configuration, and before each restart we wait for all partitions to be in-sync across all replicas.
For our Streams Apps we have properly configured group.instance.id and an appropriate session.timeout.ms so that rolling restarts of the streams apps themselves are smooth and without re-balances.
From the Kafka Streams app logs I have identified a sequence of events leading up to the fencing:
Broker starts shutting down
App logs error producing to topic due to NOT_LEADER_OR_FOLLOWER
App heartbeats failing because coordinator is restarting broker
App discovers new group coordinator (this bounces a a bit between the restarting broker and live brokers)
App stabilizes
Broker starting up again
App fails to do fetch request to starting broker due to FETCH_SESSION_ID_NOT_FOUND
App discovers starting broker as transaction coordinator
App transaction fails due to one of two reasons:
InvalidProducerEpochException: Producer attempted to produce with an old epoch.
ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one
Stream threads end up in fatal error state, get fenced and restarted which causes a rebalance.
What could be causing the two exceptions that cause stream thread transactions to fail? My intuition is that the broker starting up is assigned as transaction coordinator before it has synced its transaction states with the in-sync brokers. This could explain old epochs or different transactional ids to be known by that broker.
How can we further identify what is going wrong here and how it can be improved?
you can set request.timeout.ms in kafka streams which will make stream API wait for a longer period of time. if kafka broker is not up in a given period of time then only it will throw an exception which can be handled by using ProductionExceptionHandler as described in Handling exceptions in Kafka streams

Create Producer when the first broker in the list of brokers is down

I have a multi-node Kafka cluster which I use for consuming and producing.
In my application, I use confluent-kafka-go(1.6.1) to create producers and consumers. Everything works great when I produce and consume messages.
This is how I configure my bootstrap server list
"bootstrap.servers":"localhost:9092,localhost:9093,localhost:9094"
But the moment when I start giving out the IP address of the brokers in bootstrap.servers and if the first broker is down, it seems that the producer repeatedly fails creation telling
Failed to initialize Producer ID: Local: Timed out
If I remove the IP of the failed node, producing and consuming messages work.
If the broker is down after I create the producer/consumer, they continue to be usable by switching over to other nodes.
How should I configure bootstrap.servers in such a way that the producer will be created using the available nodes?
You shouldn't really be running 3 brokers on the same machine anyway, but using multiple unique servers works fine for me when the first is down (and the cluster elects a different leader if it needs to), so sounds like you either lost the primary leader of your topic partitions or you've lost the Controller. Enabling retires on the producer should be able fix itself (by making a new metadata request for partition leaders)
Overall, it's just a CSV; there's no other way to configure that property itself. You could stick a reverse proxy in front of the brokers that resolves only to healthy nodes, but then you'd be conflicting with a potential DNS cache

What happens if Zookeeper fails completely?

we have setup a Kafka/Zookeeper Cluster consisting of 3 Brokers. We have one producer, sending messages to one specific Kafka topic and a few consumer groups reading from said topic. Those consumers perform a leader election via Zookeeper for themselves (independent from Kafka).
The versions used are:
Kafka: 0.9.0.1
Zookeeper: 3.4.6 (included in the Kafka-Package)
All processes are managed by Supervisor. So far, everything works just fine. What we tried now (for testing purposes) was to simply kill off all Zookeeper processes and see what happens.
As we expected, our consumer processes couldn't connect to Zookeeper anymore. But unexpectedly, the Kafka Brokers still worked. Our producer didn't complain at all and was still able to write into the topic. While I couldn't use kafka/bin/kafka-topics.sh or similar, since they all require a zookeeper-parameter, I could still see the actual size of the topic-log grow. After restarting the zookeeper processes, everything again worked just like before.
What we couldn't figure out is now... what actually happened there?
We thought, Kafka would require a working Zookeeper-Connection and we couldn't find any explanation for this behaviour online.
When you have one node of zookeeper, broker will not be able to contact zookeeper, after broker discovers zookeeper is not reachable, broker also will become unreachable. Hence the producer and consumer.
In case of producer it starts dropping(reject the record). In case of consumer it can happen that, the read record which is not ack'ed may end up processing again when broker is up and ready...
in case of 3node zk one node failure is acceptable as quorum is still satisfied... but cant afford the 2node failures which will lead to the above consequences...

Apache kafka storm, persistence during maintenance

I have Ubuntu 14.04TS. I use Node.js->Kafka->Storm->MongoDB chain. With initial development, everything goes well. Messages are finally stored into mMngoDB.
In Kafka, I have one Zookeeper and broker0 in kakfa1. broker1 in kafka2. With Storm, Zookeeper, nimbus, and DRPC are located at storm1. Supervisor and worker are located at storm2.
Now the questions is when I do update storm1 and storm2. I stopped all processes of storm1 and storm2. I suppose Kafka will buffer the message from Node.js. After I restarted both storm1 and storm2, and redeployed topology, I found messages during storm1 storm2's, down and up, are lost. So indeed, Kafka does not keep persistence of messages during storm maintenance period.
In my mind, Kafka will remember the last index of the message it receive acknowledgement.
In all, how could I prevent message from lost when storm is under maintenance.