While producing the messages, some brokers going breaking down makes any exception in Kafka producer side? - apache-kafka

I am testing the scenario as follows.
I am producing the messages to sink which is the Kafka containing the three brokers.
What if brokers are going to down, the producing side have an any issue because of the broker-down?
When I tested it on my local using Flink, I generated the messages and sinked them to Kafka. And I have three kafka brokers. When I deleted the number of brokers to 2, there are no problems. And obviously, when all the brokers are going to down, then the producer-side app gives an exception.
So, according to these fact, I think that the producer-side app can still alive without any errors until one broker remains. Is my assumption correct?
Below is the my producer side configuration.
acks = 1
batch.size = 16384
compression.type = lz4
connections.max.idle.ms = 540000
delivery.timeout.ms = 120000
enable.idempotence = false
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
linger.ms = 0
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
replication is two and I have three partitions for each topic.
Any help will be appreciated.
Thanks.

It all depends on your requirements and your producer configuration. At the moment, yes you can have 2 out of 3 brokers alive and your producer will continue as normal.
This is because you have acks=1 which means only the leader has to acknowledge the message before it is considered successful. The followers don't have to acknowledge the message.
You should also check whether you have changed min.insync.replicas at the broker or topic level configuration. The default is 1, meaning only 1 in-sync replica is needed for a broker to allow acks=all requests.
Side note: you have replication=2, I'd change this so partitions were replicated across all 3 brokers.

I'm not sure if I understood your question, but in Kafka client API there are some retryable Exceptions (like Not Leader, or unreached/unknown host).
So your Producer wil retry until reaching the first limit of these two configs:
retries : https://kafka.apache.org/documentation/#producerconfigs_retries
delivery.timeout.ms : https://kafka.apache.org/documentation/#producerconfigs_delivery.timeout.ms
So using the default values :
retries > 2 billions time &
delivery.timeout.ms = 2 minutes
Your producer will do N retries for only 2 minutes then crashes.

Related

__consumer_offsets topic with very big partitions

I am using Kafka 2.0.0
There are some partitions of the __consumer_offsets topic that are 500-700 GB and more than 5000-7000 segments. These segments are older than 2-3 months.
There aren't errors in the logs and that topic is COMPACT as default.
What could be the problem?
Maybe a config or a consumer problem? or maybe a bug of kafka 2.0.0?
What checks could I do?
My settings:
log.cleaner.enable=true
log.cleanup.policy = [delete]
log.retention.bytes = -1
log.segment.bytes = 268435456
log.retention.hours = 72
log.retention.check.interval.ms = 300000
offsets.commit.required.acks = -1
offsets.commit.timeout.ms = 5000
offsets.load.buffer.size = 5242880
offsets.retention.check.interval.ms = 600000
offsets.retention.minutes = 10080
offsets.topic.compression.codec = 0
offsets.topic.num.partitions = 50
offsets.topic.replication.factor = 3
offsets.topic.segment.bytes = 104857600
Try to restart the cluster. It will resolve the issue, but it takes lot of time to rebalance because of the size of the topic.
There can be a Crash in log.cleaner.threads In your brokers. Restart the brokers restarts those treads and cleaning will start.
And log.cleaner.threads is default to one in kafka. Increase it and then if one tread crash, there will be another.
If this is the case, there should be logs about this in server logs
Could be that you have application looping with different consumer groups evey time ?
You could use this command to look inside your _consumer_offsets topic, try to find out consumer
Group names repeats, maybe you have some users creating many consumer groups with loops or running console consumers ...
echo "exclude.internal.topics=false" > /tmp/consumer.config
./kafka-console-consumer.sh --consumer.config /tmp/consumer.config \
--formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" \
--bootstrap-server localhost:9092 --topic __consumer_offsets --from-beginning

How to properly configure Kafka Stream to ensure brokers resiliency?

Recently I tried to ensure my Kafka Streams (2.0.0 version) application will still be up and running (streams will resume) after any of the Kafka brokers will go up after unspecified downtime of all of the brokers (downtime more like few hours than few seconds).
It did not come up with default configuration (retries on 0) since after all brokers have been killed, all my streams changed state to ERROR or even DEAD and stopped working:
INFO o.a.k.s.p.internals.StreamThread - stream-thread [MyStream-0936f6a6-c9f4-4591-9b25-534abc65b8d1-StreamThread-24] State transition from RUNNING to PENDING_SHUTDOWN
and then:
INFO o.a.k.s.p.internals.StreamThread - stream-thread [MyStream-0936f6a6-c9f4-4591-9b25-534abc65b8d1-StreamThread-24] State transition from PENDING_SHUTDOWN to DEAD
And some of the streams went asap to ERROR:
stream-client [MyStream-88a8fe9a-d565-43e3-acb5-20cccc6b4a88] State transition from RUNNING to ERROR
As a result, considering state transition (https://static.javadoc.io/org.apache.kafka/kafka-streams/2.0.0/org/apache/kafka/streams/KafkaStreams.State.html) I can't resume in any other way than restarting application.
What I did found was this topic:
Kafka Streams stops listening from topic and processing messages when broker goes down
It answers my question with advice to increase my streams retries as well as retry.backoff.ms config. This is what I did (increased retries to Int.MaxNumber and retry.backoff.ms to 1000) and found out that there are some performance issues with this approach and I happened to get recurring error in logs: FETCH_SESSION_ID_NOT_FOUND which I cannot really found information about.
Are there any other ways of achieving brokers resiliency than increasing retries number? I can accept some of the messages lost after Kafka Brokers failure and I don't really need to retry if a message cannot be produced/consumed. I was thinking of manually restarting the stream after broker failure, but I'm not really sure how to catch the "broker down exception", what do you think?
These is my streams config:
buffered.records.per.partition = 1000
cache.max.bytes.buffering = 10485760
commit.interval.ms = 30000
connections.max.idle.ms = 540000
default.key.serde = ... Serdes$StringSerde
default.production.exception.handler = ... DefaultProductionExHandler
default.timestamp.extractor = ... FailOnInvalidTimestamp
default.value.serde = ... Serdes$StringSerde
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
num.standby.replicas = 0
num.stream.threads = 8
partition.grouper = ... DefaultPartitionGrouper
poll.ms = 100
processing.guarantee = at_least_once
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
replication.factor = 1
request.timeout.ms = 40000
retries = 0
retry.backoff.ms = 1000
rocksdb.config.setter = null
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
state.cleanup.delay.ms = 600000
state.dir = /tmp/kafka-state
topology.optimization = all
upgrade.from = null
windowstore.changelog.additional.retention.ms = 86400000
Increasing retries is the correct way to handle broker downtime.
I am not sure, why increasing number of retries would have a performance impact though. Can you elaborate on this?
Besides this, there is nothing like a "broker down exception". You can register an uncaught exception handler in KafkaStream instance. It informs you about dying thread, and thus, you can react to it, by restarting the KafkaStreams client.
You could also try to change the production.exception.handler and skip over some production errors. Note, that you might loose all buffered message in the client for this case. (Also note, that you cannot skip all errors with the handler -- some errors are consider FATAL and KafkaStreams will shut down, regardless of you config.)

Apache Kafka: Lowering `request.timeout.ms` causes metadata fetch failures?

I have a 9 broker, 5 node Zookeeper Kafka setup.
In order to reduce the time for reporting failures, we had set request.timeout.ms to 3000. However, with this setting, I'm observing some weird behavior.
Occasionally, I'm seeing client (producer) getting an error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
This doesn't happen always. There are some producers that work just fine.
When I bumped up the request.timeout.ms value, I didn't see any errors.
Any idea why lowering request.timeout.ms cause metadata fetch timeouts?

Kafka Consumer - receiving messages Inconsistently

I can send and receive messages on command line against a Kafka location installation. I also can send messages through a Java code. And those messages are showing up in a Kafka command prompt. I also have a Java code for the Kafka consumer. The code received message yesterday. It doesn't receive any messages this morning, however. The code has not been changed. I am wondering whether the property configuration isn't quite right nor not. Here is my configuration:
The Producer:
bootstrap.servers - localhost:9092
group.id - test
key.serializer - StringSerializer.class.getName()
value.serializer - StringSerializer.class.getName()
and the ProducerRecord is set as
ProducerRecord<String, String>("test", "mykey", "myvalue")
The Consumer:
zookeeper.connect - "localhost:2181"
group.id - "test"
zookeeper.session.timeout.ms - 500
zookeeper.sync.time.ms - 250
auto.commit.interval.ms - 1000
key.deserializer - org.apache.kafka.common.serialization.StringDeserializer
value.deserializer - org.apache.kafka.common.serialization.StringDeserializer
and for Java code:
Map<String, Integer> topicCount = new HashMap<>();
topicCount.put("test", 1);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = consumer
.createMessageStreams(topicCount);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(topic);
What is missing?
A number of things could be going on.
First, your consumer's ZooKeeper session timeout is very low, which means the consumer may be experiencing many "soft failures" due to garbage collection pauses. When this happens, the consumer group will rebalance, which can pause consumption. And if this is happening very frequently, the consumer could get into a state where it never consumes messages because it's constantly being rebalanced. I suggest increasing the ZooKeeper session timeout to 30 seconds to see if this resolves the issue. If so, you can experiment setting it lower.
Second, can you confirm new messages are being produced to the "test" topic? Your consumer will only consume new messages that it hasn't committed yet. It's possible the topic doesn't have any new messages.
Third, do you have other consumers in the same consumer group that could be processing the messages? If one consumer is experiencing frequent soft failures, other consumers will be assigned its partitions.
Finally, you're using the "old" consumer which will eventually be removed. If possible, I suggest moving to the "new" consumer (KafkaConsumer.java) which was made available in Kafka 0.9. Although I can't promise this will resolve your issue.
Hope this helps.

Kafka rolling restart: Data is lost

As part of our current Kafka cluster, high-availability testing (HA) is being done. The objective is, while a producer job is pushing data to a particular partition of a topic, all the brokers in Kafka cluster are restarted sequentially (Stop-first broker- restart it and after first broker comes up, do same steps for second broker and so-on). The producer job is pushing around 7 million records for about 30 minutes while this test is going on. At the end of job, it was noticed that around 1000 records are missing.
Below are specifics of our Kafka cluster: (kafka_2.10-0.8.2.0)
-3 Kafka brokers each with 2 100GB mounts
Topic was created with:
-Replication factor of 3
-min.insync.replica=2
server.properties:
broker.id=1
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/drive1,/drive2
num.partitions=1
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=10000
log.retention.hours=1
log.segment.bytes=1073741824
log.retention.check.interval.ms=1800000
log.cleaner.enable=false
zookeeper.connect=ZK1:2181,ZK2:2181,ZK3:2181
zookeeper.connection.timeout.ms=10000
advertised.host.name=XXXX
auto.leader.rebalance.enable=true
auto.create.topics.enable=false
queued.max.requests=500
delete.topic.enable=true
controlled.shutdown.enable=true
unclean.leader.election=false
num.replica.fetchers=4
controller.message.queue.size=10
Producer.properties (aync producer with new producer API)
bootstrap.servers=broker1:9092,broker2:9092,broker3:9092
acks=all
buffer.memory=33554432
compression.type=snappy
batch.size=32768
linger.ms=5
max.request.size=1048576
block.on.buffer.full=true
reconnect.backoff.ms=10
retry.backoff.ms=100
key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
Can someone share any info about Kafka-cluster and HA to ensure that data would not be lost while rolling restarting Kafka brokers?
Also, here is my producer code. This is a fire and forget kind of producer. we are not handling failures explicitly as of now. Working fine for almost millions of records. I am seeing problem, only when Kafka brokers are restarted as explained above.
public void sendMessage(List<byte[]> messages, String destination, Integer parition, String kafkaDBKey) {
for(byte[] message : messages) {
producer.send(new ProducerRecord<byte[], byte[]>(destination, parition, kafkaDBKey.getBytes(), message));
}
}
By increasing default retries value from 0 to 4000 on producer side, we are able to send data successfully without loosing.
retries=4000
Due to this setting, there is a possibility of sending same message twice and messages are out of sequence by the time consumer receives it (second msg might reach before first msg). But for our current problem that is not an issue and is handled on consumer side to ensure everything is in order.