Why is replica fetch bytes greater than max message bytes allowed? - apache-kafka

Recently, we faced an issue in our Kafka cluster where we overrode the max.message.bytes value for a topic (which had a replication factor of 3) to a value larger than replica.fetch.max.bytes. We did not see issues immediately but when a message (replica.fetch.max.bytes < message size < max.message.bytes) was produced later, we started seeing the below error in our logs.
Replication is failing due to a message that is greater than replica.fetch.max.bytes for partition [<topic-name>,1]. This generally occurs when the max.message.bytes has been overridden to exceed this value and a suitably large message has also been sent. To fix this problem increase replica.fetch.max.bytes in your broker config to be equal or larger than your settings for max.message.bytes, both at a broker and topic level
Since, we did not want to restart our Kafka brokers and perform a rolling upgrade to the cluster immediately, we temporarily decreased the replication factor to 1 (not high availability, I know).
So, are there any useful use cases where such settings might be useful? If yes, what? Also, are there any better solutions that one can try out to mitigate this problem, instead of stopping the replication?

My guess is that since max.message.bytes is per topic (and thus stored in ZooKeeper and updated at any point in time), and replica.fetch.max.bytes is per broker, it cannot be checked or guaranteed that a topic's max.message.bytes is <= a replica's replica.fetch.max.bytes.
I also found an old ticket regarding this very problem:
https://issues.apache.org/jira/browse/KAFKA-1844
Kaka Broker on startup checks to see if the configured replica.fetch.max.bytes >= message.max.bytes. But users can override message.max.bytes per topic and there is no such validation happening for per topic message.max.bytes. If the users configured message.max.bytes > replica.fetch.max.bytes , followers won't be able to fetch data.
Also from the documentation about replica.fetch.max.bytes, it seems that in some cases it would still work:
This is not an absolute maximum, if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that progress can be made. The maximum record batch size accepted by the broker is defined via message.max.bytes (broker config) or max.message.bytes (topic config).
So all in all, it doesn't seem to make sense and is a known issue.

Brokers allocate a buffer size of replica.fetch.max.bytes for each partition they replicate. If replica.fetch.max.bytes is set to 1 MiB, and you have 1000 partitions, about 1 GiB of RAM is required.
When the value of message.max.bytes (or max.message.bytes -topic config) is grater than the replica.fetch.max.bytes it might create situations where the batch wont fit into the allocated buffer. Hence it is important to have replica.fetch.max.bytes greater than message.max.bytes. The broker will still accept messages but fail to replicate them. Leading to potential data loss
The value of max.message.bytes is usually increased to have higher throughput. Or the size of each message is larger than 1mb(Default value)
Please ensure that the number of partitions multiplied by the size of the largest message does not exceed available memory.
As for the solution, replica.fetch.max.bytes is a read only broker level config soo a restart will be required

Related

Kafka config replica.fetch.max.bytes on a per-topic level

I would like to set a Kafka cluster to only allow large messages on a particular topic. From the docs I see that if I wanted to do this at the level of the entire cluster I could do so by setting message.max.bytes to allow a larger amount of data on the broker and replica.fetch.max.bytes to allow it to be replicated, but my understanding is that this would increase memory usage for all topics in my cluster, not just the one that I know can receive large messages. There is also a topic-level setting max.message.bytes that controls the maximum size of messages, but I don't see a topic-level setting controlling the maximum data size of replication operations. It seems strange that one of these closely tied settings is not configurable at a topic level; perhaps I'm missing where such setting is or there is another way to accomplish these goals?
replica.fetch.max.bytes can only be set on the broker level. However, you can set max.partition.fetch.bytes on the consumer side:
The maximum amount of data per-partition the server will return.
Records are fetched in batches by the consumer. If the first record
batch in the first non-empty partition of the fetch is larger than
this limit, the batch will still be returned to ensure that the
consumer can make progress. The maximum record batch size accepted by
the broker is defined via message.max.bytes (broker config) or
max.message.bytes (topic config). See fetch.max.bytes for limiting the
consumer request size.
Note that this is a per-partition configuration, meaning that if you set it to a large number, it will consume a lot of memory in case you have a lot of partitions too.

Confusion in using log.retention.bytes parameter in logging Topic data in Apache Kafka

"log.retention.bytes" is the parameter we are using to retain the logs of topic messages and I had given value as 1073741824.
I had referred the Kafka documentation, where it says the size given in "log.retention.bytes" is per partition, so that means suppose if I have 20 partitions for all the topics I am using, then total size of bytes that Kafka will retain is 20*1073741824 according to the documentation.
But what clarity I need is
Will Kafka retain 20*1073741824 bytes for all the topics?
(or)
Will Kafka retain 20*1073741824 bytes per topic?
log.retention.bytes Parameter used to retain in the log for each topic partition. By default, log size is unlimited.
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
If you set log.retention.bytes = 1 GB, Kafka will trigger a clean-up activity when the partition size reaches to 1 GB. Remember that it is not a topic size. It is partition size.
Kafka give you other option to configure the retention periond i.e log.retention.ms.. The default retention period is seven days.If you want to change the duration, you can specify your value for log.retention.ms configuration.
If you specify both configurations, the clean-up will start on meeting either of the criteria.

Is this possible? producer batch.size * max.request.size > broker max.message.bytes

Average message size is small, but size is vary.
Average message size: 1KBytes
1MBytes message incomes in arbitrary rate. / So, producer's max.request.size = 1MBytes
broker's max.message.bytes = 2MBytes
My questions.
To avoid producing size error, user have to set batch.size LTE 2?
Or producer library decides batch size automatically to avoid error? (even user set large batch.size)
Thanks.
Below are the definition of the related configs in question
Producer config
batch.size : producer will attempt to batch records until it reaches batch.size before it is sent to kafka ( assuming batch.size is configured to take precedence over linger.ms ) .Default - 16384 bytes
max.request.size : The maximum size of a request in bytes. This setting will limit the number of record batches the producer will send in a single request to avoid sending huge requests. This is also effectively a cap on the maximum record batch size. Default - 1048576 bytes
Broker config
message.max.bytes : The largest record batch size allowed by Kafka. Default - 1000012 bytes
replica.fetch.max.bytes : This will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly.
To answer your questions
To avoid producer send errors , you don't need to set batch size 2MB as this will delay the transmission of your low size messages . You can keep the batch.size according to the avg message size and depending on how much you want to batch
If you don't specify batch size , it would take the default value which is
16384 bytes
So basically you will have to configure producer 'max.request.size'>=2MB and broker 'message.max.bytes' and 'replica.fetch.max.bytes' >=2MB.
This query arises because there are various settings available around batching. Let me attempt to make them clear:
Kafka Setting: message.max.bytes and fetch.max.bytes
The Kafka broker limits the maximum size (total size of messages in a batch, if messages are published in batches) of a message that can be produced, configured by the cluster-wide property message.max.bytes (defaults to 1 MB). A producer that tries to send a message larger than this will receive an error back from the broker, and the message will not be accepted. As with all byte sizes specified on the broker, this configuration deals with compressed message size, which means that producers can send messages that are much larger than this value uncompressed, provided they compress it under the configured message.max.bytes size.
Note: This setting can be overridden by a specific topic (but with name max.message.bytes).
The maximum message size, message.max.bytes, configured on the Kafka broker must be coordinated with the cluster-wide property fetch.max.bytes (defaults to 1 MB) on consumer clients. It configures the maximum number of bytes of messages to attempt to fetch for a request. If this value is smaller than message.max.bytes, then consumers that encounter larger messages will fail to fetch those messages, resulting in a situation where the consumer gets stuck and cannot proceed.
The configuration setting replica.fetch.max.bytes (defaults to 1MB) determines the rough amount of memory you will need for each partition on a broker.
Producer Setting: max.request.size
This setting controls the size of a produce request sent by the producer. It caps both the size of the largest message that can be sent and the number of messages that the producer can send in one request. For example, with a default maximum request size of 1 MB, the largest message you can send is 1MB or the producer can batch 1000 messages of size 1k each into one request.
In addition, the broker has its own limit on the size of the largest message it will accept message.max.bytes). It is usually a good idea to have these configurations match, so the producer will not attempt to send messages of a size that will be rejected by the broker.
Note that message.max.bytes (broker level) and max.requrest.size (producer level) puts a cap on the maximum size of request in a batch, but batch.size (which should be lower than previous two) and linger.ms are the settings which actually govern the size of the batch.
Producer Setting: batch.size and linger.ms
When multiple records are sent to the same partition, the producer will batch them together. The parameter batch.size controls the maximum amount of memory in bytes (not the number of messages!) that will be used for each batch. If a batch has become full, all the message in the batch has to be sent. This helps in throughput on both the client and the server.
A small batch size will make batching less common and may reduce throughput. A very large size may use memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional messages.
The linger.ms (defaults to 0) setting controls the amount of time to wait for additional messages before sending the current batch.
By default, the producer will send messages as soon as there is a sender thread available to send them, even if there's just one message in the batch (note that batch.size only specifies the maximum limit on the size of a batch). By setting linger.ms higher than 0, we instruct the producer to wait a few milliseconds to add additional messages to the batch before sending it to the brokers, even if a sender thread is available. This increases latency but also increases throughput (because we send more messages at once, there is less overhead per message).

Increase the number of messages read by a Kafka consumer in a single poll

Kafka consumer has a configuration max.poll.records which controls the maximum number of records returned in a single call to poll() and its default value is 500. I have set it to a very high number so that I can get all the messages in a single poll.
However, the poll returns only a few thousand messages(roughly 6000) in a single call even though the topic has many more. How can I further increase the number of messages read by a single consumer?
You can increase Consumer poll() batch size by increasing max.partition.fetch.bytes, but still as per documentation it has limitation with fetch.max.bytes which also need to be increased with required batch size. And also from the documentation there is one other property message.max.bytes in Topic config and Broker config to restrict the batch size. so one way is to increase all of these property based on your required batch size
In Consumer config max.partition.fetch.bytes default value is 1048576
The maximum amount of data per-partition the server will return. Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress. The maximum record batch size accepted by the broker is defined via message.max.bytes (broker config) or max.message.bytes (topic config). See fetch.max.bytes for limiting the consumer request size
In Consumer Config fetch.max.bytes default value is 52428800
The maximum amount of data the server should return for a fetch request. Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress. As such, this is not a absolute maximum. The maximum record batch size accepted by the broker is defined via message.max.bytes (broker config) or max.message.bytes (topic config). Note that the consumer performs multiple fetches in parallel.
In Broker config message.max.bytes default value is 1000012
The largest record batch size allowed by Kafka. If this is increased and there are consumers older than 0.10.2, the consumers' fetch size must also be increased so that the they can fetch record batches this large.
In the latest message format version, records are always grouped into batches for efficiency. In previous message format versions, uncompressed records are not grouped into batches and this limit only applies to a single record in that case.
This can be set per topic with the topic level max.message.bytes config.
In Topic config max.message.bytes default value is 1000012
The largest record batch size allowed by Kafka. If this is increased and there are consumers older than 0.10.2, the consumers' fetch size must also be increased so that the they can fetch record batches this large.
In the latest message format version, records are always grouped into batches for efficiency. In previous message format versions, uncompressed records are not grouped into batches and this limit only applies to a single record in that case.
Most probably your payload is limited by max.partition.fetch.bytes, which is 1MB by default. Refer to Kafka Consumer configuration.
Here's good detailed explanation:
MAX.PARTITION.FETCH.BYTES
This property controls the maximum number of bytes the server will return per partition. The default is 1 MB, which means that when KafkaConsumer.poll() returns ConsumerRecords, the record object will use at most max.partition.fetch.bytes per partition assigned to the consumer. So if a topic has 20 partitions, and you have 5 consumers, each consumer will need to have 4 MB of memory available for ConsumerRecords. In practice, you will want to allocate more memory as each consumer will need to handle more partitions if other consumers in the group fail. max. partition.fetch.bytes must be larger than the largest message a broker will accept (determined by the max.message.size property in the broker configuration), or the broker may have messages that the consumer will be unable to consume, in which case the consumer will hang trying to read them. Another important consideration when setting max.partition.fetch.bytes is the amount of time it takes the consumer to process data. As you recall, the consumer must call poll() frequently enough to avoid session timeout and subsequent rebalance. If the amount of data a single poll() returns is very large, it may take the consumer longer to process, which means it will not get to the next iteration of the poll loop in time to avoid a session timeout. If this occurs, the two options are either to lower max. partition.fetch.bytes or to increase the session timeout.
Hope it helps!

Kafka optimal retention and deletion policy

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.