Kafka fetch max bytes doesn't work as expected - apache-kafka

I have a topic worth 1 GB of messages. A. Kafka consumer decides to consume these messages. What could I do to prohibit the consumer from consuming all messages at once? I tried to set the
fetch.max.bytes on the broker
to 30 MB to allow only 30 MB of messages in each poll. The broker doesn't seem to honor that and tries to give all messages at once to the consumer causing Consumer out of memory error. How can I resolve this issue?

Kafka configurations can be quite overwhelming. Typically in Kafka, multiple configurations can work together to achieve a result. This brings flexibility, but flexibility comes with a price.
From the documentation of fetch.max.bytes:
Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress.
Only on the consumer side, there are more configurations to consider for bounding the consumer memory usage, including:
max.poll.records: limits the number of records retrieved in a single call to poll. Default is 500.
max.partition.fetch.bytes: limits the number of bytes fetched per partition. This should not be a problem as the default is 1MB.
As per the information in KIP-81, the memory usage in practice should be something like min(num brokers * max.fetch.bytes, max.partition.fetch.bytes * num_partitions).
Also, in the same KIP:
The consumer (Fetcher) delays decompression until the records are returned to the user, but because of max.poll.records, it may end up holding onto the decompressed data from a single partition for a few iterations.
I'd suggest you to also tune these parameters and hopefully this will get you into the desired state.


Kafka: Throughput of producing to thousands of topics with different message rate

The task is routing messages from a single huge source topic to many (few thousands) destination topics. Overall rate is about few millions of records per second. It barely handles such payload now, and we are looking for a solution to optimise it. However, it does not seem it reached any limit at hardware or network level, so I suppose it can be improved. A latency isn't important (few minutes delay is fine), an average message size is less than 1 KiB.
The most obvious way to increase throughput is to make batch.size and linger.ms larger. But the problem is a different message rate in destination topics: depends on a message destination the rate may vary from few messages per second to hundreds of thousands per second.
As I understand (please, correct me if I'm wrong), but batch.size is per-partition parameter. So, if we set batch.size too big we will go out of memory, because it was multiplied by a number of destination topics even all of them have only one partition. Otherwise, if batch.size will be small, then producer will send requests to broker too often. In each app instance we use a single producer for all destination topics (ProduceRequest can include batches to different topics). The only way to set this parameter different per topic is using a separate producer per topic, but it means thousands of threads and many context switches.
Can we set a minimum size of actual ProduceRequest, i.e. like batch.size, but for overall batches in the request, i.e. something opposite to max.request.size?
Or is there any way to increase throughput of producer?
the problem looks solveable and seems like we solved. it's not a big problem for Kafka to stream to 3k topics, but there are some things you should take care about:
Kafka-producer tries to allocate batch.size * number_of_destination_partitions memory on the start. if you have batch.size equals 10mb and 3k topics with 1 partition per topic, Kafka-producer will require at least ~30gb on the start (source code).
so the more destination partitions you have, the less batch.size you have to set up or the more memory you need. we chose small batch.size
messages rate per destination topics does't affect general performance. Kafka-producer sends several batches per one request. here max.request.size comes into the play (source code, maxSize is max.request.size). the higher max.request.size, the more batches could be sent per one request. it is important to understand that reaching a batch.size or a linger.ms don't instantly triggers sending batch to the broker. as soon as batch reaches the batch.size or the linger.ms, it is marked as sendable and will be processed later with other batches (source code).
moreover, batch.size or a linger.ms are not the only reasons to mark batch as sendable (check the previous link). and this is where the batches are actually sent (source code). that's why the same events rate per destination topics is not required, but still there are some nuances which will be described next.
2.1. a few words about linger.ms. can't say for sure how it acts in this scenario. on the one hand, the larger it is, the longer Kafka-producer will wait to collect messages for exact partition and the more data for that partition will be send per one request. one the other hand, it seems like the less it is, there more batches for different partitions could be packed into one request. while there is no certainty about how to do better.
despite that Kafka-producer is able to send more than one batch per request, it can't send more that one batch per request for one specific partition. thats why if you have skewed messages rate for destination topics, you have to increase partitions count for most loaded ones to increase throughput. but it's always necessary to remember that an increasing partitions count leads to an increase in memory usage.
actually, an information above helped us to solve our problems with performance. but there may be other nuances that we don't know about yet.
I hope it will be useful.

How does one Kafka consumer read from more than one partition?

I would like to know how one consumer consumes from more than one partition, specifically, in what order are messages read from the different partitions?
I had a peek at the source code (Consumer, Fetcher) but I can't really follow all of it.
This is what I thought would happen:
Partitions are read sequentially. That is: all the messages in one partition will be read before continuing to the next. If we reach max.poll.records without consuming the whole partition, the next fetch will continue reading the current partition until it is exhausted, before going on to the next.
I tried setting max.poll.records to a relatively low number and seeing what happens.
If I send messages to a topic and then start a consumer, all the messages are read from one partition before continuing to the next, even if the number of messages in that partition is higher than max.poll.records.
Then I tried to see if I could "lock" the consumer in one partition, by sending messages to that partition continuously (using JMeter). But I couldn't do it: messages from other partitions were also being read.
The consumer is polling for messages from its assigned partitions in a greedy round-robin way.
e.g. if max.poll.records is set to 100, and there are 2 partitions assigned A,B. The consumer will try to poll 100 from A. If partition A hasn't had 100 available messages, it will poll whats left to complete to 100 messages from partition B.
Although this is not ideal, this way no partition will be starved.
This is also explain why ordering is not guaranteed between partitions.
I have read the KIP mentioned in the answer to the question linked in the comments and I think I finally understood how the consumer works.
There are two main configuration options that affect how data is consumed:
max.partition.fetch.bytes: the maximum amount of data that the server will return for a given partition
max.poll.records: the maximum amount of records that are returned each time the consumer polls
The process of fetching from each partition is greedy and proceeds in a round-robin way. Greedy means that as many records as possible will be retrieved from each partition; if all records in a partition occupy less than max.partition.fetch.bytes, then all of them will be fetched; otherwise, only max.partition.fetch.bytes will be fetched.
Now, not all the fetched records will be returned in a poll call. Only max.poll.records will be returned.
The remaining records will be retained for the next call to poll.
Moreover, if the number of retained records is less than max.poll.records, the poll method will start a new round of fetching (pre-fetching) before returning. This means that, usually, the consumer is processing records while new records are being fetched.
If some partitions receive considerably more messages than others, this could lead to the less active partitions not being processed for long periods of time.
The only downside to this approach is that it could lead to some partitions going unconsumed for an extended amount of time when there is a large imbalance between the partition's respective message rates. For example, suppose that a consumer with max messages set to 1 fetches data from partitions A and B. If the returned fetch includes 1000 records from A and no records from B, the consumer will have to process all 1000 available records from A before fetching on partition B again.
In order to prevent this, we could reduce max.partition.fetch.bytes.

How does Kafka decide which records are contained in the consumer poll loop when there are more than `max.poll.records` records left?

I have a Kafka consumer group consuming several topics (each topic has more than one partition). All topics contain a considerable amount of records on each partition.
I'm currently trying to make sense of the behavior when the consumer initially starts consuming.
In particular, I'd like to know how the broker decides which records reach the client first.
The following aspects are noteworthy:
There are a lot more records than the consumer can process in one single roundtrip (i.e. more records than the consumer's max.poll.records configuration)
There are records from several topics and several partitions that the consumer has to read
I naively assumed that the broker returns records for each topic in each poll loop, so that the consumer reads all the topics at a similar pace. This doesn't seem to be the case though. Apparently it prioritizes records for a single topic at a time, switching the topic without an obvious pattern (at least that's what I'm seeing in the metrics of my consumer).
I couldn't find anything in the consumer config parameters that allows me to change this behavior. It's not really a problem, because all messages get read eventually. But I would like to understand the behavior in more detail.
So my question is: How does the broker decide which records end up in the result of a consumer's poll loop?
Consumer fetch records from Kafka using Fetch requests.
If you look at the protocol, this API is pretty complex and has many fields, but we can focus on a few fields that are relevant to your questions:
max_wait_ms: This indicates how long the broker should wait in case there's no/not enough records available. This is configurable using fetch.max.wait.ms.
min_bytes: This indicates how much data (the size of records) the broker needs to respond. This is configurable using fetch.min.bytes.
max_bytes: This indicates the maximum size of a response. This is configurable using fetch.max.bytes.
As soon as the broker hits one of these limits, it will send a response back.
The Fetch request also indicates which partitions the consumer wants to read. For each partition, there is partition_max_bytes that indicates the maximum size to return for that partition. This is configurable using max.partition.fetch.bytes.
In the past, Fetch requests contained the full list of partitions. The broker would iterate the list in order until it reached one of the limits mentioned above.
Since 1.1 (KIP-227), it's a bit more complicated as consumers use fetch sessions to avoid sending the full list in every fetch request. To keep it sinple, brokers use FetchSessions to keep an iterator on the partition list to ensure records are fetched from all partitions fairly.
Now let's look at the client side ...
At this point, you may have noticed that I've not mentioned max.poll.records. This setting is only used on the client side. Consumers try to fetch records efficiently. So even if you set max.poll.records=1, a consumer may fetch records in large batches, keep them in memory and only return 1 record each time poll() is called. This avoids sending many small requests and overloading brokers unnecessarily.
The consumer also keeps track of the records it has in memory. If it already has records for a partition, it can not include it in the next Fetch request.
So while each Fetch response may not include data all partitions, over a period of time, all partitions should be fetched fairly.
I've simplified the process to keep it short but if you want to dive into this logic, I'd recommend checking the following classes:
Fetcher.java: This is the client side logic that determines what to fetch from brokers and what to return in poll().
ReplicaManager.scala: This is the server side logic that determines what to return in a Fetch response. See fetchMessages().
FetchSession.scala: This is the session logic introduced by KIP-227

Increase the number of messages read by a Kafka consumer in a single poll

Kafka consumer has a configuration max.poll.records which controls the maximum number of records returned in a single call to poll() and its default value is 500. I have set it to a very high number so that I can get all the messages in a single poll.
However, the poll returns only a few thousand messages(roughly 6000) in a single call even though the topic has many more. How can I further increase the number of messages read by a single consumer?
You can increase Consumer poll() batch size by increasing max.partition.fetch.bytes, but still as per documentation it has limitation with fetch.max.bytes which also need to be increased with required batch size. And also from the documentation there is one other property message.max.bytes in Topic config and Broker config to restrict the batch size. so one way is to increase all of these property based on your required batch size
In Consumer config max.partition.fetch.bytes default value is 1048576
The maximum amount of data per-partition the server will return. Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress. The maximum record batch size accepted by the broker is defined via message.max.bytes (broker config) or max.message.bytes (topic config). See fetch.max.bytes for limiting the consumer request size
In Consumer Config fetch.max.bytes default value is 52428800
The maximum amount of data the server should return for a fetch request. Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress. As such, this is not a absolute maximum. The maximum record batch size accepted by the broker is defined via message.max.bytes (broker config) or max.message.bytes (topic config). Note that the consumer performs multiple fetches in parallel.
In Broker config message.max.bytes default value is 1000012
The largest record batch size allowed by Kafka. If this is increased and there are consumers older than 0.10.2, the consumers' fetch size must also be increased so that the they can fetch record batches this large.
In the latest message format version, records are always grouped into batches for efficiency. In previous message format versions, uncompressed records are not grouped into batches and this limit only applies to a single record in that case.
This can be set per topic with the topic level max.message.bytes config.
In Topic config max.message.bytes default value is 1000012
The largest record batch size allowed by Kafka. If this is increased and there are consumers older than 0.10.2, the consumers' fetch size must also be increased so that the they can fetch record batches this large.
In the latest message format version, records are always grouped into batches for efficiency. In previous message format versions, uncompressed records are not grouped into batches and this limit only applies to a single record in that case.
Most probably your payload is limited by max.partition.fetch.bytes, which is 1MB by default. Refer to Kafka Consumer configuration.
Here's good detailed explanation:
This property controls the maximum number of bytes the server will return per partition. The default is 1 MB, which means that when KafkaConsumer.poll() returns ConsumerRecords, the record object will use at most max.partition.fetch.bytes per partition assigned to the consumer. So if a topic has 20 partitions, and you have 5 consumers, each consumer will need to have 4 MB of memory available for ConsumerRecords. In practice, you will want to allocate more memory as each consumer will need to handle more partitions if other consumers in the group fail. max. partition.fetch.bytes must be larger than the largest message a broker will accept (determined by the max.message.size property in the broker configuration), or the broker may have messages that the consumer will be unable to consume, in which case the consumer will hang trying to read them. Another important consideration when setting max.partition.fetch.bytes is the amount of time it takes the consumer to process data. As you recall, the consumer must call poll() frequently enough to avoid session timeout and subsequent rebalance. If the amount of data a single poll() returns is very large, it may take the consumer longer to process, which means it will not get to the next iteration of the poll loop in time to avoid a session timeout. If this occurs, the two options are either to lower max. partition.fetch.bytes or to increase the session timeout.
Hope it helps!

How to set Kafka Producer message rate per second?

I am reading a csv file and giving the rows of this input to my Kafka Producer. now I want my Kafka Producer to produce messages at a rate of 100 messages per second.
Take a look at linger.ms and batch.size properties of Kafka Producer.
You have to adjust these properties correspondingly to get desired rate.
The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out. However in some circumstances the client may want to reduce the number of requests even under moderate load. This setting accomplishes this by adding a small amount of artificial delay—that is, rather than immediately sending out a record the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched together. This can be thought of as analogous to Nagle's algorithm in TCP. This setting gives the upper bound on the delay for batching: once we get batch.size worth of records for a partition it will be sent immediately regardless of this setting, however if we have fewer than this many bytes accumulated for this partition we will 'linger' for the specified time waiting for more records to show up. This setting defaults to 0 (i.e. no delay). Setting linger.ms=5, for example, would have the effect of reducing the number of requests sent but would add up to 5ms of latency to records sent in the absense of load.
If you like stream processing then akka-streams has nice support for throttling: http://doc.akka.io/docs/akka/current/java/stream/stream-quickstart.html#time-based-processing
Then the akka-stream-kafka (aka reactive-kafka) library allows you to connect the two together: http://doc.akka.io/docs/akka-stream-kafka/current/home.html
In Kafka JVM Producer, the throughput depends upon multiple factors. And most commonly it's calculated in MB/sec rather than Msg/sec. In your example, if let's say each of your row in CSV is 1MB in size then you need to tune your producer configs to achieve 100MB/sec, so that you can achieve your target throughput of 100 Msg/sec. While tuning producer configs, you have to take into the consideration what's your batch.size ( measured in bytes ) config value? If it's set too low then producer will try to send messages more often and wait for reply from server. This will improve the producer's throughput. But would impact the latency. If you are using async callback based producer then in this case your overall throughput will be limited by how many number of messages producer can send before waiting for reply from server determined by max.in.flight.request.per.connection.
If you keep batch.size too high then producer throughput will get affected since after waiting for linger.ms period kafka producer will send the all messages in a batch to broker for that particular partition at once. But having bigger batch.size means bigger buffer.memory which might put pressure on GC.