How does Kafka decide which records are contained in the consumer poll loop when there are more than `max.poll.records` records left? - apache-kafka

I have a Kafka consumer group consuming several topics (each topic has more than one partition). All topics contain a considerable amount of records on each partition.
I'm currently trying to make sense of the behavior when the consumer initially starts consuming.
In particular, I'd like to know how the broker decides which records reach the client first.
The following aspects are noteworthy:
There are a lot more records than the consumer can process in one single roundtrip (i.e. more records than the consumer's max.poll.records configuration)
There are records from several topics and several partitions that the consumer has to read
I naively assumed that the broker returns records for each topic in each poll loop, so that the consumer reads all the topics at a similar pace. This doesn't seem to be the case though. Apparently it prioritizes records for a single topic at a time, switching the topic without an obvious pattern (at least that's what I'm seeing in the metrics of my consumer).
I couldn't find anything in the consumer config parameters that allows me to change this behavior. It's not really a problem, because all messages get read eventually. But I would like to understand the behavior in more detail.
So my question is: How does the broker decide which records end up in the result of a consumer's poll loop?

Consumer fetch records from Kafka using Fetch requests.
If you look at the protocol, this API is pretty complex and has many fields, but we can focus on a few fields that are relevant to your questions:
max_wait_ms: This indicates how long the broker should wait in case there's no/not enough records available. This is configurable using fetch.max.wait.ms.
min_bytes: This indicates how much data (the size of records) the broker needs to respond. This is configurable using fetch.min.bytes.
max_bytes: This indicates the maximum size of a response. This is configurable using fetch.max.bytes.
As soon as the broker hits one of these limits, it will send a response back.
The Fetch request also indicates which partitions the consumer wants to read. For each partition, there is partition_max_bytes that indicates the maximum size to return for that partition. This is configurable using max.partition.fetch.bytes.
In the past, Fetch requests contained the full list of partitions. The broker would iterate the list in order until it reached one of the limits mentioned above.
Since 1.1 (KIP-227), it's a bit more complicated as consumers use fetch sessions to avoid sending the full list in every fetch request. To keep it sinple, brokers use FetchSessions to keep an iterator on the partition list to ensure records are fetched from all partitions fairly.
Now let's look at the client side ...
At this point, you may have noticed that I've not mentioned max.poll.records. This setting is only used on the client side. Consumers try to fetch records efficiently. So even if you set max.poll.records=1, a consumer may fetch records in large batches, keep them in memory and only return 1 record each time poll() is called. This avoids sending many small requests and overloading brokers unnecessarily.
The consumer also keeps track of the records it has in memory. If it already has records for a partition, it can not include it in the next Fetch request.
So while each Fetch response may not include data all partitions, over a period of time, all partitions should be fetched fairly.
I've simplified the process to keep it short but if you want to dive into this logic, I'd recommend checking the following classes:
Fetcher.java: This is the client side logic that determines what to fetch from brokers and what to return in poll().
ReplicaManager.scala: This is the server side logic that determines what to return in a Fetch response. See fetchMessages().
FetchSession.scala: This is the session logic introduced by KIP-227

Related

How does one Kafka consumer read from more than one partition?

I would like to know how one consumer consumes from more than one partition, specifically, in what order are messages read from the different partitions?
I had a peek at the source code (Consumer, Fetcher) but I can't really follow all of it.
This is what I thought would happen:
Partitions are read sequentially. That is: all the messages in one partition will be read before continuing to the next. If we reach max.poll.records without consuming the whole partition, the next fetch will continue reading the current partition until it is exhausted, before going on to the next.
I tried setting max.poll.records to a relatively low number and seeing what happens.
If I send messages to a topic and then start a consumer, all the messages are read from one partition before continuing to the next, even if the number of messages in that partition is higher than max.poll.records.
Then I tried to see if I could "lock" the consumer in one partition, by sending messages to that partition continuously (using JMeter). But I couldn't do it: messages from other partitions were also being read.
The consumer is polling for messages from its assigned partitions in a greedy round-robin way.
e.g. if max.poll.records is set to 100, and there are 2 partitions assigned A,B. The consumer will try to poll 100 from A. If partition A hasn't had 100 available messages, it will poll whats left to complete to 100 messages from partition B.
Although this is not ideal, this way no partition will be starved.
This is also explain why ordering is not guaranteed between partitions.
I have read the KIP mentioned in the answer to the question linked in the comments and I think I finally understood how the consumer works.
There are two main configuration options that affect how data is consumed:
max.partition.fetch.bytes: the maximum amount of data that the server will return for a given partition
max.poll.records: the maximum amount of records that are returned each time the consumer polls
The process of fetching from each partition is greedy and proceeds in a round-robin way. Greedy means that as many records as possible will be retrieved from each partition; if all records in a partition occupy less than max.partition.fetch.bytes, then all of them will be fetched; otherwise, only max.partition.fetch.bytes will be fetched.
Now, not all the fetched records will be returned in a poll call. Only max.poll.records will be returned.
The remaining records will be retained for the next call to poll.
Moreover, if the number of retained records is less than max.poll.records, the poll method will start a new round of fetching (pre-fetching) before returning. This means that, usually, the consumer is processing records while new records are being fetched.
If some partitions receive considerably more messages than others, this could lead to the less active partitions not being processed for long periods of time.
The only downside to this approach is that it could lead to some partitions going unconsumed for an extended amount of time when there is a large imbalance between the partition's respective message rates. For example, suppose that a consumer with max messages set to 1 fetches data from partitions A and B. If the returned fetch includes 1000 records from A and no records from B, the consumer will have to process all 1000 available records from A before fetching on partition B again.
In order to prevent this, we could reduce max.partition.fetch.bytes.

Kafka fetch max bytes doesn't work as expected

I have a topic worth 1 GB of messages. A. Kafka consumer decides to consume these messages. What could I do to prohibit the consumer from consuming all messages at once? I tried to set the
fetch.max.bytes on the broker
to 30 MB to allow only 30 MB of messages in each poll. The broker doesn't seem to honor that and tries to give all messages at once to the consumer causing Consumer out of memory error. How can I resolve this issue?
Kafka configurations can be quite overwhelming. Typically in Kafka, multiple configurations can work together to achieve a result. This brings flexibility, but flexibility comes with a price.
From the documentation of fetch.max.bytes:
Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress.
Only on the consumer side, there are more configurations to consider for bounding the consumer memory usage, including:
max.poll.records: limits the number of records retrieved in a single call to poll. Default is 500.
max.partition.fetch.bytes: limits the number of bytes fetched per partition. This should not be a problem as the default is 1MB.
As per the information in KIP-81, the memory usage in practice should be something like min(num brokers * max.fetch.bytes, max.partition.fetch.bytes * num_partitions).
Also, in the same KIP:
The consumer (Fetcher) delays decompression until the records are returned to the user, but because of max.poll.records, it may end up holding onto the decompressed data from a single partition for a few iterations.
I'd suggest you to also tune these parameters and hopefully this will get you into the desired state.

KafkaProducer send a list of messages or break list into individual messages

Is it okay to batch 100 messages into a single object and send those objects to kafka or should I split those 100 messages into individual messages and then put them in kafka
Say for example, I have an object that contains a List. I can put 100 strings in that list and send the object to kafka. Is it better to do it that way or should i split the list of strings and send individual strings to kafka instead
What are some pros and cons to the above approaches
Batching is always good when async processing, until you need to partially process the batch in case of errors.
If you are processing an order and the list of 100 are the items. send them together, as they will be processed together. If you are sending 100 orders, and will process the independently, process them one by one, as the error in one order should not block the others.
As for message sizes, kafka has some message size limits, but these are configurable.
Definitively you need to improve your question.
You want to send a huge message that is more than the max.message.bytes configuration of your kafka broker(let's assume you can't change it). You break it down and put it back together at the consumer side.
This would require some work around the limitations of kafka deployment as of now. For e.g
Should your consumer process all these 100 strings as if they were one batch? when should your consumer decide to commit the offsets for these messages? Is your consumer processing idempotent? Do you have one consumer or multiple consumer instances? what if the 100 strings were split across 5 partitions? which consumer gets which subset of these 100 strings?
An approach is to create 100 messags all with the same batch id like so
(batch1:message1, batch1:message2, batch1:message3)
On the consumer side collect all these messages with the same key
(batch1: (message1, message2, message3))
But, how would you know when the batch ends? does the sequence message1, message2, message3 matter?
So you do something like this
(batch1:message1of3, batch1:message2of3, batch1:messsage3of3)
Now what if you received message1of3 and message2of3 but not message3of3? how long do you wait for it?
As you can see, at each step there are multiple ways to go about this and you will have to make choices right for your problem. Perhaps, you will use timeouts, perhaps in your case batches are interleaved like this
(batch1:message1of3, batch2:message2of5, batch1:message2of3...)
Expect to make some compromises. With Kafka your consumer group is guaranteed to receive all messages, and while it's running, any consumer is assigned one or more partitions(meaning a single partition is not assigned to more than one consumer at the same time). Kafka will also assign messages with the same key to the same partition. With these two properties in mind you can design a system that can consume messages in batches with some obvious trade-offs and limitations.

How does a kafka process schedule writes to different partition?

Imagine a scenario where we have 3 partitions belonging to 3 different topics on a machine which runs a kafka process/broker. This broker will receive messages for all three partitions. It will store them on different log subdirectories. My question is how does the kafka broker schedule these writes? How does it decide which partition/topic will be written next?
For ordering over requests, the image below shows roughly, how the broker internally handles produce requests:
There is a number of network threads that pull bytes of the network layer and convert these to internal requests. These requests are then stuck in a fifo request queue, from where the io threads pull them and append the contained messages to the relevant partitions. So in short messages are processed in the order they are received in.
Looking through the code I am unsure, whether there may be potential for a race condition here, where a smaller request could "overtake" a large request that was sent immediately before it. However even if this were possible it is an extremely unlikely fringe case that I can't see ever occurring for a single producer. Maybe someone with a better understanding of the code can weigh in here?
As for ordering of batched messages in one request, the request stores messages internally in a HashMap, which uses TopicPartition as a key, since as far as I am aware a Scala HashMap does not preserve ordering of the inserted elements, I don't think that there are any guarantees around the order in which multiple partitions in one request get processed - which is fine, as ordering is only guaranteed to be preserved within the partition.
Within each partition, messages are processed in the order they were given to the producer before sending.

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.