concept of record vs request vs batch in kafka - apache-kafka

I've seen these terms used interchangeably while reading parameters of producer API. So, do these terms refer to same thing or is there any conceptual difference?

A record is the data or message to be sent.
A batch is a fixed number of records to be sent together.
A request is sending multiple batches to the broker so that the broker writes them to the topic.

Related

Difference between kafka batch and kafka request

I was not able to find an satisfactory answer anywhere, sorry for if this question might look trivial:
In Kafka, on producer side, can a request contain multiple batches to different partitions ?
I see the words batch and requests are used as synonyms in the documentation, and I was hoping to find some clarity on this.
If yes, how does this affect the ack policy ?
Are acks on per batch or request basis ?
A Kafka request (and response) is a message sent over the network between a Kafka client and broker. The Kafka protocol uses many types of requests, you can find them all in the Kafka protocol documentation.
The Produce and Fetch requests are used to exchange records. They both contain Kafka batches, it's the RECORDS field in the protocol description. A Kafka batch is used to group several records together and saves some bytes by sharing the metadata for all records. You can find the exact format of a batch in the documentation.
TLDR:
Requests/responses are the full messages exchanged between Kafka clients and brokers. Some requests contain Kafka batches that are groups of records.
I'm not sure you are asking about producer or consumer side. Here are some info that might answer your question.
On producer side:
By default, Kafka producer will accumulate records in a batch up to 16KB.
By default, the producer will have up to 5 requests in flight, meaning that 5 batches can be sent to Kafka at the same time. Meanwhile, the producer start to accumulate data for the next batches.
The acks config controls the number of brokers required to answer in order to consider each request successful.
On consumer side:
By default, the Kafka consumer regularly calls poll() to get a maximum of 500 records per poll.
Also by default, the Kafka consumer will ack every 5 seconds.
Meaning that the consumer will commit all the records that have been polled during the last 5 seconds by all the subsequent calls to poll().
Hope this helps!

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

High Performing Kafka Consumer

We have a requirement to consume from a Kafka Topic. The Topic is provided by the producer team and we have no control on them. The producer publishes huge amount of messages which our consumer is unable to consume. However we only require 5-10% of the volume produced. Currently in Consumer we deserialize the message and based on certain attributes drop 90-95% of the messages. The consumer is behind 5-10L messages most of the time during the day. We even tried with 5 consumer and 30 threads in each consumer but not much success.
Is there any way we can subscribe Consumer to the Topic with some filter criteria so we only receive messages we are interested in.
Any help or guidance would be highly appreciated.
It is not possible to filter messages without consuming and even partially deserializing all of them.
Broker-Side filtering is not supported, though it has been discussed for a long time (https://issues.apache.org/jira/browse/KAFKA-6020)
You mentioned that you do not control the producer. However, if you can get the producer to add the attribute you filter by to a message header, you can save yourself the parsing of the message body. You still need to read all the messages, but the parsing can be CPU intensive, so skipping that helps with lag.

What atomicity guarantees - if any - does Kafka have regarding batch writes?

We're now moving one of our services from pushing data through legacy communication tech to Apache Kafka.
The current logic is to send a message to IBM MQ and retry if errors occur. I want to repeat that, but I don't have any idea about what guarantees the broker provide in that scenario.
Let's say I send 100 messages in a batch via producer via Java client library. Assuming it reaches the cluster, is there a possibility only part of it be accepted (e.g. a disk is full, or some partitions I touch in my write are under-replicated)? Can I detect that problem from my producer and retry only those messages that weren't accepted?
I searched for kafka atomicity guarantee but came up empty, may be there's a well-known term for it
When you say you send 100 messages in one batch, you mean, you want to control this number of messages or be ok letting the producer batch a certain amount of messages and then send the batch ?
Because not sure you can control the number of produced messages in one producer batch, the API will queue them and batch them for you, but without guarantee of batch them all together ( I'll check that though).
If you're ok with letting the API batch a certain amount of messages for you, here is some clues about how they are acknowledged.
When dealing with producer, Kafka comes with some kind of reliability regarding writes ( also "batch writes")
As stated in this slideshare post :
https://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign (83)
The original list of messages is partitioned (randomly if the default partitioner is used) based on their destination partitions/topics, i.e. split into smaller batches.
Each post-split batch is sent to the respective leader broker/ISR (the individual send()’s happen sequentially), and each is acked by its respective leader broker according to request.required.acks
So regarding atomicity.. Not sure the whole batch will be seen as atomic regarding the above behavior. Maybe you can assure to send your batch of message using the same key for each message as they will go to the same partition, and thus maybe become atomic
If you need more clarity about acknowlegment rules when producing, here how it works As stated here https://docs.confluent.io/current/clients/producer.html :
You can control the durability of messages written to Kafka through the acks setting.
The default value of "1" requires an explicit acknowledgement from the partition leader that the write succeeded.
The strongest guarantee that Kafka provides is with "acks=all", which guarantees that not only did the partition leader accept the write, but it was successfully replicated to all of the in-sync replicas.
You can also look around producer enable.idempotence behavior if you aim having no duplicates while producing.
Yannick

Kafka instead of Rest for communication between microservices

I want to change the communication between (micro)-services from REST to Kafka.
I'm not sure about the topics and wanted to hear some opinions about that.
Consider the following setup:
I have an API-Gateway that provides CRUD functions via REST for web applications. So I have 4 endpoints which users can call.
The API-Gateway will produce the request and consumes the responses from the second service.
The second service consumes the requests, access the database to execute the CRUD operations on the database and produces the result.
How many topics should I create?
Do I have to create 8 (2 per endpoint (request/response)) or is there a better way to do it?
Would like to hear some experience or links to talks / documentation on that.
The short answer for this question is; It depends on your design.
You can use only one topic for all your operations or you can use several topics for different operations. However you must know that;
Your have to produce messages to kafka in the order that they created and you must consume the messages in the same order to provide consistency. Messages that are send to kafka are ordered within a topic partition. Messages in different topic partitions are not ordered by kafka. Lets say, you created an item then deleted that item. If you try to consume the message related to delete operation before the message related to create operation you get error. In this scenario, you must send these two messages to same topic partition to ensure that the delete message is consumed after create message.
Please note that, there is always a trade of between consistency and throughput. In this scenario, if you use a single topic partition and send all your messages to the same topic partition you will provide consistency but you cannot consume messages fast. Because you will get messages from the same topic partition one by one and you will get next message when the previous message consumed. To increase throughput here, you can use multiple topics or you can divide the topic into partitions. For both of these solutions you must implement some logic on producer side to provide consistency. You must send related messages to same topic partition. For instance, you can partition the topic into the number of different entity types and you send the messages of same entity type crud operation to the same partition. I don't know whether it ensures consistency in your scenario or not but this can be an alternative. You should find the logic which provides consistency with multiple topics or topic partitions. It depends on your case. If you can find the logic, you provide both consistency and throughput.
For your case, i would use a single topic with multiple partitions and on producer side i would send related messages to the same topic partition.
--regards