Kafka publishing in multiple threads to the same partition - apache-kafka

I have some thousands of records to be posted to Kafka on the same partition in one transaction. I am doing this using spring KafkaTemplate. In order to improve the performance of my current logic, I am thinking of doing Kafka publishing in multiple threads. All the events to be published have the same key and are intended to go to same partition. Will using multiple threads result in offset conflicts among multiple threads? Should I stick to one thread doing all publishing?

The transaction is bound to the thread so you'll end up with multiple transactions.
Have you tried increasing the linger.ms producer property?

We are using multi-threaded approach in a spring app to publish msgs to same Kafka topic, no issue has been reported yet. Kafka is a commitlog based process and appends the new messages into the log and gives the offset out to zookeeper to manage consumers.
Your approach is same as multiple producers sending messages simultaneously to a topic, with same key. Kafka can handle this scenario since there is an elected partition leader.
Also there is a buffer time till when produced messages are backedup into the producer buffer and are flushed when the buffer space is full. So Kafka already has mechanisms to take care of bombardment of messages with same key.

Related

Is it possible to control how often a Spring Kafka Message Listener switches between its assigned partitions?

When a Spring Kafka MessageListener is consuming messages from multiple partitions, it keeps processing messages from one partition until there are no more messages and only after that it continues with the next partition. (based on my observations)
Is it possible to set a max number of messages/batches and tell the Listener to switch faster to the next partition rather than later?
This would improve fairness and consume evenly from all assigned partitions.
switch faster to the next partition, consume evenly from all assigned partitions
I don't think Kafka has any properties for this. kafka consumer config
It's weird. You could see a partition replica in Kafka as a log file. Your consumer poll runs in one thread, for better performance, it should consume from one file, and the next poll will consume from another file rather than separate it and consume evenly from many partitions for each poll, right? Eventually, you still need to consume all of the messages on the topic.

What is the difference between pulsar and kafka in regards to consumption?

In order to consume data from Kafka, we can have multiple consumers on a topic, totally decoupled. Then, what is meant by no shared consumption on the page(https://streaml.io/blog/pulsar-streaming-queuing) which shares differences between kafka and pulsar?
In his blog, Sijie is referring to shared messaging as queuing. With queuing messaging, multiple consumers are created to receive messages from a single topic. Which consumer gets the message is completely random.
The issue with implementing the messaging pattern with Kafka lies in way that Kafka consumers mark that they’ve consumed a message. Kafka consumers use what’s called a high watermark for consumer offsets. That means that a consumer can only say, “I’ve processed up to this point” rather than, “I’ve processed this message.”
Consider the scenario in which multiple Kafka consumers from the same consumer group were processing from the same topic partition and one of the consumers fails due to an exception while the other succeed. Because Kafka does not a have a built-in way to only acknowledge a single message, and only uses a high-water mark, the failed message would be erronously marked as consumed when in fact it failed and needs to be either reprocessed or published to an error queue, etc.
In order to avoid this situation, you would need to have just a single consumer per partition which limits the comsumption throughput of the topic. Which in turn requires you to increase the number of partitions in order to meet your throughput needs.
There is a detailed explanation in this blog post

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

How to maintain ordering of message in Kafka active active site

I have a business requirement of maintaining messages in active active site, i am planning to use kafka for the same.
The producer puts messages into JMS/MQ, which will be consumed by KAFKA.
So when a batch message of 1 million messages are placed in MQ/JMS by producer, Is it possible to maintain the sequence of message in geographically distributed active-active kafka cluster?
(assuming we are having one partition and one consumer per topic)
Thanks in advance
Yes, the order of messages per partition of a topic is preserved. Between different topics there are no guarantees. So if your entire batch is sent to the same single-partition topic by one producer, yes the order will be preserved. There are some nuances of the configuration that you should be aware of, for instance the ordering guarantee will not hold if max inflight requests per connection > 1 and retries are enabled. The defaults, however, are safe. For more details look for "max.in.flight.requests.per.connection" in https://kafka.apache.org/documentation/#configuration
If your setup has redundant producers with failover, then you may want to consider enabling idempotence.

Kafka - Multiple consumers (only one active) on same group/topic

Is it possible to have multiple copies of an application listen to the same Kafka group/topic so that only one is reading it at a time, but the other ones will start working if the main one crashes/stops reading?
I need to make an application highly available but can't tolerate doubling the traffic to the data store on the other end of the application by having multiple copies actively running.
FYI - Technically I'm using MapR streams but it adheres to the Kafka API and functionality, in case anyone knows a MapR stream-specific feature that helps the situation.
It is possible. If multi consumers are in same consumer group, when the group subscribes a topic, kafka will do a partition assignment work for your consumers: one partition could only be consumed by only one consumer in a same group.
So you could set your topic to have only one partition, then only one consumer to consume message, others will be idle. Once the consumer is shutdown, it will trigger the group rebalance operation : kafka will do the partition assignment again. And Then in your case , a new consumer will go ahead this work. It will process message from the last committed offset which commited by old consumer.
And if your case supports parallel processing, you could make many process(app) doing same work and set the topic to multi partitions. They will be assigned to consume different partitions and process different messages. So it will speed up your process and also can tolerant the fail over. As above said, if some consumers is failed, kafka will take care it for you, it will assign their paritition to other working consumer. So everything will be ok.