librdkafka producer's Internal queues - how do they work? - apache-kafka

I had a few questions about GoLang Kafka producer using librdkafka -
These are based on the logs I am seeing in the producer log when I set debug: all.
The producer spends some time in building message sets once the batch threshold is recached or linger.ms is passed. However, what happens almost all the time is - messages are moved from partition queue to xmit queue. I was trying to get some documentation on it, but could not find much, so wanted to check if I can get some help on the stack. My questions are following -
a) Does the application produce calls write to a partition specific queue(s)?
b) Are there one xmit queue and one partition queue per partition?
c) What triggers a transfer from partition queue to xmit queue? and why do we need two queues?
d) When the Kafka producer is creating messagesets for a partition - does it block all operations for the partition? (Like moving messages from partition queue to xmit queue)? In short, when message sets are being built for a partition, can new messages sneak in the xmit queue? Is it blocked?
e) How many threads work for creating message sets? Is it one per producer or one per partition?

Related

Is it possible to control how often a Spring Kafka Message Listener switches between its assigned partitions?

When a Spring Kafka MessageListener is consuming messages from multiple partitions, it keeps processing messages from one partition until there are no more messages and only after that it continues with the next partition. (based on my observations)
Is it possible to set a max number of messages/batches and tell the Listener to switch faster to the next partition rather than later?
This would improve fairness and consume evenly from all assigned partitions.
switch faster to the next partition, consume evenly from all assigned partitions
I don't think Kafka has any properties for this. kafka consumer config
It's weird. You could see a partition replica in Kafka as a log file. Your consumer poll runs in one thread, for better performance, it should consume from one file, and the next poll will consume from another file rather than separate it and consume evenly from many partitions for each poll, right? Eventually, you still need to consume all of the messages on the topic.

Messages vanish between kafka producer and consumer

I have a very simple embedded kafka application: one producer thread and two consumer treads that write top postgres db. The three threads run in a single process. I am using librdkafka to implement my consumer and producer and run apache-kafka as a broker. Message size is approximately 2kB. I have two counters: one incremented when I write (rd_kafka_produce) and another incremented when I read (rd_kafka_consume_batch). If I run my producer fast enough (over 30000 messages/second) the producer counter ends up much larger than the consumer counter (15% or so if I run for 30 seconds). So I am loosing messages somewhere. My first question is how to debug such a problem? The second is what is most probable cause of this problem and how can I fix it?

Kafka publishing in multiple threads to the same partition

I have some thousands of records to be posted to Kafka on the same partition in one transaction. I am doing this using spring KafkaTemplate. In order to improve the performance of my current logic, I am thinking of doing Kafka publishing in multiple threads. All the events to be published have the same key and are intended to go to same partition. Will using multiple threads result in offset conflicts among multiple threads? Should I stick to one thread doing all publishing?
The transaction is bound to the thread so you'll end up with multiple transactions.
Have you tried increasing the linger.ms producer property?
We are using multi-threaded approach in a spring app to publish msgs to same Kafka topic, no issue has been reported yet. Kafka is a commitlog based process and appends the new messages into the log and gives the offset out to zookeeper to manage consumers.
Your approach is same as multiple producers sending messages simultaneously to a topic, with same key. Kafka can handle this scenario since there is an elected partition leader.
Also there is a buffer time till when produced messages are backedup into the producer buffer and are flushed when the buffer space is full. So Kafka already has mechanisms to take care of bombardment of messages with same key.

Apache NiFi - Asynchronously process messages after Kafka consumer

Currently we are using Apache NiFi to consume messages via Kafka consumer. Output of kafka consumer is connected to DB processor which gets the messages in queue (from consumer) and runs the stored proc/processing on it. So the DB processor will be working on one message per from queue and I can set the DB processor to work in parallel for n threads, but primarily each thread can work on one message per queue.
I am looking to do something like below:
processor after consumer will just consume message (or take messages) from queue and say will wait for "batch" or total to 1000 messages.
As soon as it gets 1000 messages OR 60 secs passed and message count is < 1000, push to another processor which can be DB stored proc for business logic on group of those messages.
Mainly, I want above to be multithreaded i.e. if we get 3000 messages, the first processor will read them in 3 batches and push to DB processor (parallely).
So I want to know any such processor which can do point 2 above i.e. just read messages and push it to next based on batch/time rules?
If you can leverage NiFi's record processor's then using ConsumeKafkaRecord with a batch size of 1000 followed by PutDatabaseRecord will give you similar behavior to what you are describing.
If you may not always have enough messages available in the Kafka topic at the time of consuming, then adding MergeContent or MergeRecord in the middle would let you wait for a certain amount of time or number of messages.

How does storm (with multiple worker nodes) guarantee message processing while reading from a kafka topic

I have a storm setup that picks up messages from a kafka topic and processes and persists them.
I want to understand how storm gurantees message processing in such a scenario
Consider the below scenario:
I have configured multiple supervisors+workers for a storm cluster.
The KafkaSpout is reading message from the topic and then passes on this a bolt. The bolt acks upon completion and the spout moves forward to the next message.
I have 2 supervisors running - each of which are running 3 workers each.
From what I understand - each of the worker on every supervisor is capable to processing a message.
So, at any given time 6 messages are being processed parallely in storm cluster.
what if the second message fails, either due to worker shutdown or due to supervisor shutdown.
the zookeeper is already pointing to the 7 message for the consumer group.
In such a scenario, how will the second message get processed?
I guess there is some miss understanding. The following claims seem to be wrong:
The bolt acks upon completion and the spout moves forward to the next message.
at any given time 6 messages are being processed parallely in storm cluster
=> A spout is not waiting for the acks; it fetches tuples over-and-over again with the maximum speed regardless of the processing speed of the bolts -- as long as new messages are available in Kafka. (Or did you limit the number of tuples in flight via max.spout.pending?). Thus, many messages are processed in parallel (even if only #executors are given to a UDF -- many other messages are buffered in internal Storm queues).
As far as I know (but I am not 100% sure), KafkaSpout "orders" the incoming acks and only move the offset if all consecutive acks are available -- ie, message 7 is not acked to Kafka if the Storm ack of message 6 is not there yet. Thus, KafkaSpout can re-emit message 6 if it fails. Re-call that Storm does not give any ordering guarantees.