Ensure that all messages are consumed in Kafka - apache-kafka

I has a consumer server in Apache-Kafka and its consuming the messages.
However in the poll loop, I have a time consuming process (db access).
So my questions are:
How to make sure that I read all the messages in a given timeframe on the consumer?
Do I need to increase the number of consumer clients?

Related

Kafka set maximum records consumed in a minute

I'm creating a scraper. The producer sends data to Kafka's topic with the information about links to be scraped. The consumer is an AWS Lambda function that will be triggered when a message is received on that topic.
To avoid blocking, I want to add a cap on the maximum number of messages consumed in a given time. For example, I just want to consume only 5 messages in a minute. While the producer should keep pushing the messages to Kafka.
How can I achieve this?

How does a Kafka Consumer behave if a Producer goes down. What happens to the data in the interval when the producer goes down

I just want to know how the Consumer is able to consume data when the producer is down. Let's say Producer keeps sending logs to the consumer at a steady rate and then the producer goes down from 8AM- 6PM. How does the consumer work in such a case and is there a way that the consumer can get the data that would have been sent during 8am - 6pm if the producer was up.
In Apache Kafka there is no relationship between how producer and consumer behaves.
Acting as a messaging system, Kafka allows to decoupling producer from a consumer providing an asynchronous communication channel.
The producer can send messages at its own pace and the consumer can read these messages in real time or later at its own pace (different from the producer one).
The messages are saved in a topic living in the Kafka cluster, and each message has a position in the topic partition (offset).
Of course, it's possible to tune when messages are deleted from the topic if the consumer isn't online for long time reading the messages.
You can set to store messages for very long time (days, weeks, months) and after that they will be deleted; or you can set to store messages based on time (so deleting the ones older than a time).
Furthermore, the consumer is also able to rewind the stream of messages in the topic, actually re-reading the messages if needed.
Finally, the consumer can also seek to a specific position in the topic partition based on offset or specifiying a time.
The Kafka doc has a nice diagram which I copied below. It shows the novelty of Kafka in a succinct way.
Without Kafka, the situation is something like this. We have multiple servers, e.g. Frontend servers, DB servers, Chat servers etc. On the other side, we have probably different metrics and monitoring tools (e.g. DB monitor, UI monitor etc.). Direct one-to-one communications between different servers and collectors might work out for smaller systems, but it breaks down pretty quickly after the system has surpassed a a certain threshold, in terms of scalability. Kafka solves this problem by decoupling the senders and receivers. Both of them talk through the Kafka brokers instead of talking to each other.
So, in your case the consumer would simply ask the broker if there's any new data on the topic it's subscribing to. As the producer is down, and assuming there is no data in the queue, broker would reply, there's nothing to be consumed.. So, the consumer would be perpetually polling in a fixed interval, in an endless loop and do nothing. Whenever the producer comes up and starts pumping out data, consumer would start receiving (and processing) it. There are more involved use cases when you might be losing data if retention period for particular topic is over, and the consumer hasn't processed the backlog. But I don't think that's a concern for you at this point of your journey.

What is the difference between pulsar and kafka in regards to consumption?

In order to consume data from Kafka, we can have multiple consumers on a topic, totally decoupled. Then, what is meant by no shared consumption on the page(https://streaml.io/blog/pulsar-streaming-queuing) which shares differences between kafka and pulsar?
In his blog, Sijie is referring to shared messaging as queuing. With queuing messaging, multiple consumers are created to receive messages from a single topic. Which consumer gets the message is completely random.
The issue with implementing the messaging pattern with Kafka lies in way that Kafka consumers mark that they’ve consumed a message. Kafka consumers use what’s called a high watermark for consumer offsets. That means that a consumer can only say, “I’ve processed up to this point” rather than, “I’ve processed this message.”
Consider the scenario in which multiple Kafka consumers from the same consumer group were processing from the same topic partition and one of the consumers fails due to an exception while the other succeed. Because Kafka does not a have a built-in way to only acknowledge a single message, and only uses a high-water mark, the failed message would be erronously marked as consumed when in fact it failed and needs to be either reprocessed or published to an error queue, etc.
In order to avoid this situation, you would need to have just a single consumer per partition which limits the comsumption throughput of the topic. Which in turn requires you to increase the number of partitions in order to meet your throughput needs.
There is a detailed explanation in this blog post

Can we have retention period of zero in Kafka broker?

Does retention period of zero makes sense in kafka borker?
We want to quickly forward message from producer to consumer via kafka broker. From buffercache/pagecache on broker machine without flushing to disk. We do not need replication and assume our broker will never crash.
When a message is produced to a Kafka topic it is written to the disk. Once the message has been consumed, the offset of this message is committed by the consumer (if you are using the high-level consumer API) however, there is no functionality that deletes only the messages that have been consumed (many consumers may subscribe to the same topic and some of them might have consumed that message while some others might have not).
What I would suggest in your case is to set a short retention period (which by default is set to 7 days) but allow a reasonable amount of time in order to allow your consumer to consume the messages. To do this, you simply need to configure the following parameter in server.properties:
log.retention.ms=X
Note that there is no guarantee that the deleted message(s) have been successfully consumed by your consumer(s). For example, if you set the retention period to 2 seconds (i.e. log.retention.ms=2000) and your consumer crashes, then every message which is sent to the topic while the consumer is down will be lost.

Kafka throttle producer based on consumer lag

Is there any way to pause or throttle a Kafka producer based on consumer lag or other consumer issues? Would the producer need to determine itself if there is consumer lag then perform throttling itself?
Kafka is build on Pub/Sub design. Producer publish the message to centralized topic. Multiple consumers can subscribe to that topic. Since multiple consumers are involve you cannot decide on producer speed. One consumer can be slow other can be fast. Also it is against the design principle otherwise both system will become tightly couple. If you have use case of throttling may be you should evaluate other framework like direct rest call.
Producer and Consumer are decoupled.
Producer push data to Kafka topics (partitions topic), that are stored in Kafka Brokers. Producer doesn't know who and how often consume messages.
Consumer consume data from Brokers. Consumer doesn't know how many producers produce the messages. Even the same messages can be consumed by several consumers that are in different groups. In example some consumer can consume faster than the other.
You can read more about Producer and Consumer in Apache Kafka webpage
It is not possible to throttle the producer/producers weighing on performance of consumers.
In my scenario I don't want to loose events if the disk size is
exceeded before a message is consumed
To tackle your issue, you have to depend on the parallelism offering by the Kafka. Your Kafka topic should have multiple partitions and producers has to use different keys to populate the topic. So your data will be distributed across multiple partitions and bringing a consumer group you can manage load within a group of consumers. All data within a partition can be processed in order, that may be relevant since you are dealing with event processing.