Throttling of messages on consumer side - apache-kafka

I am beginner level at kafka and have developed consumer for kafka messages which looks good right now.
Though there is a requirement came along while testing of consumer that may be some throttling of messages will be needed at consumer side.
The consumer (.net core, using confluent), after receiving messages, calls api and api processes the message. As part this process, It has few number of read and write to database.
The scenario is, Consumer may receive millions or atleast few thousand of messages daily. This makes load on DB side as part of processing.
So I am thinking to put some throttling on receiving messages on kafka consumer so the DB will not be overloaded. I have checked the option for poll but seems its not all that I want.
For example, within 10 minutes, consumer can receive 100k messages only. Something like that.
Could anybody please suggest how to implement throttling of messages on kafka consumer or is there any better way that this can be handled?

I investigated more and come to know from expert that "throttling on consumer side is not easy to implement, since kafka consumer is implemented in such way to read and process messages as soon as they are available in kafka topic. So, speed is a benefit in kafka world :)"
Seems I can not do much at kafka consumer side. I am thinking to see on the other side and may be separating reads (to replica) and writes to the database can help.

Related

Why does kafka consumer poll the broker?

Currently learning about Kafka architecture and I'm confused as to why the consumer polls the broker. Why wouldn't the consumer simply subscribe to the broker and supply some callback information and wait for the broker to get a record? Then when the broker gets a relevant record, look up who needs to know about it and look at the callback information to dispatch the messages? This would reduce the number of network operations hugely.
Kafka can be used as a messaging service, but it is not the only possible usecase. You could also treat it as a remote file that can have bytes (records) read on demand.
Also, if notification mechanism were to implemented in message-broker fashion as you suggest, you'd need to handle slow consumers. Kafka leaves all control to consumers, allowing them to consume at their own speed.

Are Kafka and Kafka Streams right tools for our case?

I'm new to Kafka and will be grateful for any advice
We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?
Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.
While it is ideal to have smaller messages I have seen streams users have messages in the GBs.
You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.

Using a Kafka consumer in order for a message to be consumed by exactly once semantics

I am new to Kafka and I am seeking guidance on how to use Kafka in order to implement the following message pattern:
First, I want the message to be asynchronous and furthermore it needs to be "consumed" i.e. a single consumer should consume it and other consumers won't be able to consume it thereafter.
A use case of this message pattern is when you have multiple instances of a "delivery service" and you want only one of these instances to consume the message (this assumes one cannot leverage idempotency for some reason).
Can someone please advise how to configure the Kafka Consumer in order to achieve the above?
I think you're essentially looking to use Kafka as a traditional message queue (e.g. Rabbit MQ) where in the message gets removed after consumption. There has been quite a lot of debate on this. As it is always the case, there are merits and demerits on both sides of the fence.
The answers on this post are more or less against the idea ...
However...
This article talks about an approach on how you could possibly try and make it work. The messages won't really be deleted but the approach is quite similar. It is a fairly comprehensive post that covers the overhead and the optimisations that you could explore to make it more efficient.
I hope this helps!
Great question and its something a lot of us struggle with when deploying and using Kafka. In fact, there are a number of times where a project I was working on tried to use Kafka for the use case you described with very little success.
In a nutshell, there are a few Message Exchange Patterns that you come across when dealing with messaging:
Request->Reply
Publish/Subscribe
Queuing (which is what you are trying to do)
Without digging too deep into why, Kafka was really built simply for Publish/Subscribe. There are other products that implement the other features separately and one that actually does all three.
So a question I have for you is would you be open to using something other than Kafka for this project?
You may use spring kafka to do this. Spring Kafka takes care of lot of configurations and boiler plate code. Check example here https://www.baeldung.com/spring-kafka. This should get your started.
Also, you may need to read on how Kafka actually works. The messages that you publish to the Topics in Kafka are natively asynchronous. Your producers don't worry about who consumes it or what happens to the messages once published.
Then consumers in your delivery services should subscribe to the topics. If you want your delivery services to consume a message only once, then the consumers for your delivery services should be in the same group (same group id). Kafka takes care of making sure that the message that was consumed by one of the Consumers (in a same group) won't be available to other Consumers.
The default message retention period is seven days which is configurable in Kafka.

What ways can a Consumer consume message in Kafka?

If there is a Kafka server "over there" somewhere across the network I would assume that there might be two ways that a Consumer could consume the messages:
By first of all 'subscribing' to the Topic and in effect telling the Kafka server where it is listening so that when a new message is Produced, Kafka proactively sends the message to the Consumer, across the network.
The Consumer has to poll the Kafka server asking for any new messages, using the offset of the messages it has currently taken.
Is this how Kafka works, and is the option configurable for which one to use?
I'm expanding my comment into an answer.
Reading through the consumer documentation, Kafka only supports option 2 as you've described it. It is the consumers responsibility to get messages from the Kafka server. In the 0.9.x.x Consumer this is accomplished by the poll() method. The Consumer polls the Kafka Server and returns messages if there are any. There are several reasons I believe they've chosen to avoid supporting option 1.
It limits the complexity needed in the Kafka Server. It's not the Server's responsibility to push messages to a consumer, it just holds the messages and waits till a consumer fetches them.
If the Kafka Server was pushing all messages to the consumers, it could overwhelm a consumer. Say a Producer was pushing messaging into a Kafka Server 10 msg/sec, but a certain Consumer could only process 2 msg/sec. If the Kafka Server attempted to push every message it received to that Consumer, the Consumer would quickly be overwhelmed by the number of messages it receives.
There's probably other reasons, but at the moment those were the two I thought about.

Jboss Messaging. sending one message per time

We are using JBOSS 5.1.0, we using topic for storing our messages. And our client is making a durable subscription to get those messages.
Everything is working fine, but one issue is we are getting data from TCP client, we are processing and keeping it in topic, it is sending around 10 messages per second, and our client is reading one message at a time. There is a huge gap between that, and after sometime JBOSS Topic have many messages and it crashes saying out of memory.
IS there any workaround for this.
Basically the producer is producing 10x more messages than consumer can handle. If this situation is stable (not only during peak), this will never work.
If you limit the producer to send only one message per second (which is of course possible, e.g. check out RateLimiter), what will you do with extra messages on the producer side? If they are not queueing up in the topic, they will queue up on the producer side.
You have few choices:
somehow tune your consumer to process messages faster, so the topic is never filled up
tune the topic to use persistent storage. This is much better. Not only the topic won't store everything in memory, but you might also get transactional behaviour (messages are durable)
put a queue of messages that you want to set to the topic and process one message per second. That queue must be persistent and must be able to keep more messages than the topic currently can