Trigger Kafka Consumer on receiving data - apache-kafka

I have a producer application which sends data to a Kafka topic, but only once in a while, as and when it receives from a source. I also have a consumer application (Spark) which keeps running all the time and receives data from Kafka when producer sends to it.
Since the consumer keeps running all the time, there is wastage of resources at times. Moreover, because my producer sends data only once in a while, is there any way to trigger the consumer application only when a kafka topic gets any data?

Sounds like you shouldn't be using Spark and would rather run some Serverless solution that can be triggered to run code on Kafka events.
Otherwise, run Cron to look at consumer lag. Define a threshold to submit your code at, then batch read from Kafka only then

Related

Why does kafka consumer poll the broker?

Currently learning about Kafka architecture and I'm confused as to why the consumer polls the broker. Why wouldn't the consumer simply subscribe to the broker and supply some callback information and wait for the broker to get a record? Then when the broker gets a relevant record, look up who needs to know about it and look at the callback information to dispatch the messages? This would reduce the number of network operations hugely.
Kafka can be used as a messaging service, but it is not the only possible usecase. You could also treat it as a remote file that can have bytes (records) read on demand.
Also, if notification mechanism were to implemented in message-broker fashion as you suggest, you'd need to handle slow consumers. Kafka leaves all control to consumers, allowing them to consume at their own speed.

Throttling of messages on consumer side

I am beginner level at kafka and have developed consumer for kafka messages which looks good right now.
Though there is a requirement came along while testing of consumer that may be some throttling of messages will be needed at consumer side.
The consumer (.net core, using confluent), after receiving messages, calls api and api processes the message. As part this process, It has few number of read and write to database.
The scenario is, Consumer may receive millions or atleast few thousand of messages daily. This makes load on DB side as part of processing.
So I am thinking to put some throttling on receiving messages on kafka consumer so the DB will not be overloaded. I have checked the option for poll but seems its not all that I want.
For example, within 10 minutes, consumer can receive 100k messages only. Something like that.
Could anybody please suggest how to implement throttling of messages on kafka consumer or is there any better way that this can be handled?
I investigated more and come to know from expert that "throttling on consumer side is not easy to implement, since kafka consumer is implemented in such way to read and process messages as soon as they are available in kafka topic. So, speed is a benefit in kafka world :)"
Seems I can not do much at kafka consumer side. I am thinking to see on the other side and may be separating reads (to replica) and writes to the database can help.

Consuming Kafka messages by two separate applications (storm and spark streaming)

We have a developed an ingestion application using Storm which consume Kafka messages (some times series sensor data) and save those messages into Cassandra. We use a Nifi workflow to do this.
I am now going to develop a separate Spark Streaming application which need to consume those Kafka messages as a source. I wonder why if there would be a problem when two application interacting with one Kafka Chanel? Should I duplicate Kafka messages in the Nifi to another Chanel so my Spark Streaming application use them, this is an overhead though.
From Kafka documentation:
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
Which in your case means that your second application just have to use another consumer group, so that these two applications will get same messages.

What ways can a Consumer consume message in Kafka?

If there is a Kafka server "over there" somewhere across the network I would assume that there might be two ways that a Consumer could consume the messages:
By first of all 'subscribing' to the Topic and in effect telling the Kafka server where it is listening so that when a new message is Produced, Kafka proactively sends the message to the Consumer, across the network.
The Consumer has to poll the Kafka server asking for any new messages, using the offset of the messages it has currently taken.
Is this how Kafka works, and is the option configurable for which one to use?
I'm expanding my comment into an answer.
Reading through the consumer documentation, Kafka only supports option 2 as you've described it. It is the consumers responsibility to get messages from the Kafka server. In the 0.9.x.x Consumer this is accomplished by the poll() method. The Consumer polls the Kafka Server and returns messages if there are any. There are several reasons I believe they've chosen to avoid supporting option 1.
It limits the complexity needed in the Kafka Server. It's not the Server's responsibility to push messages to a consumer, it just holds the messages and waits till a consumer fetches them.
If the Kafka Server was pushing all messages to the consumers, it could overwhelm a consumer. Say a Producer was pushing messaging into a Kafka Server 10 msg/sec, but a certain Consumer could only process 2 msg/sec. If the Kafka Server attempted to push every message it received to that Consumer, the Consumer would quickly be overwhelmed by the number of messages it receives.
There's probably other reasons, but at the moment those were the two I thought about.

Jboss Messaging. sending one message per time

We are using JBOSS 5.1.0, we using topic for storing our messages. And our client is making a durable subscription to get those messages.
Everything is working fine, but one issue is we are getting data from TCP client, we are processing and keeping it in topic, it is sending around 10 messages per second, and our client is reading one message at a time. There is a huge gap between that, and after sometime JBOSS Topic have many messages and it crashes saying out of memory.
IS there any workaround for this.
Basically the producer is producing 10x more messages than consumer can handle. If this situation is stable (not only during peak), this will never work.
If you limit the producer to send only one message per second (which is of course possible, e.g. check out RateLimiter), what will you do with extra messages on the producer side? If they are not queueing up in the topic, they will queue up on the producer side.
You have few choices:
somehow tune your consumer to process messages faster, so the topic is never filled up
tune the topic to use persistent storage. This is much better. Not only the topic won't store everything in memory, but you might also get transactional behaviour (messages are durable)
put a queue of messages that you want to set to the topic and process one message per second. That queue must be persistent and must be able to keep more messages than the topic currently can