Need Kafka consumer whch fetch data in batch - apache-kafka

I have searched about Kafka batch consumer and I didn't find any valuable information.
Use case :
Producer will produce data very frequently and at consumer site we will consume data and from consumer & we will be posting data to Facebook and Google which have limits of data which can be posted.
Let me know if it is possible to pause consumer to consume data for specific time till other APIs consumes data from Consumer.
Note : This can be achieved by storm easily but I am not looking for this solution. We can also configure byte size in kafka but that won't serve the purpose.

There is a couple ways you could do this:
Option #1: Employ one consumer thread that handles all data consumption and hands off the messages to a blocking queue consumed by a worker thread pool. In dosing so could you easily scale the worker processes and consumers. But the offset commit management will be a little harder in this case.
Option #2: Simply invoke KafkaConsumer.pause() and KafkaConsumer.resume() methods to pause and resume fetching from specific partitions to implement your own logic.

Related

Delayed packet consumptions in kafka

is it possible to pick the packets by consumers after defined time in the packet by kafka consumer or how can we achieve this in kafka?
Found related question, but it didn't help. As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
same is possible with rabbitMQ.
If I understand the question, you would need to consume the data, deserialize it and inspect the time field. Then append to some priority queue data structure and start a background timer thread to check if events from this queue should further be processed, and not block the Kafka consumer.
The only downside to this approach is that you then need to worry about processing and committing "shorter time" events that are read by the consumer while waiting for previously consumed "longer time". Otherwise, a restart of your client will drop all events from an in memory queue and start consuming after the last committed record.
You might be able to workaround this using a persistent "outbox pattern" database table, or otherwise tracking offsets and processed records manually, and seeking past any duplicates

Is it better to keep a Kafka Producer open or to create a new one for each message?

I have data coming in through RabbitMQ. The data is coming in constantly, multiple messages per second.
I need to forward that data to Kafka.
In my RabbitMQ delivery callback where I am getting the data from RabbitMQ I have a Kafka producer that immediately sends the recevied messages to Kafka.
My question is very simple. Is it better to create a Kafka producer outside of the callback method and use that one producer for all messages or should I create the producer inside the callback method and close it after the message is sent, which means that I am creating a new producer for each message?
It might be a naive question but I am new to Kafka and so far I did not find a definitive answer on the internet.
EDIT : I am using a Java Kafka client.
Creating a Kafka producer is an expensive operation, so using Kafka producer as a singleton will be a good practice considering performance and utilizing resources.
For Java clients, this is from the docs:
The producer is thread safe and should generally be shared among all threads for best performance.
For librdkafka based clients (confluent-dotnet, confluent-python etc.), I can link this related issue with this quote from the issue:
Yes, creating a singleton service like that is a good pattern. you definitely should not create a producer each time you want to produce a message - it is approximately 500,000 times less efficient.
Kafka producer is stateful. It contains meta info(periodical synced from brokers), send message buffer etc. So create producer for each message is impracticable.

Opening Kafka streams dynamically from a queue consumer

We have a use case where based on work item arriving on a worker queue, we would need to use the message metatdata to decide which Kafka topic to stream our data from. We would have maybe less than 100 worker nodes deployed and each worker node can have a configurable number of threads to receive messages from the queue. So if a worker has "n" threads , we would land up opening maybe kafka streams to "n" different topics. (n is usually less than 10).
Once the worker is done processing the message, we would need to close the stream also.
The worker can receive the next messsage once its acked the first message and at which point , I need to open a kafka stream for another topic.
Also every kafka stream needs to scan all the partitions(around 5-10) for the topic to filter by a certain attribute.
Can a flow like this work for Kafka streams or is this not an optimal approach?
I am not sure if I fully understand the use case, but it seem to be a "simple" copy data from topic A to topic B use case, ie, no data processing/modification. The logic to copy data from input to output topic seems complex though, and thus using Kafka Streams (ie, Kafka's stream processing library) might not be the best fit, as you need more flexibility.
However, using plain KafkaConsumers and KafkaProducers should allow you to implement what you want.

Is kafka consumer sequential or parallel?

In my application, there are multiple enterprises. each enterprise login and do some action like upload the data, then Kafka producer takes the data and sends to the topic. Another side Kafka consumer consumes data from the topic and performs business logic. and persists into the database.
In this case, everything is perfect when a single enterprise login. but when multiple enterprise logins then Kafka consuming in sequentially. i.e.,
how can I make the process parallel? on multiple client requests.
thanks in advance.
As mentioned in previous Answers you can use multiple partitions .
Another option is you get advantage of threading(Threadpoolexecutor) so follow will be like :
receive message -> create parallel thread to do the required logic --> ack message .
Please ensure you have throttling (using thread pool executors) application perforamance .
If that topic only has one partition, it's sequential on the consumer side. Multiple producers for one partition has no guarantees on ordering.
Consumers and producers will batch messages and process them in chunks.
Another side Kafka consumer consumes data from the topic and performs business logic. and persists into the database.
I suggest not using a regular consumer for this. Please research Kafka Connect and see if your database is supported

Is there a way to throttle data received by a Kafka consumer?

I was reading through docs and found a max.poll.interval.ms property but it doesn't seem to be the config that I need.
Basically, I need something like a min.poll.interval.ms to tell the consumer to poll for records every n second.
In conjunction with max.poll.records, I can ensure that my services are processing the right amount of load.
It doesn't work this way.
You need to invoke Consumer.poll(...) periodically (in a loop), to get new records if any have appeared.
If you do record processing and receving (poll) in the same thread, then if the processing takes too long, your consumer will be thrown out of consumer group and another one will get the partitions.
An alternative is to use kafka-streams if you do not want to do that. Starting stream applications on different instances with the same application id will provide some kind of load balancing.