Delayed packet consumptions in kafka - apache-kafka

is it possible to pick the packets by consumers after defined time in the packet by kafka consumer or how can we achieve this in kafka?
Found related question, but it didn't help. As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
same is possible with rabbitMQ.

If I understand the question, you would need to consume the data, deserialize it and inspect the time field. Then append to some priority queue data structure and start a background timer thread to check if events from this queue should further be processed, and not block the Kafka consumer.
The only downside to this approach is that you then need to worry about processing and committing "shorter time" events that are read by the consumer while waiting for previously consumed "longer time". Otherwise, a restart of your client will drop all events from an in memory queue and start consuming after the last committed record.
You might be able to workaround this using a persistent "outbox pattern" database table, or otherwise tracking offsets and processed records manually, and seeking past any duplicates

Related

How to consume Kafka's messages on a single consumer?

I need to implement a system that when the application starts a thread consumes all the messages generated during the shutdown of the service, it means that in parallel the application must consume the messages starting from the last message read by the thread that is in charge of consuming the old messages.
Is there a solution to this problem on kafka?
I'm not writing the language I'm using because I think it's a kafka feature.
EDIT:
Suppose we start the machine with consumers at 18:00 from 00:00 must take all messages from 00:00 to 18:00 the consumer assigned to read old messages and in parallel the other consumers start reading messages from 18:00 onward
This is how consumers work by default. You also have to be mindful about the retention of messages, as if that process doesn't restart after a certain amount of time you might lose messages. Kafka can retain data forever but it costs $$$, you need to find out what is the right retention for you.
From your comment, what you describe (multiple consumers consuming the same messages) happens when they have different consumer group ids. If you use the same consumer group, messages won't be processed twice during normal operation.
I need to warn you: Kafka is very complex technology, do not use it unless you know properly how consumers and producers work in detail. I would suggest you to pick at bare minimum the Kafka Definitive Guide before using it, unless you are ok with all kinds of failure scenarios.
Also, by default kafka guarantees "deliver at least once". If you want to be sure that you process messages exactly once, please read Exactly-Once Semantics Are Possible: Here’s How Kafka Does It, and know that this also depends on what you do while processing messages. If you touch a database, it might be better to use something on the DB that guarantees uniqueness (a kind of idempotency) so each message is processed once.

Kafka consumer design to process huge volume of data with multi instance

I am trying to design Kafka consumers, and I have a road block on how to design the process. I am thinking of two options:
1. Process records directly from Kafka.
2. Staging table write from Kafka and process records.
Approach 1: Process Key messages on the go from Kafka:
• Read messages one at a time from Kafka & if no records to process break the loop (configurable messages to process)
• Execute business rules.
• Apply changes to consumer database.
• Update Kafka offset to read after processing message.
• Insert into staging table (used for PD guide later on)
Questions with above approach:
• Is it OK to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed
and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way ?
• Is it OK to manually commit the offset to Kafka? (Performance issues with manual offset commit).
Approach 2: Staging table write from Kafka and process records
Process 1: Consuming events from Kafka and put in staging table.
Process 2: Reading staging table (configurable rows), execute business rules, apply consumer database changes
& update the status of processed records in staging table. (we may have multiple process to do this step)
I see a lot of downside on this approach:
• We are missing the advantage of offset handling provided by Kafka and we are doing manual update of processed records in staging table.
• Locking & Blocking on staging tables for multi instance, as we are trying to insert & do updates after processing in the same staging table
(note: I can design separate tables and move this data there and process them but that could is introducing multiple processes again.
How can I design Kafka with multi instance consumer and huge data to process, which design is appropriate, is it good to read data on the go from Kafka and process the messages or stage it to a table and write another job to process these messages ?
This is how I think we can get the best throughput without worrying about the loss of messages-
Maximize the number of partitions.
Deploy the consumers(at max the number of partitions, even less if your consumers can operate multi-threaded without any problem.)
Read single-threadedly from within each consumer(with auto offset commit) and put the messages in a Blocking Queue which you can control based upon the number of actual processing threads in each consumer.
If the processing fails, you can retry for success or else put messages in a dead-letter queue. Don't forget the implementation of shut down hookups for processing already consumed messages.
If you want to ensure ordering like processing events with the same key one after the another or on any other factor from a single partition, you can use a deterministic executor. I have written a basic ExecutorService in Java that can execute multiple messages in a deterministic way without compromising on the multi-threading of logically separate events. Link- https://github.com/mukulbansal93/deterministic-threading
To answer your questions-
Is it ok to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way? I don't see much performance issues here as you are processing in bulk. However, it is possible that one of your consumed messages is taking a long time while others get processes. In that case, you will not read other messages from Kafka leading to a performance bottleneck.
Is it ok to manually commit the offset to Kafka? (Performance issues with manual offset commit). This is definitely going to be the least throughput approach as offset committing is an expensive operation.
The first approach where you consume the data and update a table accordingly sounds like the right way.
Kafka guarantees
At least once: you may get the same message twice.
that means that you need the messages to be idempotent -> set amount to x and not add an amount to the previous value.
order (per partition): Kafka promise that you consume messages in the same order the messages were produced - per partition. Like a queue per partition.
if when you say "Execute business rules" you need to also read previous writes, that means you need to process them one by one.
How to define the partitions
If you define one partition you won't have a problem with conflicts but you will only have one consumer and that doesn't scale.
if you arbitrarily define multiple partitions then you may lose the order.
why is that a problem?
you need to define the partitions according to your business model:
For example, let's say that every message updates some user's DB. when you process a message you want to read the user row, check some fields, and then update (or not) according to that field.
that means that if you define the partition by user-id -> (user-id % number of partitions)
you guarantee that you won't have a race condition between two updates on the same user and you can scale to multiple machines/processes/threads. each consumer in-charge of some set of users but it's always the same users.
The design of your consumer depends on your usecase.
If there are other downstream processes that is expecting the same data and has a limitation to connect to your kafka cluster. In this case having a staging table is a good idea.
I think in your case approach 1 with a little alteration is a good way to go.
However you dont need to break the loop if there are no new messages in the topic.
Also, theres a consumer property that helps to configure the number of records that you want to poll from kafka in a single request (default 500) you might want to change it to a lower number if each message takes a long time to process (To avoid timeout or unwanted repartitioning issues).
Since you mentioned the amount of data is huge I would recommend having more partitions for concurrency if processing order doesnot matter for you. Concurrency can be achieved my creating a consumer group with instance count no more than the number of partitions for the topic. (If the consumer instance count is more than the number of partitions the extra instances will be ideal)
If order does matter, The producer should ideally send logically grouped messages with the same message key so that all messages with the same key land in the same partition.
About offset commiting, if you sync commit each message to kafka you will definitely have performance impact. Usually in offset is commited for each consumed batch of record. eg poll 500 records-> process -> commit the batch of records.
However, If you need to send out a commit for each message you might want to opt for Async Commit.
Additionally, when partitions are assigned to a consumer group instance it doesnot lock the partitions. Other consumer groups can subscribe to the same topic and consume messages concurrently.

How to handle various failure conditions in Kafka

Issue we were facing:
In our system we were logging a ticket in database with status NEW and also putting it in the kafka queue for further processing. The processors pick those tickets from kafka queue, do processing and update the status accordingly. We found that some tickets are left in NEW state forever. So we were guessing whether tickets are failing to get produced in the queue or are no getting consumed.
Message loss / duplication scenarios (and some other related points):
So I started to dig exhaustively to know in what all ways we can face message loss and duplication in Kafka. Below I have listed all possible message loss and duplication scenarios that I can find in this post:
How data loss can occur in different approaches to handle all replicas down
Handle by waiting for leader to come online
Messages sent between all replica down and leader comes online are lost.
Handle by electing new broker as a leader once it comes online
If new broker is out of sync from previous leader, all data written between the
time where this broker went down and when it was elected the new leader will be
lost. As additional brokers come back up, they will see that they have committed
messages that do not exist on the new leader and drop those messages.
How data loss can occur when leader goes down, while other replicas may be up
In this case, the Kafka controller will detect the loss of the leader and elect a new leader from the pool of in sync replicas. This may take a few seconds and result in LeaderNotAvailable errors from the client. However, no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.
When a consumer may miss to consume a message
If Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages.
Evaluating different approaches to consumer consistency
Message might not be processed when consumer is configured to receive each message at most once
Message might be duplicated / processed twice when consumer is configured to receive each message at least once
No message is processed multiple times or left unprocessed if consumer is configured to receive each message exactly once.
Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers.
Kafka makes the following guarantees about data consistency and availability:
Messages sent to a topic partition will be appended to the commit log in the order they are sent,
a single consumer instance will see messages in the order they appear in the log,
a message is ‘committed’ when all in sync replicas have applied it to their log, and
any committed message will not be lost, as long as at least one in sync replica is alive.
Approach I came up with:
After reading several articles, I felt I should do following:
If message is not enqueued, producer should resend
For this producer should listen for acknowledgement for each message sent. If no ackowledement is received, it can retry sending message
Producer should be async with callback:
As explained in last example here
How to avoid duplicates in case of producer retries sending
To avoid duplicates in queue, set enable.idempotence=true in producer configs. This will make producer ensure that exactly one copy of each message is sent. This requires following properties set on producer:
max.in.flight.requests.per.connection<=5
retries>0
acks=all (Obtain ack when all brokers has committed message)
Producer should be transactional
As explained here.
Set transactional id to unique id:
producerProps.put("transactional.id", "prod-1");
Because we've enabled idempotence, Kafka will use this transaction id as part of its algorithm to deduplicate any message this producer sends, ensuring idempotency.
Use transactions semantics: init, begin, commit, close
As explained here:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
Consumer should be transactional
consumerProps.put("isolation.level", "read_committed");
This ensures that consumer don't read any transactional messages before the transaction completes.
Manually commit offset in consumer
As explained here
Process record and save offsets atomically
Say by atomically saving both record processing output and offsets to any database. For this we need to set auto commit of database connection to false and manually commit after persisting both processing output and offset. This also requires setting enable.auto.commit to false.
Read initial offset (say to read after recovery from cache) from database
Seek consumer to this offset and then read from that position.
Doubts I have:
(Some doubts might be primary and can be resolved by implementing code. But I want words from experienced kafka developer.)
Does the consumer need to read the offset from database only for initial (/ first after consumer recovery) read or for all reads? I feel it needs to read offset from database only on restarts, as explained here
Do we have to opt for manual partitioning? Does this approach works only with auto partitioning off? I have this doubt because this example explains storing offset in MySQL by specifying partitions explicitly.
Do we need both: Producer side kafka transactions and consumer side database transactions (for storing offset and processing records atomically)? I feel for producer idempotence, we need producer to have unique transaction id and for that we need to use kafka transactional api (init, begin, commit). And as a counterpart, consumer also need to set isolation.level to read_committed. However can we ensure no message loss and duplicate processing without using kafka transactions? Or they are absolutely necessary?
Should we persist offset to external db as explained above and here
or send offset to transaction as explained here (also I didnt get what does it exactly mean by sending offset to transaction)
or follow sync async commit combo explained here.
I feel message loss / duplication scenarios 1 and 2 are handled by points 1 to 4 of approach I explained above.
I feel message loss / duplication scenario 3 is handled by point 6 of approach I explained above.
How do we implement different consumer consistency approaches as stated in message loss / duplication scenario 4? Is their any configuration or it needs to be implemented inside custom logic inside consumer?
Message loss / duplication scenario 5 says: "Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition."? Is it something to concern about while building correct system?
Is any consideration unnecessary/redundant in the approach I came up with above? Also did I miss any necessary consideration? Did I miss any message loss / duplication scenarios?
Is their any other standard / recommended / preferable approach to ensure no message loss and duplicate processing than what I have thought above?
Do I have to actually code above approach using kafka APIs? or is there any high level API built atop kafka API which allows to easily ensure no message loss and duplicate processing?
Looking at issue we were facing (as stated at very beginning), we were thinking if we can recover any lost/unprocessed messages from files in which kafka stores messages. However that isnt correct, right?
(Extremely sorry for such an exhaustive post but wanted to write question which will ask all related question at one place allowing to build big picture of how to build system around kafka.)

Is there a way to throttle data received by a Kafka consumer?

I was reading through docs and found a max.poll.interval.ms property but it doesn't seem to be the config that I need.
Basically, I need something like a min.poll.interval.ms to tell the consumer to poll for records every n second.
In conjunction with max.poll.records, I can ensure that my services are processing the right amount of load.
It doesn't work this way.
You need to invoke Consumer.poll(...) periodically (in a loop), to get new records if any have appeared.
If you do record processing and receving (poll) in the same thread, then if the processing takes too long, your consumer will be thrown out of consumer group and another one will get the partitions.
An alternative is to use kafka-streams if you do not want to do that. Starting stream applications on different instances with the same application id will provide some kind of load balancing.

Need Kafka consumer whch fetch data in batch

I have searched about Kafka batch consumer and I didn't find any valuable information.
Use case :
Producer will produce data very frequently and at consumer site we will consume data and from consumer & we will be posting data to Facebook and Google which have limits of data which can be posted.
Let me know if it is possible to pause consumer to consume data for specific time till other APIs consumes data from Consumer.
Note : This can be achieved by storm easily but I am not looking for this solution. We can also configure byte size in kafka but that won't serve the purpose.
There is a couple ways you could do this:
Option #1: Employ one consumer thread that handles all data consumption and hands off the messages to a blocking queue consumed by a worker thread pool. In dosing so could you easily scale the worker processes and consumers. But the offset commit management will be a little harder in this case.
Option #2: Simply invoke KafkaConsumer.pause() and KafkaConsumer.resume() methods to pause and resume fetching from specific partitions to implement your own logic.