Kafka consumer design to process huge volume of data with multi instance - apache-kafka

I am trying to design Kafka consumers, and I have a road block on how to design the process. I am thinking of two options:
1. Process records directly from Kafka.
2. Staging table write from Kafka and process records.
Approach 1: Process Key messages on the go from Kafka:
• Read messages one at a time from Kafka & if no records to process break the loop (configurable messages to process)
• Execute business rules.
• Apply changes to consumer database.
• Update Kafka offset to read after processing message.
• Insert into staging table (used for PD guide later on)
Questions with above approach:
• Is it OK to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed
and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way ?
• Is it OK to manually commit the offset to Kafka? (Performance issues with manual offset commit).
Approach 2: Staging table write from Kafka and process records
Process 1: Consuming events from Kafka and put in staging table.
Process 2: Reading staging table (configurable rows), execute business rules, apply consumer database changes
& update the status of processed records in staging table. (we may have multiple process to do this step)
I see a lot of downside on this approach:
• We are missing the advantage of offset handling provided by Kafka and we are doing manual update of processed records in staging table.
• Locking & Blocking on staging tables for multi instance, as we are trying to insert & do updates after processing in the same staging table
(note: I can design separate tables and move this data there and process them but that could is introducing multiple processes again.
How can I design Kafka with multi instance consumer and huge data to process, which design is appropriate, is it good to read data on the go from Kafka and process the messages or stage it to a table and write another job to process these messages ?

This is how I think we can get the best throughput without worrying about the loss of messages-
Maximize the number of partitions.
Deploy the consumers(at max the number of partitions, even less if your consumers can operate multi-threaded without any problem.)
Read single-threadedly from within each consumer(with auto offset commit) and put the messages in a Blocking Queue which you can control based upon the number of actual processing threads in each consumer.
If the processing fails, you can retry for success or else put messages in a dead-letter queue. Don't forget the implementation of shut down hookups for processing already consumed messages.
If you want to ensure ordering like processing events with the same key one after the another or on any other factor from a single partition, you can use a deterministic executor. I have written a basic ExecutorService in Java that can execute multiple messages in a deterministic way without compromising on the multi-threading of logically separate events. Link- https://github.com/mukulbansal93/deterministic-threading
To answer your questions-
Is it ok to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way? I don't see much performance issues here as you are processing in bulk. However, it is possible that one of your consumed messages is taking a long time while others get processes. In that case, you will not read other messages from Kafka leading to a performance bottleneck.
Is it ok to manually commit the offset to Kafka? (Performance issues with manual offset commit). This is definitely going to be the least throughput approach as offset committing is an expensive operation.

The first approach where you consume the data and update a table accordingly sounds like the right way.
Kafka guarantees
At least once: you may get the same message twice.
that means that you need the messages to be idempotent -> set amount to x and not add an amount to the previous value.
order (per partition): Kafka promise that you consume messages in the same order the messages were produced - per partition. Like a queue per partition.
if when you say "Execute business rules" you need to also read previous writes, that means you need to process them one by one.
How to define the partitions
If you define one partition you won't have a problem with conflicts but you will only have one consumer and that doesn't scale.
if you arbitrarily define multiple partitions then you may lose the order.
why is that a problem?
you need to define the partitions according to your business model:
For example, let's say that every message updates some user's DB. when you process a message you want to read the user row, check some fields, and then update (or not) according to that field.
that means that if you define the partition by user-id -> (user-id % number of partitions)
you guarantee that you won't have a race condition between two updates on the same user and you can scale to multiple machines/processes/threads. each consumer in-charge of some set of users but it's always the same users.

The design of your consumer depends on your usecase.
If there are other downstream processes that is expecting the same data and has a limitation to connect to your kafka cluster. In this case having a staging table is a good idea.
I think in your case approach 1 with a little alteration is a good way to go.
However you dont need to break the loop if there are no new messages in the topic.
Also, theres a consumer property that helps to configure the number of records that you want to poll from kafka in a single request (default 500) you might want to change it to a lower number if each message takes a long time to process (To avoid timeout or unwanted repartitioning issues).
Since you mentioned the amount of data is huge I would recommend having more partitions for concurrency if processing order doesnot matter for you. Concurrency can be achieved my creating a consumer group with instance count no more than the number of partitions for the topic. (If the consumer instance count is more than the number of partitions the extra instances will be ideal)
If order does matter, The producer should ideally send logically grouped messages with the same message key so that all messages with the same key land in the same partition.
About offset commiting, if you sync commit each message to kafka you will definitely have performance impact. Usually in offset is commited for each consumed batch of record. eg poll 500 records-> process -> commit the batch of records.
However, If you need to send out a commit for each message you might want to opt for Async Commit.
Additionally, when partitions are assigned to a consumer group instance it doesnot lock the partitions. Other consumer groups can subscribe to the same topic and consume messages concurrently.

Related

Guaranteed ordering of messages across a Kafka cluster

I have read dozens of articles about Kafka message ordering and still don't see an out-of-the-box solution to my very common need - publishing messages with a sequentially-incrementing ID and consuming them in that same order.
Kafka preserves message order within a partition. But what enterprise-grade solution would ever use a single partition for critical data (single point of data loss failure, reduced throughput without parallelism, etc.)? So the challenge is how to consume messages in order across a multi-partitioned topic.
Doing blockchain analytics, we harvest sequentially-incrementing blocks of data from blockchain nodes and then publish them to our Kafka topic. Key = block number, Value = block data. Block numbers start at 0 and increment by 1 for eternity.
Our analytics code needs to consume those messages IN ORDER (block 1, block 2, block 3, etc.). If a Smart contract get created on a blockchain in block 2 and then a transaction on it occurs in block 3, our analytics code would fail if we processed block 3 before block 2 ("no contract found error", for example).
Some more info about our use case.
The topic with block data will never be purged. This will grow to several TB and will have millions of messages on it. Though most consumers won't use this directly, it still servers as our off-chain copy of a blockchain and may fulfill future needs within our software.
We have a SQL database table which stores the stateful information about how much of a blockchain we've analyzed (example, highest block # is 25,555,555).
For guaranteed ordering, most articles recommend Kafka Streams and KTables. If we use in-memory KTables, then we face major challenges (can't store TB of data in-memory, rebuilding the KTable at startup would take days, etc.)
If we use persisted KTables, then we're bloating our disk usage (several TB of data duplicated across the source topic and the KTable).
We can create a secondary "operational" single-partition topic [with a relatively short data retention time] and stream the data to that in order, and then have our consumers pull data from that topic. But this is exactly the opposite of out-of-the-box and we'd like to avoid doing this for the hundreds of blockchains and messaging needs we have. It'll become and administrative debacle.
This seems like a technical need that thousands of companies have had since the creation of Kafka (like what messaging queues have done for decades). Is there no out-of-the-box solution for a KafkaListener to receive messages in order based on a numeric Key [in a multi-partition topic]?
publishing messages with a sequentially-incrementing ID and consuming them in that same order
A single partition is the only way to accomplish this when using Kafka.
One alternative design, from a blockchain perspective, would be to key by wallet address, for example, then you have ordered events per wallet. But then if you have transactions between wallets, there is no guarantee the "other wallet" from that withdraw/deposit event-value will exist, so you will need some other state-store (e.g. KTable) for all known wallet addresses before fully processing such events.
The topic with block data will never be purged. This will grow to several TB
Partition segments are not distributed. If you had one partition, that means you're limited to the size of one HDD.
Similarly, RocksDB or in-memory state-stores will have the same problem. But, the interface for those are pluggable and can be replaced, with some tradeoffs for processing ordering guarantees.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Producing a batch message

Let's say there is a batch API for performing tasks List[T]. In order to do the job all the tasks needs to be pushed to kafka. There are 2 ways to do that :
1) Pushing List as a message in kafka
2) Pushing individual task T in kafka
I believe approach 1 would be better since i don't have to push the messages to kafka mutiple times for a single batch call. Can some one please tell me if there is any harm in such approach ?
A Kafka producer can batch together individual messages sent within a short time window (the particular config is linger.ms), so the cost of sending individual messages is probably a lot lower than you think.
Probably a more important factor to consider is how the consumer is going to consume messages. What should happen if the consumer cannot process one of the tasks, for example? If the consumer is just just going to call some other batch-based API which succeeds or fails as a batch, the a single message containing a list of tasks would be a perfectly good fit. On the other hand if the consumer ultimately has to process tasks individually then sending individual messages is probably a better fit, and will probably save you from having to implement some sort of retry logic in your consumer, because you can probably configure Kafka to behave with the semantics you need.
Starting from Kafka v0.11 you can also use transactions in the producer to publish your entire batch atomically. i.e. you begin the transaction, then publish your tasks message by message, finally you commit the transaction. Even though the messages can be sent to kafka in multiple batches, they will only become visible to consumers once you commit the transaction, as long as your consumers are running in read-committed mode.
Option 1 is the preferred method in Kafka so long as the entire batch should always stay together. If you publish a List of records as a batch then they will be stored as a batch, they will be (optionally) compressed as a batch yielding better compression, and they will be fetched by consumers as a batch yielding fewer fetch requests.
If you send individual messages then you will have to give them a common key or they will get spread out over different partitions and possibly be sent out of order, or to different consumers of a consumer group.

Merging ordered Kafka topics into a single ordered topic

I have N topics as input, each with messages added in ascending delivery date order. Topics can vary widely in message count, date range, partitioning strategy. But I know that all partitions for every topic will independently be in date order.
I want to merge all N topics priority-queue style into a new single topic T. T also has whatever partition count and strategy it wants since the only requirement is that each individual partition of T is still in date order on its own. I then feed T to partition-aware consumers which will consume them and idle between due dates since I want each message to be delivered on or closely thereafter its delivery date. This whole pipeline can stream forever.
I expect tuning issues with exactly how partitions amongst all the N input topics and the single T output topic are distributed, and advice which affects that specifically is welcome but right now I'm mainly interested in the overall viability of doing this at all using only Kafka topics, not a RDB or Key-value store. So some extra I/O moving messages between non-optimal topic partitions is okay.
Is this doable with the 0.9 consumer where I can control knowing which partitions are assigned to each consumer, so I can let auto-rebalancing occur while endlessly peek/merge-to-T/commit-offset the oldest message on each actual partition? I must have partition awareness to have a chance of this working.
Due to needing shared merge state (the last date added to T), is it better to stick with multiple partition-aware consumers in a single process, parallel processes or multiple servers given where that state will need to be? I favor keeping the state onboard in shared memory not networked in ZK or whatever. On a restart I can get it once and maintain it while running if on a single machine.
Am I overlooking any Kafka features that would make what I describe easier or more efficient, like some atomic message move between topics? I know I am going against the grain of its design and this scenario is similar to TS.

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.