Guaranteed ordering of messages across a Kafka cluster - apache-kafka

I have read dozens of articles about Kafka message ordering and still don't see an out-of-the-box solution to my very common need - publishing messages with a sequentially-incrementing ID and consuming them in that same order.
Kafka preserves message order within a partition. But what enterprise-grade solution would ever use a single partition for critical data (single point of data loss failure, reduced throughput without parallelism, etc.)? So the challenge is how to consume messages in order across a multi-partitioned topic.
Doing blockchain analytics, we harvest sequentially-incrementing blocks of data from blockchain nodes and then publish them to our Kafka topic. Key = block number, Value = block data. Block numbers start at 0 and increment by 1 for eternity.
Our analytics code needs to consume those messages IN ORDER (block 1, block 2, block 3, etc.). If a Smart contract get created on a blockchain in block 2 and then a transaction on it occurs in block 3, our analytics code would fail if we processed block 3 before block 2 ("no contract found error", for example).
Some more info about our use case.
The topic with block data will never be purged. This will grow to several TB and will have millions of messages on it. Though most consumers won't use this directly, it still servers as our off-chain copy of a blockchain and may fulfill future needs within our software.
We have a SQL database table which stores the stateful information about how much of a blockchain we've analyzed (example, highest block # is 25,555,555).
For guaranteed ordering, most articles recommend Kafka Streams and KTables. If we use in-memory KTables, then we face major challenges (can't store TB of data in-memory, rebuilding the KTable at startup would take days, etc.)
If we use persisted KTables, then we're bloating our disk usage (several TB of data duplicated across the source topic and the KTable).
We can create a secondary "operational" single-partition topic [with a relatively short data retention time] and stream the data to that in order, and then have our consumers pull data from that topic. But this is exactly the opposite of out-of-the-box and we'd like to avoid doing this for the hundreds of blockchains and messaging needs we have. It'll become and administrative debacle.
This seems like a technical need that thousands of companies have had since the creation of Kafka (like what messaging queues have done for decades). Is there no out-of-the-box solution for a KafkaListener to receive messages in order based on a numeric Key [in a multi-partition topic]?

publishing messages with a sequentially-incrementing ID and consuming them in that same order
A single partition is the only way to accomplish this when using Kafka.
One alternative design, from a blockchain perspective, would be to key by wallet address, for example, then you have ordered events per wallet. But then if you have transactions between wallets, there is no guarantee the "other wallet" from that withdraw/deposit event-value will exist, so you will need some other state-store (e.g. KTable) for all known wallet addresses before fully processing such events.
The topic with block data will never be purged. This will grow to several TB
Partition segments are not distributed. If you had one partition, that means you're limited to the size of one HDD.
Similarly, RocksDB or in-memory state-stores will have the same problem. But, the interface for those are pluggable and can be replaced, with some tradeoffs for processing ordering guarantees.

Related

Kafka consumer design to process huge volume of data with multi instance

I am trying to design Kafka consumers, and I have a road block on how to design the process. I am thinking of two options:
1. Process records directly from Kafka.
2. Staging table write from Kafka and process records.
Approach 1: Process Key messages on the go from Kafka:
• Read messages one at a time from Kafka & if no records to process break the loop (configurable messages to process)
• Execute business rules.
• Apply changes to consumer database.
• Update Kafka offset to read after processing message.
• Insert into staging table (used for PD guide later on)
Questions with above approach:
• Is it OK to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed
and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way ?
• Is it OK to manually commit the offset to Kafka? (Performance issues with manual offset commit).
Approach 2: Staging table write from Kafka and process records
Process 1: Consuming events from Kafka and put in staging table.
Process 2: Reading staging table (configurable rows), execute business rules, apply consumer database changes
& update the status of processed records in staging table. (we may have multiple process to do this step)
I see a lot of downside on this approach:
• We are missing the advantage of offset handling provided by Kafka and we are doing manual update of processed records in staging table.
• Locking & Blocking on staging tables for multi instance, as we are trying to insert & do updates after processing in the same staging table
(note: I can design separate tables and move this data there and process them but that could is introducing multiple processes again.
How can I design Kafka with multi instance consumer and huge data to process, which design is appropriate, is it good to read data on the go from Kafka and process the messages or stage it to a table and write another job to process these messages ?
This is how I think we can get the best throughput without worrying about the loss of messages-
Maximize the number of partitions.
Deploy the consumers(at max the number of partitions, even less if your consumers can operate multi-threaded without any problem.)
Read single-threadedly from within each consumer(with auto offset commit) and put the messages in a Blocking Queue which you can control based upon the number of actual processing threads in each consumer.
If the processing fails, you can retry for success or else put messages in a dead-letter queue. Don't forget the implementation of shut down hookups for processing already consumed messages.
If you want to ensure ordering like processing events with the same key one after the another or on any other factor from a single partition, you can use a deterministic executor. I have written a basic ExecutorService in Java that can execute multiple messages in a deterministic way without compromising on the multi-threading of logically separate events. Link- https://github.com/mukulbansal93/deterministic-threading
To answer your questions-
Is it ok to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way? I don't see much performance issues here as you are processing in bulk. However, it is possible that one of your consumed messages is taking a long time while others get processes. In that case, you will not read other messages from Kafka leading to a performance bottleneck.
Is it ok to manually commit the offset to Kafka? (Performance issues with manual offset commit). This is definitely going to be the least throughput approach as offset committing is an expensive operation.
The first approach where you consume the data and update a table accordingly sounds like the right way.
Kafka guarantees
At least once: you may get the same message twice.
that means that you need the messages to be idempotent -> set amount to x and not add an amount to the previous value.
order (per partition): Kafka promise that you consume messages in the same order the messages were produced - per partition. Like a queue per partition.
if when you say "Execute business rules" you need to also read previous writes, that means you need to process them one by one.
How to define the partitions
If you define one partition you won't have a problem with conflicts but you will only have one consumer and that doesn't scale.
if you arbitrarily define multiple partitions then you may lose the order.
why is that a problem?
you need to define the partitions according to your business model:
For example, let's say that every message updates some user's DB. when you process a message you want to read the user row, check some fields, and then update (or not) according to that field.
that means that if you define the partition by user-id -> (user-id % number of partitions)
you guarantee that you won't have a race condition between two updates on the same user and you can scale to multiple machines/processes/threads. each consumer in-charge of some set of users but it's always the same users.
The design of your consumer depends on your usecase.
If there are other downstream processes that is expecting the same data and has a limitation to connect to your kafka cluster. In this case having a staging table is a good idea.
I think in your case approach 1 with a little alteration is a good way to go.
However you dont need to break the loop if there are no new messages in the topic.
Also, theres a consumer property that helps to configure the number of records that you want to poll from kafka in a single request (default 500) you might want to change it to a lower number if each message takes a long time to process (To avoid timeout or unwanted repartitioning issues).
Since you mentioned the amount of data is huge I would recommend having more partitions for concurrency if processing order doesnot matter for you. Concurrency can be achieved my creating a consumer group with instance count no more than the number of partitions for the topic. (If the consumer instance count is more than the number of partitions the extra instances will be ideal)
If order does matter, The producer should ideally send logically grouped messages with the same message key so that all messages with the same key land in the same partition.
About offset commiting, if you sync commit each message to kafka you will definitely have performance impact. Usually in offset is commited for each consumed batch of record. eg poll 500 records-> process -> commit the batch of records.
However, If you need to send out a commit for each message you might want to opt for Async Commit.
Additionally, when partitions are assigned to a consumer group instance it doesnot lock the partitions. Other consumer groups can subscribe to the same topic and consume messages concurrently.

How to efficiently repair data in large kafka / kafka streams applications

Project:
the application i am working on processes financial transaction (orders and trade) data, several millions per day.
the data is fed into a kafka topic.
kafka streams microservices aggregate the information (e.g. nr of trades per stock), and this data is consumed by other software. In addition, the data is persisted in mongodb.
Problem:
the data sent to the topic needs to be sometimes modified, e.g. changes of prices due to bug or misconfiguration.
Since kafka is append-only, i do the correction in mongodb, and after that, the corrected data is piped into a new kafka topic, leading to a complete re-calculations of the downstream aggregations.
However, this process causes scalability concerns, as more and more data needs to be replayed over time.
Question
I am considering splitting the large kafka topic into daily topics, so that only a single day's topics needs to be replayed in most cases of data repair.
My question is if this is a plausible way to address this problem or if there are better solutions to it.
Data repairing or in general error handling and Kafka heavily depends on the use case. In our case we build our system based on the CQRS + event sourcing principles (generic description here) and as a result for data repairing we are using "compensating events" (i.e. an event that amends the effects of another event) and eventually the system will be consistent.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Desigining Kafka Topics - Many Topics vs One Big Topic

Considering a stream of different events the recommended way would be
one big topic containing all events
multiple topics for different types of events
Which option would be better?
I understand that messages not being in the same partition of a topic it means there are no order guarantee, but are there any other factors to be considered when making this decision?
A topic is a logical abstraction and should contain message of the same type. Let's say, you monitor a website and capture click stream events and on the other hand you have a database that populates it's changes into a changelog topics. You should have two different topics because click stream events are not related to you database changelog.
This has multiple advantages:
your data will have different format und you will need different (de)serializers to write read the data (using a single topic you would need a hybrid serializer and you will not get type safety when reading data)
you will have different consumer application and one application might be interested in click stream events only, while a second application is only interested in the database changelog and a third application is interested in both. If you have multiple topics, application one and two only subscribe to the topics they are interesting in -- if you have a single topic, application one an two need to read everything and filter the stuff they are not interested in increasing broker, network, can client load
As #Matthias J. Sax told before there is not a golden bullet over here. But we have to take different topics into account.
The conditioner: ordered deliveries
If you application needs guarantee order delivery, you need to work with only one topic, plus same keys for those messages which need to guarantee it.
If ordering is not mandatory, the game starts...
Does the schema same for all messages?
Would be consumers interested in the same type of different events?
What is gonna happen at the consumer side?, do we are reducing or increasing complexity in terms of implementation, maintainability, error handling...?
Does horizontal scalability important for us? More topics often means more partitions available, which means more horizontal scalability capacity. Also it allows more accurate scalability configuration at the broker side, because we can choose what number of partitions to increase per event type. or at the consumer side, what number of consumers stand up per event type.
Does makes sense parallelising consumption per message type?
...
Technically speaking, if we allow consumers to fine tune those type of events to be consumed we're potentially reducing the network bandwidth required to send undesired messages from the broker to the consumer, plus the number deserialisations for all of them (cpu used, which makes along time more free resources, energy cost reduction...).
Also is worthy to remember that splitting different type of messages in different topics doesn't mean have to consume them with different Kafka consumers because they allow consumption from different topics at the same time.
Well, there's not a clear answer for this question, but I have the feeling that with Kafka, because multiple features, if ordered deliveries are not needed we should split our messages per type in different topics.

Merging ordered Kafka topics into a single ordered topic

I have N topics as input, each with messages added in ascending delivery date order. Topics can vary widely in message count, date range, partitioning strategy. But I know that all partitions for every topic will independently be in date order.
I want to merge all N topics priority-queue style into a new single topic T. T also has whatever partition count and strategy it wants since the only requirement is that each individual partition of T is still in date order on its own. I then feed T to partition-aware consumers which will consume them and idle between due dates since I want each message to be delivered on or closely thereafter its delivery date. This whole pipeline can stream forever.
I expect tuning issues with exactly how partitions amongst all the N input topics and the single T output topic are distributed, and advice which affects that specifically is welcome but right now I'm mainly interested in the overall viability of doing this at all using only Kafka topics, not a RDB or Key-value store. So some extra I/O moving messages between non-optimal topic partitions is okay.
Is this doable with the 0.9 consumer where I can control knowing which partitions are assigned to each consumer, so I can let auto-rebalancing occur while endlessly peek/merge-to-T/commit-offset the oldest message on each actual partition? I must have partition awareness to have a chance of this working.
Due to needing shared merge state (the last date added to T), is it better to stick with multiple partition-aware consumers in a single process, parallel processes or multiple servers given where that state will need to be? I favor keeping the state onboard in shared memory not networked in ZK or whatever. On a restart I can get it once and maintain it while running if on a single machine.
Am I overlooking any Kafka features that would make what I describe easier or more efficient, like some atomic message move between topics? I know I am going against the grain of its design and this scenario is similar to TS.