Kafka: topic compaction notification? - apache-kafka

I was given the following architecture that I'm trying to improve.
I receive a stream of DB changes which end up in a compacted topic. The stream is basically key/value pairs and the keyspace is large (~4 GB).
The topic is consumed by one kafka stream process that stores the data in RockDB (separate for each consumer/shard). The processor does two different things:
join the data into another stream.
check if a message from the topic is a new key or an update to an existing one. If it is an update it sends the old key/value and the new key/value pair to a different topic (updates are rare).
The construct has a couple of problems:
The two different functionalities of the stream processor belong to different teams and should not be part of the same code base. They are put together to save memory. If we separate it we would have to duplicate RockDB's.
I would prefer to use a normal KTable join instead of the handcrafted join that's currently in the code.
RockDB seems to be a bit of overkill if the data is already persisted in a topic. We currently running into some performance issues and I assume it would be faster if we just keep everything in memory.
Question 1:
Is there a way to hook into the compaction process of a compacted topic? I would like a notification (to a different topic) for every key that is actually compacted (including the old and new value).
If this is somehow possible I could easily split the code bases apart and simplify the join.
Question 2:
Any other idea on how this can be solved more elegantly?

You overall design makes sense.
About your join semantics: I guess you need to stick with Processor API as regular KTable cannot provide you want. It's also not possible to hook into the compaction process.
However, Kafka Streams also supports in-memory state stores: https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html#state-stores
RocksDB is used by default, to allow the state to be larger than available main-memory. Spilling to disk with RocksDB to reliability -- however, it also has the advantage, that stores can be recreated quicker if an instance come back online on the same machine, as it's not required to re-read the whole changelog topic.
If you want to split the app into two, is your own decision on how much resources you want to provide.

Related

Guaranteed ordering of messages across a Kafka cluster

I have read dozens of articles about Kafka message ordering and still don't see an out-of-the-box solution to my very common need - publishing messages with a sequentially-incrementing ID and consuming them in that same order.
Kafka preserves message order within a partition. But what enterprise-grade solution would ever use a single partition for critical data (single point of data loss failure, reduced throughput without parallelism, etc.)? So the challenge is how to consume messages in order across a multi-partitioned topic.
Doing blockchain analytics, we harvest sequentially-incrementing blocks of data from blockchain nodes and then publish them to our Kafka topic. Key = block number, Value = block data. Block numbers start at 0 and increment by 1 for eternity.
Our analytics code needs to consume those messages IN ORDER (block 1, block 2, block 3, etc.). If a Smart contract get created on a blockchain in block 2 and then a transaction on it occurs in block 3, our analytics code would fail if we processed block 3 before block 2 ("no contract found error", for example).
Some more info about our use case.
The topic with block data will never be purged. This will grow to several TB and will have millions of messages on it. Though most consumers won't use this directly, it still servers as our off-chain copy of a blockchain and may fulfill future needs within our software.
We have a SQL database table which stores the stateful information about how much of a blockchain we've analyzed (example, highest block # is 25,555,555).
For guaranteed ordering, most articles recommend Kafka Streams and KTables. If we use in-memory KTables, then we face major challenges (can't store TB of data in-memory, rebuilding the KTable at startup would take days, etc.)
If we use persisted KTables, then we're bloating our disk usage (several TB of data duplicated across the source topic and the KTable).
We can create a secondary "operational" single-partition topic [with a relatively short data retention time] and stream the data to that in order, and then have our consumers pull data from that topic. But this is exactly the opposite of out-of-the-box and we'd like to avoid doing this for the hundreds of blockchains and messaging needs we have. It'll become and administrative debacle.
This seems like a technical need that thousands of companies have had since the creation of Kafka (like what messaging queues have done for decades). Is there no out-of-the-box solution for a KafkaListener to receive messages in order based on a numeric Key [in a multi-partition topic]?
publishing messages with a sequentially-incrementing ID and consuming them in that same order
A single partition is the only way to accomplish this when using Kafka.
One alternative design, from a blockchain perspective, would be to key by wallet address, for example, then you have ordered events per wallet. But then if you have transactions between wallets, there is no guarantee the "other wallet" from that withdraw/deposit event-value will exist, so you will need some other state-store (e.g. KTable) for all known wallet addresses before fully processing such events.
The topic with block data will never be purged. This will grow to several TB
Partition segments are not distributed. If you had one partition, that means you're limited to the size of one HDD.
Similarly, RocksDB or in-memory state-stores will have the same problem. But, the interface for those are pluggable and can be replaced, with some tradeoffs for processing ordering guarantees.

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).
consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it
I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

Apache Kafka - KStream and KTable hard disk space requirements

I am trying to, better, understand what happens in the level of resources when you create a KStream and a KTable. Below, I wil mention some conclusions that I have come to, as I understand them (feel free to correct me).
Firstly, every topic has a number of partitions and all the messages in those partitions are stored in the hard disk(s) in continuous order.
A KStream does not need to store the messages, that are read from a topic, again to another location, because the offset is sufficient to retrieve those messages from the topic which is connected to.
(Is this correct? )
The question regards the KTable. As I have understand, a KTable, in contrast with a KStream, updates every message with the with the same key. In order to do that, you have to either store externally the messages that arrive from the topic to a static table, or read all the message queue, each time a new message arrives. The later does not seem very efficient regarding time performance. Is the first approach I presented correct?
read all the message queue, each time a new message arrives.
All messages are only read at the fresh start of the application. Once the app reads up to the latest offset, it's just updating the table like any other consumer
How disk usage is determined ultimately depends on the state store you've configured for the application, along with its own settings. For example, in-memory vs rocksdb vs an external state store interface that you've written on your own

Is it ok to use Apache Kafka "infinite retention policy" as a base for an Event sourced system with CQRS?

I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).
your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.
I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon