Is it ok to use Apache Kafka "infinite retention policy" as a base for an Event sourced system with CQRS? - apache-kafka

I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).

your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.

I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.

Related

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).
consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it
I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

Use transactional API and exactly-once with regular Producers and Consumers

Confluent documents that I was able to find all focus on Kafka Streams application when it comes to exactly-once/transactions/idempotence.
However, the APIs for transactions were introduced on a "regular" Producer/Consumer level and all the explanations and diagrams focus on them.
I was wondering whether it's Ok to use those API directly without Kafka Streams.
I do understand the consequences of Kafka processing boundaries and the guarantees, and I'm Ok with violating it. I don't have a need for 100% exactly-once guarantee, it's Ok to have a duplicate once in a while, for example, when I read from/write to external systems.
The problem I'm facing is that I need to create an ETL pipeline for Big Data project where we are getting a lot of duplicates when the apps are restated/relocated to different hosts automatically by Kubernetes.
In general, it's not a problem to have some duplicates, it's a pipeline for analytics where duplicates are acceptable, but if the issue can be mitigated at least on the Kafka side - that would be great. Will using transactional API guarantee exactly-once for Kafka at least(to make sure that re-processing doesn't happen when reassignments/shut-downs/scaling activities are happening)?
Switching to Kafka Streams is not an option because we are quite late in the project.
Exactly-once semantics is achievable with regular producers and consumers also. Kafka Streams are built on top of these clients themselves.
We can use an idempotent producer to do achieve this.
When dealing with external systems, it is important to ensure that we don't produce the same message again and again using producer.send(). Idempotence applies to internal retries by Kafka clients but doesn't take care of duplicate calls to send().
When we produce messages that arrive from a source we need to ensure that the source doesn't produce a duplicate message. For example, if it is a database, use a WAL and last maintain last read offset for that WAL and restart from that point. Debezium, for example does that. You may check to see if it supports your datasource.

Kafka: topic compaction notification?

I was given the following architecture that I'm trying to improve.
I receive a stream of DB changes which end up in a compacted topic. The stream is basically key/value pairs and the keyspace is large (~4 GB).
The topic is consumed by one kafka stream process that stores the data in RockDB (separate for each consumer/shard). The processor does two different things:
join the data into another stream.
check if a message from the topic is a new key or an update to an existing one. If it is an update it sends the old key/value and the new key/value pair to a different topic (updates are rare).
The construct has a couple of problems:
The two different functionalities of the stream processor belong to different teams and should not be part of the same code base. They are put together to save memory. If we separate it we would have to duplicate RockDB's.
I would prefer to use a normal KTable join instead of the handcrafted join that's currently in the code.
RockDB seems to be a bit of overkill if the data is already persisted in a topic. We currently running into some performance issues and I assume it would be faster if we just keep everything in memory.
Question 1:
Is there a way to hook into the compaction process of a compacted topic? I would like a notification (to a different topic) for every key that is actually compacted (including the old and new value).
If this is somehow possible I could easily split the code bases apart and simplify the join.
Question 2:
Any other idea on how this can be solved more elegantly?
You overall design makes sense.
About your join semantics: I guess you need to stick with Processor API as regular KTable cannot provide you want. It's also not possible to hook into the compaction process.
However, Kafka Streams also supports in-memory state stores: https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html#state-stores
RocksDB is used by default, to allow the state to be larger than available main-memory. Spilling to disk with RocksDB to reliability -- however, it also has the advantage, that stores can be recreated quicker if an instance come back online on the same machine, as it's not required to re-read the whole changelog topic.
If you want to split the app into two, is your own decision on how much resources you want to provide.

Akka Stream Kafka vs Kafka Streams

I am currently working with Akka Stream Kafka to interact with kafka and I was wonderings what were the differences with Kafka Streams.
I know that the Akka based approach implements the reactive specifications and handles back-pressure, functionality that kafka streams seems to be lacking.
What would be the advantage of using kafka streams over akka streams kafka?
Your question is very general, so I'll give a general answer from my point of view.
First, I've got two usage scenario:
cases where I'm reading data from kafka, processing it and writing some output back to kafka, for these I'm using kafka streams exclusively.
cases where either the data source or sink is not kafka, for those I'm using akka streams.
This already allows me to answer the part about back-pressure: for the 1st scenario above, there is a back-pressure mechanism in kafka streams.
Let's now only focus on the first scenario described above. Let's see what I would loose if I decided to stop using Kafka streams:
some of my stream processors stages need a persistent (distributed) state store, kafka streams provides it for me. It is something that akka streams doesn't provide.
scaling, kafka streams automatically balances the load as soon as a new instance of a stream processor is started, or as soon as one gets killed. This works inside the same JVM, as well as on other nodes: scaling up and out. This is not provided by akka streams.
Those are the biggest differences that matter to me, I'm hoping that it makes sense to you!
The big advantage of Akka Stream over Kafka Streams would be the possibility to implement very complex processing graphs that can be cyclic with fan in/out and feedback loop. Kafka streams only allows acyclic graph if I am not wrong. It would be very complicated to implement cyclic processing graph on top of Kafka streams
Found this article to give a good summary of distributed design concerns that Kafka Streams provides (complements Akka Streams).
https://www.beyondthelines.net/computing/kafka-streams/
message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.
partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That’s good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).
Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.
State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction
Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.
Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the event time vs processed time and handle it correctly.
Author also says "Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state."
My take is that this applies mostly to an enterprise application where the "application state" is ... small.
For a data science application working with "big data", the "application state" produced by a combination of data munging, machine learning models and business logic to orchestrate all of this will likely not be managed well with Kafka Streams.
Also, am thinking that using a "pure functional event sourcing runtime" like https://github.com/notxcain/aecor will help make the mutations explicit and separate the application logic from the technology used to manage the persistent form of the state through the principled management of state mutation and IO "effects" (functional programming).
In other words the business logic does not become tangled with the Kafka apis.
Akka Streams emerged as a dataflow-centric abstraction for the Akka Actors model.
These are high-performance library built for the JVM and specially designed for general-purpose microservices.
Whereas as long as Kafka Streams is concerned, these are client libraries used to process unbounded data. They are used to read data from Kafka topics, then process it, and write the results to new topics.
Well I used both of those and I have a pretty good idea about their strength's and weaknesses.
If you are solely concentrated in Kafka and you don't have to much experience about stream processing, Kafka Streams is good solution out of the box to help understand the streaming concepts. It Achilles heel in my opinion is its datastore, RockDB to help stateful scenarios with KTable or internal State Stores.
If you use Kafka Streams library, RockDB install itself in the background transparently, which is great for a beginner but troublesome for an experienced developer. RockDB is a key/value database like Cassandra, it has the most strengths of Cassandra but also the weakness, one major of those you can only query the things with primary key, which is for most of the real life scenarios s huge limitation. There are some means to implement your own datastore but they are not that well documented and could be great challenge. Also RockDB is really great loading single Values but if you have iterate over things, after a Dataset size of 100 000 the performance degrades significantly.
Unfortunately while RockDB is embedded so deep in Kafka Streams, it is also not that easy to implement a CQRS solution with it.
And as mentioned above, it has no concept of Back Pressure while Kafka Consumer give Records one by one, in a scenario that you have to scale out that can be really good bottleneck. And be really careful about that statement that Kafka Streams does not need Backpressure mechanism, as this Netflix blog points out it can really cause unpleasant effects.
"By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. An investigation of the JVM memory dump revealed an internal Kafka message concurrent queue whose size had grown uncontrollably to over 1.3 million elements.
The cause for this abnormal queue growth is due to Spring KafkaListener’s lack of native back-pressure support."
Well so what are the advantages and disadvantages of Akka Streams compared to Kafka Streams. Well first of all, Akka is not that much of out of the box framework, you have to understand the concepts much better, it is not coupled with single persistence of options, you can choose whatever you want. It has direct support for CQRS pattern (Akka Projection) so you are not bound to query your data only over Primary Key. Akka developer thought about a lot scaling out and back pressure, committed a lot of code to Kafka code base to improve performance.
So if you are only working with Kafka and new to Stream Processing you can use Kafka Streams but be prepared that at some point you can hit a wall and switch to Akka Stream.
You want to see working details/example, I have two blogs about it, you can check it those, blog1 blog2

Reliable usage of DirectKafkaAPI

I am pllaned to develop a reliable streamig application based on directkafkaAPI..I will have one producer and another consumer..I wnated to know what is the best approach to achieve the reliability in my consumer?..I can employ two solutions..
Increasing the retention time of messages in Kafka
Using writeahead logs
I am abit confused regarding the usage of writeahead logs in directkafka API as there is no receiver..but in the documentation it indicates..
"Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. "
so I wanted to know what is the best approach..if it suffices to increase the TTL of messages in kafka or I have to also enable write ahead logs..
I guess it would be good practice if I avoid one of the above since the backup data (retentioned messages, checkpoint files) can be lost and then recovery could face failure..
Direct Approach eliminates the duplication of data problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
Also, Direct approach by default supports exactly-once message delivery semantics, it does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.