Backing up Kafka in a Non Idempotent Architecture - apache-kafka

We're heading up the architecture of an event sourced solution leveraging both Kafka Streams and vanilla Kafka topic consuming/producing of messages. The "workers" are not idempotent but we are instead relying on consumer groups offsets. Kafka acts as our single source of truth.
Looking through options for true backup (i.e., replication is not backup) solutions for Kafka we've seen the variety of S3 connectors out there, and did check out Confluent's Replicator as well as the coming MirrorMaker 2. None do – to my knowledge – offer true protection in terms of topic deletion nor point-in-time restoration.
As a small-ish startup, we're keen on not operating Kafka in house, albeit disk level snapshots (not offered by any actor known to me) seems to be the way forward.
Having looked at AWS MSK, Confluent and CloudKarafka I am interested in hearing about your experiences, pros/cons and solutions in a similar (or at least approximate) architectures...

Related

Kafka (MSK) and MirrorMaker2 disaster recovery procedures for producers and consumers

I have two questions I'm hoping someone with experience in MSK/Kafka and MirrorMaker2 can help with.
Currently we have a production MSK 2.7.0 cluster with 3 brokers and roughly 1T in topic data. We use the Debezium plugin for most things, a few jdbc/mysql sink connectors and then a handful of random consumers so far. For DR purposes, I'm considering adding a second MSK cluster of the same size and using MirrorMaker2 to replicate everything to it. I've done a fair amount of searching and reading about how others might be approaching DR for Kafka. It seems that MM2 is the standard.
I've seen conflicting views on whether active/standby or active/active is recommended. It seems that active/active would be ideal but it comes with a lot of considerations for producers and consumers, mostly when event ordering is important. Curious if anyone can elaborate on that, and how realistic it would be in setting up that topology. Event order is important for most of our cases.
For an active/standby configuration, it's not clear to me after what I've read what to plan for in the event the primary cluster goes down permanently and all of the consumers/producers have to migrate over to the new cluster. There's a lot written about how MM2 replicates its own offset data, but I'm not finding much about what a consumer needs to account for when being moved over to the replicated topic. I'm especially interested in what it would mean to move the Debezium connectors over, and if it has a mechanism built in for such a thing or what I should expect.

Use transactional API and exactly-once with regular Producers and Consumers

Confluent documents that I was able to find all focus on Kafka Streams application when it comes to exactly-once/transactions/idempotence.
However, the APIs for transactions were introduced on a "regular" Producer/Consumer level and all the explanations and diagrams focus on them.
I was wondering whether it's Ok to use those API directly without Kafka Streams.
I do understand the consequences of Kafka processing boundaries and the guarantees, and I'm Ok with violating it. I don't have a need for 100% exactly-once guarantee, it's Ok to have a duplicate once in a while, for example, when I read from/write to external systems.
The problem I'm facing is that I need to create an ETL pipeline for Big Data project where we are getting a lot of duplicates when the apps are restated/relocated to different hosts automatically by Kubernetes.
In general, it's not a problem to have some duplicates, it's a pipeline for analytics where duplicates are acceptable, but if the issue can be mitigated at least on the Kafka side - that would be great. Will using transactional API guarantee exactly-once for Kafka at least(to make sure that re-processing doesn't happen when reassignments/shut-downs/scaling activities are happening)?
Switching to Kafka Streams is not an option because we are quite late in the project.
Exactly-once semantics is achievable with regular producers and consumers also. Kafka Streams are built on top of these clients themselves.
We can use an idempotent producer to do achieve this.
When dealing with external systems, it is important to ensure that we don't produce the same message again and again using producer.send(). Idempotence applies to internal retries by Kafka clients but doesn't take care of duplicate calls to send().
When we produce messages that arrive from a source we need to ensure that the source doesn't produce a duplicate message. For example, if it is a database, use a WAL and last maintain last read offset for that WAL and restart from that point. Debezium, for example does that. You may check to see if it supports your datasource.

Is it ok to use Apache Kafka "infinite retention policy" as a base for an Event sourced system with CQRS?

I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).
your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.
I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.

Akka Stream Kafka vs Kafka Streams

I am currently working with Akka Stream Kafka to interact with kafka and I was wonderings what were the differences with Kafka Streams.
I know that the Akka based approach implements the reactive specifications and handles back-pressure, functionality that kafka streams seems to be lacking.
What would be the advantage of using kafka streams over akka streams kafka?
Your question is very general, so I'll give a general answer from my point of view.
First, I've got two usage scenario:
cases where I'm reading data from kafka, processing it and writing some output back to kafka, for these I'm using kafka streams exclusively.
cases where either the data source or sink is not kafka, for those I'm using akka streams.
This already allows me to answer the part about back-pressure: for the 1st scenario above, there is a back-pressure mechanism in kafka streams.
Let's now only focus on the first scenario described above. Let's see what I would loose if I decided to stop using Kafka streams:
some of my stream processors stages need a persistent (distributed) state store, kafka streams provides it for me. It is something that akka streams doesn't provide.
scaling, kafka streams automatically balances the load as soon as a new instance of a stream processor is started, or as soon as one gets killed. This works inside the same JVM, as well as on other nodes: scaling up and out. This is not provided by akka streams.
Those are the biggest differences that matter to me, I'm hoping that it makes sense to you!
The big advantage of Akka Stream over Kafka Streams would be the possibility to implement very complex processing graphs that can be cyclic with fan in/out and feedback loop. Kafka streams only allows acyclic graph if I am not wrong. It would be very complicated to implement cyclic processing graph on top of Kafka streams
Found this article to give a good summary of distributed design concerns that Kafka Streams provides (complements Akka Streams).
https://www.beyondthelines.net/computing/kafka-streams/
message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.
partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That’s good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).
Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.
State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction
Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.
Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the event time vs processed time and handle it correctly.
Author also says "Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state."
My take is that this applies mostly to an enterprise application where the "application state" is ... small.
For a data science application working with "big data", the "application state" produced by a combination of data munging, machine learning models and business logic to orchestrate all of this will likely not be managed well with Kafka Streams.
Also, am thinking that using a "pure functional event sourcing runtime" like https://github.com/notxcain/aecor will help make the mutations explicit and separate the application logic from the technology used to manage the persistent form of the state through the principled management of state mutation and IO "effects" (functional programming).
In other words the business logic does not become tangled with the Kafka apis.
Akka Streams emerged as a dataflow-centric abstraction for the Akka Actors model.
These are high-performance library built for the JVM and specially designed for general-purpose microservices.
Whereas as long as Kafka Streams is concerned, these are client libraries used to process unbounded data. They are used to read data from Kafka topics, then process it, and write the results to new topics.
Well I used both of those and I have a pretty good idea about their strength's and weaknesses.
If you are solely concentrated in Kafka and you don't have to much experience about stream processing, Kafka Streams is good solution out of the box to help understand the streaming concepts. It Achilles heel in my opinion is its datastore, RockDB to help stateful scenarios with KTable or internal State Stores.
If you use Kafka Streams library, RockDB install itself in the background transparently, which is great for a beginner but troublesome for an experienced developer. RockDB is a key/value database like Cassandra, it has the most strengths of Cassandra but also the weakness, one major of those you can only query the things with primary key, which is for most of the real life scenarios s huge limitation. There are some means to implement your own datastore but they are not that well documented and could be great challenge. Also RockDB is really great loading single Values but if you have iterate over things, after a Dataset size of 100 000 the performance degrades significantly.
Unfortunately while RockDB is embedded so deep in Kafka Streams, it is also not that easy to implement a CQRS solution with it.
And as mentioned above, it has no concept of Back Pressure while Kafka Consumer give Records one by one, in a scenario that you have to scale out that can be really good bottleneck. And be really careful about that statement that Kafka Streams does not need Backpressure mechanism, as this Netflix blog points out it can really cause unpleasant effects.
"By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. An investigation of the JVM memory dump revealed an internal Kafka message concurrent queue whose size had grown uncontrollably to over 1.3 million elements.
The cause for this abnormal queue growth is due to Spring KafkaListener’s lack of native back-pressure support."
Well so what are the advantages and disadvantages of Akka Streams compared to Kafka Streams. Well first of all, Akka is not that much of out of the box framework, you have to understand the concepts much better, it is not coupled with single persistence of options, you can choose whatever you want. It has direct support for CQRS pattern (Akka Projection) so you are not bound to query your data only over Primary Key. Akka developer thought about a lot scaling out and back pressure, committed a lot of code to Kafka code base to improve performance.
So if you are only working with Kafka and new to Stream Processing you can use Kafka Streams but be prepared that at some point you can hit a wall and switch to Akka Stream.
You want to see working details/example, I have two blogs about it, you can check it those, blog1 blog2

Is Kafka useful if we have less messages to process

Is Kafka useful if we have less messages to process. If I have 1000 messages per second to process, is Kafka feasible?
As any experienced software engineer will say, it depends ;-). There are many factors to consider. Here is just a sample:
Do you need to have these messages persisted? If not, then probably Kafka is not what you're looking for.
Even if you require persistence, it doesn't mean that Kafka can handle your throughput requirements (although my gut feeling says it can cope with your volume). The only way to determine that is to run performance tests with your message volumes against Kafka and see how it copes. It's also quite possible that other brokers like ActiveMQ can handle your volumes as well. Then it falls down to how appropriate is the broker for your use case (e.g., event sourcing?) Checkout out Kafka's docs to see how Kafka is used in the industry.
You have to keep in mind that Kafka is currently not as popular as other brokers such as ActiveMQ. So even if Kafka is useful to your scenario, you could have a hard time finding help on Kafka questions/issues you'll have along the way.