How to query the event repository in a microservice Event Sourcing architecture with Spring Cloud Stream Kafka - apache-kafka

CLARIFICATION: Notice that this question is different form this one: How to implement a microservice Event Driven architecture with Spring Cloud Stream Kafka and Database per service
This one is about using Kafka as the only repository (of events), no DB needed, The other one is about using a Database (MariaDB) per service + Kafka.
I would like to implement an Event Sourcing architecture to handle distributed transactions:
OrdersService <------------> | Kafka Event Store | <------------>PaymentsService
subscribe/ subscribe/
find find
OrdersService receives an order request and stores the new Order in the broker.
private OrderBusiness orderBusiness;
#PostMapping
public Order createOrder(#RequestBody Order order){
logger.debug("createOrder()");
//do whatever
//Publish the new Order with state = pending
order.setState(PENDING);
try{
orderSource.output().send(MessageBuilder.withPayload(order).build());
}catch(Exception e){
logger.error("{}", e);
}
return order;
}
This is my main doubt: how can I query a Kafka broker? Imagine I want to search for orders by user/date,state, etc.

Short answer: you cannot query the broker but you could exploit Kafka's Streams API and "Interactive Queries".
Long answer: The access pattern for reading Kafka topics, are linear scans and not random lookups. Of course, you can also reposition at any time via #seek(), but only by offset or time. Also topics are sharded into partitions and data is (by default) hash partitioned by key (data model is key-value pairs). So there is a notion of a key.
However, you can use Kafka's Streams API that allows you to build an app that hold the current state -- base on a Kafka topics that is the ground truth -- as a materialized view (basically a cache). "Interactive Queries" allows you to query this materialized view.
For more details, see this two blog post:
https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/

Related

How can Kafka Streams be used for Event sourcing?

I read about how event sourcing can be achieved by using Apache Kafka as the event broker. (Link to the confluent article)
If we take a look at this picture, it shows how event is written into Kafka, and then Kafka Streams is used to create views in the database.
My question here is how can we use Kafka Streams for this? If i'm correct it is a client library, so we need something that uses this, like a microservice called "Aggregate Service".
Is this the right approach to implement such design? Would it scale well?
Kafka Streams must first consume events from Kafka that have been "sourced" by some other process using a plain Kafka producer library.
Kafka Streams applications can only scale up to the number of partitions in their source topics as they're built on the base Kafka consumer API
In that diagram, Kafka Streams is being used as a projection from the event store (the write-model for this application) to a read-model (a view of the data that's more optimized for performing queries).
The write side of the application could well be a service that receives commands and writes to an event store (which could be a DB purposely designed for this like EventStore, or some other datastore being utilized with such patterns as it satisfies the contract for an event store). The broad contract for an event store is that it allows appending an event for some entity and provides a means to retrieve all events for a given entity after some point (often "the beginning of time", though it's also not uncommon to have some snapshot store, in which case that point is derived from the latest snapshot).
Kafka is usable as an event store, especially if there are fairly few entities being event-sourced relative to the number of partitions: otherwise the "retrieve all events for a given entity" operation implies filtering out events for other entities, which at some point becomes prohibitively inefficient.
If not using Kafka as the event store but using Kafka Streams as a projection, then you'd likely have one of:
(high-level, e.g. using something like Akka Persistence to manage the event store; disclaimer: I am employed by Lightbend which maintains Akka and provides commercial support and consulting around Akka) a projection from the event store publishing events to a Kafka topic to which Kafka Streams subscribes
(low-level, e.g. a hand-rolled library for treating a regular DB as an event store) change-data-capture (e.g. Debezium for MySQL/Postgres/etc.) publishing updates to the event store tables to a Kafka topic to which Kafka Streams subscribes

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

Kafka Streams - Best way to do lookups in remote store via interactive queries?

I have a bit of confusion and I would like some clarification. I have something I'm working on. I want to have one Kafka Streams topology that will have five separate KStreams reading from their own respective topic and dumping that data into a large monolithic topic. Next I'll have a GlobalKTable that will read from that monolithic topic and materialize a global store let's say called lookupStore. I want to have this materialized global store as basically a "lookup" table for other Kafka Streams applications. I've done some reading on exposing this with an RPC layer with the application.server configuration which will be in the form of some unique host:port.
Now I want to have however many separate microservices each that are Kafka Streams applications that will perform are processing events from a KStream and then doing a lookup on lookupStore via an interactive query. For instance a .filter() operation based on whether the lookup on that lookupStore returned a value or not. So here's my confusion... let's assume I hardcode that exposed RPC layer on host:port how do I query lookupStore specifically to query it. If this was in the same topology/local instance you could just do something like lookupStore.get("key")... but how do you do this within a remote Kafka Streams instance?
Or does connecting to that RPC layer expose that state store to the remote application so that it "knows" of it and you can query the lookupStore like as if it was a local instance? Is this feasible or am I going down the wrong path?
If your microservices (which are streams applications) share the same Kafka cluster as the main streaming app (that generates GlobalKTable), then they can access the Table topic corresponding to the same application and do KTable join or lookupStore.get("key"). Also it is not recommended to do remote API calls within a stream application to do lookups, because of latency. If the two Kafak clusters are different, then you could explore replicating the topics (GlobalKTable and State Store change log topics) using something like mirror maker.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Using Apache Kafka to maintain data integrity across databases in microservices architecture

Has anyone used Apache Kafka to maintain data integrity across microservice architecture which each service has its own database? I have been searching around and there was some posts mentioned about using Kafka but I'm looking for more details such as in how Kafka was used. Do you have to write code for producer and consumer (say for Customer database as producer and Orders database as consumer so that if a Customer is deleted in the Customer database then the Orders database somehow need to know that so it will delete all Orders for that Customer as well).
Yes, you'll need to write that processing code
For example, one database would be connected to a CDC reader to emit all changes to a stream (the producer), which could be fed into a KTable or custom consumer to write upserts/deletes into a local cache of another service. I mention it ought to be a cache rather than a database is because when the service restarts, you potentially miss some events, or duplicate others, so the source of the materialized view should ideally be Kafka itself (via a compacted topic)