How can Kafka Streams be used for Event sourcing? - apache-kafka

I read about how event sourcing can be achieved by using Apache Kafka as the event broker. (Link to the confluent article)
If we take a look at this picture, it shows how event is written into Kafka, and then Kafka Streams is used to create views in the database.
My question here is how can we use Kafka Streams for this? If i'm correct it is a client library, so we need something that uses this, like a microservice called "Aggregate Service".
Is this the right approach to implement such design? Would it scale well?

Kafka Streams must first consume events from Kafka that have been "sourced" by some other process using a plain Kafka producer library.
Kafka Streams applications can only scale up to the number of partitions in their source topics as they're built on the base Kafka consumer API

In that diagram, Kafka Streams is being used as a projection from the event store (the write-model for this application) to a read-model (a view of the data that's more optimized for performing queries).
The write side of the application could well be a service that receives commands and writes to an event store (which could be a DB purposely designed for this like EventStore, or some other datastore being utilized with such patterns as it satisfies the contract for an event store). The broad contract for an event store is that it allows appending an event for some entity and provides a means to retrieve all events for a given entity after some point (often "the beginning of time", though it's also not uncommon to have some snapshot store, in which case that point is derived from the latest snapshot).
Kafka is usable as an event store, especially if there are fairly few entities being event-sourced relative to the number of partitions: otherwise the "retrieve all events for a given entity" operation implies filtering out events for other entities, which at some point becomes prohibitively inefficient.
If not using Kafka as the event store but using Kafka Streams as a projection, then you'd likely have one of:
(high-level, e.g. using something like Akka Persistence to manage the event store; disclaimer: I am employed by Lightbend which maintains Akka and provides commercial support and consulting around Akka) a projection from the event store publishing events to a Kafka topic to which Kafka Streams subscribes
(low-level, e.g. a hand-rolled library for treating a regular DB as an event store) change-data-capture (e.g. Debezium for MySQL/Postgres/etc.) publishing updates to the event store tables to a Kafka topic to which Kafka Streams subscribes

Related

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Is Kafka a message queue and can Kafka be used as the database?

Some places mentioned Kafka is the publish-subscribe messaging. Other sources mentioned Kafka is the Message Queue. May I ask the differences between those and can Kakfa be used as the database?
There are 2 patterns named Publish-Subscribe and Message Queue. There are some places discussed the differences. here
Kafka especially supports both of these 2 patterns. For the publish-subscribe pattern, Kafka has publisher/subscriber which supported this pattern. The publisher sends messages to one topic and the subscriber can subscribes and receives messages on that one. For the queueing pattern, Kafka has a concept named Consumer Group. Within the same consumer group, all consumers will share jobs hence balancing the workload.
Because of the flexible design from the start, Kafka is broadly used for many software patterns while designing the system.
Personally, I would not call Kafka itself a database but you can use Kafka as the storage, especially through some mechanisms such as the log compaction. Ref1 Ref2
Kafka is a storage at base like a database but without indexes, where every query is a full scan of your data. Kafka it store data in files that can not be modified. Ex if you use event sourcing you can save all event of your system in Kafka and reprocess all events when your system have a bug.
Imagine that Kafka can split a very huge file(10TB or more) on multiple server and provide a way to read that file in a distributed manner using partitions( more partition you have, more application can read in parallel).
Because its a storage, Kafka can also be used as a message queue or as a publish-subscribe system.

Producer-consumer processing pattern for Kafka processing

I'm implementing a streaming pipeline that resembles the illustration below:
*K-topic1* ---> processor1 ---> *K-topic2* ---> processor2 -->
*K-topic3* ---> processor3 --> *K-topic4*
The K-topic components represent Kafka topics and the processor components code (Python/Java).
For the processor component, the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain as well as persistent store such as S3.
I have a question regarding the design approach.
The way I see it, each processor component should encapsulate both consumer and producer functionality.
Would the best approach be to have a Processor module/class that could contain KafkaConsumer and KafkaProducer classes ? To date, most examples I've seen have separate consumer and producer components which are run separately and would entail running double the number of components
as opposed to encapsulating producers & consumers within each Processor object.
Any suggestions/references are welcome.
This question is different from
Designing a component both producer and consumer in Kafka
as that question specifically mentions using Samza which is not the case here.
the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain
This is exactly the strength of Kafka Streams and/or KSQL. You could use the Processor API, but from what you describe, I think you'll only need the Streams DSL API
persist the results to the next topic in the chain as well as persistent store such as S3.
From the above topic, you can use a Kafka Connect Sink for getting the topic data into these other external systems. There is no need to write a consumer to do this for you.

Kafka Consumer API vs Streams API for event filtering

Should I use the Kafka Consumer API or the Kafka Streams API for this use case? I have a topic with a number of consumer groups consuming off it. This topic contains one type of event which is a JSON message with a type field buried internally. Some messages will be consumed by some consumer groups and not by others, one consumer group will probably not be consuming many messages at all.
My question is:
Should I use the consumer API, then on each event read the type field and drop or process the event based on the type field.
OR, should I filter using the Streams API, filter method and predicate?
After I consume an event, the plan is to process that event (DB delete, update, or other depending on the service) then if there is a failure I will produce to a separate queue which I will re-process later.
Thanks you.
This seems more a matter of opinion. I personally would go with Streams/KSQL, likely smaller code that you would have to maintain. You can have another intermediary topic that contains the cleaned up data that you can then attach a Connect sink, other consumers, or other Stream and KSQL processes. Using streams you can scale a single application on different machines, you can store state, have standby replicas and more, which would be a PITA to do it all yourself.