KTable initialization and persistence - apache-kafka

This is more of an architectural question. I'm learning about Event-Driven Architecture and Streaming Systems with Apache Kafka. I've learned about Event Sourcing and CQRS and have some basic questions regarding implementation.
For example, consider a streaming application where we are monitoring vehicular events of drivers registered in our system. These events will be coming in as a KStream. The drivers registered in the system will be in a KTable, and we need to join the events and drivers to derive some output.
Assume that we insert a new driver in the system by a microservice, which pushes the data in a Cassandra table and then to the KTable topic by Change Data Capture.
since Kafka topics have a TTL associated with them, how do we make sure that the driver records are not dropped?
I understand that Kafka has a persistent state store that can maintain the required state, but can I depend on it like a Cassandra table? Is there a size consideration?
If the whole application, and all kafka brokers and consumer nodes are terminated, can the application be restarted without loss of driver records in the KTable?
If the streaming application is Kubernetes based, how would I maintain the persistent disk volumes of each container and correctly attach them as containers come and go?
Would it be preferable to join the event stream with the driver table in Cassandra using Spark Streaming or Flink? Can Spark and Flink still maintain data locality as their streaming consumers will be distributed by Kafka partitions, and the Cassandra data by I don't know what?
EDIT: - I realized Spark and Flink would be pulling data from Cassandra on the respective nodes depending on what keys they have. Kafka Streaming has the advantage that the Stream and KTable to join will already be data local.

KTables don't have a TTL since they are built from compacted topics (infinite retention).
Yes, you need to maintain storage directories for persistent Kafka StateStores. Since those stores would be on-disk, no records should be dropped from them upon broker/app restarts until you actively clear the state directories from the app instance hosts.
Spark/Flink do not integrate with Kafka Streams stores, and have their own locality considerations. I do believe Flink offers RocksDB state, and both broadcast data for remote-joins, otherwise, joining Kafka record keys requires both topics have matching partition counts - this way partitions are assigned to the same instances/executors, similar to Kafka Streams joins.

Related

What happens internally when we run a kSQL query?

I am entirely new to Apache Kafka and kSQL. I was having a question in my mind and I tried to find out the answer but I failed to do so.
My current understanding is that the events that are getting generated from the producer are being stored in the Kafka internally in the topics in serialized form (0s and 1s). If I create a Kafka stream to consume the data and after that, If I run kSQL query let's say to use the COUNT() function so will the output of that query persist in the Kafka topics.
If that the case will it not be a storage cost?
Behind the scenes, it runs a Kafka Streams topology.
Any persisted streams or aggregated tables, in your case, indeed occupy storage.

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

How does Kinesis achieve Kafka style Consumer Groups?

In Kafka, I can split my topic into many partitions. I cannot have more consumers than partitions in Kafka, because the partition is used as a way to scale out a topic. If I have more load, I can increase the number of partitions, which will allow me to increase the number of consumers, which will allow me to have more threads / processes processing on a given topic.
In Kafka, there is a concept of a Consumer Group. If we have 10 consumer groups on a single topic, each consumer group will have the opportunity to process every message in a topic. The consumer group still takes advantage of the scalability from the partitions (i.e. Each consumer group can have up to 'n' consumers, where 'n' is the number of partitions on a topic). This is the beauty of kafka, scalability and multi-channel reading are two separate concepts with two separate knobs to turn.
In Kinesis, we are told that, if you use the Kinesis Library Client you can get the same functionality as consumer groups by defining different Kinesis Applications. In other words, we can have different Kinesis Applications independently streaming all records from the same stream and different times.
We are also told that "Amazon Kinesis Client Library (KCL) automatically creates an Amazon DynamoDB table for each Amazon Kinesis Application to track and maintain state information such as resharding events and sequence number checkpoints."
OK, So I'm getting ready to start reading through the KCL code here, but I'm hoping someone can answer these questions to save me some time.
How does the KCL actually do this?
Are there diagrams somewhere explaining the process?
If I started a new Kinesis Application (MyKinesisApp1) after a record was already produced and consumed by all prior Kinesis Applications, will the new Kinesis Application (MyKinesisApp1) still have an opportunity to consume that record? In other words, does Kinesis remove the record from its stream after it has been processed, or does it leave it there for the 7 days no matter what?
I have seen this question here but it doesn't answer my question. Especially my third question! Also, this question does a direct comparison between two similar technologies. It will help people that know Kafka, learn Kinesis more quickly.
In the KCL configuration, there is a section "appName" which corresponds to "Application Name" and that is the same as "consumer group" in Kafka. For each consumer group (ie. Kinesis Streams Consumer Application) there is a DynamoDB table. You can see an example DynamoDB here (the KCL appName is 'quickstats-development'): AWS Kinesis leaseOwner confusion
No, as far as I know, there is not. "Kinesis Streams" is similar to Kafka, but other than that, not much graphical representation.
Yes. Each Kafka Consumer-Group is represented as a different DynamoDB table in Kinesis. That way, different Kinesis Consumer Applications can consume same record independently. The checkpoint in Kinesis is the Offset value of Kafka. And a checkpoint in DynamoDB is the cursor of reading point in a Kinesis shard. Read this answer for a similar example: https://stackoverflow.com/a/42833193/1622134

Why we require Apache Kafka with NoSQL databases?

Apache Kafka is an real-time messaging service. It stores streams of data safely in distributed and fault-tolerant. We can filter streaming data when comming producer. I don't understant that why we need NoSQL databases like as MongoDB to store same data in Apache Kafka. The true question is that why we store same data in a NoSQL database and Apache Kafka?
I think if we need a NoSQL database, we can collect streams of data from clients in MongoDB at first without the use of Apache Kafka. But, most of big data architecture preference using Apache Kafka between data source and NoSQL database.(see)
What is the advantages of that for real systems?
This architecture has several advantages:
Kafka as Data Integration Bus
It helps distribute data between several producers and many consumers easily. Here Apache Kafka serves as an "data" integration message bus.
Kafka as Data Buffer
Putting Kafka in front of your "end" data storages like MongoDB or MySQL acts like a natural data buffer. So you are able to deploy/maintain/redeploy your consumer services independently. At the time your service is down for maintanance Kafka is still storing all incoming data, that is quite useful.
Kafka as a Short Time Data Storage
You don't have to store everything in Kafka: very often you use Kafka topics with retention. It means all data older than some value will be deleted by Kafka automatically. So, for example you may have Kafka topic with 1 week retention (so you store 1 week of data only) but at the same time your data lives in long time storage services like classic SQL-DBs or Cassandra etc.
Kafka as a Long Time Data Storage
On the other hand you can use Apache Kafka as a long term storage system. Using compacted topics enables you to store only the last value for each key. So your topic becomes a last state storage of your app.

Join streaming based on key - Spark/Kafka

Suppose 2 streaming given by spark and one of streaming is not 100 % in sync. There might be difference in getting streaming. We need to join streaming by key. Is there any way we can do it without any persistent ?
I don't think it is possible, Kafka Streams ships with built-in support to interpret the data in a Kafka topic as such a continuously updated table. In the Kafka Streams DSL this is achieved via the so-called Ktable
these KTables are backed by state stores in Kafka Streams. These state stores are local to your application (more precisely: they are local to the instances of your application, of which there can be one or many), which means that interacting with these state stores does not require talking over the network, so read and write operations are very fast. Incase you decided not to persist data, you might start losing information which you might not want