Why we require Apache Kafka with NoSQL databases? - mongodb

Apache Kafka is an real-time messaging service. It stores streams of data safely in distributed and fault-tolerant. We can filter streaming data when comming producer. I don't understant that why we need NoSQL databases like as MongoDB to store same data in Apache Kafka. The true question is that why we store same data in a NoSQL database and Apache Kafka?
I think if we need a NoSQL database, we can collect streams of data from clients in MongoDB at first without the use of Apache Kafka. But, most of big data architecture preference using Apache Kafka between data source and NoSQL database.(see)
What is the advantages of that for real systems?

This architecture has several advantages:
Kafka as Data Integration Bus
It helps distribute data between several producers and many consumers easily. Here Apache Kafka serves as an "data" integration message bus.
Kafka as Data Buffer
Putting Kafka in front of your "end" data storages like MongoDB or MySQL acts like a natural data buffer. So you are able to deploy/maintain/redeploy your consumer services independently. At the time your service is down for maintanance Kafka is still storing all incoming data, that is quite useful.
Kafka as a Short Time Data Storage
You don't have to store everything in Kafka: very often you use Kafka topics with retention. It means all data older than some value will be deleted by Kafka automatically. So, for example you may have Kafka topic with 1 week retention (so you store 1 week of data only) but at the same time your data lives in long time storage services like classic SQL-DBs or Cassandra etc.
Kafka as a Long Time Data Storage
On the other hand you can use Apache Kafka as a long term storage system. Using compacted topics enables you to store only the last value for each key. So your topic becomes a last state storage of your app.

Related

KTable initialization and persistence

This is more of an architectural question. I'm learning about Event-Driven Architecture and Streaming Systems with Apache Kafka. I've learned about Event Sourcing and CQRS and have some basic questions regarding implementation.
For example, consider a streaming application where we are monitoring vehicular events of drivers registered in our system. These events will be coming in as a KStream. The drivers registered in the system will be in a KTable, and we need to join the events and drivers to derive some output.
Assume that we insert a new driver in the system by a microservice, which pushes the data in a Cassandra table and then to the KTable topic by Change Data Capture.
since Kafka topics have a TTL associated with them, how do we make sure that the driver records are not dropped?
I understand that Kafka has a persistent state store that can maintain the required state, but can I depend on it like a Cassandra table? Is there a size consideration?
If the whole application, and all kafka brokers and consumer nodes are terminated, can the application be restarted without loss of driver records in the KTable?
If the streaming application is Kubernetes based, how would I maintain the persistent disk volumes of each container and correctly attach them as containers come and go?
Would it be preferable to join the event stream with the driver table in Cassandra using Spark Streaming or Flink? Can Spark and Flink still maintain data locality as their streaming consumers will be distributed by Kafka partitions, and the Cassandra data by I don't know what?
EDIT: - I realized Spark and Flink would be pulling data from Cassandra on the respective nodes depending on what keys they have. Kafka Streaming has the advantage that the Stream and KTable to join will already be data local.
KTables don't have a TTL since they are built from compacted topics (infinite retention).
Yes, you need to maintain storage directories for persistent Kafka StateStores. Since those stores would be on-disk, no records should be dropped from them upon broker/app restarts until you actively clear the state directories from the app instance hosts.
Spark/Flink do not integrate with Kafka Streams stores, and have their own locality considerations. I do believe Flink offers RocksDB state, and both broadcast data for remote-joins, otherwise, joining Kafka record keys requires both topics have matching partition counts - this way partitions are assigned to the same instances/executors, similar to Kafka Streams joins.

RabbitMq and KStreams for Data Aggregation

I'm trying to solve the problem of data denormalization before indexing to the Elasticsearch. Right now, my Postgres 11 database is configured with pgoutput plugin and Debezium with Postgresql Connector is streaming the log changes to RabbitMq which are then aggregated by doing a reverse lookup on the db and feeding to the Elasticsearch.
Although, this works okay, the lookup at the App layer to aggregate the data is expensive and taking a lot of execution time (the query is already refined but it has about 10 joins making it sloppy).
The other alternative I explored was to use KStreams for data aggregation. My knowledge on Apache Kafka is minimal and thus I'm here. My question here is it a requirement to have Apache Kafka as the broker to be able to utilize the Java KStreams API or can it be leveraged with any broker such as RabbitMq? I'm unsure about this because all the articles talk about Kafka Topics and Key Value pairs which are specific to Apache Kafka.
If there is a better way to solve the data denormalization problem, I'm open to it too.
Thanks
Kafka Steams is only for Kafka. You're more than welcome to use Kafka Streams between Debezium and the process that consumes any topic (the Postgres connector that writes to RabbitMQ?)
You can use Spark, Flink, or Beam for stream processing on other messaging queues, but Debezium requires Kafka so start with tools around that.
Spark, for example, has an Elasticsearch writer library; not sure about the others.

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

Why do we need a database when using Apache Kafka?

According to the schema data comes to Kafka, then to stream and Mapr-DB.
After storing data in DB, user can display data on the map.
Question is, why we use DB to dispaly data on the map if Kafka is already DB.
It seems to me more slowly to get realtime data from Mapr-DB that from Kafka.
What do you think, why this example uses this appoarch?
The core abstraction Kafka provides for a stream of records is known as topic. You can imagine topics as the tables in a database. A database (Kafka) can have multiple tables (topics). Like in databases, a topic can have any kind of records depending on the usecase. But note that Kafka is not a database.
Also note that in most cases, you would have to configure a retention policy. This means that messages at some point will be deleted based on a configurable time or size based retention policy. Therefore, you need to store the data into a persistent storage system and in this case, this is your Database.
You can read more about how Kafka works in this blog post.

Kafka and microservices - Architecture questions

In a Microservices based architecture, who writes to Kafka? services themselves or the Microservices databases? I've been thinking about this and see pros and cons to both approaches but leaning towards having database write to Kafka topics because
Database and data in the Kafka topic won't go out of sync in case write to Kafka fails for whatever reason
Application teams won't have to have one more step to worry about
Applications can keep focusing on the core function rather than worrying about Kafka.
Thanks for your inputs
As cricket_007 has been saying, databases typically cannot write to Apache Kafka themselves; instead, you'd need a change data capturing services such as Debezium in order to stream data changes from the database into Kafka (disclaimer: I'm the lead of Debezium).
Such an approach allows to ensure (eventual) consistency between a service's own database and Kafka messages sent to other services. On specific CDC application I'd recommend to look into is the outbox pattern. The idea there is to not capture changes to the service's actual business tables, but instead work with a separate "outbox table", into which the service writes specific messages meant for consumption by other services. CDC would then be used to sent these events from that table to Kafka.
This approach avoids exposing internal data structures to outside consumers while also avoiding the issues of "dual writes" which a service would suffer from when directly writing to its database and Kafka. In Debezium there's some means of built-in support for the outbox pattern via a message transformation that helps to route the events from the outbox table into event-type specific Kafka topics.
Not all services need a database, they just emit data (logs, metrics, sensors, etc)
So, the answer would be either.
Plus, I'm not sure what database directly can export to Kafka, so you'd have some other service like Debezium deployed which would be polling those CDC records off the database
Application developers still have to "worry" about how to deserialize their data, how many partitions are in the topic so they can scale out consumption, manage offsets, among other things