Why do we need a database when using Apache Kafka? - apache-kafka

According to the schema data comes to Kafka, then to stream and Mapr-DB.
After storing data in DB, user can display data on the map.
Question is, why we use DB to dispaly data on the map if Kafka is already DB.
It seems to me more slowly to get realtime data from Mapr-DB that from Kafka.
What do you think, why this example uses this appoarch?

The core abstraction Kafka provides for a stream of records is known as topic. You can imagine topics as the tables in a database. A database (Kafka) can have multiple tables (topics). Like in databases, a topic can have any kind of records depending on the usecase. But note that Kafka is not a database.
Also note that in most cases, you would have to configure a retention policy. This means that messages at some point will be deleted based on a configurable time or size based retention policy. Therefore, you need to store the data into a persistent storage system and in this case, this is your Database.
You can read more about how Kafka works in this blog post.

Related

What is the best practice to enrich data when synchronizing data with Kafka Connect

I am thinking about solutions to enrich data from Kafka.
Now I am using implementing Mongo Kafka Connect to sync all changes to Kafka. The kafka connect use the change stream to watch oplogs and public changes to Kafka. Relationship between Mongo's collection and Kafka Topic is 1:1.
On the consumer side, when it pulls data, it will get the reference id that we need to join to other collection to get the data.
To join data between collections, I have 2 solutions below.
when pulling data by consumers, it need to go back to the Mongo database to fetch or the data or join collections according to the reference key.
For this way, I concern about the number of connects that I need to go back to the Mongo database.
using kafka streaming to join data among topics.
For the second solution, I like to know how to keep that master data in the topics forever and how to maintain records in topics like db tables, so each row have unique index, and when data changes come to the topic, we can update the records.
If you have any other solutions, please let me know.
Your consumer can do whatever it wants. You may need to increase various Kafka timeout configs depending on your database lookups, though.
Kafka topics can be infinitely retained with retention.ms=-1, or by compaction. When you use compaction, it'll act similarly to a KV store (but as a log). To get an actual lookup store, you can build a KTable, then join a topic stream against it
This page covers various join patterns in Kafka Streams - https://developer.confluent.io/learn-kafka/kafka-streams/joins/
You can also use ksqlDB

How to create kafka event from database?

There is a legacy service that writes values to the database.
I need to converting values to events and then sending it to kafka.
I'm going to make a service that, once in a fixed delay checks for new records and sends them, also writing the submitted records ids to the technical table, but maybe there is some other way, best practice or pattern.
You may want to look into Debezium that implements Change Data Capture on relational and NoSql data stores and streams the data into Kafka.
https://github.com/debezium/debezium
https://debezium.io/documentation

Using Apache Kafka to maintain data integrity across databases in microservices architecture

Has anyone used Apache Kafka to maintain data integrity across microservice architecture which each service has its own database? I have been searching around and there was some posts mentioned about using Kafka but I'm looking for more details such as in how Kafka was used. Do you have to write code for producer and consumer (say for Customer database as producer and Orders database as consumer so that if a Customer is deleted in the Customer database then the Orders database somehow need to know that so it will delete all Orders for that Customer as well).
Yes, you'll need to write that processing code
For example, one database would be connected to a CDC reader to emit all changes to a stream (the producer), which could be fed into a KTable or custom consumer to write upserts/deletes into a local cache of another service. I mention it ought to be a cache rather than a database is because when the service restarts, you potentially miss some events, or duplicate others, so the source of the materialized view should ideally be Kafka itself (via a compacted topic)

Why we require Apache Kafka with NoSQL databases?

Apache Kafka is an real-time messaging service. It stores streams of data safely in distributed and fault-tolerant. We can filter streaming data when comming producer. I don't understant that why we need NoSQL databases like as MongoDB to store same data in Apache Kafka. The true question is that why we store same data in a NoSQL database and Apache Kafka?
I think if we need a NoSQL database, we can collect streams of data from clients in MongoDB at first without the use of Apache Kafka. But, most of big data architecture preference using Apache Kafka between data source and NoSQL database.(see)
What is the advantages of that for real systems?
This architecture has several advantages:
Kafka as Data Integration Bus
It helps distribute data between several producers and many consumers easily. Here Apache Kafka serves as an "data" integration message bus.
Kafka as Data Buffer
Putting Kafka in front of your "end" data storages like MongoDB or MySQL acts like a natural data buffer. So you are able to deploy/maintain/redeploy your consumer services independently. At the time your service is down for maintanance Kafka is still storing all incoming data, that is quite useful.
Kafka as a Short Time Data Storage
You don't have to store everything in Kafka: very often you use Kafka topics with retention. It means all data older than some value will be deleted by Kafka automatically. So, for example you may have Kafka topic with 1 week retention (so you store 1 week of data only) but at the same time your data lives in long time storage services like classic SQL-DBs or Cassandra etc.
Kafka as a Long Time Data Storage
On the other hand you can use Apache Kafka as a long term storage system. Using compacted topics enables you to store only the last value for each key. So your topic becomes a last state storage of your app.

Read data from KSQL tables

maybe this is a beginner question but what is the recommended way to read data produced in KSQL?
Let's assume I do some stream processing and write the data to a KSQL table. Now I want to access this data via a Spring application (e.g. fan-out some live data via a websocket). My first guess here was to use Spring Kafka and just subscribe to the underlying topic. Or should I use Kafka Streams?
Another use-case could be to do stream processing and write the results to a Redis store (e.g. for a webservice which always returns current values). What would be the approach here?
Thanks!
The results if KSQL queries are stored in Kafka topics. So you can access the results from third party applications by reading from the result topic.
If the query result is a Table the resulted Kafka topic is a changelog topic meaning that you can read it into a table in third party system such as Cassandra or Redis. This table will always have the latest result and you can query it from web services.
Check out our Clickstream demo where we push the results into Elastic for visualization. The visualized values are the latest values for in the corresponding tables.
https://github.com/confluentinc/ksql/tree/master/ksql-clickstream-demo#clickstream-analysis