What is the best practice to enrich data when synchronizing data with Kafka Connect

What is the best practice to enrich data when synchronizing data with Kafka Connect - apache-kafka

I am thinking about solutions to enrich data from Kafka.
Now I am using implementing Mongo Kafka Connect to sync all changes to Kafka. The kafka connect use the change stream to watch oplogs and public changes to Kafka. Relationship between Mongo's collection and Kafka Topic is 1:1.
On the consumer side, when it pulls data, it will get the reference id that we need to join to other collection to get the data.
To join data between collections, I have 2 solutions below.
when pulling data by consumers, it need to go back to the Mongo database to fetch or the data or join collections according to the reference key.
For this way, I concern about the number of connects that I need to go back to the Mongo database.
using kafka streaming to join data among topics.
For the second solution, I like to know how to keep that master data in the topics forever and how to maintain records in topics like db tables, so each row have unique index, and when data changes come to the topic, we can update the records.
If you have any other solutions, please let me know.

Your consumer can do whatever it wants. You may need to increase various Kafka timeout configs depending on your database lookups, though.
Kafka topics can be infinitely retained with retention.ms=-1, or by compaction. When you use compaction, it'll act similarly to a KV store (but as a log). To get an actual lookup store, you can build a KTable, then join a topic stream against it
This page covers various join patterns in Kafka Streams - https://developer.confluent.io/learn-kafka/kafka-streams/joins/
You can also use ksqlDB

Related

Set kafka message key to source database name in Debezium Postgresql

We are trying to collect changes from a number of Postgresql databases using Debezium.
The idea is to create a single topic with a number of partitions equal to the number of databases - each database gets its own partition, because order of events matters.
We managed to reroute events to a single topic using topic routing, but to be able to partition events by databases I need to set message key properly.
Qestion: Is there a way we can set kafka message key to be equal to the source database name?
My thougts:
Maybe there is a way to set message key globally per connector configuration?
Database name can be found in the message, but its a nested property payload.source.name. Didn't find a way to extract value from a nested propery.
Any thoughts?
Thank you in advance!

You'd need to write/find a Connect transform that can extract nested fields and set the message key, or if you don't mind duplicating data within Kafka topics, you can use Kafka Streams / KsqlDB, etc to do the same.
Overall, I don't think one topic + one partition per database is a good design for scalability of consumers. Sure, it'll keep order, but it's not much overhead to simply create one topic per database with only one partition. Then make consumers read all topics using a regex pattern rather than needing to assign to specific/all partitions in one topic.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.

As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.

Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Why do we need a database when using Apache Kafka?

According to the schema data comes to Kafka, then to stream and Mapr-DB.
After storing data in DB, user can display data on the map.
Question is, why we use DB to dispaly data on the map if Kafka is already DB.
It seems to me more slowly to get realtime data from Mapr-DB that from Kafka.
What do you think, why this example uses this appoarch?

The core abstraction Kafka provides for a stream of records is known as topic. You can imagine topics as the tables in a database. A database (Kafka) can have multiple tables (topics). Like in databases, a topic can have any kind of records depending on the usecase. But note that Kafka is not a database.
Also note that in most cases, you would have to configure a retention policy. This means that messages at some point will be deleted based on a configurable time or size based retention policy. Therefore, you need to store the data into a persistent storage system and in this case, this is your Database.
You can read more about how Kafka works in this blog post.

Using Apache Kafka to maintain data integrity across databases in microservices architecture

Has anyone used Apache Kafka to maintain data integrity across microservice architecture which each service has its own database? I have been searching around and there was some posts mentioned about using Kafka but I'm looking for more details such as in how Kafka was used. Do you have to write code for producer and consumer (say for Customer database as producer and Orders database as consumer so that if a Customer is deleted in the Customer database then the Orders database somehow need to know that so it will delete all Orders for that Customer as well).

Yes, you'll need to write that processing code
For example, one database would be connected to a CDC reader to emit all changes to a stream (the producer), which could be fed into a KTable or custom consumer to write upserts/deletes into a local cache of another service. I mention it ought to be a cache rather than a database is because when the service restarts, you potentially miss some events, or duplicate others, so the source of the materialized view should ideally be Kafka itself (via a compacted topic)

KSQL query and tables storage

I was looking for a documentation about where KSQL storage querys and tables. For example, since KSQL was built to work with Kafka, when I create a table from a topic, or when I write a query, where are stored the tables or the query results? More specifically, does KSQL use some kind of pointers to events inside segments inside the topic partitions, or it duplicates the events when I create a table from a topic, for example?

The queries that have been ran or are active are persisteted back into a Kafka topic.
A Select statement has no persistent state - it acts as a consumer
A Create Stream/Table command will create potentially many topics, resulting in duplication, manpulation, and filtering of the input topic out to a given destination topic. For any stateful operations, results would be stored in a RocksDB instance on the KSQL server(s).
Since KSQL is built on Kafka Streams, you can refer to the wiki on Kafka Streams Internal Data Management