I've several orders databases (Ex: OrderUSA, OrderUK, OrderIndia,...) in postgres, all the database will have the same schema and tables. I want to merge all the tbl_orders from these databases into one. I am writing debezium connectors for each database. Is possible to use the same topic for all these debezium connectors? It will be easy for me to have one topic which will have consolidated data. Then I could use the topic for the sink connector. Please advise so that I could move this model to production.
Yes, it is possible to have one topic containing data for all databases.
Whether you should do that, well it depends. This reddit post may be helpful to you: https://www.reddit.com/r/apachekafka/comments/q8a7sj/topic_strategies_when_to_split_into_multiple/
Ultimately, it depends on your business requirements and whether it is logical for all order data to be in one topic.
Related
I would like to add real time data from SQL server to Kafka directly and I found there is a SQL server connector provided by https://debezium.io/docs/connectors/sqlserver/
In the documentation, it says that it will create one topic for each table. I am trying to understand the architecture because I have 500 clients which means I have 500 databases and each of them has 500 tables. Does it mean that it will create 250000 topics or do I need separate Kafka Cluster for each client and each cluster/node will have 500 topics based on the number of tables in the database?
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
With debezium you are stuck with one table to one topic mapping. However, there are creative ways to get around it.
Based on the description, it looks like you have some sort of product that has SQL Server backend, and that has 500 tables. This product is being used by 500 or more clients and everyone has their own instance of the database.
You can create a connector for one client and read all 500 tables and publish it to Kafka. At this point you will have 500 Kafka topics. You can route the data from all other database instances to the same 500 topics by creating separate connectors for each client / database instance. I am assuming that since this is a backend database for a product, the table names, schema names etc. are all same, and the debezium connector will generate same topic names for the tables. If that is not the case, you can use topic routing SMT.
You can differentiate the data in Kafka by adding a few metadata columns in the topic. This can easily be done in the connector by adding SMTs. The metadata columns could be client_id, client_name or something else.
As for your other question,
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
The answer is "it depends!".
If it is a simple transactional application, I would simply write the data to the database and not worry about anything else.
The answer is also dependent on why you want to deliver data to Kafka. If you are looking to deliver data / business events to Kafka to perform some downstream business processing requiring transactional integrity, and strict SLAs, writing the data from application may make sense. However, if you are publishing data to Kafka to make it available for others to use for analytical or any other reasons, using the K-Connect approach makes sense.
There is a licensed alternative, Qlik Replicate, which is capable of something very similar.
good time of day. I am sorry my poor English. I have some issue, can you help me to understand how i can use kafka and kafka streams like database.
My problem is i have some microservices and each service have their data in own database. I need for report purposes collect data in one point, for this i chose the kafka. I use debezuim maybe you know it (change data capture debezium), each table in relational database it is a topic in kafka. And i wrote the application with kafka stream (i joined streams each other) so far good. Example: I have the topic for ORDER and ORDER_DETAILS, after a while will come some event for join this topic, problem is i dont know when come this event maybe after minutes or after monthes or after years. How i can get data in topics ORDER and ORDER_DETAIL after month or year ? It is right way save data in topic infinitely? can you give me some advice maybe have some solutions.
The event will come as soon as there is a change in the database.
Typically, the changes to the database tables are pushed as messages to the topic.
Each and every update to the database will be a kafka message. Since there is a message for every update, you might be interested in only the latest value (update) for any given key which mostly will be the primary key
In this case, you can maintain the topic infinitely (retention.ms=-1) but compact (cleanup.policy=compact) it in order to save space.
You may also be interested in configuring segment.ms and/or segment.bytes for further tuning the topic retention.
According to the schema data comes to Kafka, then to stream and Mapr-DB.
After storing data in DB, user can display data on the map.
Question is, why we use DB to dispaly data on the map if Kafka is already DB.
It seems to me more slowly to get realtime data from Mapr-DB that from Kafka.
What do you think, why this example uses this appoarch?
The core abstraction Kafka provides for a stream of records is known as topic. You can imagine topics as the tables in a database. A database (Kafka) can have multiple tables (topics). Like in databases, a topic can have any kind of records depending on the usecase. But note that Kafka is not a database.
Also note that in most cases, you would have to configure a retention policy. This means that messages at some point will be deleted based on a configurable time or size based retention policy. Therefore, you need to store the data into a persistent storage system and in this case, this is your Database.
You can read more about how Kafka works in this blog post.
After doing lots of reading on the web I finally reach out to this forum. My challenge is to denormalize transactional data from a database sourced via CDC into Kafka before writing it out into a NoSQL database, in this case Cassandra. What is the best way to join the transactional data with lookups from master tables? The issue I have is there are maybe 5 to 10 lookup tables per transactional table.
Trying to do this in a proof of concept using KSQL learned me to A) load the lookup tables as KTables and B) repartition the transactional stream and finally C) perform the join and write into a new topic.
Following this approach, if I have 5 or 10 lookup tables that will generate lots and lots of data being sent around the cluster. I know Streams DSL can use the concept of GlobalKTable but that only works when the lookup tables are relatively small and in addition I prefer a higher level language like KSQL. Is there a better approach?
What you need is for ksqlDB to support non-key joins. So you should up-vote this issue that tracks that feature: https://github.com/confluentinc/ksql/issues/4424
Until then, your approach of repartitioning the transactional stream to match the key of the lookup tables it the only viable solution.
We recently started using Confluent Kafka-JDBC connector to import RDBMS data.
As part of default configuration settings, it seems that one Topic is created for every Table in the schema.
I would like to know if there is any way to
Create Topic per schema rather than every table. And if Topic per Schema is enabled then can Schema evolution (With Schema Registry) be supported on a table basis ?
If Topic per schema is not possible then are there any guidelines on how to manage hundred's or thousands of topics ? Considering that there will one to one mapping between number of tables to number of topics ?
Thanks in advance,
Create Topic per schema rather than every table.
No - it's either n tables -> n topics, or 1 query -> 1 topic.
any guidelines on how to manage hundred's or thousands of topics ?
Adopt a standard naming pattern for them. Use topic-specific configuration as required.