How can i publish and consume changes from thousands of tables from a RDBMS with Apache Nifi - apache-kafka

You have a RDBMS with thousands of tables (~9.000). Changes in these tables should be published to other consumers through Apache Kafka. I know Apache Nifi as a data routing and transformation tool from nearly any source. And i think about a graphical interface where i can select a table/object from the source system to publish the records to Apache Kafka. But how would a Change Data Capture of so many tables look like in Apache Nifi and also within Apache Kafka? Would i have to create ~9.000 processors (one for each table) and publish each of the tables to a dedicated topic per table? Or is there any more elegant way to do this?

Related

SQL Server Data to Kafka in real time

I would like to add real time data from SQL server to Kafka directly and I found there is a SQL server connector provided by https://debezium.io/docs/connectors/sqlserver/
In the documentation, it says that it will create one topic for each table. I am trying to understand the architecture because I have 500 clients which means I have 500 databases and each of them has 500 tables. Does it mean that it will create 250000 topics or do I need separate Kafka Cluster for each client and each cluster/node will have 500 topics based on the number of tables in the database?
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
With debezium you are stuck with one table to one topic mapping. However, there are creative ways to get around it.
Based on the description, it looks like you have some sort of product that has SQL Server backend, and that has 500 tables. This product is being used by 500 or more clients and everyone has their own instance of the database.
You can create a connector for one client and read all 500 tables and publish it to Kafka. At this point you will have 500 Kafka topics. You can route the data from all other database instances to the same 500 topics by creating separate connectors for each client / database instance. I am assuming that since this is a backend database for a product, the table names, schema names etc. are all same, and the debezium connector will generate same topic names for the tables. If that is not the case, you can use topic routing SMT.
You can differentiate the data in Kafka by adding a few metadata columns in the topic. This can easily be done in the connector by adding SMTs. The metadata columns could be client_id, client_name or something else.
As for your other question,
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
The answer is "it depends!".
If it is a simple transactional application, I would simply write the data to the database and not worry about anything else.
The answer is also dependent on why you want to deliver data to Kafka. If you are looking to deliver data / business events to Kafka to perform some downstream business processing requiring transactional integrity, and strict SLAs, writing the data from application may make sense. However, if you are publishing data to Kafka to make it available for others to use for analytical or any other reasons, using the K-Connect approach makes sense.
There is a licensed alternative, Qlik Replicate, which is capable of something very similar.

Using Apache Kafka to maintain data integrity across databases in microservices architecture

Has anyone used Apache Kafka to maintain data integrity across microservice architecture which each service has its own database? I have been searching around and there was some posts mentioned about using Kafka but I'm looking for more details such as in how Kafka was used. Do you have to write code for producer and consumer (say for Customer database as producer and Orders database as consumer so that if a Customer is deleted in the Customer database then the Orders database somehow need to know that so it will delete all Orders for that Customer as well).
Yes, you'll need to write that processing code
For example, one database would be connected to a CDC reader to emit all changes to a stream (the producer), which could be fed into a KTable or custom consumer to write upserts/deletes into a local cache of another service. I mention it ought to be a cache rather than a database is because when the service restarts, you potentially miss some events, or duplicate others, so the source of the materialized view should ideally be Kafka itself (via a compacted topic)

Kafka and microservices - Architecture questions

In a Microservices based architecture, who writes to Kafka? services themselves or the Microservices databases? I've been thinking about this and see pros and cons to both approaches but leaning towards having database write to Kafka topics because
Database and data in the Kafka topic won't go out of sync in case write to Kafka fails for whatever reason
Application teams won't have to have one more step to worry about
Applications can keep focusing on the core function rather than worrying about Kafka.
Thanks for your inputs
As cricket_007 has been saying, databases typically cannot write to Apache Kafka themselves; instead, you'd need a change data capturing services such as Debezium in order to stream data changes from the database into Kafka (disclaimer: I'm the lead of Debezium).
Such an approach allows to ensure (eventual) consistency between a service's own database and Kafka messages sent to other services. On specific CDC application I'd recommend to look into is the outbox pattern. The idea there is to not capture changes to the service's actual business tables, but instead work with a separate "outbox table", into which the service writes specific messages meant for consumption by other services. CDC would then be used to sent these events from that table to Kafka.
This approach avoids exposing internal data structures to outside consumers while also avoiding the issues of "dual writes" which a service would suffer from when directly writing to its database and Kafka. In Debezium there's some means of built-in support for the outbox pattern via a message transformation that helps to route the events from the outbox table into event-type specific Kafka topics.
Not all services need a database, they just emit data (logs, metrics, sensors, etc)
So, the answer would be either.
Plus, I'm not sure what database directly can export to Kafka, so you'd have some other service like Debezium deployed which would be polling those CDC records off the database
Application developers still have to "worry" about how to deserialize their data, how many partitions are in the topic so they can scale out consumption, manage offsets, among other things

oracle Golden gate for big data kafka adapter grouping data to kafka

Source: Oracle Database
Target: kafka
Moving data from source to target by oracle golden adapter for big data. Problem is data is moving fine but when am inserting 5 records its going as one file in topic.
I want to group it. If am making 5 insert i need five separate entries in topic(kafka)
kafka handler, version gg for big data 12.3.1
Am inserting five records in source and in khafka am getting all inserts like below
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"I","op_ts":"2017-10-24 08:52:01.000000","current_ts":"2017-10-24T12:52:04.960000","pos":"00000000030000001263","after":{"TEST_ID":2,"TEST_NAME":"Francis","TEST_NAME_AR":"Francis"}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"I","op_ts":"2017-10-24 08:52:01.000000","current_ts":"2017-10-24T12:52:04.961000","pos":"00000000030000001437","after":{"TEST_ID":3,"TEST_NAME":"Ashfak","TEST_NAME_AR":"Ashfak"}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"U","op_ts":"2017-10-24 08:55:04.000000","current_ts":"2017-10-24T12:55:07.252000","pos":"00000000030000001734","before":{"TEST_ID":null,"TEST_NAME":"Francis"},"after":{"TEST_ID":null,"TEST_NAME":"updatefrancis"}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"D","op_ts":"2017-10-24 08:56:11.000000","current_ts":"2017-10-24T12:56:14.365000","pos":"00000000030000001865","before":{"TEST_ID":2}}
{"table":"MYSCHEMATOPIC.ELASTIC_TEST","op_type":"U","op_ts":"2017-10-24 08:57:43.000000","current_ts":"2017-10-24T12:57:45.817000","pos":"00000000030000002152","before":{"TEST_ID":3},"after":{"TEST_ID":4}}
I would recommend using the Kafka Connect Handler, since it then registers the data's schema with the Confluent Schema Registry, making it much easier to stream onwards to targets such as Elasticsearch (using Kafka Connect).
In Kafka each record from Oracle will be one Kafka message.
Made below in .props file
gg.handler.kafkahandler.mode=op.
And it worked!!

Build a solution for Kafka+Spark for RDBMS data

My current project is in MainFrames with DB2 as its database. We have 70 databases with nearly 60 tables in each of them. Our architect proposed a plan of using Kafka with Spark streaming for processing data. How good is Kafka in reading the RDBMS tables for data ? Do we directly read the data from the tables using Kafka or is there any other way to get the data from RDBMS into Kafka ?
If there is any better solution, your suggestions can help a lot.
Do not directly read from database, it will create additional load. I would suggest two approaches.
Send new data both to databases and to Kafka, or send it to Kafka and then consume for processing.
Read data from database write ahead log (I know it is possible for MySQL with Maxwell but I am not sure for DB2) and send it to Kafka for further processing.
You can use Spark Streaming or Kafka Streams depending on your needs.