ksqlDB: Best way to create Tables from a Debezium source topics? - apache-kafka

I would like to create Tables in ksqlDB from Debezium source topics, with the ultimate aim of performing a left join on these tables and efficiently outputting materialized views to a downstream database using the JDBC sink connector.
The Debezium source topics have not had any transforms applied (such as ExtractNewRecordState), so contain a 'before' and 'after' property, as described in the Debezium documentation here.
The reason for not applying the ExtractNewRecordState transform (which would presumably simplify matters) is that the source CDC topics may be used for various purposes and it does not appear possible to create multiple topics off the same source database table (since topic names are automatically determined by Debezium and depend on the database server, schema and table name as described here).
The best approach I have found so far is to:
create a stream in ksqlDB from the raw Debezium input, e.g.:
CREATE STREAM user_stream WITH (KAFKA_TOPIC='mssql.dbo.user', VALUE_FORMAT='AVRO');
create a second stream selecting the required fields from the 'after' property of the first stream, e.g.:
CREATE STREAM user_stream2 AS SELECT AFTER->user_id, AFTER->username, AFTER->email FROM user_stream EMIT CHANGES;
finally, convert the second stream to a table as described here, namely:
SELECT user_id,
LATEST_BY_OFFSET(username) AS username,
LATEST_BY_OFFSET(email) AS email
FROM user_stream2
GROUP BY user_id
EMIT CHANGES;
These steps must be repeated to generate each Table, at which point a join can be performed on the Tables to produce an output.
This seems quite long-winded, with a lot of intermediate steps. Performance also seems sluggish. Is there a better and/or more direct way to generated materialized views using ksqlDB and Debezium? Can any of the steps be cut out and/or should I be using a different approach in step 3 (such as a windowing function)?
I'm particularly keen to ensure that the approach taken is the most efficient from a performance and resource usage perspective.

Related

Is it right use-case of KSql

I am using KStreams where I need to de-duplicate the data. Source ingests duplicated data due to many reasons i.e data itself duplicate, re-partitioning.
Currently using Redis for this use-case where data is stored something as below
id#object list-of-applications-processed-this-id-and-this-object
As KSQL is implemented on top of RocksDB which is also a Key-Value database, can I use KSql for this use case?
At the time of successful processing, I would add an entry to KSQL. At the time of reception, I will have to check the existence of the id in KSQL.
Is it correct use case as per KSql design in the event processing world?
If you want to use to use ksqlDB as a cache, you can create a TABLE using the topic as data source. Note that a CREATE TABLE statement by itself, does only declare a schema (it does not pull in any data into ksqlDB yet).
CREATE TABLE inputTable <schemaDefinition> WITH(kafka_topic='...');
Check out the docs for more details: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/
To pull in the data, you can create a second table via:
CREATE TABLE cache AS SELECT * FROM inputTable;
This will run a query in the background, that read the input data and puts the result into the ksqlDB server. Because the query is a simple SELECT * it effectively pulls in all data from the topic. You can now issue "pull queries" (i.e, lookups) against the result to use TABLE cache as desired: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/
Future work:
We are currently working on adding "source tables" (cf https://github.com/confluentinc/ksql/pull/7474) that will make this setup simpler. If you declare a source table, you can do the same with a single statement instead of two:
CREATE SOURCE TABLE cache <schemaDefinition> WITH(kafka_topic='...');

Audit data changes with Debezium

I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.

Kafka Connect and Custom Queries

I'm interested in using the Kafka Source JDBC connector to perform to a publish to Kafka, for when an Invoice gets created. On the source end, it's broken up into 2 tables Invoice, and InvoiceLine.
Is this possible, using custom queries. What would the query look like?
Also since its polling, what gets published could contain one or more invoices in a topic?
Thanks
Yes, you can use custom queries. From the docs:
Custom Query: The source connector supports using custom queries instead of copying whole tables. With a custom query, one of the other update automatic update modes can be used as long as the necessary WHERE clause can be correctly appended to the query. Alternatively, the specified query may handle filtering to new updates itself; however, note that no offset tracking will be performed (unlike the automatic modes where incrementing and/or timestamp column values are recorded for each record), so the query must track offsets itself.

How to de-normalize data in Kafka?

I have a MySQL database with ~20 tables. The data is normalized.
Considering this example:
book -> book_authors <- authors
we try to stream the books info eg.:
{book_id:3, title='Red', authors:[{id:3, name:'Mary'}, {id:4, name:'John'}]}
An instance when we see a serious problem: if an author's name change, we have to re-generate all their books.
I'm using Debezium to post the change log for each table in Kafka.
I am unable to find an elegant solution for data denormalization, eg. for adding it to ElasticSearch, MongoDb etc.
I identified two solutions, but both seem to fail:
De-normalize data into a new MySQL table, at source, and use Debezium to stream only this new table. This might be not possible and we have to invest a lot of effort in changing the code of the source system.
Join the streams in Kafka, though, I didn't manage to make it work. It seems that Kafka does not allow joining on a non-primary-key field. This seems a common situation with N-to-N relations.
Did anyone find a solution to data denormalization and publish data into a Kafka stream? This seems to be a common problem and I couldn't find any solution yet.
Try publishing the changes from Debezium to the topics book, book_authors and authors in its raw form, which creates three disjoint streams.
Create a simple consumer application that subscribes to all three topics. Upon receiving a message on either topic, it queries the database to obtain the latest snapshot of the referenced entities, merges the data together, and publishes the denormalised version onto a new merged_book_authors topic. Downstream consumers can read directly from the merged topic.
A minor variation of the above: rather than querying the database for each Debezium change, which may be slow, build a materialised view using a fast key-value or document store such as Redis. This is a little more work, but will (1) improve the throughput of the overall pipeline and (2) take the load off the system-of-record database.

Storing relational data in Apache Flink as State and querying by a property

I have a database with Tables T1(id, name, age) and T2(id, subject).
Flink receives all updates from the database as event stream using something like debezium. The tables are related to each other and required data can be extracted by joining T1 with T2 on id. Currently the whole state of the database is stored in Flink MapState with id as the key. Now the problem is that I need to select the row based on name from T1 without using id. It seems like I need an index on T1(name) for making it faster. Is there any way I can automatically index it, without having to manually create an index for each table. What is the recommended way for doing this?. I know about SQL streaming on tables, but I require support for updates to the tables. By the way, I use Flink with Scala. Any pointers/suggestions would be appreciated.
My understanding is that you are connecting T1 and T2, and storing some representation (in MapState) of the data from these two streams in keyed state, keyed by id. It sounds like T1 and T2 are evolving over time, and you want to be able to interactively query the join at any time by specifying a name.
One idea would be to broadcast in the name(s) you want to select, and use a KeyedBroadcastProcessFunction to process them. In its processBroadcastElement method you could use ctx.applyToKeyedState to compute the results by extracting data from the MapState records (which would have to be held in this operator). I suspect you will want to use the names as the keys in these MapState records, so that you don't have to iterate over all of the entries in each map to find the items of interest.
You will find a somewhat similar example of this pattern in https://training.data-artisans.com/exercises/ongoingRides.html.