Build real-time Data Warehouse using Apache Kafka & Mysql - apache-kafka

We have a data warehouse on MySQL with dimensions, fact tables and some aggregate tables.
I want to implement the same structure using Apache Kafka where Source would be another MySQL server. I want to address the below requirements
Add foreign keys to fact tables
Add aggregate tables
Handle schema changes
I read somewhere that dimension-fact table architecture cannot be created well using Kafka because star schema(dimensions and fact tables) should be updated in order(first load dimensions then fact tables).
If a fact goes into fact table before its dimension in dim table then FK would be NULL.
How we can handle these problems using Kafka Streams(using Python)?
Is it possible to do this using only Source & Sink connector?

Related

Is it right use-case of KSql

I am using KStreams where I need to de-duplicate the data. Source ingests duplicated data due to many reasons i.e data itself duplicate, re-partitioning.
Currently using Redis for this use-case where data is stored something as below
id#object list-of-applications-processed-this-id-and-this-object
As KSQL is implemented on top of RocksDB which is also a Key-Value database, can I use KSql for this use case?
At the time of successful processing, I would add an entry to KSQL. At the time of reception, I will have to check the existence of the id in KSQL.
Is it correct use case as per KSql design in the event processing world?
If you want to use to use ksqlDB as a cache, you can create a TABLE using the topic as data source. Note that a CREATE TABLE statement by itself, does only declare a schema (it does not pull in any data into ksqlDB yet).
CREATE TABLE inputTable <schemaDefinition> WITH(kafka_topic='...');
Check out the docs for more details: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/
To pull in the data, you can create a second table via:
CREATE TABLE cache AS SELECT * FROM inputTable;
This will run a query in the background, that read the input data and puts the result into the ksqlDB server. Because the query is a simple SELECT * it effectively pulls in all data from the topic. You can now issue "pull queries" (i.e, lookups) against the result to use TABLE cache as desired: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/
Future work:
We are currently working on adding "source tables" (cf https://github.com/confluentinc/ksql/pull/7474) that will make this setup simpler. If you declare a source table, you can do the same with a single statement instead of two:
CREATE SOURCE TABLE cache <schemaDefinition> WITH(kafka_topic='...');

Audit data changes with Debezium

I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.

ksqlDB: Best way to create Tables from a Debezium source topics?

I would like to create Tables in ksqlDB from Debezium source topics, with the ultimate aim of performing a left join on these tables and efficiently outputting materialized views to a downstream database using the JDBC sink connector.
The Debezium source topics have not had any transforms applied (such as ExtractNewRecordState), so contain a 'before' and 'after' property, as described in the Debezium documentation here.
The reason for not applying the ExtractNewRecordState transform (which would presumably simplify matters) is that the source CDC topics may be used for various purposes and it does not appear possible to create multiple topics off the same source database table (since topic names are automatically determined by Debezium and depend on the database server, schema and table name as described here).
The best approach I have found so far is to:
create a stream in ksqlDB from the raw Debezium input, e.g.:
CREATE STREAM user_stream WITH (KAFKA_TOPIC='mssql.dbo.user', VALUE_FORMAT='AVRO');
create a second stream selecting the required fields from the 'after' property of the first stream, e.g.:
CREATE STREAM user_stream2 AS SELECT AFTER->user_id, AFTER->username, AFTER->email FROM user_stream EMIT CHANGES;
finally, convert the second stream to a table as described here, namely:
SELECT user_id,
LATEST_BY_OFFSET(username) AS username,
LATEST_BY_OFFSET(email) AS email
FROM user_stream2
GROUP BY user_id
EMIT CHANGES;
These steps must be repeated to generate each Table, at which point a join can be performed on the Tables to produce an output.
This seems quite long-winded, with a lot of intermediate steps. Performance also seems sluggish. Is there a better and/or more direct way to generated materialized views using ksqlDB and Debezium? Can any of the steps be cut out and/or should I be using a different approach in step 3 (such as a windowing function)?
I'm particularly keen to ensure that the approach taken is the most efficient from a performance and resource usage perspective.

Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?)

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

data integration-, multiple databases, unique incremental SOR_id using talend

I'm trying to integrate multiple databases using talend and in turn have an SOR_id for each table for auditing purposes. is it possible to map between multiple source tables simultaneously to destination table having an SOR_id which is meant to be auto incremented? Would I have incremental values for each source tables rows
I have approached this using another way as shown in the image so that my SOR_id can be accounted for.