Best way to join two (or more) kafka topics in KSQL emiting changes from all topics? - apache-kafka

We have a "microservices" platform and we are using debezium for change data capture from databases on these platforms which is working nicely.
Now, we'd like to make it easy for us to join these topics and stream the results into a new topic which could be consumed by multiple services.
Disclaimer: this assumes v0.11 ksqldb and cli (seems like much of this might not work in older versions)
Example of two tables from two database instances that stream into Kafka topics:
-- source identity microservice (postgres)
CREATE TABLE public.user_entity (
id varchar(36) NOT NULL,
first_name varchar(255) NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
-- source organization microservice (postgres)
CREATE TABLE public.user_info (
id varchar(36) NOT NULL,
user_entity_id varchar(36) NOT NULL,
business_unit varchar(255) NOT NULL,
cost_center varchar(255) NOT NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
Option 1 : Streams
CREATE STREAM stream_user_info_by_user_entity_id
AS SELECT * FROM stream_user_info
PARTITION BY user_entity_id
EMIT CHANGES;
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN stream_user_info_by_user_entity_id ui WITHIN 365 DAYS ON ue.id = ui.user_entity_id
EMIT CHANGES;
Notice WITHIN 365 DAYS, conceptually these tables could go a very long time without being changed so this window would be technically infinitely large. This looks fishy and seems to hint that this is not a good way to do this.
Option 2 : Tables
CREATE TABLE ktable_user_info_by_user_entity_id (
user_entity_id,
first_name,
business_unit,
cost_center
)
with (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON ue.id = ui.user_entity_id
EMIT CHANGES;
We no longer need the window WITHIN 365 DAYS, so this feels more correct. However this only emits a change when a message is sent to the stream not the table.
In this example:
User updates first_name -> change is emitted
User updates business_unit -> no change emitted
Perhaps there might be a way to create a merged stream partitioned by the user_entity_id and join to child tables which would hold the current state, which leads me to....
Option 3 : Merged stream and tables
-- "master" change stream with merged stream output
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
CREATE STREAM stream_user_entity_by_id
AS SELECT * FROM stream_user_entity
PARTITION BY id
EMIT CHANGES;
CREATE TABLE ktable_user_entity_by_id (
id VARCHAR PRIMARY KEY,
first_name VARCHAR
) with (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
This one looks the best, but appears to have a lot of moving components for each table we have 2 streams, 1 insert query, 1 ktable. Another potential issue here might be a hidden race condition where the stream emits the change before the table is updated under the covers.
Option 4 : More merged tables and streams
CREATE STREAM stream_user_entity_changes_enriched
AS SELECT
ue.id AS user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_by_id ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
CREATE STREAM stream_user_info_changes_enriched
AS SELECT
ui.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_info_by_user_entity_id ui
LEFT JOIN ktable_user_entity_by_id ue ON ui.user_entity_id = ue.id
EMIT CHANGES;
CREATE STREAM stream_user_changes_enriched (user_entity_id VARCHAR, first_name VARCHAR, business_unit VARCHAR, cost_center VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes_enriched', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_entity_changes_enriched;
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_info_changes_enriched;
This is conceptually the same as the earlier one, but the "merging" happens after the joins. Conceivably, this might eliminate any potential race condition because we're selecting primarily from the streams instead of the tables.
The downside is that the complexity is even worse than option 3 and writing and tracking all these streams for any joins with more than two tables would be kind of mind numbing...
Question :
What method is the best for this use case and/or are we attempting to do something that ksql shouldn't be used for? Would we better off to just offload this to traditional RDBMS or spark alternatives?

I'm going to attempt to answer my own question, only accept if upvoted.
The answer is: Option 3
Here are the reasons for this use case this would be the best, while perhaps could be subjective
Streams partitioned by primary and foreign keys are common and simple.
Tables based on these streams are common and simple.
Tables used in this way will not be a race condition.
All options have merits, e.g. if you don't care about emitting all the changes or if the data behaves like streams (logs or events) instead of slow changing dimensions (sql tables).
As for "race conditions", the word "table" tricks your mind that you are actually processing and persisting data. In reality is that they are not actually physical tables, they actually behave more like sub-queries on streams. Note: It might be an exception for aggregation tables which actually produce topics (which I would suggest is a different topic, but would love to see comments)
In the end (syntax may have some slight bugs):
---------------------------------------------------------
-- shared objects (likely to be used by multiple queries)
---------------------------------------------------------
-- shared streams wrapping topics
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
-- shared keyed streams (i like to think of them as "indexes")
CREATE STREAM stream_user_entity_by_id AS
SELECT * FROM stream_user_entity PARTITION BY id
EMIT CHANGES;
CREATE STREAM stream_user_info_by_user_entity_id AS
SELECT * FROM stream_user_info PARTITION BY user_entity_id
EMIT CHANGES;
-- shared keyed tables (inferring columns with schema registry)
CREATE TABLE ktable_user_entity_by_id (id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
CREATE TABLE ktable_user_info_by_user_entity_id (user_entity_id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
---------------------------------------------------------
-- query objects (specific to the produced data)
---------------------------------------------------------
-- "master" change stream (include all tables in join)
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
-- pretty simple looking query
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
The "shared" objects are basically the streaming schema (temptation is to create for all our topics, but that's another question) and the second portion is like the query schema. It ultimately is a functional, clean, and repeatable pattern.

i like your approach number 3. indeed i have tried to use that one to merge streams with different primary keys into one master stream and then grouping them in a materialized view.
the join seems to work, but i ended up having the same situation as in a regular stream-table join ... indeed i see changes in the master stream, but somehow those changes are only triggered downstream (to the group by table) when they affect the first table in the table-stream join and not the others.
So basically what i have achieved is the following:
debezium --> create 3 streams: A,B,AB (AB is a matching table between ids in A and ids in B, used in postgres to make an n-to-n join)
stream A,B,C are repartitioned by one id (A_id) and merged into one stream. in this step all elements of B get assigned a virtual A_id since it is not relevant to them
the 3 KTables are created (i still keep wondering, why? is this a sort of self-join?)
a materialized view (table) is created by grouping the master stream after the joining it to the 3 KTables
when a change in A,B, or AB happens, i see changes in the master stream too, but the materialized view is only updated when changes on stream B occur. Of course destroying the table and recreating it makes it "up-to-date".
are you facing the same problem?

Related

Generating a materialized join table with a many-to-one column in ksqldb

thanks in advance for any input!
I have the requirement of retrieving data from 4 of the below databases via 1 HTTP request. I've chosen to create a materialized table with KSQLDB which will contain all relevant data from the 4 database tables. My API Gateway will then query that table using KSQLDB's rest API.
My struggle is in creating 1 KSQLDB table to show information for a purchase order webpage which consists of data from all 4 of the below services shown below:
vendor_tbl
contract_tbl (has vendorId column referencing vendor_table's PK)
services_tbl (has contractId column referencing contract_table's PK)
purchase_order_tbl (has both vendorId && contractId columns referencing vendor/contract table's PKs)
The issue lies mainly with the services table, because it has a many-to-one relationship with contracts. 1 service is for 1 contract, but 1 contract can have many services, which is common.
The structure as is works perfectly in the RDBMS context, but I'm struggling to find any lane using KSQLDB to create 1 materialized table containing all of the services ID's for the PO, along with data from corresponding, contract, vendor && PO tables...
What I have tried:
1.
I have tried creating 2 streams, 1 to join 2/4 of the streams each and then "daisy-chaining" the 2 joined streams as such:
CREATE STREAM po_vendor_join AS SELECT * FROM po_stream p INNER JOIN vendors_tbl v ON p.vendorId = v.id;
CREATE STREAM service_contract_join AS SELECT * FROM services_stream s INNER JOIN contracts_tbl c ON s.contractId = c.id;
This works. But of course there are duplicate entries in the service_contract_join stream, but the next step anyway is to create a table from these 2 streams, but I cannot do this because of the aggregation/group by requirements, which require that every column in the table be part of the aggregation. I understand that I would have duplicates in OTHER columns (vendor PK, contract PK is referenced in multiple tables for instance) but the PK itself is UNIQUE, and needs to be the id of the services_tbl in this case, since there are multiple for a PO. (Note that I have also tried OUTER LEFT && RIGHT joins on the streams)
2.
I have tried "staggering" between table/stream, as such:
CREATE TABLE vendors_tbl (id VARCHAR PRIMARY KEY, name VARCHAR) WITH (KAFKA_TOPIC='vendors', VALUE_FORMAT='json', PARTITIONS=1);
CREATE STREAM contracts_stream (id VARCHAR, vendorId VARCHAR) WITH (KAFKA_TOPIC='contracts', VALUE_FORMAT='json', PARTITIONS=1);
And then rendering a joined stream between the table/stream like so:
CREATE STREAM po_vendor_join
AS SELECT p.*, v.*
FROM po_stream p
INNER JOIN vendors_tbl v ON p.vendorId = v.id;
But then when trying to join the stream, of course I am met with the same restrictions as in attempt # 1.
3.
I have tried making all 4 of the services tables, and simply modeling them as non-queryable tables.
The problem with this approach arises when I try to create a join table between 2 of the tables, like below:
CREATE TABLE PO_HOLLISTIC_AGGREGATE
AS
SELECT * FROM services_contracts_join_tbl scj
INNER JOIN po_vendors_join_tbl pvj ON scj.c_vendorId = pvj.v_id;
I receive the error here stating that the join needs to be on the PK of the right table, but again, the PK required here would be of the services table because there are multiple.
This makes me conclude that the only way this would be possible with KSQLDB is if I stored the service PK on the other 3 streams, which wouldn't be doable either really because of the aggregation restrictions when creating a table via joined streams.
I'd appreciate any ideas, thanks again!

How to correctly GROUP BY on jdbc sources

I have a Kafka stream with user_id and want to produce another stream with user_id and number of records in a JDBC table.
Following is how I tried to achieve this (I'm new to flink, so please correct me if that's not how things are supposed to be done). The issue is that flink ignores all updates to JDBC table after the job has started.
As far as I understand the answer to this is to use lookup joins but flink complains that lookup joins are not supported on temporal views. Also tried to use versioned views without much success.
What would be the correct approach to achieve what I want?
CREATE TABLE kafka_stream (
user_id STRING,
event_time TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE jdbc_table (
user_id STRING,
checked_at TIMESTAMP,
PRIMARY KEY(user_id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
-- ...
)
-- NEXT SQL --
CREATE TEMPORARY VIEW checks_counts AS
SELECT user_id, count(*) as num_checks
FROM jdbc_table
GROUP BY user_id
-- NEXT SQL --
INSERT INTO output_kafka_stream
SELECT
kafka_stream.user_id,
checks_counts.num_checks
FROM kafka_stream
LEFT JOIN checks_counts ON kafka_stream.user_id = checks_counts.user_id

Create Table without data aggregation

I just started to use the ksqlDB Confluent feature, and it stood out that it is not possible to proceed with the following command: CREATE TABLE AS SELECT A, B, C FROM [STREAM_A] [EMIT CHANGES];
I wonder why this is not possible or if there's a way of doing it?
Data aggregation here is feeling a heavy process to a simple solution.
Edit 1: Source is a STREAM and not a TABLE.
The field types are:
String
Integers
Record
Let me share an example of the executed command that returns an error as a result.
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
[EMIT CHANGES];
create a table with a smaller dataset and fewer fields than the original topic
I think the confusion here is that you talk about a TABLE, but you're actually creating a STREAM. The two are different types of object.
A STREAM is an unbounded series of events - just like a Kafka topic. The only difference is that a STREAM has a declared schema.
A TABLE is state, for a given key. It's the same as KTable in Kafka Streams if you're familiar with that.
Both are backed by Kafka topics.
So you can do this - note that it's creating a STREAM not a TABLE
CREATE STREAM test_stream
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL;
If you really want to create a TABLE then use the LATEST_BY_OFFSET aggregation, assuming you'd using id as your key:
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, LATEST_BY_OFFSET(timestamp)
, LATEST_BY_OFFSET(servicename)
, LATEST_BY_OFFSET(content->assignedcontent)
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
GROUP BY id;

Key while creating KSQL Stream

1) Is Key required on the Stream where you want to perform aggregate function. I have read several blogs and also recommendation from Confluent that KEY is required for aggregation function to work
CREATE STREAM Employee (EmpId BIGINT, EmpName VARCHAR,
DeptId BIGINT, SAL BIGINT) WITH (KAFKA_TOPIC='EmpTopic',
VALUE_FORMAT='JSON');
While defining above Stream, I have not defined any KEY (ROWKEY is NULL). Underlying topic 'EmpTopic' also does not a KEY.
I am performing aggregation function on the Stream.
CREATE TABLE SALBYDEPT AS
SELECT DeptId,
SUM(SAL)
FROM Employee
GROUP BY DeptId;
Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream
2) As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?
3) Will JOIN on Stream-Table retain the KEY
CREATE TABLE users
(registertime BIGINT,
userid VARCHAR,
gender VARCHAR,
regionid VARCHAR)
WITH (KAFKA_TOPIC = 'users',
VALUE_FORMAT='JSON',
KEY = 'userid');
CREATE STREAM pageviews
(viewtime BIGINT,
userid VARCHAR,
pageid VARCHAR)
WITH (KAFKA_TOPIC='pageviews',
VALUE_FORMAT='DELIMITED',
KEY='pageid',
TIMESTAMP='viewtime');
CREATE STREAM pageviews_transformed as
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
CREATE STREAM pageviews_enriched AS
SELECT pv.viewtime,
pv.userid AS userid,
pv.pageid,
pv.timestring,
u.gender,
u.regionid,
u.interests,
u.contactinfo
FROM pageviews_transformed pv
LEFT JOIN users u ON pv.userid = u.userid;
Will JOIN on Stream-Table retain the 'UserId' as ROWKEY in the new Stream 'pageviews_enriched'
4) I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.
Stream-Stream join and Stream-Table join. In the below example I am performing JOIN on Stream with NULL ROWKEY and Table. Is this valid?
CREATE TABLE users
(registertime BIGINT,
userid VARCHAR,
gender VARCHAR,
regionid VARCHAR)
WITH (KAFKA_TOPIC = 'users',
VALUE_FORMAT='JSON',
KEY = 'userid');
CREATE STREAM pageviews
(viewtime BIGINT,
userid VARCHAR,
pageid VARCHAR)
WITH (KAFKA_TOPIC='pageviews',
VALUE_FORMAT='DELIMITED',
TIMESTAMP='viewtime');
CREATE STREAM pageviews_transformed as
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
CREATE STREAM pageviews_enriched AS
SELECT pv.viewtime,
pv.userid AS userid,
pv.pageid,
pv.timestring,
u.gender,
u.regionid,
u.interests,
u.contactinfo
FROM pageviews_transformed pv
LEFT JOIN users u ON pv.userid = u.userid;
CREATE TABLE SALBYDEPT AS
SELECT DeptId,
SUM(SAL)
FROM Employee
GROUP BY DeptId;
Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream
You do not need a key on this stream. The key of the created table will be DeptId.
As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?
This means that when you create an aggregation you can do so over a time window, and that time window is part of the message key. For example, instead of aggregating all employee SAL (sales?), you could choose to do so over a time window, perhaps every hour or day. In that case you would have the aggregate key (DeptId), combined with the window key (e.g. for hourly 2019-06-23 06:00:00, 2019-06-23 07:00:00, 2019-06-23 08:00:00 etc)
Will JOIN on Stream-Table retain the KEY
It will retain the stream's key, unless you include a PARTITION BY in the DDL.
I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.
Do you have a link to the specific documentation you're referencing? Whilst a table does need to be keyed, a stream does not (KSQL may handle this under the covers; I'm not sure).

KSQL - Determining When a Table Is Loaded

How can I determine when KSQL has fully loaded my data from a Kafka topic into my table?
GOAL: Take 2 Kafka topics, join them and write the results to a new Kafka topic.
EXAMPLE:
I am using Ksql's Rest API to issue the following commands.
CREATE TABLE MyTable (A1 VARCHAR, A2 VARCHAR) WITH (kafka_topic='topicA', key='A1', value_format='json');
CREATE STREAM MyStream (B1 varchar, B2 varchar) WITH (kafka_topic='topicB', value_format='json');
CREATE STREAM MyDestination WITH (Kafka_topic='topicC', PARTITIONS = 1, value_format='json') AS SELECT a.A1 as A1, a.A2 as A2, b.B1 as B1, b.B2 as B2 FROM MyStream b left join MyTable a on a.A1 = b.B1;
PROBLEM: topicC only has data from topicB, and all joined values are null.
Although I receive back a status of SUCCESS from the create table command, it appears that the data has not fully loaded into the table. Consequently the result of the 3rd command only has data from the stream and does not include data from the table. If I artificially delay before executing the join command, then the resulting topic will correctly have data from both topics. How can I determine when my table is loaded, and it is safe to execute the join command?
This is indeed a great question. At this point KSQL doesn't have a way to automatically execute a stream-table join only once the table is fully loaded. This is indeed a useful feature. A more general and related problem is discussed here: https://github.com/confluentinc/ksql/issues/1751
Tables in KSQL (and underlying Kafka Streams) have a time dimension, ie, the evolve over time. For a stream-table join, each stream-record is joined with the "correct" table version (ie, tables are versioned by time).
In upcoming CP 5.1 release, you can "pre-load" the table, by ensuring that all record timestamp of the table topic are smaller than the record timestamps of the stream topic. This tells, KSQL, that it needs to process the table topic data first, but advance the table timestamp-version accordingly before it can start joining.
For more details, check out: https://www.confluent.io/resources/streams-tables-two-sides-same-coin