Create Table without data aggregation - apache-kafka

I just started to use the ksqlDB Confluent feature, and it stood out that it is not possible to proceed with the following command: CREATE TABLE AS SELECT A, B, C FROM [STREAM_A] [EMIT CHANGES];
I wonder why this is not possible or if there's a way of doing it?
Data aggregation here is feeling a heavy process to a simple solution.
Edit 1: Source is a STREAM and not a TABLE.
The field types are:
String
Integers
Record
Let me share an example of the executed command that returns an error as a result.
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
[EMIT CHANGES];

create a table with a smaller dataset and fewer fields than the original topic
I think the confusion here is that you talk about a TABLE, but you're actually creating a STREAM. The two are different types of object.
A STREAM is an unbounded series of events - just like a Kafka topic. The only difference is that a STREAM has a declared schema.
A TABLE is state, for a given key. It's the same as KTable in Kafka Streams if you're familiar with that.
Both are backed by Kafka topics.
So you can do this - note that it's creating a STREAM not a TABLE
CREATE STREAM test_stream
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL;
If you really want to create a TABLE then use the LATEST_BY_OFFSET aggregation, assuming you'd using id as your key:
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, LATEST_BY_OFFSET(timestamp)
, LATEST_BY_OFFSET(servicename)
, LATEST_BY_OFFSET(content->assignedcontent)
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
GROUP BY id;

Related

Generating a materialized join table with a many-to-one column in ksqldb

thanks in advance for any input!
I have the requirement of retrieving data from 4 of the below databases via 1 HTTP request. I've chosen to create a materialized table with KSQLDB which will contain all relevant data from the 4 database tables. My API Gateway will then query that table using KSQLDB's rest API.
My struggle is in creating 1 KSQLDB table to show information for a purchase order webpage which consists of data from all 4 of the below services shown below:
vendor_tbl
contract_tbl (has vendorId column referencing vendor_table's PK)
services_tbl (has contractId column referencing contract_table's PK)
purchase_order_tbl (has both vendorId && contractId columns referencing vendor/contract table's PKs)
The issue lies mainly with the services table, because it has a many-to-one relationship with contracts. 1 service is for 1 contract, but 1 contract can have many services, which is common.
The structure as is works perfectly in the RDBMS context, but I'm struggling to find any lane using KSQLDB to create 1 materialized table containing all of the services ID's for the PO, along with data from corresponding, contract, vendor && PO tables...
What I have tried:
1.
I have tried creating 2 streams, 1 to join 2/4 of the streams each and then "daisy-chaining" the 2 joined streams as such:
CREATE STREAM po_vendor_join AS SELECT * FROM po_stream p INNER JOIN vendors_tbl v ON p.vendorId = v.id;
CREATE STREAM service_contract_join AS SELECT * FROM services_stream s INNER JOIN contracts_tbl c ON s.contractId = c.id;
This works. But of course there are duplicate entries in the service_contract_join stream, but the next step anyway is to create a table from these 2 streams, but I cannot do this because of the aggregation/group by requirements, which require that every column in the table be part of the aggregation. I understand that I would have duplicates in OTHER columns (vendor PK, contract PK is referenced in multiple tables for instance) but the PK itself is UNIQUE, and needs to be the id of the services_tbl in this case, since there are multiple for a PO. (Note that I have also tried OUTER LEFT && RIGHT joins on the streams)
2.
I have tried "staggering" between table/stream, as such:
CREATE TABLE vendors_tbl (id VARCHAR PRIMARY KEY, name VARCHAR) WITH (KAFKA_TOPIC='vendors', VALUE_FORMAT='json', PARTITIONS=1);
CREATE STREAM contracts_stream (id VARCHAR, vendorId VARCHAR) WITH (KAFKA_TOPIC='contracts', VALUE_FORMAT='json', PARTITIONS=1);
And then rendering a joined stream between the table/stream like so:
CREATE STREAM po_vendor_join
AS SELECT p.*, v.*
FROM po_stream p
INNER JOIN vendors_tbl v ON p.vendorId = v.id;
But then when trying to join the stream, of course I am met with the same restrictions as in attempt # 1.
3.
I have tried making all 4 of the services tables, and simply modeling them as non-queryable tables.
The problem with this approach arises when I try to create a join table between 2 of the tables, like below:
CREATE TABLE PO_HOLLISTIC_AGGREGATE
AS
SELECT * FROM services_contracts_join_tbl scj
INNER JOIN po_vendors_join_tbl pvj ON scj.c_vendorId = pvj.v_id;
I receive the error here stating that the join needs to be on the PK of the right table, but again, the PK required here would be of the services table because there are multiple.
This makes me conclude that the only way this would be possible with KSQLDB is if I stored the service PK on the other 3 streams, which wouldn't be doable either really because of the aggregation restrictions when creating a table via joined streams.
I'd appreciate any ideas, thanks again!

KSQL Rolling Sum

W have created a ksql stream that have id, sale, event as parameters where "event" can have two values, "sale" and "endSale". We used the below command to create the stream.
CREATE STREAM sales (id BIGINT, sale BIGINT , data VARCHAR)
WITH (KAFKA_TOPIC='raw_topic', VALUE_FORMAT='json');
We want to aggregate on the "sale" parameter. So, we have used SUM(sale) to aggregate it. We have used the below command to create a table
CREATE TABLE sales_total WITH (VALUE_FORMAT='JSON', KEY_FORMAT='JSON') AS
SELECT ID, SUM(SALE) AS TOTAL_SALES
FROM SALES GROUP BY ID EMIT CHANGES;
Now, we want to keep the sum aggregating as long as "event = sale". And, when the "event = endSale", we want to publish the value of SUM(SALE) AS TOTAL_SALES to a different topic and make the value of "SUM(SALE) AS TOTAL_SALES" to 0;
How to achieve the above scenario?
Is there a way to achieve this using UDAF? Can we pass custom values to UDAF "aggregate" function?
According to this link
An aggregate function that takes N input rows and returns one output value. During the function call, the state is retained for all input records, which enables aggregating results. When you implement a custom aggregate function, it's called a User-Defined Aggregate Function (UDAF).
How to pass N input rows to the UDAF? Please share any link for example

Best way to join two (or more) kafka topics in KSQL emiting changes from all topics?

We have a "microservices" platform and we are using debezium for change data capture from databases on these platforms which is working nicely.
Now, we'd like to make it easy for us to join these topics and stream the results into a new topic which could be consumed by multiple services.
Disclaimer: this assumes v0.11 ksqldb and cli (seems like much of this might not work in older versions)
Example of two tables from two database instances that stream into Kafka topics:
-- source identity microservice (postgres)
CREATE TABLE public.user_entity (
id varchar(36) NOT NULL,
first_name varchar(255) NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
-- source organization microservice (postgres)
CREATE TABLE public.user_info (
id varchar(36) NOT NULL,
user_entity_id varchar(36) NOT NULL,
business_unit varchar(255) NOT NULL,
cost_center varchar(255) NOT NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
Option 1 : Streams
CREATE STREAM stream_user_info_by_user_entity_id
AS SELECT * FROM stream_user_info
PARTITION BY user_entity_id
EMIT CHANGES;
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN stream_user_info_by_user_entity_id ui WITHIN 365 DAYS ON ue.id = ui.user_entity_id
EMIT CHANGES;
Notice WITHIN 365 DAYS, conceptually these tables could go a very long time without being changed so this window would be technically infinitely large. This looks fishy and seems to hint that this is not a good way to do this.
Option 2 : Tables
CREATE TABLE ktable_user_info_by_user_entity_id (
user_entity_id,
first_name,
business_unit,
cost_center
)
with (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON ue.id = ui.user_entity_id
EMIT CHANGES;
We no longer need the window WITHIN 365 DAYS, so this feels more correct. However this only emits a change when a message is sent to the stream not the table.
In this example:
User updates first_name -> change is emitted
User updates business_unit -> no change emitted
Perhaps there might be a way to create a merged stream partitioned by the user_entity_id and join to child tables which would hold the current state, which leads me to....
Option 3 : Merged stream and tables
-- "master" change stream with merged stream output
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
CREATE STREAM stream_user_entity_by_id
AS SELECT * FROM stream_user_entity
PARTITION BY id
EMIT CHANGES;
CREATE TABLE ktable_user_entity_by_id (
id VARCHAR PRIMARY KEY,
first_name VARCHAR
) with (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
This one looks the best, but appears to have a lot of moving components for each table we have 2 streams, 1 insert query, 1 ktable. Another potential issue here might be a hidden race condition where the stream emits the change before the table is updated under the covers.
Option 4 : More merged tables and streams
CREATE STREAM stream_user_entity_changes_enriched
AS SELECT
ue.id AS user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_by_id ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
CREATE STREAM stream_user_info_changes_enriched
AS SELECT
ui.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_info_by_user_entity_id ui
LEFT JOIN ktable_user_entity_by_id ue ON ui.user_entity_id = ue.id
EMIT CHANGES;
CREATE STREAM stream_user_changes_enriched (user_entity_id VARCHAR, first_name VARCHAR, business_unit VARCHAR, cost_center VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes_enriched', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_entity_changes_enriched;
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_info_changes_enriched;
This is conceptually the same as the earlier one, but the "merging" happens after the joins. Conceivably, this might eliminate any potential race condition because we're selecting primarily from the streams instead of the tables.
The downside is that the complexity is even worse than option 3 and writing and tracking all these streams for any joins with more than two tables would be kind of mind numbing...
Question :
What method is the best for this use case and/or are we attempting to do something that ksql shouldn't be used for? Would we better off to just offload this to traditional RDBMS or spark alternatives?
I'm going to attempt to answer my own question, only accept if upvoted.
The answer is: Option 3
Here are the reasons for this use case this would be the best, while perhaps could be subjective
Streams partitioned by primary and foreign keys are common and simple.
Tables based on these streams are common and simple.
Tables used in this way will not be a race condition.
All options have merits, e.g. if you don't care about emitting all the changes or if the data behaves like streams (logs or events) instead of slow changing dimensions (sql tables).
As for "race conditions", the word "table" tricks your mind that you are actually processing and persisting data. In reality is that they are not actually physical tables, they actually behave more like sub-queries on streams. Note: It might be an exception for aggregation tables which actually produce topics (which I would suggest is a different topic, but would love to see comments)
In the end (syntax may have some slight bugs):
---------------------------------------------------------
-- shared objects (likely to be used by multiple queries)
---------------------------------------------------------
-- shared streams wrapping topics
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
-- shared keyed streams (i like to think of them as "indexes")
CREATE STREAM stream_user_entity_by_id AS
SELECT * FROM stream_user_entity PARTITION BY id
EMIT CHANGES;
CREATE STREAM stream_user_info_by_user_entity_id AS
SELECT * FROM stream_user_info PARTITION BY user_entity_id
EMIT CHANGES;
-- shared keyed tables (inferring columns with schema registry)
CREATE TABLE ktable_user_entity_by_id (id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
CREATE TABLE ktable_user_info_by_user_entity_id (user_entity_id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
---------------------------------------------------------
-- query objects (specific to the produced data)
---------------------------------------------------------
-- "master" change stream (include all tables in join)
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
-- pretty simple looking query
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
The "shared" objects are basically the streaming schema (temptation is to create for all our topics, but that's another question) and the second portion is like the query schema. It ultimately is a functional, clean, and repeatable pattern.
i like your approach number 3. indeed i have tried to use that one to merge streams with different primary keys into one master stream and then grouping them in a materialized view.
the join seems to work, but i ended up having the same situation as in a regular stream-table join ... indeed i see changes in the master stream, but somehow those changes are only triggered downstream (to the group by table) when they affect the first table in the table-stream join and not the others.
So basically what i have achieved is the following:
debezium --> create 3 streams: A,B,AB (AB is a matching table between ids in A and ids in B, used in postgres to make an n-to-n join)
stream A,B,C are repartitioned by one id (A_id) and merged into one stream. in this step all elements of B get assigned a virtual A_id since it is not relevant to them
the 3 KTables are created (i still keep wondering, why? is this a sort of self-join?)
a materialized view (table) is created by grouping the master stream after the joining it to the 3 KTables
when a change in A,B, or AB happens, i see changes in the master stream too, but the materialized view is only updated when changes on stream B occur. Of course destroying the table and recreating it makes it "up-to-date".
are you facing the same problem?

Can I convert from Table to Stream in KSQL?

I am working in the kafka with KSQL. I would like to find out the last row within 5 min in different DEV_NAME(ROWKEY). Therefore, I have created the stream and aggregated table for further joining.
By below KSQL, I have created the table for finding out the last row within 5 min for different DEV_NAME
CREATE TABLE TESTING_TABLE AS
SELECT ROWKEY AS DEV_NAME, max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY ROWKEY;
Then, I would like to join together:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
However, it occured the error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (org.apache.kafka.streams.kstream.TimeWindowedSerializer) is not compatible to the actual key type (key type: org.apache.kafka.connect.data.Struct). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
It should be the WINDOW TUMBLING function changed my ROWKEY style
(e.g. DEV_NAME_11508 -> DEV_NAME_11508 : Window{start=157888092000 end=-}
Therefore, without setting the Serdes, could I convert from the table to stream and set the PARTITION BY DEV_NAME?
As you've identified, the issue is that your table is a windowed table, meaning the key of the table is windowed, and you can not look up into a windowed table with a non-windowed key.
You're table, as it stands, will generate one unique row per-ROWKEY for each 5 minute window. Yet it seems like you don't care about anything but the most recent window. It may be that you don't need the windowing in the table, e.g.
CREATE TABLE TESTING_TABLE AS
SELECT
ROWKEY AS DEV_NAME,
max(ROWTIME) as LAST_TIME
FROM TESTING_STREAM
WHERE ROWTIME > (UNIX_TIMESTAMP() - 300000)
GROUP BY ROWKEY;
Will track the max timestamp per key, ignoring any timestamp that is over 5 minutes old. (Of course, this check is only done at the time the event is received, the row isn't removed after 5 minutes).
Also, this join:
CREATE STREAM TESTING_S_2 AS
SELECT *
FROM TESTING_S S
INNER JOIN TESTING_T T
ON S.ROWKEY = T.ROWKEY
WHERE
S.ROWTIME = T.LAST_TIME;
Almost certainly isn't doing what you think and wouldn't work in the way you want due to race conditions.
It's not clear what you're trying to achieve. Adding more information about your source data and required output may help people to provide you with a solution.

KSQL - Determining When a Table Is Loaded

How can I determine when KSQL has fully loaded my data from a Kafka topic into my table?
GOAL: Take 2 Kafka topics, join them and write the results to a new Kafka topic.
EXAMPLE:
I am using Ksql's Rest API to issue the following commands.
CREATE TABLE MyTable (A1 VARCHAR, A2 VARCHAR) WITH (kafka_topic='topicA', key='A1', value_format='json');
CREATE STREAM MyStream (B1 varchar, B2 varchar) WITH (kafka_topic='topicB', value_format='json');
CREATE STREAM MyDestination WITH (Kafka_topic='topicC', PARTITIONS = 1, value_format='json') AS SELECT a.A1 as A1, a.A2 as A2, b.B1 as B1, b.B2 as B2 FROM MyStream b left join MyTable a on a.A1 = b.B1;
PROBLEM: topicC only has data from topicB, and all joined values are null.
Although I receive back a status of SUCCESS from the create table command, it appears that the data has not fully loaded into the table. Consequently the result of the 3rd command only has data from the stream and does not include data from the table. If I artificially delay before executing the join command, then the resulting topic will correctly have data from both topics. How can I determine when my table is loaded, and it is safe to execute the join command?
This is indeed a great question. At this point KSQL doesn't have a way to automatically execute a stream-table join only once the table is fully loaded. This is indeed a useful feature. A more general and related problem is discussed here: https://github.com/confluentinc/ksql/issues/1751
Tables in KSQL (and underlying Kafka Streams) have a time dimension, ie, the evolve over time. For a stream-table join, each stream-record is joined with the "correct" table version (ie, tables are versioned by time).
In upcoming CP 5.1 release, you can "pre-load" the table, by ensuring that all record timestamp of the table topic are smaller than the record timestamps of the stream topic. This tells, KSQL, that it needs to process the table topic data first, but advance the table timestamp-version accordingly before it can start joining.
For more details, check out: https://www.confluent.io/resources/streams-tables-two-sides-same-coin