Key while creating KSQL Stream - apache-kafka

1) Is Key required on the Stream where you want to perform aggregate function. I have read several blogs and also recommendation from Confluent that KEY is required for aggregation function to work
CREATE STREAM Employee (EmpId BIGINT, EmpName VARCHAR,
DeptId BIGINT, SAL BIGINT) WITH (KAFKA_TOPIC='EmpTopic',
VALUE_FORMAT='JSON');
While defining above Stream, I have not defined any KEY (ROWKEY is NULL). Underlying topic 'EmpTopic' also does not a KEY.
I am performing aggregation function on the Stream.
CREATE TABLE SALBYDEPT AS
SELECT DeptId,
SUM(SAL)
FROM Employee
GROUP BY DeptId;
Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream
2) As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?
3) Will JOIN on Stream-Table retain the KEY
CREATE TABLE users
(registertime BIGINT,
userid VARCHAR,
gender VARCHAR,
regionid VARCHAR)
WITH (KAFKA_TOPIC = 'users',
VALUE_FORMAT='JSON',
KEY = 'userid');
CREATE STREAM pageviews
(viewtime BIGINT,
userid VARCHAR,
pageid VARCHAR)
WITH (KAFKA_TOPIC='pageviews',
VALUE_FORMAT='DELIMITED',
KEY='pageid',
TIMESTAMP='viewtime');
CREATE STREAM pageviews_transformed as
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
CREATE STREAM pageviews_enriched AS
SELECT pv.viewtime,
pv.userid AS userid,
pv.pageid,
pv.timestring,
u.gender,
u.regionid,
u.interests,
u.contactinfo
FROM pageviews_transformed pv
LEFT JOIN users u ON pv.userid = u.userid;
Will JOIN on Stream-Table retain the 'UserId' as ROWKEY in the new Stream 'pageviews_enriched'
4) I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.
Stream-Stream join and Stream-Table join. In the below example I am performing JOIN on Stream with NULL ROWKEY and Table. Is this valid?
CREATE TABLE users
(registertime BIGINT,
userid VARCHAR,
gender VARCHAR,
regionid VARCHAR)
WITH (KAFKA_TOPIC = 'users',
VALUE_FORMAT='JSON',
KEY = 'userid');
CREATE STREAM pageviews
(viewtime BIGINT,
userid VARCHAR,
pageid VARCHAR)
WITH (KAFKA_TOPIC='pageviews',
VALUE_FORMAT='DELIMITED',
TIMESTAMP='viewtime');
CREATE STREAM pageviews_transformed as
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
CREATE STREAM pageviews_enriched AS
SELECT pv.viewtime,
pv.userid AS userid,
pv.pageid,
pv.timestring,
u.gender,
u.regionid,
u.interests,
u.contactinfo
FROM pageviews_transformed pv
LEFT JOIN users u ON pv.userid = u.userid;

CREATE TABLE SALBYDEPT AS
SELECT DeptId,
SUM(SAL)
FROM Employee
GROUP BY DeptId;
Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream
You do not need a key on this stream. The key of the created table will be DeptId.
As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?
This means that when you create an aggregation you can do so over a time window, and that time window is part of the message key. For example, instead of aggregating all employee SAL (sales?), you could choose to do so over a time window, perhaps every hour or day. In that case you would have the aggregate key (DeptId), combined with the window key (e.g. for hourly 2019-06-23 06:00:00, 2019-06-23 07:00:00, 2019-06-23 08:00:00 etc)
Will JOIN on Stream-Table retain the KEY
It will retain the stream's key, unless you include a PARTITION BY in the DDL.
I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.
Do you have a link to the specific documentation you're referencing? Whilst a table does need to be keyed, a stream does not (KSQL may handle this under the covers; I'm not sure).

Related

How to correctly GROUP BY on jdbc sources

I have a Kafka stream with user_id and want to produce another stream with user_id and number of records in a JDBC table.
Following is how I tried to achieve this (I'm new to flink, so please correct me if that's not how things are supposed to be done). The issue is that flink ignores all updates to JDBC table after the job has started.
As far as I understand the answer to this is to use lookup joins but flink complains that lookup joins are not supported on temporal views. Also tried to use versioned views without much success.
What would be the correct approach to achieve what I want?
CREATE TABLE kafka_stream (
user_id STRING,
event_time TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE jdbc_table (
user_id STRING,
checked_at TIMESTAMP,
PRIMARY KEY(user_id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
-- ...
)
-- NEXT SQL --
CREATE TEMPORARY VIEW checks_counts AS
SELECT user_id, count(*) as num_checks
FROM jdbc_table
GROUP BY user_id
-- NEXT SQL --
INSERT INTO output_kafka_stream
SELECT
kafka_stream.user_id,
checks_counts.num_checks
FROM kafka_stream
LEFT JOIN checks_counts ON kafka_stream.user_id = checks_counts.user_id

Create Table without data aggregation

I just started to use the ksqlDB Confluent feature, and it stood out that it is not possible to proceed with the following command: CREATE TABLE AS SELECT A, B, C FROM [STREAM_A] [EMIT CHANGES];
I wonder why this is not possible or if there's a way of doing it?
Data aggregation here is feeling a heavy process to a simple solution.
Edit 1: Source is a STREAM and not a TABLE.
The field types are:
String
Integers
Record
Let me share an example of the executed command that returns an error as a result.
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
[EMIT CHANGES];
create a table with a smaller dataset and fewer fields than the original topic
I think the confusion here is that you talk about a TABLE, but you're actually creating a STREAM. The two are different types of object.
A STREAM is an unbounded series of events - just like a Kafka topic. The only difference is that a STREAM has a declared schema.
A TABLE is state, for a given key. It's the same as KTable in Kafka Streams if you're familiar with that.
Both are backed by Kafka topics.
So you can do this - note that it's creating a STREAM not a TABLE
CREATE STREAM test_stream
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL;
If you really want to create a TABLE then use the LATEST_BY_OFFSET aggregation, assuming you'd using id as your key:
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, LATEST_BY_OFFSET(timestamp)
, LATEST_BY_OFFSET(servicename)
, LATEST_BY_OFFSET(content->assignedcontent)
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
GROUP BY id;

Best way to join two (or more) kafka topics in KSQL emiting changes from all topics?

We have a "microservices" platform and we are using debezium for change data capture from databases on these platforms which is working nicely.
Now, we'd like to make it easy for us to join these topics and stream the results into a new topic which could be consumed by multiple services.
Disclaimer: this assumes v0.11 ksqldb and cli (seems like much of this might not work in older versions)
Example of two tables from two database instances that stream into Kafka topics:
-- source identity microservice (postgres)
CREATE TABLE public.user_entity (
id varchar(36) NOT NULL,
first_name varchar(255) NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
-- source organization microservice (postgres)
CREATE TABLE public.user_info (
id varchar(36) NOT NULL,
user_entity_id varchar(36) NOT NULL,
business_unit varchar(255) NOT NULL,
cost_center varchar(255) NOT NULL,
PRIMARY KEY (id)
);
-- ksql stream
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
Option 1 : Streams
CREATE STREAM stream_user_info_by_user_entity_id
AS SELECT * FROM stream_user_info
PARTITION BY user_entity_id
EMIT CHANGES;
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN stream_user_info_by_user_entity_id ui WITHIN 365 DAYS ON ue.id = ui.user_entity_id
EMIT CHANGES;
Notice WITHIN 365 DAYS, conceptually these tables could go a very long time without being changed so this window would be technically infinitely large. This looks fishy and seems to hint that this is not a good way to do this.
Option 2 : Tables
CREATE TABLE ktable_user_info_by_user_entity_id (
user_entity_id,
first_name,
business_unit,
cost_center
)
with (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
SELECT
user_entity_id,
first_name,
business_unit,
cost_center
FROM stream_user_entity ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON ue.id = ui.user_entity_id
EMIT CHANGES;
We no longer need the window WITHIN 365 DAYS, so this feels more correct. However this only emits a change when a message is sent to the stream not the table.
In this example:
User updates first_name -> change is emitted
User updates business_unit -> no change emitted
Perhaps there might be a way to create a merged stream partitioned by the user_entity_id and join to child tables which would hold the current state, which leads me to....
Option 3 : Merged stream and tables
-- "master" change stream with merged stream output
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
CREATE STREAM stream_user_entity_by_id
AS SELECT * FROM stream_user_entity
PARTITION BY id
EMIT CHANGES;
CREATE TABLE ktable_user_entity_by_id (
id VARCHAR PRIMARY KEY,
first_name VARCHAR
) with (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
This one looks the best, but appears to have a lot of moving components for each table we have 2 streams, 1 insert query, 1 ktable. Another potential issue here might be a hidden race condition where the stream emits the change before the table is updated under the covers.
Option 4 : More merged tables and streams
CREATE STREAM stream_user_entity_changes_enriched
AS SELECT
ue.id AS user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_by_id ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
CREATE STREAM stream_user_info_changes_enriched
AS SELECT
ui.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_info_by_user_entity_id ui
LEFT JOIN ktable_user_entity_by_id ue ON ui.user_entity_id = ue.id
EMIT CHANGES;
CREATE STREAM stream_user_changes_enriched (user_entity_id VARCHAR, first_name VARCHAR, business_unit VARCHAR, cost_center VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes_enriched', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_entity_changes_enriched;
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_info_changes_enriched;
This is conceptually the same as the earlier one, but the "merging" happens after the joins. Conceivably, this might eliminate any potential race condition because we're selecting primarily from the streams instead of the tables.
The downside is that the complexity is even worse than option 3 and writing and tracking all these streams for any joins with more than two tables would be kind of mind numbing...
Question :
What method is the best for this use case and/or are we attempting to do something that ksql shouldn't be used for? Would we better off to just offload this to traditional RDBMS or spark alternatives?
I'm going to attempt to answer my own question, only accept if upvoted.
The answer is: Option 3
Here are the reasons for this use case this would be the best, while perhaps could be subjective
Streams partitioned by primary and foreign keys are common and simple.
Tables based on these streams are common and simple.
Tables used in this way will not be a race condition.
All options have merits, e.g. if you don't care about emitting all the changes or if the data behaves like streams (logs or events) instead of slow changing dimensions (sql tables).
As for "race conditions", the word "table" tricks your mind that you are actually processing and persisting data. In reality is that they are not actually physical tables, they actually behave more like sub-queries on streams. Note: It might be an exception for aggregation tables which actually produce topics (which I would suggest is a different topic, but would love to see comments)
In the end (syntax may have some slight bugs):
---------------------------------------------------------
-- shared objects (likely to be used by multiple queries)
---------------------------------------------------------
-- shared streams wrapping topics
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');
-- shared keyed streams (i like to think of them as "indexes")
CREATE STREAM stream_user_entity_by_id AS
SELECT * FROM stream_user_entity PARTITION BY id
EMIT CHANGES;
CREATE STREAM stream_user_info_by_user_entity_id AS
SELECT * FROM stream_user_info PARTITION BY user_entity_id
EMIT CHANGES;
-- shared keyed tables (inferring columns with schema registry)
CREATE TABLE ktable_user_entity_by_id (id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
CREATE TABLE ktable_user_info_by_user_entity_id (user_entity_id VARCHAR PRIMARY KEY)
WITH (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');
---------------------------------------------------------
-- query objects (specific to the produced data)
---------------------------------------------------------
-- "master" change stream (include all tables in join)
CREATE STREAM stream_user_changes (user_entity_id VARCHAR)
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;
-- pretty simple looking query
SELECT
uec.user_entity_id,
ue.first_name,
ui.business_unit,
ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id
EMIT CHANGES;
The "shared" objects are basically the streaming schema (temptation is to create for all our topics, but that's another question) and the second portion is like the query schema. It ultimately is a functional, clean, and repeatable pattern.
i like your approach number 3. indeed i have tried to use that one to merge streams with different primary keys into one master stream and then grouping them in a materialized view.
the join seems to work, but i ended up having the same situation as in a regular stream-table join ... indeed i see changes in the master stream, but somehow those changes are only triggered downstream (to the group by table) when they affect the first table in the table-stream join and not the others.
So basically what i have achieved is the following:
debezium --> create 3 streams: A,B,AB (AB is a matching table between ids in A and ids in B, used in postgres to make an n-to-n join)
stream A,B,C are repartitioned by one id (A_id) and merged into one stream. in this step all elements of B get assigned a virtual A_id since it is not relevant to them
the 3 KTables are created (i still keep wondering, why? is this a sort of self-join?)
a materialized view (table) is created by grouping the master stream after the joining it to the 3 KTables
when a change in A,B, or AB happens, i see changes in the master stream too, but the materialized view is only updated when changes on stream B occur. Of course destroying the table and recreating it makes it "up-to-date".
are you facing the same problem?

Rowkey as Concatenated in Create Table from a Stream in ksqlDB

The stream is:
CREATE STREAM SENSORS_KSTREAM (sensorid INT,
serialnumber VARCHAR,
mfgdate VARCHAR,
productname VARCHAR,
customerid INT,
locationid INT,
macaddress VARCHAR,
installationdate VARCHAR)
WITH (KAFKA_TOPIC='SENSORS_DETAILS', VALUE_FORMAT='AVRO', KEY='sensorid');
the table I created with this is:
CREATE TABLE SENSORS_KTABLE AS
SELECT sensorid, serialnumber, mfgdate, productname, customerid, locationid, macaddress, installationdate, COUNT(*) AS TOTAL
FROM SENSORS_KSTREAM WINDOW TUMBLING (SIZE 1 MINUTES)
GROUP BY sensorid, serialnumber, mfgdate, productname, customerid, locationid, macaddress, installationdate;
The ROWKEY produced is not what I want.
I want only SENSORID as the rowkey.
Can anyone help me do this.
Thanks in advance.
PS:
I am using Confluent 5.4.0 standalone.
ksqlDB stores the primary key of a table in the key of the underlying Kafka message. This is crucial to ensure important things like consistent partition assignment for the same key, and log compaction.
ksqlDB does not support compound keys, though this is a feature being worked on. So in the meantime, when you group by multiple columns, ksqlDB does the best it can and builds the compound key you're encountered. Not great, but it actually works for many use cases.
The statement you have above is creating a table with many columns in the primary key - and they're all currently getting serialized into a single STRING value.
You're asking to only having SENSORID in the key... but your GROUP BY clause makes all the columns that come after part of the key.
It seems to me that you have a topic that contains a stream of updated values for sensors. That being the case I would suggest looking into two options:
If each row in your input topic contains all the data for each sensor, then why not just import it as a TABLE rather than a STREAM:
CREATE TABLE SENSORS_KSTREAM (sensorid INT,
serialnumber VARCHAR,
mfgdate VARCHAR,
productname VARCHAR,
customerid INT,
locationid INT,
macaddress VARCHAR,
installationdate VARCHAR)
WITH (KAFKA_TOPIC='SENSORS_DETAILS', VALUE_FORMAT='AVRO', KEY='sensorid');
Alternatively, maybe LATEST_BY_OFFSET might be of use to capture the latest value for each column:
CREATE TABLE SENSORS_KTABLE AS
SELECT sensorid, LATEST_BY_OFFSET(serialnumber), LATEST_BY_OFFSET(mfgdate), LATEST_BY_OFFSET(productname), LATEST_BY_OFFSET(customerid), LATEST_BY_OFFSET(locationid), LATEST_BY_OFFSET(macaddress), LATEST_BY_OFFSET(installationdate)
FROM SENSORS_KSTREAM WINDOW TUMBLING (SIZE 1 MINUTES)
GROUP BY sensorid;
LAST_BY_OFFSET was only introduced a couple of releases ago, so you may need to update.
Hopefully these two options will help you get where you need to be.

ksql - is it possible to create stream from multiple topics and get full event payload?

We have a requirement of listening on multiple topics and look for specific field in each topic's event. Each topic event is in json format and is gauranteed to have few fixed fields in json format. Need to filter events from all these multiple topics and look for a specific field in each event payload. If this field value matches certain format, then send those events from different topic to one fixed topic which can be further processed by another consumer.
Was looking if ksql can help in this scenario - we create a stream from multiple topics and filter data based on fixed column in ksql stream and push it to new topic.
The question I have is:
1) Is it possible to create a ksql stream from multiple topics?
2) Is it possible to get topic's complete event payload as one column in ksql stream?
On high level, (with wrong ksql syntax), I am looking for something like
CREATE STREAM my_all_topics (myFixedFiedl1 varchar, eventPayload varchar) WITH (value_format = 'json', kafka_topic_LIST='topic1, topic2, topic3');
CREATE STREAM mytopic_stream (myFixedFiedl1 varchar, eventPayload varchar) with (kafka_topic='my-final-topic-name', value_format='json')
as select myFixedField1, eventPayload from my_all_topics where myFixedField1 like 'myprefix%';
You can't do it quite how you want—a KSQL STREAM is sourced from one and only one Kafka topic.
But you could use KSQL's INSERT INTO feature to achieve what you want.
Model your source topics:
CREATE STREAM source_a (myFixedField1 varchar, eventPayload varchar) WITH (kafka_topic='topic_a', value_format='json')
CREATE STREAM source_b (myFixedField1 varchar, eventPayload varchar) WITH (kafka_topic='topic_b', value_format='json')
CREATE STREAM source_c (myFixedField1 varchar, eventPayload varchar) WITH (kafka_topic='topic_c', value_format='json')
Create the target topic, based on the first source topic:
CREATE STREAM mytopic_stream (myFixedField1 varchar, eventPayload varchar) AS SELECT myFixedField1, eventPayload from source_a where myFixedField1 like 'myprefix%';
Specify insertion to the target topic from the remaining source topics:
INSERT INTO mytopic_stream SELECT myFixedField1, eventPayload from source_b where myFixedField1 like 'myprefix%';
INSERT INTO mytopic_stream SELECT myFixedField1, eventPayload from source_c where myFixedField1 like 'myprefix%';
See also
https://www.youtube.com/watch?v=z508VDdtp_M
https://docs.confluent.io/current/ksql/docs/tutorials/basics-local.html#insert-into
I don't know for certain, but it seems you would be able to combine streams with a JOIN.
CREATE STREAM mytopic_stream AS
SELECT A.*, B.*, C.*
FROM stream_A A
JOIN stream_B B ON A.key = B.key_for_A
JOIN stream_C C ON A.key = B.key_for_A
If you haven't already registered the Kafka topics with KSQL then you will have to take care of that step first.
You can consume your data from multiple topics and send to a common topic, as to a collector and then process the stream as usual, Just take care of the compatibility of key and value types.