I am trying to group events by one of its properties and over time using the KSQL Windowed Aggregation, specifically the Session Window.
I have a STREAM made from a kafka topic with the TIMESTAMP property well specified.
When I try to create a STREAM with a Session Windowing with a query like:
CREATE STREAM SESSION_STREAM AS
SELECT ...
FROM EVENT_STREAM
WINDOW SESSION (5 MINUTES)
GROUP BY ...;
I always get the error:
Your SELECT query produces a TABLE. Please use CREATE TABLE AS SELECT statement instead.
Is it possible to create a STREAM with a Windowed Aggregation?
When I try as suggested to create a TABLE and then a STREAM that contains all the session starting events, with a query like:
CREATE STREAM SESSION_START_STREAM AS
SELECT *
FROM SESSION_TABLE
WHERE WINDOWSTART=WINDOWEND;
KSQL informs me that:
KSQL does not support persistent queries on windowed tables
How to create a STREAM of events starting a session window in KSQL?
Your create stream statement, if switched to a create table statement will create a table that is constantly being updated. The sink topic SESSION_STREAM will contain the stream of changes to the table, i.e. its changelog.
ksqlDB models this as a TABLE, because it has TABLE semantics, i.e. only a single row can exist in the table with any specific key. However, the changelog will contain the STREAM of changes that have been applied to the table.
If what you want is a topic containing all the sessions then something like this will create that:
-- create a stream with a new 'data' topic:
CREATE STREAM DATA (USER_ID INT)
WITH (kafka_topic='data', value_format='json');
-- create a table that tracks user interactions per session:
CREATE TABLE SESSION AS
SELECT USER_ID, COUNT(USER_ID) AS COUNT
FROM DATA
WINDOW SESSION (5 SECONDS)
GROUP BY USER_ID;
This will create a SESSIONS topic that contains the changes to the SESSIONS table: i.e. its changelog.
If you want to convert this into a stream of session start events, then unfortunately ksqlDB doesn't yet allow you to directly change create a stream from the table, but you can create a stream over the table's change log:
-- Create a stream over the existing `SESSIONS` topic.
-- Note it states the window_type is 'Session'.
CREATE STREAM SESSION_STREAM (ROWKEY INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');
-- Create a stream of window start events:
CREATE STREAM SESSION_STARTS AS
SELECT * FROM SESSION_STREAM
WHERE WINDOWSTART = WINDOWEND;
Note, with the upcoming 0.10 release you'll be able to name the key column in the SESSION_STREAM correctly:
CREATE STREAM SESSION_STREAM (USER_ID INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');
Related
I had successfully created Debezium Source Connector for Oracle and pulled some data to Kafka topics.
After that I create a stream on existing topic that was generated by debezium connector.
CREATE STREAM "MySourceTable" WITH (KAFKA_TOPIC='server1.DEBEZIUM.MySourceTable', VALUE_FORMAT='AVRO');
--Stream was created successfully.
After that I am trying to create a new stream with some data extracted:
CREATE OR REPLACE STREAM "CustomTable1" WITH (KAFKA_TOPIC='server2.DEBEZIUM.CustomTable1', VALUE_FORMAT='AVRO', PARTITIONS=1, KEY_FORMAT='AVRO') AS
SELECT after->customTableId AS "key_id",
AS_VALUE(after->customTableId) AS "id",
after->customTableHref AS "href",
after->customTableName AS "name",
after->customTableType AS "type",
after->id AS "mySourceTableId"
FROM "MySourceTable"
WHERE (after->customTableId IS NOT NULL) EMIT CHANGES;
--Stream was created successfully.
Then I delete a value in column customTableId from MySourceTable table and it becomes null in my source db. I can capture changes from other columns like customTableHref and customTableName and the result in sink db. But I still have the records with null id from source db that are not null in sink db and they are still exist in my sink table (id is primary key in my sink table). How can solve this problem ?
I am getting data into the source topic with keys and values in AVRO schema. I have 2 more streams in the pipeline downstream before I am sending the data into MySQL. But since I am using functions from KSQL the key from the source is not getting mapped into the downstream streams.
The keys are of a composite type. Is there any way where I can do some transformations on the key values and keep the KEYS in downstream streams too?
I tried multiple ways and know that if use SELECT AS statement during CREATE STREAM statement will keep the KEY intact but any transformations remove the KEY from the streams.
Example:
-- source stream
CREATE STREAM TEST_1
(COL1 STRING, COL2 STRING, COL3 STRING)
WITH (KAFKA_TOPIC='TEST_1', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='AVRO');
-- Second stream
CREATE STREAM TEST_2
WITH (KAFKA_TOPIC='TEST_2', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='AVRO') AS
SELECT CLEAR(COL1) AS COL1, CLEAR(COL2) AS COL2, CLEAR(COL3) AS COL3
FROM TEST_1;
Here the TEST_1 is from a database. It has composite keys (COL1, COL2).
But when I am creating stream TEST_2 the composite keys are not reflected from TEST_1.
I need the keys for tombstone records for delete operation in MySQL. There are some cleaning operations performed on the keys.
I am using Kafka confluent for this. The insert and update operations don't have any problem as I had an SMT - value to a key, in the sink connectors.
Here CLEAR() is a custom UDF. Is there a way to achieve this?
We are trying to send data from MySQL to elastic(ETL) though Kafka.
In MySQL we have multiple tables which we need to aggregate in specific format than we can send it to elastic search.
For that we used debezium to connect with Mysql and elastic and transformed data through ksql.
we have created streams for both the tables then partition them and create table of one entity but after joining we dint get the data from both the tables.
we are trying to join two tables of Mysql through Ksql and send it to elastic search using debezium.
Table 1: items
table 2 : item_images
CREATE STREAM items_from_debezium (id integer, tenant_id integer, name string, sku string, barcode string, qty integer, type integer, archived integer)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.items',VALUE_FORMAT='json');
CREATE STREAM images_from_debezium (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.item_images',VALUE_FORMAT='json');
CREATE STREAM items_flat
WITH (KAFKA_TOPIC='ITEMS_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM items_from_debezium PARTITION BY id;
CREATE STREAM images_flat
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM images_from_debezium PARTITION BY item_id;
CREATE TABLE item_images (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',KEY='item_id');
SELECT item_images.id,item_images.image,item_images.thumbnail,items_flat.id,items_flat.name,items_flat.sku,items_flat.barcode,items_flat.type,items_flat.archived,items_flat.qty
FROM items_flat left join item_images on items_flat.id=item_images.item_id
limit 10;
We are expecting data of both the tables but we are getting null from item_images table.
Is it possible in KSQL to stream out the old and new values from a table? We'd like to use a table as a store of values and when one changes stream out a "reversal" value which is the previous one, tagged in some way, and the new value so that we can just handle the delta in downstream systems?
Kafka tables are generally used for storing the latest values. So for example say stream with key '123' exist in table and a new stream with same key '123' but different column value appears on topic, this will override(upsert) the existing value in table.
So probably its not a great idea to do it on Table.
Your use case is not clear to me still my suggestion would be you need to have some mechanism either in the source of stream or using timestamp to deal with delta feed.
Yes its possible. Does require some juggling.
Create table to keep last state
create table v1_mux_connection_ping_ta
as
select
assetid,
LATEST_BY_OFFSET(pingable) pingable
from v1_mux_connection_ping_st_parse
group by assetid;
Problem is it also emits no-changes. A solution is to translate the table to a stream.
CREATE STREAM v1_mux_connection_ping_ta_s
(assetId VARCHAR KEY, pingable VARCHAR)
WITH (kafka_topic='V1_MUX_CONNECTION_PING_TA', value_format='JSON');
To arrive at only changed values
create table d_opt_details as
select
s.assetId,
LATEST_BY_OFFSET(s.pingable) new,
LATEST_BY_OFFSET(s.pingable, 2)[1] old
from v1_mux_connection_ping_ta_s s
group by
s.assetId;
create table opt_details as
select
s.assetId, s.new as pingable
from d_opt_details s
where new != old;
I created this record:
new ProducerRecord(topic = "petstore-msg-topic", key = msg.username, value = s"${msg.route},${msg.time}")
I want now to do something like this:
CREATE STREAM petstorePages (KEY, route VARCHAR, time VARCHAR) \
WITH (KAFKA_TOPIC='petstore-msg-topic', VALUE_FORMAT='DELIMITED');
Is there a possibility to access the key in the Stream creation or do I have to include the key also in the value?
It's added automatically and called ROWKEY
KSQL adds the implicit columns ROWTIME and ROWKEY to every stream and table, which represent the corresponding Kafka message timestamp and message key
https://docs.confluent.io/current/ksql/docs/syntax-reference.html#id16