KSQL UNIX_TIMESTAMP function is not dinamic on streams created with queries - apache-kafka

I'm creating the following stream:
CREATE STREAM riderLocations (profileId VARCHAR, latitude DOUBLE, longitude DOUBLE, publishtime VARCHAR)
WITH (kafka_topic='locations', value_format='json', partitions=1);
and then this other one:
CREATE STREAM INVENTORY WITH (KAFKA_TOPIC='locations_in')
AS select * FROM riderLocations
where STRINGTOTIMESTAMP(publishtime, 'yyyy-MM-dd HH:mm:ss.SSSZ') < UNIX_TIMESTAMP();
when I execute the command: select * from inventory emit changes;
it only presents the messages that had a publish date smaller than the moment when the inventory stream was created.
How can I force the unix_timestamp value to update and update my stream inventory?

The timestamp is evaluated at query creation time, not runtime.
If you actively want messages less than "now", you'll need to include that as a WHERE clause for each SELECT query you run

Related

Create Table without data aggregation

I just started to use the ksqlDB Confluent feature, and it stood out that it is not possible to proceed with the following command: CREATE TABLE AS SELECT A, B, C FROM [STREAM_A] [EMIT CHANGES];
I wonder why this is not possible or if there's a way of doing it?
Data aggregation here is feeling a heavy process to a simple solution.
Edit 1: Source is a STREAM and not a TABLE.
The field types are:
String
Integers
Record
Let me share an example of the executed command that returns an error as a result.
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
[EMIT CHANGES];
create a table with a smaller dataset and fewer fields than the original topic
I think the confusion here is that you talk about a TABLE, but you're actually creating a STREAM. The two are different types of object.
A STREAM is an unbounded series of events - just like a Kafka topic. The only difference is that a STREAM has a declared schema.
A TABLE is state, for a given key. It's the same as KTable in Kafka Streams if you're familiar with that.
Both are backed by Kafka topics.
So you can do this - note that it's creating a STREAM not a TABLE
CREATE STREAM test_stream
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, timestamp
, servicename
, content->assignedcontent
FROM created_stream
WHERE content->assignedcontent IS NOT NULL;
If you really want to create a TABLE then use the LATEST_BY_OFFSET aggregation, assuming you'd using id as your key:
CREATE TABLE test_table
WITH (KEY_FORMAT='JSON',VALUE_FORMAT='AVRO')
AS
SELECT id
, LATEST_BY_OFFSET(timestamp)
, LATEST_BY_OFFSET(servicename)
, LATEST_BY_OFFSET(content->assignedcontent)
FROM created_stream
WHERE content->assignedcontent IS NOT NULL
GROUP BY id;

KSQL Rolling Sum

W have created a ksql stream that have id, sale, event as parameters where "event" can have two values, "sale" and "endSale". We used the below command to create the stream.
CREATE STREAM sales (id BIGINT, sale BIGINT , data VARCHAR)
WITH (KAFKA_TOPIC='raw_topic', VALUE_FORMAT='json');
We want to aggregate on the "sale" parameter. So, we have used SUM(sale) to aggregate it. We have used the below command to create a table
CREATE TABLE sales_total WITH (VALUE_FORMAT='JSON', KEY_FORMAT='JSON') AS
SELECT ID, SUM(SALE) AS TOTAL_SALES
FROM SALES GROUP BY ID EMIT CHANGES;
Now, we want to keep the sum aggregating as long as "event = sale". And, when the "event = endSale", we want to publish the value of SUM(SALE) AS TOTAL_SALES to a different topic and make the value of "SUM(SALE) AS TOTAL_SALES" to 0;
How to achieve the above scenario?
Is there a way to achieve this using UDAF? Can we pass custom values to UDAF "aggregate" function?
According to this link
An aggregate function that takes N input rows and returns one output value. During the function call, the state is retained for all input records, which enables aggregating results. When you implement a custom aggregate function, it's called a User-Defined Aggregate Function (UDAF).
How to pass N input rows to the UDAF? Please share any link for example

ksqlDB shows wrong `yyyy-MM-dd HH:mm:ss.SSSSSS` format for timestamp

I used kafka-connect to stream a table from MySQL to a kafka topic. The table contained some columns with datetime(6) column type liked this 1611290740285818.
When I converted this timestamp to string using ksqlDB using:
SELECT TIMESTAMPTOSTRING(my_timestamp, 'yyyy-MM-dd HH:mm:ss.SSSSSS','UTC') AS DT6
FROM my_topic
EMIT CHANGES;
The displayed string was actually +53114-10-20 14:12:20.712000, while the actual time was supposed to be 2021-01-22 04:45:40.285818.
Can anyone advise what was wrong with my query?
#Aydin is correct in their answer. The bigint value you you shared is microseconds, and ksqlDB's TIMESTAMPTOSTRING function expects milliseconds. The time format string that you specified is just telling ksqlDB how to format the timestamp, not how to interpret it. Here's an example:
-- Create a sample stream
ksql> CREATE STREAM TMP (TS BIGINT) WITH (KAFKA_TOPIC='TMP', PARTITIONS=1, VALUE_FORMAT='AVRO');
Message
----------------
Stream created
----------------
-- Populate it with example data
ksql> INSERT INTO TMP VALUES (1611290740285818);
-- Query the stream from the beginning
ksql> SET 'auto.offset.reset' = 'earliest';
>
Successfully changed local property 'auto.offset.reset' from 'earliest' to 'earliest'.
-- Reproduce the described behaviour
ksql> SELECT TS, TIMESTAMPTOSTRING(TS, 'yyyy-MM-dd HH:mm:ss.SSSSSS','UTC') FROM TMP EMIT CHANGES;
+--------------------+------------------------------+
|TS |KSQL_COL_0 |
+--------------------+------------------------------+
|1611290740285818 |+53029-10-09 09:11:25.818000 |
^CQuery terminated
By dividing the microseconds by 1000 they become milliseconds, and the function behaves as you would expect:
ksql> SELECT TS,
TIMESTAMPTOSTRING(TS/1000, 'yyyy-MM-dd HH:mm:ss.SSS','UTC')
FROM TMP
EMIT CHANGES;
+------------------+-------------------------+
|TS |KSQL_COL_0 |
+------------------+-------------------------+
|1611290740285818 |2021-01-22 04:45:40.285 |

Key while creating KSQL Stream

1) Is Key required on the Stream where you want to perform aggregate function. I have read several blogs and also recommendation from Confluent that KEY is required for aggregation function to work
CREATE STREAM Employee (EmpId BIGINT, EmpName VARCHAR,
DeptId BIGINT, SAL BIGINT) WITH (KAFKA_TOPIC='EmpTopic',
VALUE_FORMAT='JSON');
While defining above Stream, I have not defined any KEY (ROWKEY is NULL). Underlying topic 'EmpTopic' also does not a KEY.
I am performing aggregation function on the Stream.
CREATE TABLE SALBYDEPT AS
SELECT DeptId,
SUM(SAL)
FROM Employee
GROUP BY DeptId;
Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream
2) As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?
3) Will JOIN on Stream-Table retain the KEY
CREATE TABLE users
(registertime BIGINT,
userid VARCHAR,
gender VARCHAR,
regionid VARCHAR)
WITH (KAFKA_TOPIC = 'users',
VALUE_FORMAT='JSON',
KEY = 'userid');
CREATE STREAM pageviews
(viewtime BIGINT,
userid VARCHAR,
pageid VARCHAR)
WITH (KAFKA_TOPIC='pageviews',
VALUE_FORMAT='DELIMITED',
KEY='pageid',
TIMESTAMP='viewtime');
CREATE STREAM pageviews_transformed as
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
CREATE STREAM pageviews_enriched AS
SELECT pv.viewtime,
pv.userid AS userid,
pv.pageid,
pv.timestring,
u.gender,
u.regionid,
u.interests,
u.contactinfo
FROM pageviews_transformed pv
LEFT JOIN users u ON pv.userid = u.userid;
Will JOIN on Stream-Table retain the 'UserId' as ROWKEY in the new Stream 'pageviews_enriched'
4) I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.
Stream-Stream join and Stream-Table join. In the below example I am performing JOIN on Stream with NULL ROWKEY and Table. Is this valid?
CREATE TABLE users
(registertime BIGINT,
userid VARCHAR,
gender VARCHAR,
regionid VARCHAR)
WITH (KAFKA_TOPIC = 'users',
VALUE_FORMAT='JSON',
KEY = 'userid');
CREATE STREAM pageviews
(viewtime BIGINT,
userid VARCHAR,
pageid VARCHAR)
WITH (KAFKA_TOPIC='pageviews',
VALUE_FORMAT='DELIMITED',
TIMESTAMP='viewtime');
CREATE STREAM pageviews_transformed as
SELECT viewtime,
userid,
pageid,
TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
FROM pageviews
CREATE STREAM pageviews_enriched AS
SELECT pv.viewtime,
pv.userid AS userid,
pv.pageid,
pv.timestring,
u.gender,
u.regionid,
u.interests,
u.contactinfo
FROM pageviews_transformed pv
LEFT JOIN users u ON pv.userid = u.userid;
CREATE TABLE SALBYDEPT AS
SELECT DeptId,
SUM(SAL)
FROM Employee
GROUP BY DeptId;
Please confirm whether performing Aggregation function on the above Stream requires a KEY on 'Employee' stream ie NOT NULL ROWKEY on 'Employee' Stream
You do not need a key on this stream. The key of the created table will be DeptId.
As per Confluent documentation, "Windowing lets you control how to group records that have the same key for stateful operations, like aggregations or joins, into time spans. KSQL tracks windows per record key". Please help me understand the meaning of the above statement. Is it required the Stream should have NOT NULL KEY?
This means that when you create an aggregation you can do so over a time window, and that time window is part of the message key. For example, instead of aggregating all employee SAL (sales?), you could choose to do so over a time window, perhaps every hour or day. In that case you would have the aggregate key (DeptId), combined with the window key (e.g. for hourly 2019-06-23 06:00:00, 2019-06-23 07:00:00, 2019-06-23 08:00:00 etc)
Will JOIN on Stream-Table retain the KEY
It will retain the stream's key, unless you include a PARTITION BY in the DDL.
I have seen several examples from Confluent on Github where Stream used in JOIN is not KEY'ed. But as per the documentation, Stream should have NOT NULL ROWKEY participating the JOIN. Please confirm to have NOT NULL ROWKEY in the Stream.
Do you have a link to the specific documentation you're referencing? Whilst a table does need to be keyed, a stream does not (KSQL may handle this under the covers; I'm not sure).

Rolling up events to a minute in Apache Beam

So, I have a stream of data with this structure (I apologize it's in SQL)
CREATE TABLE github_events
(
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
user_id bigint,
org jsonb,
created_at timestamp
);
In SQL, I would rollup this data up to a minute like this:
1.Create a roll-up table for this purpose:
CREATE TABLE github_events_rollup_minute
(
created_at timestamp,
event_count bigint
);
2.And populate with INSERT/SELECT:
INSERT INTO github_events_rollup_minute(
created_at,
event_count
)
SELECT
date_trunc('minute', created_at) AS created_at,
COUNT(*)the AS event_count
FROM github_events
GROUP BY 1;
In Apache Beam, I am trying to roll-up events to a minute, i.e. count the total number of events received in that minute as per event's timestamp field.
Timestamp(in YYYY-MM-DDThh:mm): event_count
So, later in the pipeline if we receive more events with the same overlapping timestamp (due to the event receiving delays as the customer might be offline), we just need to take the roll-up count and increment the count for that timestamp.
This will allow us to simply increment the count for YYYY-MM-DDThh:mm by event_count in the application.
Assuming, events might be delayed but they'll always have the timestamp field.
I would like to accomplish the same thing in Apache Beam. I am very new to Apache Beam, I feel that I am missing something in Beam that would allow me to accomplish this. I've read the Apache Beam Programming Guide multiple times.
Take a look at the sections on Windowing and Triggers. What you're describing is fixed-time windows with allowed late data. The general shape of the pipeline sounds like:
Read input github_events data
Window into fixed windows of 1 minute, allowing late data
Count events per-window
Output the result to github_events_rollup_minute
The WindowedWordCount example project demonstrates this pattern.