KSQL Windowed Aggregation Stream, Session ending - apache-kafka

I am grouping events coming from a kafka topic by one of its properties and over time using the KSQL Windowed Aggregation, specifically the Session Window.
I have been able to create a stream of "session start signals" as described in this answer.
-- create a stream with a new 'data' topic:
CREATE STREAM DATA (USER_ID INT)
WITH (kafka_topic='data', value_format='json', partitions=2);
-- create a table that tracks user interactions per session:
CREATE TABLE SESSION AS
SELECT USER_ID, COUNT(USER_ID) AS COUNT
FROM DATA
WINDOW SESSION (5 SECONDS)
GROUP BY USER_ID;
-- Create a stream over the existing `SESSIONS` topic.
CREATE STREAM SESSION_STREAM (ROWKEY INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');
-- Create a stream of window start events:
CREATE STREAM SESSION_STARTS AS
SELECT * FROM SESSION_STREAM
WHERE WINDOWSTART = WINDOWEND;
Would be possible to create a stream of "session end signals" every time a the Windowed Aggregation ends?

I'm assuming by this you mean you want to emit an event/row when a session window hasn't seen any new messages that fit into the session for the 5 seconds you've configured for the window?
I don't think this is possible at present.
Because the source data can have records that are out-of-order, i.e. an event with a timestamp much earlier than rows already processed, a session window can not be 'closed' once the 5 SECONDS window has elapsed.
Existing sessions will, by default, be closed after 24 hours if no new data is received that should be included in the session. This can be controlled by setting a GRACE PERIOD in the window definition.
This closing of windows once the grace period has elapsing does not result in any row being output at present. However, KLIP 10 - Add Suppress to KSQL may give you want you once it is implemented

Related

optimal way to stream trades data out of a postgres database

I have a table with a very simple schema:
(
instrument varchar(20) not null,
ts timestamp not null,
price double precision not null,
quantity double precision not null,
direction integer not null,
id serial
constraint trades_pkey
primary key
);
It stores a list of trades done on various instruments.
You can have multiple trades on a single timestamp and also the timestamps are not regular; it's possible to have 10 entries on the same millisecond and then nothing for 2 seconds, etc.
When the client starts, I would like to accomplish two things:
Load the last hour of data.
Stream all the new updates.
The client processes the trades one by one, as if they were coming from a queue. They are sorted by instrument and each instrument has its own queue, expecting each trade to be the one following the previous one.
Solution A:
I did a query to find the id at now - 1hour, and then query all rows with id >= start id, and then loop to get all row with id > last id.
This does not work:
the row id and timestamps do not match, sometimes an older timestamp gets a higher row id, etc. I guess this is due to writes being done on multiple threads, but getting data by id doesn't guarantee I will get the trades in order and while I can sort one batch I receive, I can't be sure that the next batch will not contain an older row.
Solution B:
I can make a query loop that takes the last timestamp received, subtracts 1 second and queries again, etc. I can sort the data in the client and, for each instrument, discard all rows older than the last one processed.
Not very efficient, but that will work.
Solution C:
I can make a query per instrument (there are 22 of them), ordered by timestamp. Can 22 subqueries be grouped into a single one?
Or, is there another solution?
You could try big serial with auto increment to ensure each row is numbered in order as it is inserted.
Since this number is handled by Postgres you should be fine to get a guaranteed ordering on your data.
On the client side you just store (maybe in a separate table of meta-data) the latest serial number you have seen and then just query everything larger than that and keep your meta data table up to date.

How can I select only data within a specific window in KSQL?

I have a table with tumbling window, e.g.
CREATE TABLE total_transactions_per_1_days AS
SELECT
sender,
count(*) AS count,
sum(amount) AS total_amount,
histogram(recipient) AS recipients
FROM
completed_transactions
WINDOW TUMBLING (
SIZE 1 DAYS
)
Now I need to only select data from the current window, i.e. windowstart <= current time and windowend <= current time. Is it possible? I could not find any example.
Depends what you mean when you say 'select data' ;)
ksqlDB supports two main query types, (see https://docs.ksqldb.io/en/latest/concepts/queries/).
If what you want is a pull query, i.e. a traditional sql query where you want to pull back the current window as a one time result, then what you want may be possible, though pull queries are a recent feature and not fully featured yet. As of version 0.10 you can only look up a known key. For example, if sender is the key of the table, you could run a query like:
SELECT * FROM total_transactions_per_1_days
WHERE sender = some_value
AND WindowStart <= UNIX_TIMESTAMP()
AND WindowEnd >= UNIX_TIMESTAMP();
This would require the table to have processed data with a timestamp close to the current wall clock time for it to pull back data, i.e. if the system was lagging, or if you were processing historic or delayed data, this would not work.
Note: the above query will work on ksqlDB v0.10. Your success on older versions may vary.
There are plans to extend the functionality of pull queries. So keep an eye for updates to ksqlDB.

KSQL table group by with only one output within the given time

Hi I have created a stream with has the following values from the topic,
"id VARCHAR, src_ip VARCHAR, message VARCHAR"
Now I need to see if failed_login repeates more than 3 times in given time then raise an alert. So I have created a table as below,
CREATE TABLE 231_console_failure AS \
SELECT src_ip, count(*) \
FROM console_failure \
WINDOW TUMBLING (SIZE 30 SECONDS) \
WHERE message = 'failed_login' \
GROUP BY src_ip \
HAVING count(*) > 3;
Now when I use my python script to consume from the topic as '231_console_failure' then I get a None continously when there is no match
And when there is a match i.e more that 3 in 30 sec then it gives that value. But say if there are 10 attempt in 30 sec then the consumer fetches 7 messages where each message differ with count from 4 to 10.
I know I can handle this in script by avoiding the None and take only higher count in given time. But is there any way to create a stream from the above table which will have only matched messages with groupby in KSQL?
This isn't currently possible in KSQL, but there is an enhancement request open if you want to upvote/track it: https://github.com/confluentinc/ksql/issues/1030
For now, per the same ticket, you can experiment with cache.max.bytes.buffering and commit.interval.ms to vary how often the aggregate is emmited.

finding last 5 minutes page views

I have a kafka topic by the name of page_views and stream by the name of pageviews. Now I want to calculate last 5 minutes page viewed.I am using ksql.
Tried with
SELECT after->pageview_id FROM pageviews WHERE after->pageview_id >= NOW() - INTERVAL 10 MINUTE;
and
SELECT AFTER ->pageview_id FROM pageviews WHERE after->pageview_id >= sysdate - 5/(24*60);
but not working. This is nested avro schema.
You can use HOPPING window to emulate sliding window in KSQL. For a hopping window you should specify the window size which in this case is 5 minutes and an advance value which indicate how the window moves, i.e. slides (for example, every 1 second). So you can write a query like this:
CREATE STREAM foo AS SELECT after->pageview_id AS pv_id FROM pageviews;
CREATE TABLE bar AS SELECT pv_id, COUNT(pv_id) FROM foo WINDOW HOPPING (SIZE 5 MINUTES, ADVANCE BY 1 SECOND) GROUP BY pv_id;
For more information on HOPPING WINDOW refer to the following pages:
https://docs.confluent.io/current/ksql/docs/developer-guide/syntax-reference.html#ksql-statements
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows

KSQL Hopping Window : accessing only oldest subwindow

I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?
Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.
I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.