I have a kafka topic by the name of page_views and stream by the name of pageviews. Now I want to calculate last 5 minutes page viewed.I am using ksql.
Tried with
SELECT after->pageview_id FROM pageviews WHERE after->pageview_id >= NOW() - INTERVAL 10 MINUTE;
and
SELECT AFTER ->pageview_id FROM pageviews WHERE after->pageview_id >= sysdate - 5/(24*60);
but not working. This is nested avro schema.
You can use HOPPING window to emulate sliding window in KSQL. For a hopping window you should specify the window size which in this case is 5 minutes and an advance value which indicate how the window moves, i.e. slides (for example, every 1 second). So you can write a query like this:
CREATE STREAM foo AS SELECT after->pageview_id AS pv_id FROM pageviews;
CREATE TABLE bar AS SELECT pv_id, COUNT(pv_id) FROM foo WINDOW HOPPING (SIZE 5 MINUTES, ADVANCE BY 1 SECOND) GROUP BY pv_id;
For more information on HOPPING WINDOW refer to the following pages:
https://docs.confluent.io/current/ksql/docs/developer-guide/syntax-reference.html#ksql-statements
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Related
I have a table with tumbling window, e.g.
CREATE TABLE total_transactions_per_1_days AS
SELECT
sender,
count(*) AS count,
sum(amount) AS total_amount,
histogram(recipient) AS recipients
FROM
completed_transactions
WINDOW TUMBLING (
SIZE 1 DAYS
)
Now I need to only select data from the current window, i.e. windowstart <= current time and windowend <= current time. Is it possible? I could not find any example.
Depends what you mean when you say 'select data' ;)
ksqlDB supports two main query types, (see https://docs.ksqldb.io/en/latest/concepts/queries/).
If what you want is a pull query, i.e. a traditional sql query where you want to pull back the current window as a one time result, then what you want may be possible, though pull queries are a recent feature and not fully featured yet. As of version 0.10 you can only look up a known key. For example, if sender is the key of the table, you could run a query like:
SELECT * FROM total_transactions_per_1_days
WHERE sender = some_value
AND WindowStart <= UNIX_TIMESTAMP()
AND WindowEnd >= UNIX_TIMESTAMP();
This would require the table to have processed data with a timestamp close to the current wall clock time for it to pull back data, i.e. if the system was lagging, or if you were processing historic or delayed data, this would not work.
Note: the above query will work on ksqlDB v0.10. Your success on older versions may vary.
There are plans to extend the functionality of pull queries. So keep an eye for updates to ksqlDB.
I am grouping events coming from a kafka topic by one of its properties and over time using the KSQL Windowed Aggregation, specifically the Session Window.
I have been able to create a stream of "session start signals" as described in this answer.
-- create a stream with a new 'data' topic:
CREATE STREAM DATA (USER_ID INT)
WITH (kafka_topic='data', value_format='json', partitions=2);
-- create a table that tracks user interactions per session:
CREATE TABLE SESSION AS
SELECT USER_ID, COUNT(USER_ID) AS COUNT
FROM DATA
WINDOW SESSION (5 SECONDS)
GROUP BY USER_ID;
-- Create a stream over the existing `SESSIONS` topic.
CREATE STREAM SESSION_STREAM (ROWKEY INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');
-- Create a stream of window start events:
CREATE STREAM SESSION_STARTS AS
SELECT * FROM SESSION_STREAM
WHERE WINDOWSTART = WINDOWEND;
Would be possible to create a stream of "session end signals" every time a the Windowed Aggregation ends?
I'm assuming by this you mean you want to emit an event/row when a session window hasn't seen any new messages that fit into the session for the 5 seconds you've configured for the window?
I don't think this is possible at present.
Because the source data can have records that are out-of-order, i.e. an event with a timestamp much earlier than rows already processed, a session window can not be 'closed' once the 5 SECONDS window has elapsed.
Existing sessions will, by default, be closed after 24 hours if no new data is received that should be included in the session. This can be controlled by setting a GRACE PERIOD in the window definition.
This closing of windows once the grace period has elapsing does not result in any row being output at present. However, KLIP 10 - Add Suppress to KSQL may give you want you once it is implemented
Hi I have created a stream with has the following values from the topic,
"id VARCHAR, src_ip VARCHAR, message VARCHAR"
Now I need to see if failed_login repeates more than 3 times in given time then raise an alert. So I have created a table as below,
CREATE TABLE 231_console_failure AS \
SELECT src_ip, count(*) \
FROM console_failure \
WINDOW TUMBLING (SIZE 30 SECONDS) \
WHERE message = 'failed_login' \
GROUP BY src_ip \
HAVING count(*) > 3;
Now when I use my python script to consume from the topic as '231_console_failure' then I get a None continously when there is no match
And when there is a match i.e more that 3 in 30 sec then it gives that value. But say if there are 10 attempt in 30 sec then the consumer fetches 7 messages where each message differ with count from 4 to 10.
I know I can handle this in script by avoiding the None and take only higher count in given time. But is there any way to create a stream from the above table which will have only matched messages with groupby in KSQL?
This isn't currently possible in KSQL, but there is an enhancement request open if you want to upvote/track it: https://github.com/confluentinc/ksql/issues/1030
For now, per the same ticket, you can experiment with cache.max.bytes.buffering and commit.interval.ms to vary how often the aggregate is emmited.
I was checking out spark window function to check page hit per 30 second,but it keep on adding value of previous window time too.
suppose at **12:00:30 count is 10** and **12:01:00 count is 10**.
But spark gives **output as 20**
Adding previous window value.I'm using Kafka-spark streaming.
val rs=words.reduceByKeyAndWindow((x,y)=>(x._1 + y._1,x._2 + y._2),Durations.seconds(30))
Please help and how i can reset value like window tumbling works in Kafka's KSQL.
I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?
Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.
I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.