KSQL table group by with only one output within the given time - apache-kafka

Hi I have created a stream with has the following values from the topic,
"id VARCHAR, src_ip VARCHAR, message VARCHAR"
Now I need to see if failed_login repeates more than 3 times in given time then raise an alert. So I have created a table as below,
CREATE TABLE 231_console_failure AS \
SELECT src_ip, count(*) \
FROM console_failure \
WINDOW TUMBLING (SIZE 30 SECONDS) \
WHERE message = 'failed_login' \
GROUP BY src_ip \
HAVING count(*) > 3;
Now when I use my python script to consume from the topic as '231_console_failure' then I get a None continously when there is no match
And when there is a match i.e more that 3 in 30 sec then it gives that value. But say if there are 10 attempt in 30 sec then the consumer fetches 7 messages where each message differ with count from 4 to 10.
I know I can handle this in script by avoiding the None and take only higher count in given time. But is there any way to create a stream from the above table which will have only matched messages with groupby in KSQL?

This isn't currently possible in KSQL, but there is an enhancement request open if you want to upvote/track it: https://github.com/confluentinc/ksql/issues/1030
For now, per the same ticket, you can experiment with cache.max.bytes.buffering and commit.interval.ms to vary how often the aggregate is emmited.

Related

Spring batch, compare current processed record with the rest of chunk records

I need to plan a solution to this case. I have a Table like this and I have to reduce the number of registers that share Product+Service+Origin to minium range dates possible:
ID
PRODUCT
SERVICE
ORIGIN
STARTDATE
ENDDATE
1
100001
500
1
10/01/2023
15/01/2023
2
100001
500
1
12/01/2023
18/01/2023
I have to read all records, and in the process check date intervals to unificate them:
RecordA (10/01/2023 - 15/01/2023) RecordB (12/01/2023 - 18/01/2023) this will result in update the register with ID1 dates leaving the big range between the two dates and registers: 10/01/2023 - 18/01/2023 (extending to "right" or "left" one of the ranges when necessary)
Other case:
ID
PRODUCT
SERVICE
ORIGIN
STARTDATE
ENDDATE
1
100001
500
1
10/01/2023
15/01/2023
2
100001
500
1
12/01/2023
14/01/2023
On this case, range of dates from Record1 is biggest, We Should delete Record2.
Of course, deleting duplicate date ranges
Now whe have implemented and chunk step to make this possible:
Reader: Read data ordering by common fields (Product-Service-Origin)
Processor: Saves in a HashMap<String, List> in the job context all the register while the combination "Product+Service+Origin" is the same. When detect a new combination, get The current List and make a lot of comparision between this, marking records aux properties to "delete" or "update" and sending the full list to the writer (previusly starting a create other list in the map with the new combination of common fields)
Writer: group the records to delete and update and call child writers to execute the sentence.
Well, this was the software several years ago but soon We have to control massive records for each case and the "solution" of use an map in the JobContext have to change.
I was thinking if Spring Batch has some features for process this type of situations that I can use.
Anyway I am thinking about change the step where We insert all this records, and make date range checks one-to-one in the processor, but I think the commit interval here will be mandatory 1 to allows each register check all the previous processed registers (table is iintially empty when We execute this job). Other value in commit interval will check in bbdd but not in the previous processed items making incorrect processing here.
All this cases can have 0-n records sharing Product+Service+Origin.
Sorry my english, it's difficult explain this on other language.

KSQL Windowed Aggregation Stream, Session ending

I am grouping events coming from a kafka topic by one of its properties and over time using the KSQL Windowed Aggregation, specifically the Session Window.
I have been able to create a stream of "session start signals" as described in this answer.
-- create a stream with a new 'data' topic:
CREATE STREAM DATA (USER_ID INT)
WITH (kafka_topic='data', value_format='json', partitions=2);
-- create a table that tracks user interactions per session:
CREATE TABLE SESSION AS
SELECT USER_ID, COUNT(USER_ID) AS COUNT
FROM DATA
WINDOW SESSION (5 SECONDS)
GROUP BY USER_ID;
-- Create a stream over the existing `SESSIONS` topic.
CREATE STREAM SESSION_STREAM (ROWKEY INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');
-- Create a stream of window start events:
CREATE STREAM SESSION_STARTS AS
SELECT * FROM SESSION_STREAM
WHERE WINDOWSTART = WINDOWEND;
Would be possible to create a stream of "session end signals" every time a the Windowed Aggregation ends?
I'm assuming by this you mean you want to emit an event/row when a session window hasn't seen any new messages that fit into the session for the 5 seconds you've configured for the window?
I don't think this is possible at present.
Because the source data can have records that are out-of-order, i.e. an event with a timestamp much earlier than rows already processed, a session window can not be 'closed' once the 5 SECONDS window has elapsed.
Existing sessions will, by default, be closed after 24 hours if no new data is received that should be included in the session. This can be controlled by setting a GRACE PERIOD in the window definition.
This closing of windows once the grace period has elapsing does not result in any row being output at present. However, KLIP 10 - Add Suppress to KSQL may give you want you once it is implemented

KSQL Query returning unexpected values in simple aggregation

I am getting unexpected results from a KSQL query against a KTable that is itself defined by a Kafka topic. The KTABLE is "Trades" and it is backed by the compacted topic "localhost.dbo.TradeHistory". It is supposed to contain the latest information for a stock trade keyed by a TradeId. The topic's key is TradeId. Each trade has an AccountId and I'm trying to construct a query to get the SUM of the Amount(s) of the trades grouped by account.
The Definition of the Trades KTABLE
ksql> create table Trades(TradeId int, AccountId int, Spn int, Amount double) with (KAFKA_TOPIC = 'localhost.dbo.TradeHistory', VALUE_FORMAT = 'JSON', KEY = 'TradeId');
...
ksql> describe extended Trades;
Name : TRADES
Type : TABLE
Key field : TRADEID
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : JSON
Kafka topic : localhost.dbo.TradeHistory (partitions: 1, replication: 1)
Field | Type
---------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
TRADEID | INTEGER
ACCOUNTID | INTEGER
SPN | INTEGER
AMOUNT | DOUBLE
---------------------------------------
Local runtime statistics
------------------------
consumer-messages-per-sec: 0 consumer-total-bytes: 3709 consumer-total-messages: 39 last-message: 2019-10-12T20:52:16.552Z
(Statistics of the local KSQL server interaction with the Kafka topic localhost.dbo.TradeHistory)
The Configuration of the localhost.dbo.TradeHistory Topic
/usr/bin/kafka-topics --zookeeper zookeeper:2181 --describe --topic localhost.dbo.TradeHistory
Topic:localhost.dbo.TradeHistory PartitionCount:1 ReplicationFactor:1 Configs:min.cleanable.dirty.ratio=0.01,delete.retention.ms=100,cleanup.policy=compact,segment.ms=100
Topic: localhost.dbo.TradeHistory Partition: 0 Leader: 1 Replicas: 1 Isr: 1
In my test, I'm adding messages to the localhost.dbo.TradeHistory topic with TradeId 2 that simply change the amount of the trade. Only the Amount is updated; the AccountId remains 1.
The messages in the localhost.dbo.TradeHistory topic
/usr/bin/kafka-console-consumer --bootstrap-server broker:9092 --property print.key=true --topic localhost.dbo.TradeHistory --from-beginning
... (earlier values redacted) ...
2 {"TradeHistoryId":47,"TradeId":2,"AccountId":1,"Spn":1,"Amount":106.0,"__table":"TradeHistory"}
2 {"TradeHistoryId":48,"TradeId":2,"AccountId":1,"Spn":1,"Amount":107.0,"__table":"TradeHistory"}
The dump of the topic, above, shows the Amount of Trade 2 (in Account 1) changing from 106.0 to 107.0.
The KSQL Query
ksql> select AccountId, count(*) as Count, sum(Amount) as Total from Trades group by AccountId;
1 | 1 | 106.0
1 | 0 | 0.0
1 | 1 | 107.0
The question is, why does the KSQL query shown above return an "intermediate" value each time I publish a trade update. As you can see, the Count and the Amount fields show 0,0 and then the KSQL query immediately "corrects" it to 1,107.0. I'm a bit confused by this behavior.
Can anyone explain it?
Many thanks.
Thanks for your question. I've added an answer to our knowledge base: https://github.com/confluentinc/ksql/pull/3594/files.
When KSQL sees an update to an existing row in a table it internally emits a CDC event, which contains the old and new value.
Aggregations handle this by first undoing the old value, before applying the new value.
So, in the example above, when the second insert happens, KSQL first undos the old value. This results in the COUNT going down by 1, and the SUM going down by the old value of 106.0, i.e. going down to zero.
Then KSQL applies the new row value, which sees the COUNT going up by 1 and the SUM going up by the new value 107.0.
By default, KSQL is configured to buffer results for up to 2 seconds, or 10MB of data, before flushing the results to Kafka. This is why you may see a slight delay on the output when inserting values in this example. If both output rows are buffered together then KSQL will suppress the first result. This is why you often do not see the intermediate row being output. The configurations commit.interval.ms and cache.max.bytes.buffering, which are set to 2 seconds and 10MB, respectively, can be used to tune this behaviour. Setting either of these settings to zero will cause KSQL to always output all intermediate results.
If you are seeing these intermediate results output every time, then it's likely you have set one, or both, of these settings to zero.
We have a Github issue to enhance KSQL to make use of Kafka Stream's Suppression functionality,
which would allow users more control how results are materialized.

Why does join use rows that were sent after watermark of 20 seconds?

I’m using watermark to join two streams as you can see below:
val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds")
val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds")
val join_df = order_wm
.join(invoice_wm, order_wm.col("s_order_id") === invoice_wm.col("order_id"))
My understanding with the above code, it will keep each of the stream for 20 secs. After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined. It seems like even after watermark got finished Spark is holding the data in memory. I even tried after 45 seconds and that was getting joined too.
This is creating confusion in my mind regarding watermark.
After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined.
That's possible since the time measured is not the time of events as they arrive, but the time that is inside the watermarked field, i.e. tstamp_trans. You have to make sure that the last time in tstamp_trans is 20 seconds after the rows that will participate in the join.
Quoting the doc from: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking
In other words, you will have to do the following additional steps in the join.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.
Time range join conditions (e.g. ...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR),
Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).

KSQL Hopping Window : accessing only oldest subwindow

I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?
Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.
I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.