emit final with tumbling window - apache-kafka

Use Case : Get the message from the KSQL stream (visitor_topic1) to push in to new kafka topic (final_visitor) every 1 minutes.
> CREATE TABLE final_visitors_per_min
WITH (KAFKA_TOPIC='final_visitor', KEY_FORMAT='JSON', PARTITIONS = 3, REPLICAS = 3)
AS
SELECT
> id,
> visitorName
> FROM vister_List_stream
> WINDOW TUMBLING (SIZE 1 MINUTE)
> GROUP BY id EMIT CHANGES;
so I created one table and get message from stream (visitor_topic1) with tumbling window with size as 1 MINUTES and emit change .
With Emit CHANGES - in the topic2 am getting immediately when the topic 1 received message not wait 1 minutes to send .
with Final . but not emit any message topic2
any one have the suggestions ? where the problem want to receive the message every 1 minute delay ..

Related

Kafka Stream aggregate function sinking each joined record in an incremental way, instead of a single list of aggregated records

I’m having some issues with my Kafka Streams implementation in production. I’ve implemented a function that takes a KTable and a KStream, and yields another KTable with aggregated results based on the join of these two inputs. The idea is to iterate a list in the KStream input, and for each iteration, join it with the KTable, and aggregate into a list of events present in the KTable and sink to a topic, containing the original KTable event and the list of joined KStream (1 to N join).
Context
This is how my component interacts with its context. MovementEvent contains a list of transaction_ids that should match the transaction_id of TransactionEvent, where the joiner should match & generate a new event (Sinked Event) with the original MovementEvent and a list of the matched TransactionEvent.
For reference, the Movement topic has 12 million, while the Transaction topic has 21 million records.
Implementation
public class SinkEventProcessor implements BiFunction<
KTable<TransactionKey, Transaction>,
KStream<String, Movement>,
KTable<SinkedEventKey, SinkedEvent>> {
#Override
public KTable<SinkedEventKey, SinkedEvent> apply(final KTable<TransactionKey, Transaction> transactionTable,
final KStream<String, Movement> movementStream) {
return movementStream
// [A]
.flatMap((movementKey, movement) -> movement
.getTransactionIds()
.stream()
.distinct()
.map(transactionId -> new KeyValue<>(
TransactionKey.newBuilder()
.setTransactionId(transactionId)
.build(),
movement))
.toList())
// [B]
.join(transactionTable, (movement, transaction) -> Pair.newBuilder()
.setMovement(movement)
.setTransaction(transaction)
.build())
// [C]
.groupBy((transactionKey, pair) -> SinkedEventKey.newBuilder()
.setMovementId(pair.getMovement().getMovementId())
.build())
// [D]
.aggregate(SinkedEvent::new, (key, pair, collectable) ->
collectable.setMovement(pair.getMovement())
.addTransaction(pair.getTransaction()));
}
}
[A] I have started the implementation by iterating the Movement KStream, extracting the transactionId and creating a TransactionKey to use as the new key for the following operation, to facilitate the join with each transactionId present in the Movement entity. This operation returns a KStream<TransactionKey, Movement>
[B] Joins the formerly transformed KStream and adds each value to an intermediate pair. Returns a `KStream<TransactionKey, Pair>.
[C] Groups the pairs by movementId and constructs the new key (SinkedEventKey) for the sink operation.
[D] Aggregates into the result object (SinkedEvent) by adding the transaction to the list. This operation will also sink to the topic as a KTable<SinkedEventKey, SinkedEvent>
Problem
The problem starts when we start processing the stream, the sink operation of the processor starts generating more records than it should. For instance, for a Movement with 4 transaction_id, the output topic will become something like this:
partition
offset
count of [TransactionEvent]
expected count
0
1
1
4
0
2
2
4
0
3
4
4
0
4
4
4
And the same happens for other records (e.g. a Movement with 13 transaction_id will yield 13 menssages). So for some reason that I can't compreehend, the aggregate operation is sinking on each operation, instead of waiting and collecting into the list and sinking only once.
I've tried to reproduce it in a development cluster, with exactly the same settings, with no avail. Everything seems to be working properly when I try to reproduce it (a Movement with 8 transactions produces only 1 record) but whenever I bring it to production it doesn't work as intended. I'm not sure what I'm missing, any help?

How to ensure outer NULL join results output in spark streaming if the future events are delayed

In a scenario of Spark stream-stream outer join:
val left = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val right = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val res = left.as("left").join(right.as("right"),
expr("left.key = right.key AND (left.enqueuedTime BETWEEN right.enqueuedTime - INTERVAL 1 hour AND right.enqueuedTime + INTERVAL 1 hour)"),
"left_outer")
res.writeStream(....)
And a data in left and right streams:
How to ensure a record:
2, left_value1, 2022-04-18T12:39:49.370+0000, NULL, NULL, NULL
is outputted after a given period of time even if new events aren't flowing thought the stream?
I'm only able to get it if new events arrive to both tables, like:
INSERT into left_df VALUES ("004", "left_df_value", current_timestamp() + INTERVAL 5 hours);
INSERT into right_df VALUES ("004", "right_df_value", current_timestamp() + INTERVAL 5 hours);
Using which, Spark updates the watermarks and understands that now it's safe to output a nullable record. But how to still output it after some kind of timeout, without the new records arriving to both streams?

KSQLDB Push query doesn't emit all changes

I'm new to ksqlDB and I started experimenting with very basic aggregation. I have created topic where I get few messages every second, those get streamed and then output to a materialised table.
I can see my stream is up to date with the topic, but the materialised table doesn't output each change. I couldn't find anywhere in the docs how to emit every update not just one every 1 or 2 seconds.
This is my setup:
Topic (JSON):
key: username, value: { balance_change }
Stream:
create stream balance_stream (user varchar key, balance_change bigint)
with (kafka_topic='balance', value_format='JSON');
Materialised Table:
create table balance_table as
select user,
sum(balance_change) balance
from balance_stream
group by user
emit changes;
And as you can see in the video below, it's very slow to get total balance for my user bob:
https://www.youtube.com/watch?v=0HmCA3ueUo0
How can I get all the updates from my balance_table?
There are less than 900 messages in the table, which should instantly sum all the values.
Name : BALANCE_TABLE
Type : TABLE
Timestamp field : Not set - using <ROWTIME>
Key format : KAFKA
Value format : JSON
Kafka topic : BALANCE_TABLE (partitions: 1, replication: 1)
Statement : CREATE TABLE BALANCE_TABLE WITH (KAFKA_TOPIC='BALANCE_TABLE', PARTITIONS=1, REPLICAS=1) AS SELECT
BALANCE_STREAM.USER USER,
SUM(BALANCE_STREAM.BALANCE_CHANGE) BALANCE
FROM BALANCE_STREAM BALANCE_STREAM
GROUP BY BALANCE_STREAM.USER
EMIT CHANGES;
Field | Type
------------------------------------------
USER | VARCHAR(STRING) (primary key)
BALANCE | BIGINT
------------------------------------------
Queries that write from this TABLE
-----------------------------------
CTAS_BALANCE_TABLE_155 (RUNNING) : CREATE TABLE BALANCE_TABLE WITH (KAFKA_TOPIC='BALANCE_TABLE', PARTITIONS=1, REPLICAS=1) AS SELECT BALANCE_STREAM.USER USER, SUM(BALANCE_STREAM.BALANCE_CHANGE) BALANCE FROM BALANCE_STREAM BALANCE_STREAM GROUP BY BALANCE_STREAM.USER EMIT CHANGES;
For query topology and execution plan please run: EXPLAIN <QueryId>
Local runtime statistics
------------------------
messages-per-sec: 0.28 total-messages: 814 last-message: 2021-06-17T16:08:43.091Z
(Statistics of the local KSQL server interaction with the Kafka topic BALANCE_TABLE)
Consumer Groups summary:
Consumer Group : _confluent-ksql-ksqldbquery_CTAS_BALANCE_TABLE_155
Kafka topic : balance
Max lag : 0
Partition | Start Offset | End Offset | Offset | Lag
------------------------------------------------------
0 | 0 | 21230 | 21230 | 0
------------------------------------------------------
the issue was with the default server settings
this can be fixed by adjusting value for 'commit.interval.ms'

processing duplicates on different partition and offset in kafa consumer application

We are trying to consume records from our producers. It has 2 partitions as of now but it can get increased in future to improve our throughput. We are trying to consume records with 2 consumer threads but we are getting duplicate. Our producers are saying they had included key also but still it's not fixing issue. Not sure why ?
But, from consumer end, because of duplicates , our whole process cycle is getting triggered twice which we would want to avoid. Our concern is if we increase partitions in future, it will increase duplicates as well.
Our process cycle :
Getting records from stream -- > Upsert in a table based on key --> fetch records based on key and insert it into a table --> call api and update records
Log :
coming from stream :004582777into offset 500405and partition 0
coming from stream :004582777into offset 499525and partition 1
skipping tax id 004582777
skipping tax id 004582777
coming from stream :002402419into offset 499526and partition 1
coming from stream :002402419into offset 500406and partition 0
skipping tax id 002402419
skipping tax id 002402419
coming from stream :010546936into offset 499527and partition 1
coming from stream :010546936into offset 500407and partition 0
skipping tax id 010546936
skipping tax id 010546936
coming from stream :010646378into offset 500408and partition 0
coming from stream :010646378into offset 499528and partition 1
skipping tax id 010646378
skipping tax id 010646378
coming from stream :010866219into offset 500409and partition 0
coming from stream :010866219into offset 499529and partition 1
skipping tax id 010866219
skipping tax id 010866219
coming from stream :019541747into offset 499530and partition 1
coming from stream :019541747into offset 500410and partition 0
skipping tax id 019541747
skipping tax id 019541747
coming from stream :020438119into offset 500411and partition 0
coming from stream :020438119into offset 499531and partition 1
skipping tax id 020438119
skipping tax id 020438119
coming from stream :020594385into offset 499532and partition 1
coming from stream :020594385into offset 500412and partition 0
skipping tax id 020594385
skipping tax id 020594385
coming from stream :043514479into offset 500413and partition 0
coming from stream :043514479into offset 499533and partition 1
skipping tax id 043514479
skipping tax id 043514479
coming from stream :030446242into offset 500414and partition 0
coming from stream :030446242into offset 499534and partition 1
record count is more than zero :1 for tax id:030446242 <--- we are calling API 2 times because of 2 ocurance
record count is more than zero :1 for tax id:030446242
How can we make sure to pick only occurrence of this record even if we are getting duplicate from different partition ? As both are records are getting processed by 2 consumer threads in parallel, for some records it's capturing both instances in table and for some only 1.
Code :
#KafkaListener(topics = "${app.topic}", groupId = "${app.group_id_config}")
public void run(ConsumerRecord<String, GenericRecord> record, Acknowledgment acknowledgement) throws Exception {
try {
prov_tin_number = record.value().get("providerTinNumber").toString();
prov_tin_type = record.value().get("providerTINType").toString();
enroll_type = record.value().get("enrollmentType").toString();
vcp_prov_choice_ind = record.value().get("vcpProvChoiceInd").toString();
error_flag = "";
dataexecutor.peStremUpsertTbl(prov_tin_number, prov_tin_type, enroll_type, vcp_prov_choice_ind, error_flag,
record.partition(), record.offset());
acknowledgement.acknowledge();
}catch (Exception ex) {
System.out.println(record);
System.out.println(ex.getMessage());
}
}
getting duplicate from different partition
Kafka knows nothing about the data; you will get all records at all partitions/offsets.
You can add an implementation of FilterStrategy to the listener container factory to filter out duplicates. https://docs.spring.io/spring-kafka/docs/2.6.2/reference/html/#filtering-messages

Why Spark Structured Streaming window aggregation evaluates after each trigger

With Spark 2.2.0, I am reading data from Kafka having 2 columns "textcol" and "time". The "time" column has the latest processing time. I want to get the count of my unique values of "textcol" in fixed window duration of 20 seconds. My trigger duration is 10 seconds.
For example if in a 20 sec window duration, trigger1 has textcol=a and trigger2 has textcol=b, then I am expecting to have output as below after 20 sec
textcol cnt
a 1
b 1
I used below code for dataset ds
ds.groupBy(functions.col("textcol"),
functions.window(functions.col("time"), "20 seconds"))
.agg(functions.count("textcol").as("cnt"))
.writeStream().trigger(Trigger.ProcessingTime("10 seconds"))
.outputMode("update")
.format("console").start();
But I am getting output twice due to 2 triggers after 20 sec
Trigger1:
textcol cnt
a 1
Trigger2:
textcol cnt
b 1
So why window does not aggregate the results and outputs after 20 sec, instead of triggering each time 10-10 sec?
Is there any other way to achieve it in spark structured streaming?
change your .outputMode("update") to .outputMode("complete")