Kafka Stream timestamp synchronization in KStream/KTable join - apache-kafka

Having an inner join KStream/KTable with the following sequence of messages:
table_evt_at_t1 --> stream_evt_at_t2 --> table_evt_at_t3 --> stream_evt_at_t4
the join triggers:
(stream_evt_at_t2, table_evt_at_t1) + (stream_evt_at_t4, table_evt_at_t3)
So far, everything ok.
The unexpected result comes up when I reset the stream application (with kafka-streams-application-reset.sh) and replay all the events:
(stream_evt_at_t2, table_evt_at_t3) + (stream_evt_at_t4, table_evt_at_t3)
It seems that Kafka Stream doesn't take into account the timestamps when processing the events. It populates the Ktable and then it processes the KStream getting the last value of the Ktable (table_evt_at_t3) for the two KStream events.
Note that I am using Kafka Streams 2.3.1, a custom TimestampExtractor and the property max.task.idle.ms = 10 * 1000L as [KIP-353][1] suggests
Is this the expected behaviour?

The first result that join triggers is expected behavior since KStream-KTable joins are not windowed but timestamped
The result after a 'reset'/replay is also expected behavior since KTable only keeps latest value for a given key and "table_evt_at_t3"("table_evt_at_t1" is already overwritten) is the last value

Related

Can't join between stream and stream in ksqldb

I would like to inquire about the problem of not joining between stream and stream in ksqldb.
Situations and Problems
I have generated two ksqldb streams from a topic containing events from different databases (postgresql, mssql) and are joining against specific columns in both streams.
To help you understand, I will name both streams stream1, stream2, and join target columns target_col.
type
name
join target column
stream in ksqldb
stream1
taget_col
stream in ksqldb
stream2
taget_col
There is a problem that these two streams are not joined when joined with the query below.
select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes
1. Meeting the joining requirements.
According to the official ksqldb document, the conditions for co-partitioning, which are essential conditions for join, are the following three conditions, and it is confirmed that both streams satisfy the conditions.
Co-partitioning Requirements source
1. The input records for the join must have the same key schema.
-> The describe stream1 command and describe stram2 command confirmed that the join key schema of stream1 and stream2 is the same as string.
2. The input records must have the same number of partitions on both sides.
-> The partition numbers for both streams were specified the same in the statement(CREATE STREAM ~ WITH(PARTITIONS=1, ~ )) at the time of the stream declaration. The number of partitions in the source topic that the stream is subscribed to is also equal to one.
3. Both sides of the join must have the same partitioning strategy.
-> The original topic that the stream is subscribed to has one partition, so even if the partitioning strategies are different, all records are included in one partition, so it doesn't matter if the partitioning strategies are different.
2. Time difference between records.
The timestamp and partition number of the record were verified through the psuedocolumns.
The queries used are as follows: select taget_col, rowtime, rowpartition from stream1 emit changes select taget_col, rowtime, rowpartition from stream2 emit changes
When the join key column has the same value, the partition number is the same(ex. 0), and the record timestamp is not more than 2 seconds apart.
Therefore, I think the time interval(1 minutes) of the query in question(select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes) is not a problem.
3. Data Acquisition
Here's how data is obtained as a topic subscribed to by both streams.
postgresql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream1
mssql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream2
Because I use data from different databases, I utilized the appropriate jar library(mssql-jdbc-7.2.1.jre8.jar, postgresql-42.3.1.jar) for each database on the same connector.
I build kafka ecosystem using confluent official docker images.(zookeeper, broker, connect, ksqldb-server, ksqldb-cli)
In this situation, please advise if there is any action plan to solve the join problem.
Thank you.

Is it possible to specify Kafka Stream topology starting sequence

Let say I have Topology A that streams from Source A to Stream A, and I have Topology B which stream from Source B to stream B (used as Table B).
Then I have a stream/table join that joins Stream A and Table B.
As expected the join only triggers when something arrives in Stream A and theres a correlating record in Table B.
I have an architecture, where the source topics are still populated while the Kafka Stream is DEAD. And messages are always arrives in source B before source A.
I am finding that when I restart Kafka Stream (by redeploy the app), the topology that streams stuff to stream A, can happen BEFORE the topology that streams stuff to Table B.
And as a result, the join won't trigger.
I know this is probably the expected behaviour, there's no coordination between separate topologies.
I was wondering if there is a mechanism, a delay or something that can ORDER/Sequence the start of the topologies?
Once they are up, they are fine, as I can ensure the message arrives in the right order.
I think you want to try setting the max.task.idle.ms to something greater than the default (0), maybe 30 secs? It's tough to give a precise answer, so you'll have to experiment some.
HTH,
Bill
If you need to trigger a downstream result from both sides of the join, you have to do a KTable-to-KTable join. From the javadoc:
"The join is computed by (1) updating the internal state of one KTable and (2) performing a lookup for a matching record in the current (i.e., processing time) internal state of the other KTable. This happens in a symmetric way, i.e., for each update of either this or the other input KTable the result gets updated."
EDIT: Even if you do stream-to-KTable join that triggers only when a new stream event is emitted on the left side of the join (KTable updates do not emit downstream event), when you start the topology Streams will try to do timestamp re-synchronisation using the timestamps of the input events, and there should not be any race condition between the rate of consumption of the KTable source and the stream topic. BUT, my understanding is that this is on a best effort cases. E.g. if two events have exactly the same timestamp then Streams cannot deduce which should be processed first.

Consumer side behavior on using coGroupByKey in Apache beam

I have a beam job that reads data from 2 kafka producers and does a join using a common key in both streams. I am not using the partition key used by kafka to do the join. So essentially kafka partitions data by some key in both streams, my consumer/beam job consumes this data from the two streams and extracts the actual key using which I wish to perform join into a pCollection and then I run coGroupByKey.
I see the join happen for several events, but if I query for specific events, I do not see the join happen. I have used the same window to window into the two streams. This makes me question if a consumer is getting the right data from two streams to perform this join. Let's say that consumer 0 consumes from partition 0 of both streams. Is there a chance that kafka partitions data using a key x and my consumer 0 is not getting the right data to join across the streams. I was told that coGroupByKey ensures that the right data lands in each consumer, but I am not able to visualize this. How can using coGroupByKey affect the input side behavior?
CoGroupByKey will join data across all input partitions. I suspect the issue is windowing--are the unjoined items in the same window? (CoGroupByKey does not join across windows, so items that land in separate windows do not get joined. You could look at using session windows if fixed windows don't work.)

KStream-KTable LeftJoin, Join occured while KTable is not fully loaded

I am trying to use KStream-KTable leftJoin to enrich the item from topic A with Topic B. Topic A is my KStream, and topic B is my KTtable which has around 23M records. The keys from both topics are not mathced, so I have to KStream(topic B) to KTable using reducer.
Here is my code:
KTable<String, String> ktable = streamsBuilder
.stream("TopicB", Consumed.withTimestampExtractor(new customTimestampsExtractor()))
.filter((key, value) -> {...})
.transform(new KeyTransformer()) // generate new key
.groupByKey()
.reduce((aggValue, newValue) -> {...});
streamBuilder
.stream("TopicA")
.filter((key, value) -> {...})
.transform(...)
.leftJoin(ktable, new ValueJoiner({...}))
.transform(...)
.to("result")
1) the KTable initialization is slow. (around 2000 msg/s), is this normal? My topic is only have 1 partition. Any way to improve the performance?
I tried to set the following to reduec write throughput but seems doesn't improve a lot.
CACHE_MAX_BYTES_BUFFERING_CONFIG = 10 * 1024 * 1024
COMMIT_INTERVAL_MS_CONFIG = 15 * 1000
2) The join occurs when KTable is not finished loaded from Topic B.
Here is the offset when join is occured (CURRENT-OFFSET/LOG-END-OFFSET)
Topic A: 32725/32726 (Lag 1)
Topic B: 1818686/23190390 (Lag 21371704)
I checked the timestamp of the record of Topic A that failed, it is a record of 4 days ago, and the last record of Topic B which is processed is 6 days ago.
As my understanding, kstream process record based on timestamp, I don't understand why in my case, KStream(Topic A) didn't wait till KTable(Topic B) is fully loaded up to the point when it is 4 days ago to trigger the join.
I also tried setting timestamp extractor return 0, but it doesn't work as well.
Updated: When setting timestamp to 0, I am getting the following error:
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException: This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producerID are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.
I also tried setting max.task.idle.ms to > 0 (3 seconds and 30 minute), but still getting the same error.
Updated: I fixed the 'UnknownProducerIdException' error by setting the customTimestampsExtractor to 6 days ago which is still earlier than record from Topic A. I thhink (not sure) setting to 0 trigger retention on the changelog which caused this error. However, join is still not working where it still happen before the ktable finished loading. Why is that?
I am using Kafka Streams 2.3.0.
Am I doing anything wrong here? Many thanks.
1.the KTable initialization is slow. (around 2000 msg/s), is this normal?
This depend on your network, and I think the limition is the consuming rate of TopicB, two config CACHE_MAX_BYTES_BUFFERING_CONFIG and COMMIT_INTERVAL_MS_CONFIG which you use is to choose the trade-off between how much output of KTable you want to produce (cause KTable changelog is stream of revisions) and how much latency you accept when you update KTable to underlying topic and downstream processor. Take a detail look at Kafka Streams caching config for state store and this blog part Tables, Not Triggers.
I think the good way to increase the consuming rate of TopicB is to add more partition.
KStream.leftJoin(KTable,...) is always table lookup, it's always join the current stream record with the latest updated record on KTable, it'll not take stream time into account when deciding whether to join or not. If you want to consider stream time when joining, take a look at KStream-KStream join.
In your case this lag is the lag of TopicB, it does not mean KTable is not fully loaded. Your KTable is not fully loaded when it's in the state restore process when it's read from underlying changelog topic of KTable to restore the current state before actually running your stream app, in just case you will not able to do the join because stream app is not running until state is fully restore.

Understanding max.task.idle.ms in Kafka Stream for a KStream-KTable join

I need help understanding Kafka stream behavior when max.task.idle.ms is used in Kafka 2.2.
I have a KStream-KTable join where the KStream has been re-keyed:
KStream stream1 = builder.stream("topic1", Consumed.with(myTimeExtractor));
KStream stream2 = builder.stream("topic2", Consumed.with(myTimeExtractor));
KTable table = stream1
.groupByKey()
.aggregate(myInitializer, myAggregator, Materialized.as("myStore"))
stream2.selectKey((k,v)->v)
.through("rekeyedTopic")
.join(table, myValueJoiner)
.to("enrichedTopic");
All topics have 10 partitions and for testing, I've set max.task.idle.ms to 2 minutes. myTimeExtractor updates the event time of messages only if they are labelled "snapshot": Each snapshot message in stream1 gets its event time set to some constant T, messages in stream2 get their event time set to T+1.
There are 200 messages present in each of topic1 and in topic2 when I call KafkaStreams#start, all labelled "snapshot" and no message is added thereafter. I can see that within a second or so both myStore and rekeyedTopic get filled up. Since the event time of the messages in the table is lower than the event time of the messages in the stream my understanding (from reading https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization) is that I should see the result of the join (in enrichedTopic) shortly after myStore and rekeyedTopic are filled up. In fact I should be able to fill up rekeyedTopic first and as long as myStore gets filled up less than 2 minutes after that, the join should still produce the expected result.
This is not what happens. What happens is that myStore and rekeyedTopic get filled up within the first second or so, then nothing happens for 2 minutes and only then enrichedTopic gets filled with the expected messages.
I don't understand why there is a pause of 2 minutes before the enrichedTopic gets filled since everything is "ready" long before. What I am missing?
based on the documentation where it states:
max.task.idle.ms - Maximum amount of time a stream task will stay idle when not
all of its partition buffers contain records, to avoid potential out-of-order
record processing across multiple input streams.
I would say it's possibly due to some of the partition buffers NOT containing records so it's basically waiting to avoid out of order processing up to the defined time you have configured for the property.