I am trying to understand the difference between joinWindows.of vs joinWindows.until while doing a left join. For example
Stream1.leftJoin(Stream2,SomeJoinerValue,joinWindows.of(2 mins).until(5 mins))
My understanding as per the documentation, as long as the time difference between Stream1 & Stream2 is less than 2 mins, a successful join will be performed without dropping anything from the streams.
My question here is, what is the use of windows retention period of 5 mins?
The window retention period is a lower bound for how log the window is kept and accepts new input data. This is required to handle out-of-order records. Joins are based on event-time and thus, it's not guaranteed that all records are processed in timestamp ordered. In fact, Kafka Streams processed records in offset order.
Related
I have a situation where I’m loading data into Kafka. I would like to process the records in discrete 10m buckets. But bare in mind that the record time stamps come from the producers and so they may not be perfectly in the right order so I can’t simply use the standard Kafka consumer approach since that will result in records outside of my discrete bucket.
Is it possible to use partitions for this? I could look at the timestamp of each record before placing it in the topic, using that to select the appropriate partition. But I don’t know if Kafka supports adhoc named partitions.
They aren't "named" partitions. Sure, you could define a topic with 6 partitions (10 minute "buckets", ignoring hours and days) and a Partitioner subclass that computes which partition the record timestamp will go into with a simple math function, however, this is really only useful for ordering and doesn't address that you need to consume from two partitions for every non-exact 10 minute interval. E.g. records at minute 11 (partition 1) would need to consume records with minute 1-9 (partition 0).
Overall, sounds like you want sliding/hopping windowing features of Kafka Streams, not the plain Consumer API. And this will work without writing custom Producer Partitioners with any number of partitions.
Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0
Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.
Let's consider a topic with multiple partitions and messages written in event-time order without any particular partitioning scheme. Kafka Streams application does some transformations on these messages, then groups by some key, and then aggregates messages by an event-time window with the given grace period.
Each task could process incoming messages at a different speed (e.g., because running on servers with different performance characteristics). This means that after groupBy shuffle, event-time ordering will not be preserved between messages in the same partition of the internal topic when they originate from different tasks. After a while, this event-time skew could become larger than the grace period, which would lead to dropping messages originating from the lagging task.
Increasing the grace period doesn't seem like a valid option because it would delay emitting the final aggregation result. Apache Flink handles this by emitting the lowest watermark on partitions merge.
Should it be a real concern, especially when processing large amounts of historical data, or do I miss something? Does Kafka Streams offer a solution to deal with this scenario?
UPDATE My question is not about KStream-KStream joins but about single KStream event-time aggregation preceded by a stream shuffle.
Consider this code snippet:
stream
.mapValues(...)
.groupBy(...)
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofSeconds(10)))
.aggregate(...)
I assume mapValues() operation could be slow for some tasks for whatever reason, and because of that tasks do process messages at a different pace. When a shuffle happens at the aggregate() operator, task 0 could have processed messages up to time t while task 1 is still at t-skew, but messages from both tasks end up interleaved in a single partition of the internal topic (corresponding to the grouping key).
My concern is that when skew is large enough (more than 10 seconds in my example), messages from the lagging task 1 will be dropped.
Basically, a task/processor maintains a stream-time which is defined as the highest timestamp of any record already polled. This stream-time is then used for different purpose in Kafka Streams (e.g: Punctator, Windowded Aggregation, etc).
[Windowed Aggregation]
As you mentioned, the stream-time is used to determine if a record should be accepted, i.e record_accepted = end_window_time(current record) + grace_period > observed stream_time.
As you described it, if several tasks run in parallel to shuffle messages based on a grouping key, and some tasks are slower than others (or some partitions are offline) this will create out-of-order messages. Unfortunately, I'm afraid that the only way to deal with that is to increase the grace_period.
This is actually the eternal trade-off between Availability and Consistency.
[Behaviour for KafkaStream and KafkaStream/KTable Join
When you are perfoming a join operation with Kafka Streams, an internal Task is assigned to the "same" partition over multiple co-partitioned Topics. For example the Task 0 will be assigned to Topic1-Partition0 and TopicB-Partition0.
The fetched records are buffered per partition into internal queues that are managed by Tasks. So, each queue contains all records for a single partition waiting for processing.
Then, records are polled one by one from queues and processed by the topology instance. But, this is the record from the non-empty queue having the lowest timestamp which is returned from the polled.
In addition, if a queue is empty, the task may become idle during a period of time so that no more records are polled from queue. You can actually configure the maximum amount of time a Task will stay idle can be defined with the stream config :max.task.idle.ms
This mecanism allows synchronizing co-localized partitions. Bu, default the max.task.idle.ms is set to 0. This means a Task will never wait for more data from a partition which may lead to records being filtered because the stream-time will potentially increase more quickly.
I am trying to use KStream-KTable leftJoin to enrich the item from topic A with Topic B. Topic A is my KStream, and topic B is my KTtable which has around 23M records. The keys from both topics are not mathced, so I have to KStream(topic B) to KTable using reducer.
Here is my code:
KTable<String, String> ktable = streamsBuilder
.stream("TopicB", Consumed.withTimestampExtractor(new customTimestampsExtractor()))
.filter((key, value) -> {...})
.transform(new KeyTransformer()) // generate new key
.groupByKey()
.reduce((aggValue, newValue) -> {...});
streamBuilder
.stream("TopicA")
.filter((key, value) -> {...})
.transform(...)
.leftJoin(ktable, new ValueJoiner({...}))
.transform(...)
.to("result")
1) the KTable initialization is slow. (around 2000 msg/s), is this normal? My topic is only have 1 partition. Any way to improve the performance?
I tried to set the following to reduec write throughput but seems doesn't improve a lot.
CACHE_MAX_BYTES_BUFFERING_CONFIG = 10 * 1024 * 1024
COMMIT_INTERVAL_MS_CONFIG = 15 * 1000
2) The join occurs when KTable is not finished loaded from Topic B.
Here is the offset when join is occured (CURRENT-OFFSET/LOG-END-OFFSET)
Topic A: 32725/32726 (Lag 1)
Topic B: 1818686/23190390 (Lag 21371704)
I checked the timestamp of the record of Topic A that failed, it is a record of 4 days ago, and the last record of Topic B which is processed is 6 days ago.
As my understanding, kstream process record based on timestamp, I don't understand why in my case, KStream(Topic A) didn't wait till KTable(Topic B) is fully loaded up to the point when it is 4 days ago to trigger the join.
I also tried setting timestamp extractor return 0, but it doesn't work as well.
Updated: When setting timestamp to 0, I am getting the following error:
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException: This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producerID are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.
I also tried setting max.task.idle.ms to > 0 (3 seconds and 30 minute), but still getting the same error.
Updated: I fixed the 'UnknownProducerIdException' error by setting the customTimestampsExtractor to 6 days ago which is still earlier than record from Topic A. I thhink (not sure) setting to 0 trigger retention on the changelog which caused this error. However, join is still not working where it still happen before the ktable finished loading. Why is that?
I am using Kafka Streams 2.3.0.
Am I doing anything wrong here? Many thanks.
1.the KTable initialization is slow. (around 2000 msg/s), is this normal?
This depend on your network, and I think the limition is the consuming rate of TopicB, two config CACHE_MAX_BYTES_BUFFERING_CONFIG and COMMIT_INTERVAL_MS_CONFIG which you use is to choose the trade-off between how much output of KTable you want to produce (cause KTable changelog is stream of revisions) and how much latency you accept when you update KTable to underlying topic and downstream processor. Take a detail look at Kafka Streams caching config for state store and this blog part Tables, Not Triggers.
I think the good way to increase the consuming rate of TopicB is to add more partition.
KStream.leftJoin(KTable,...) is always table lookup, it's always join the current stream record with the latest updated record on KTable, it'll not take stream time into account when deciding whether to join or not. If you want to consider stream time when joining, take a look at KStream-KStream join.
In your case this lag is the lag of TopicB, it does not mean KTable is not fully loaded. Your KTable is not fully loaded when it's in the state restore process when it's read from underlying changelog topic of KTable to restore the current state before actually running your stream app, in just case you will not able to do the join because stream app is not running until state is fully restore.
I need help understanding Kafka stream behavior when max.task.idle.ms is used in Kafka 2.2.
I have a KStream-KTable join where the KStream has been re-keyed:
KStream stream1 = builder.stream("topic1", Consumed.with(myTimeExtractor));
KStream stream2 = builder.stream("topic2", Consumed.with(myTimeExtractor));
KTable table = stream1
.groupByKey()
.aggregate(myInitializer, myAggregator, Materialized.as("myStore"))
stream2.selectKey((k,v)->v)
.through("rekeyedTopic")
.join(table, myValueJoiner)
.to("enrichedTopic");
All topics have 10 partitions and for testing, I've set max.task.idle.ms to 2 minutes. myTimeExtractor updates the event time of messages only if they are labelled "snapshot": Each snapshot message in stream1 gets its event time set to some constant T, messages in stream2 get their event time set to T+1.
There are 200 messages present in each of topic1 and in topic2 when I call KafkaStreams#start, all labelled "snapshot" and no message is added thereafter. I can see that within a second or so both myStore and rekeyedTopic get filled up. Since the event time of the messages in the table is lower than the event time of the messages in the stream my understanding (from reading https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization) is that I should see the result of the join (in enrichedTopic) shortly after myStore and rekeyedTopic are filled up. In fact I should be able to fill up rekeyedTopic first and as long as myStore gets filled up less than 2 minutes after that, the join should still produce the expected result.
This is not what happens. What happens is that myStore and rekeyedTopic get filled up within the first second or so, then nothing happens for 2 minutes and only then enrichedTopic gets filled with the expected messages.
I don't understand why there is a pause of 2 minutes before the enrichedTopic gets filled since everything is "ready" long before. What I am missing?
based on the documentation where it states:
max.task.idle.ms - Maximum amount of time a stream task will stay idle when not
all of its partition buffers contain records, to avoid potential out-of-order
record processing across multiple input streams.
I would say it's possibly due to some of the partition buffers NOT containing records so it's basically waiting to avoid out of order processing up to the defined time you have configured for the property.