Kafka Windowing Streams Junk Data With Windowed Ktable Serdes - apache-kafka

I am working on a Kafka Streaming Dashboard where I need count of daily data of last 24 hours
stream.groupByKey() .windowedBy(TimeWindows.of(Duration.ofHours(24)).grace(Duration.ofMinutes(5))) .aggregate(() -> "0",(key, value, aggregate) -> value,Materialized.with(Serdes.String(), Serdes.String())).suppress(untilWindowCloses(unbounded())) .toStream().to("Test_windowing1");
In Above code I am trying to view data with of past 24 hours with windowing but the output before streaming with windowing and after are
Before Windowing
Key: 2023010612
Value: {"Title":"HOURLY_COUNT","Count":"19967","Key":"2023010612"}
After Windowing
Key: 2023010612����
Value: {"Title":"HOURLY_COUNT","Count":"19967","Key":"2023010612"}
I want to delete the past 2 days data with windowing and have same result for the key after windowing as well
Thank you in Advance

Related

Bucketizing Kafka Data with Partitions

I have a situation where I’m loading data into Kafka. I would like to process the records in discrete 10m buckets. But bare in mind that the record time stamps come from the producers and so they may not be perfectly in the right order so I can’t simply use the standard Kafka consumer approach since that will result in records outside of my discrete bucket.
Is it possible to use partitions for this? I could look at the timestamp of each record before placing it in the topic, using that to select the appropriate partition. But I don’t know if Kafka supports adhoc named partitions.
They aren't "named" partitions. Sure, you could define a topic with 6 partitions (10 minute "buckets", ignoring hours and days) and a Partitioner subclass that computes which partition the record timestamp will go into with a simple math function, however, this is really only useful for ordering and doesn't address that you need to consume from two partitions for every non-exact 10 minute interval. E.g. records at minute 11 (partition 1) would need to consume records with minute 1-9 (partition 0).
Overall, sounds like you want sliding/hopping windowing features of Kafka Streams, not the plain Consumer API. And this will work without writing custom Producer Partitioners with any number of partitions.

KStream-KTable LeftJoin, Join occured while KTable is not fully loaded

I am trying to use KStream-KTable leftJoin to enrich the item from topic A with Topic B. Topic A is my KStream, and topic B is my KTtable which has around 23M records. The keys from both topics are not mathced, so I have to KStream(topic B) to KTable using reducer.
Here is my code:
KTable<String, String> ktable = streamsBuilder
.stream("TopicB", Consumed.withTimestampExtractor(new customTimestampsExtractor()))
.filter((key, value) -> {...})
.transform(new KeyTransformer()) // generate new key
.groupByKey()
.reduce((aggValue, newValue) -> {...});
streamBuilder
.stream("TopicA")
.filter((key, value) -> {...})
.transform(...)
.leftJoin(ktable, new ValueJoiner({...}))
.transform(...)
.to("result")
1) the KTable initialization is slow. (around 2000 msg/s), is this normal? My topic is only have 1 partition. Any way to improve the performance?
I tried to set the following to reduec write throughput but seems doesn't improve a lot.
CACHE_MAX_BYTES_BUFFERING_CONFIG = 10 * 1024 * 1024
COMMIT_INTERVAL_MS_CONFIG = 15 * 1000
2) The join occurs when KTable is not finished loaded from Topic B.
Here is the offset when join is occured (CURRENT-OFFSET/LOG-END-OFFSET)
Topic A: 32725/32726 (Lag 1)
Topic B: 1818686/23190390 (Lag 21371704)
I checked the timestamp of the record of Topic A that failed, it is a record of 4 days ago, and the last record of Topic B which is processed is 6 days ago.
As my understanding, kstream process record based on timestamp, I don't understand why in my case, KStream(Topic A) didn't wait till KTable(Topic B) is fully loaded up to the point when it is 4 days ago to trigger the join.
I also tried setting timestamp extractor return 0, but it doesn't work as well.
Updated: When setting timestamp to 0, I am getting the following error:
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException: This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producerID are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.
I also tried setting max.task.idle.ms to > 0 (3 seconds and 30 minute), but still getting the same error.
Updated: I fixed the 'UnknownProducerIdException' error by setting the customTimestampsExtractor to 6 days ago which is still earlier than record from Topic A. I thhink (not sure) setting to 0 trigger retention on the changelog which caused this error. However, join is still not working where it still happen before the ktable finished loading. Why is that?
I am using Kafka Streams 2.3.0.
Am I doing anything wrong here? Many thanks.
1.the KTable initialization is slow. (around 2000 msg/s), is this normal?
This depend on your network, and I think the limition is the consuming rate of TopicB, two config CACHE_MAX_BYTES_BUFFERING_CONFIG and COMMIT_INTERVAL_MS_CONFIG which you use is to choose the trade-off between how much output of KTable you want to produce (cause KTable changelog is stream of revisions) and how much latency you accept when you update KTable to underlying topic and downstream processor. Take a detail look at Kafka Streams caching config for state store and this blog part Tables, Not Triggers.
I think the good way to increase the consuming rate of TopicB is to add more partition.
KStream.leftJoin(KTable,...) is always table lookup, it's always join the current stream record with the latest updated record on KTable, it'll not take stream time into account when deciding whether to join or not. If you want to consider stream time when joining, take a look at KStream-KStream join.
In your case this lag is the lag of TopicB, it does not mean KTable is not fully loaded. Your KTable is not fully loaded when it's in the state restore process when it's read from underlying changelog topic of KTable to restore the current state before actually running your stream app, in just case you will not able to do the join because stream app is not running until state is fully restore.

Kafka - consuming messages based on timestamp

I'm kind of new to Kafka but need to implement the logic for the consumer to consume from a particular topic based on timestamp. Another use case is also for me to be able to consume for a particular time range (for example from 10:00 to 10:20). The range will always be dividable by 5 minutes - meaning I won't need to consume from for example 10:00 to 10:04). The logic I was thinking would be as follows:
create a table where I store timestamp and Kafka messageId (timestamp | id)
create a console\service which does the following every 5 minutes:
Get all partitions for a topic
Query all partitions for min offset value (a starting point)
Store the offset and timestamp in the table
Get all partitions for a topic
Now if everything is alright I should have something like this in the table:
10:00 | 0
10:05 | 100
10:10 | 200
HH: mm | (some number)
Now having this I could start the consumer at any time and knowing the offsets I should be able to consume just what I need.
Does it look right or have I made a flaw somewhere? Or maybe there is a better way of achieving the required result? Any thoughts or suggestions would be highly appreciated.
P.S.: one of my colleagues suggested to use partition and work out with each partition separately... Meaning if I got a topic and replica count is for example 5 - then I'd need to save offsets 5 times for my topic for every interval (once per partition). And then the consumer would also need to account for the partitions and consume based on what offsets I got for each partition. But this would kind of incorporate additional complexity which I am trying to avoid...
Thanks in advance!
BR,
Mike
No need for tables.
You can use the seek method of a Consumer instance to move all partitions to an offset defined by that partition.
Partitioning might work... 12 partitions of 5 minute message intervals
I don't think replication addresses your problem.

Calculating delta values using kafka streams

I have some metrics written to Kafka topic with a timestamp. I need to calculate the delta value between the current value and the previous value of the metric. I would like to do this via the Kafka streams API or KSQL to scale better than the current solution that I have.
What I have now is a simple Kafka producer/consumer in Python that reads a metric at time and calculates the the delta with the previous value stored in a Redis database.
Some example code to accomplish this via Kafka streams API would be highly appreciated.

Difference between joinWindows.of vs joinWindows.until in kafka streams?

I am trying to understand the difference between joinWindows.of vs joinWindows.until while doing a left join. For example
Stream1.leftJoin(Stream2,SomeJoinerValue,joinWindows.of(2 mins).until(5 mins))
My understanding as per the documentation, as long as the time difference between Stream1 & Stream2 is less than 2 mins, a successful join will be performed without dropping anything from the streams.
My question here is, what is the use of windows retention period of 5 mins?
The window retention period is a lower bound for how log the window is kept and accepts new input data. This is required to handle out-of-order records. Joins are based on event-time and thus, it's not guaranteed that all records are processed in timestamp ordered. In fact, Kafka Streams processed records in offset order.