How can I deduplicate streams in KSQL? - apache-kafka

I have a Kafka topic / stream that sometimes receives duplicates of events. How can I deduplicate the stream in KSQL?

De-duplicating a stream is not currently possible in raw KSQL. You might be able to write a UDF for this.
Note that a table will only store the latest update (message) for a given key. Depending on your usecase, that could be helpful.

Related

What exactly is data record in Kafka Streams?

So I've read enough tutorials and official documentation, but everything that I've found on data record is pretty much copy-pasting from one source to another:
A stream partition is an, ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined
as a key-value pair.
Each stream partition is a totally ordered sequence of data records and maps to a Kafka topic partition. A data record in
the stream maps to a Kafka message from that topic.
So what exactly is data record? Since it maps kafka message is it safe to say that it is pretty much the same thing or is it sort of another object that has some sort of information regardig kafka message?
A data record is nothing but a message which is structured as a key-value pair like name=smith or id=101.
Stream is a high-level term used in the context of Kafka-streams and Kafka streams is a high-level API built on top of the core kafka-clients API to provide some additional functionality.
Generally, a stream is a flow of data, in this case it is a collection of messages or data-records.
So, when you say data record, it means a Kafka message only and it is not some other object that has some information (or metadata) about Kafka message. If you want to store that some other information termed as metadata, it is usually stored in headers of the Kafka message/data-record.

Reading already partitioning topic in Kafka Streams DSL

Repartitioning a high-volume topic in Kafka Streams could be very expensive. One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Let me clarify my question. Suppose I have a simple aggregation like that (details omitted for brevity):
builder
.stream("messages")
.groupBy((key, msg) -> msg.field)
.count();
Given this code, Kafka Streams would read messages topic and immediately write messages back to internal repartitioning topic, this time partitioned by msg.field as a key.
One simple way to render this round-trip unnecessary is to write the original messages topic partitioned by the msg.field in the first place. But Kafka Streams knows nothing about messages topic partitioning and I've found no way to tell it how the topic is partitioned without causing real repartition.
Note that I'm not trying to eliminate the partitioning step completely as the topic has to be partitioned to compute keyed aggregations. I just want to shift the partitioning step upstream from the Kafka Streams application to the original topic producers.
What I'm looking for is basically something like this:
builder
.stream("messages")
.assumeGroupedBy((key, msg) -> msg.field)
.count();
where assumeGroupedBy would mark stream as already partitioned by msg.field. I understand this solution is kind of fragile and would break on partitioning key mismatch, but it solves one of the problems when processing really large volumes of data.
Update after question was updated: If your data is already partitioned as needed, and you simply want to aggregate the data without incurring a repartitioning operation (both are true for your use case), then all you need is to use groupByKey() instead of groupBy(). Whereas groupBy() always results in repartitioning, its sibling groupByKey() assumes that the input data is already partitioned as needed as per the existing message key. In your example, groupByKey() would work if key == msg.field.
Original answer below:
Repartitioning a high-volume topic in Kafka Streams could be very expensive.
Yes, that's right—it could be very expensive (e.g., when high volume means millions of event per second).
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Kafka Streams does not repartition the data unless you instruct it; e.g., with the KStream#groupBy() function. Hence there is no need to tell it "not to partition" as you say in your question.
One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Given this workaround idea of yours, my impression is that your motivation for asking is something else (you must have a specific situation in mind), but your question text does not make it clear what that could be. Perhaps you need to update your question with more details?

How to remove duplicate input messages using Kakfa stream

I have a topic wherein I get a burst of events from various devices. There are n number of devices which emit weather report every s seconds.
The problem is that these devices emit 5-10 records of the same value every s seconds. So if you see the output in the kafka topic for a single device, it is as follows:-
For device1:-
t1,t1,t1,t1(in the same moment, then gap of s seconds)t2,t2,t2,t2(then gap of s seconds),t3,t3,t3,t3
However, I want to remove these duplicate records in kafka that come as burst of events.
I want to consume as follows:-
t1,t2,t3,...
I was trying to use concepts of windowing and ktable that Kafka stream API provide, but it doesn't seem possible. Any ideas?
You might want to use Kafka's Log compaction. But in order to use it you should have the same key for all the duplicated messages, and a different key for non duplicate messages. Have a look at this.
https://kafka.apache.org/documentation/#compaction
Would it be an option to read the topic into a KTable using t as the key. The duplicated values would be treated as upserts rather than inserts which would effectively drop them. Then write the KTable into another topic
Step 1:
Produce the same key with all messages, that are logically duplicates.
Step 2:
If you don't need near real-time processing with this topic as an input, use cleanup.policy=compact. It will produce "eventual" deduplication (may be delayed for a long time).
Otherwise, use exactly-once kafka streams deduplication. Here are DSL and Transformer examples.

Can we create kafka time windowed stream from historical data?

I have some historical data, each record has their timestamp. I would like to read them and feed them into kafka topic, and use kafka stream to process them in a time windowed manner.
Now the question is, when I create kafka stream time windowed aggregation processor, how can I tell kafka to use timestamp field in the record to create time window, instead of real live time?
You need to create a custom TimestampExtractor that will extract the value from the record itself - there's an example of this in the documentation, and here too. I also found this gist which looks relevant.

Query Kafka topic for specific record

Is there an elegant way to query a Kafka topic for a specific record? The REST API that I'm building gets an ID and needs to look up records associated with that ID in a Kafka topic. One approach is to check every record in the topic via a custom consumer and look for a match, but I'd like to avoid the overhead of reading a bunch of records. Does Kafka have a fast, built in filtering capability?
The only fast way to search for a record in Kafka (to oversimplify) is by partition and offset. The new producer class can return, via futures, the partition and offset into which a message was written. You can use these two values to very quickly retrieve the message.
So if you make the ID out of the partition and offset then you can implement your fast query. Otherwise, not so much. This means that the ID for an object isn't part of your data model, but rather is generated by the Kafka-knowledgable code.
Maybe that works for you, maybe it doesn't.
This might be late for you, but it will help for how other see this question, now there is KSQL, kafka sql is an open-source streaming SQL engine
https://github.com/confluentinc/ksql/