How to remove duplicate input messages using Kakfa stream

How to remove duplicate input messages using Kakfa stream - apache-kafka

I have a topic wherein I get a burst of events from various devices. There are n number of devices which emit weather report every s seconds.
The problem is that these devices emit 5-10 records of the same value every s seconds. So if you see the output in the kafka topic for a single device, it is as follows:-
For device1:-
t1,t1,t1,t1(in the same moment, then gap of s seconds)t2,t2,t2,t2(then gap of s seconds),t3,t3,t3,t3
However, I want to remove these duplicate records in kafka that come as burst of events.
I want to consume as follows:-
t1,t2,t3,...
I was trying to use concepts of windowing and ktable that Kafka stream API provide, but it doesn't seem possible. Any ideas?

You might want to use Kafka's Log compaction. But in order to use it you should have the same key for all the duplicated messages, and a different key for non duplicate messages. Have a look at this.
https://kafka.apache.org/documentation/#compaction

Would it be an option to read the topic into a KTable using t as the key. The duplicated values would be treated as upserts rather than inserts which would effectively drop them. Then write the KTable into another topic

Step 1:
Produce the same key with all messages, that are logically duplicates.
Step 2:
If you don't need near real-time processing with this topic as an input, use cleanup.policy=compact. It will produce "eventual" deduplication (may be delayed for a long time).
Otherwise, use exactly-once kafka streams deduplication. Here are DSL and Transformer examples.

Related

How to scale to thousands of producer-consumer pairs in Kafka?

I have a usecase where I want to have thousands of producers writing messages which will be consumed by thousands of corresponding consumers. Each producer's message is meant for exactly one consumer.
Going through the core concepts here and here: it seems like each consumer-producer pair should have its own topic. Is this correct understanding? I also looked into consumer groups but it seems they are more for parallellizing consumption.
Right now I have multiple producer-consumer pairs sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time. Also in the event I have to delete the checkpoint this will be even more problematic as it starts reading from the very beginning.
Is creating thousands of topics the solution for this? Or is there any other way to use concepts like partitions, consumer groups etc? Both producers and consumers are spark streaming/batch applications. Thanks.

Each producer's message is meant for exactly one consumer
Assuming you commit the offsets, and don't allow retries, this is the expected behavior of all Kafka consumers (or rather, consumer groups)
seems like each consumer-producer pair should have its own topic
Not really. As you said, you have many-to-many relationship of clients. You do not need to have a known pair ahead of time; a producer could send data with no expected consumer, then any consumer application(s) in the future should be able to subscribe to that topic for the data they are interested in.
sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time
The consumption would take linearly more time on a higher production rate, yes, and partitions are the way to solve for that. Beyond that, you need faster network and processing. You still need to consume and deserialize in order to filter, so the filter is not the bottleneck here.
Is creating thousands of topics the solution for this?
Ultimately depends on your data, but I'm guessing not.

Is creating thousands of topics the solution for this? Or is there any
other way to use concepts like partitions, consumer groups etc? Both
producers and consumers are spark streaming/batch applications.
What's the reason you want to have thousands of consumers? or want to have a 1 to 1 explicit relationship? As mentioned earlier, only one consumer within a consumer group will process a message. This is normal.
If however you are trying to make your record processing extremely concurrent, instead of using very high partition counts or very large consumer groups, should use something like Parallel Consumer (PC).
By using PC, you can processing all your keys in parallel, regardless of how long it takes to process, and you can be as concurrent as you wish .
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.

By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.

Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Kafka Streams Sort Within Processing Time Window

I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API.
Imagine the following situation as an example (arbitrary one, but similar to what I need):
There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.
We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).
Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.
I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.
Notice:
We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.
We need to remove all the records from the store, as soon window is sorted and flushed.
If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user.
I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).

You can see my answer to a similar question here:
https://stackoverflow.com/a/44345374/7897191
Since your message keys are already unique you can ignore my comments about de-duplication.
Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.

How can I consume a data sequentially(in order of their time-stamp) from a multi-partitioned Kafka topic

I know that Kafka will not be able to guarantee ordering of data when a topic has multiple partitions. But my problem is:- I need to have multiple partitions to an event topic(user activities generating events) since I want multiple consumer groups to consume the data from the topic.
But there are times when I need to bootstrap the entire data,i.e, read the complete data right from the beginning to the end and rebuild my graph of events from the historical messages in Kafka and then I lose the ordering which is creating problem.
One approach might be to process it in a Map-Reduce paradigm where I map the data based on time and order it and consume it.
Is there anybody who has faced similar situation / problem and who would like to help me out with the right approach / solution.
Thanks in advance.

As per kafka documentation global ordering throughout partitions not guaranteed so you can create N number of partitions with N number of consumers. Create partitions based on type of data i.e. all type of data of category A should go in one partition as the order of messages maintained within partition you can consume those messages in separate consumer and process data.
I gone through some blogs which saying buffer those messages and apply sorting logic on those messages, but this is not seems to be a good practice as one of partition may be slow message message is late in some cases and you need to sort your messages as and when every new message arrives.

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?

In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.

I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.

In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.

You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.

Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8

Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse