Combining data coming from multiple kafka to single kafka - apache-kafka

I have N Kafka topic, with data and a timestamp, I need to combine them in a single topic with sorted timestamp order, where the data is sorted inside the partition. I got one way to do that.
Combine all the Kafka topic data in Cassandra(because of its fast write) with clustering order as DESCENDING, it will combine them all but the limit would be if after a timed window of accumulation of data if a data came late, it won't be sorted
Is there any other appropriate way to do that? If not then is there any chance of improvement in my solution.
Thanks

Not clear why you need Kafka to sort on timestamps. Typically this is done only at consumption time for each batch of messages.
For example, create Kafka Streams process that reads from all topics. Create a Global KTable and enable Interactive Querying.
When you query, then you sort the data on the client side, regardless of how it is ordered in the topic.
This way, you are no limited to a single, ordered partition.
Alternatively, I would write to something other than Cassandra (due to my lack of deep knowledge of it). For example, Couchbase or CockroachDB.
Then when you query those later, run a SORT BY

Related

Reading already partitioning topic in Kafka Streams DSL

Repartitioning a high-volume topic in Kafka Streams could be very expensive. One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Let me clarify my question. Suppose I have a simple aggregation like that (details omitted for brevity):
builder
.stream("messages")
.groupBy((key, msg) -> msg.field)
.count();
Given this code, Kafka Streams would read messages topic and immediately write messages back to internal repartitioning topic, this time partitioned by msg.field as a key.
One simple way to render this round-trip unnecessary is to write the original messages topic partitioned by the msg.field in the first place. But Kafka Streams knows nothing about messages topic partitioning and I've found no way to tell it how the topic is partitioned without causing real repartition.
Note that I'm not trying to eliminate the partitioning step completely as the topic has to be partitioned to compute keyed aggregations. I just want to shift the partitioning step upstream from the Kafka Streams application to the original topic producers.
What I'm looking for is basically something like this:
builder
.stream("messages")
.assumeGroupedBy((key, msg) -> msg.field)
.count();
where assumeGroupedBy would mark stream as already partitioned by msg.field. I understand this solution is kind of fragile and would break on partitioning key mismatch, but it solves one of the problems when processing really large volumes of data.
Update after question was updated: If your data is already partitioned as needed, and you simply want to aggregate the data without incurring a repartitioning operation (both are true for your use case), then all you need is to use groupByKey() instead of groupBy(). Whereas groupBy() always results in repartitioning, its sibling groupByKey() assumes that the input data is already partitioned as needed as per the existing message key. In your example, groupByKey() would work if key == msg.field.
Original answer below:
Repartitioning a high-volume topic in Kafka Streams could be very expensive.
Yes, that's right—it could be very expensive (e.g., when high volume means millions of event per second).
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Kafka Streams does not repartition the data unless you instruct it; e.g., with the KStream#groupBy() function. Hence there is no need to tell it "not to partition" as you say in your question.
One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Given this workaround idea of yours, my impression is that your motivation for asking is something else (you must have a specific situation in mind), but your question text does not make it clear what that could be. Perhaps you need to update your question with more details?

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

How to remove duplicate input messages using Kakfa stream

I have a topic wherein I get a burst of events from various devices. There are n number of devices which emit weather report every s seconds.
The problem is that these devices emit 5-10 records of the same value every s seconds. So if you see the output in the kafka topic for a single device, it is as follows:-
For device1:-
t1,t1,t1,t1(in the same moment, then gap of s seconds)t2,t2,t2,t2(then gap of s seconds),t3,t3,t3,t3
However, I want to remove these duplicate records in kafka that come as burst of events.
I want to consume as follows:-
t1,t2,t3,...
I was trying to use concepts of windowing and ktable that Kafka stream API provide, but it doesn't seem possible. Any ideas?
You might want to use Kafka's Log compaction. But in order to use it you should have the same key for all the duplicated messages, and a different key for non duplicate messages. Have a look at this.
https://kafka.apache.org/documentation/#compaction
Would it be an option to read the topic into a KTable using t as the key. The duplicated values would be treated as upserts rather than inserts which would effectively drop them. Then write the KTable into another topic
Step 1:
Produce the same key with all messages, that are logically duplicates.
Step 2:
If you don't need near real-time processing with this topic as an input, use cleanup.policy=compact. It will produce "eventual" deduplication (may be delayed for a long time).
Otherwise, use exactly-once kafka streams deduplication. Here are DSL and Transformer examples.

How can I consume a data sequentially(in order of their time-stamp) from a multi-partitioned Kafka topic

I know that Kafka will not be able to guarantee ordering of data when a topic has multiple partitions. But my problem is:- I need to have multiple partitions to an event topic(user activities generating events) since I want multiple consumer groups to consume the data from the topic.
But there are times when I need to bootstrap the entire data,i.e, read the complete data right from the beginning to the end and rebuild my graph of events from the historical messages in Kafka and then I lose the ordering which is creating problem.
One approach might be to process it in a Map-Reduce paradigm where I map the data based on time and order it and consume it.
Is there anybody who has faced similar situation / problem and who would like to help me out with the right approach / solution.
Thanks in advance.
As per kafka documentation global ordering throughout partitions not guaranteed so you can create N number of partitions with N number of consumers. Create partitions based on type of data i.e. all type of data of category A should go in one partition as the order of messages maintained within partition you can consume those messages in separate consumer and process data.
I gone through some blogs which saying buffer those messages and apply sorting logic on those messages, but this is not seems to be a good practice as one of partition may be slow message message is late in some cases and you need to sort your messages as and when every new message arrives.

Query Kafka topic for specific record

Is there an elegant way to query a Kafka topic for a specific record? The REST API that I'm building gets an ID and needs to look up records associated with that ID in a Kafka topic. One approach is to check every record in the topic via a custom consumer and look for a match, but I'd like to avoid the overhead of reading a bunch of records. Does Kafka have a fast, built in filtering capability?
The only fast way to search for a record in Kafka (to oversimplify) is by partition and offset. The new producer class can return, via futures, the partition and offset into which a message was written. You can use these two values to very quickly retrieve the message.
So if you make the ID out of the partition and offset then you can implement your fast query. Otherwise, not so much. This means that the ID for an object isn't part of your data model, but rather is generated by the Kafka-knowledgable code.
Maybe that works for you, maybe it doesn't.
This might be late for you, but it will help for how other see this question, now there is KSQL, kafka sql is an open-source streaming SQL engine
https://github.com/confluentinc/ksql/