Is it possible to filter some data from topic instead of moving its data content in Kafka Streams? - apache-kafka

I have two kafka clusters, Server1 and Server2.
My goal is to send the filtered data from Server1 to Server2.
Here is my simple example.
The topic1 in Server1 has data such as below.
Server 1
offset 1 2 3 4 5 6 7 ...
data a b c a a b c ...
Server 2
offset 1 2 3 4 ...
data a a a a ...
What I want to do is that filtering data containing a and sending it to Server 2.
Therefore the result looks like as above in Server 2.
I know it is simple business logic and it can be easily achieved by filtering method in Kafka Streams Apis.
However, my real case is that the content of data size is larger than above example.
So I think that it is not a good idea to filter and send the original data, because it is almost duplicated between two servers. Instead, it will be better if I can send the filtered index (offset) to server2 so that the data is not duplicated between two servers.
I have googled it regarding kafka streams in terms of filter but no idea to achieve my case.
I would appreciate if you could give any hints or idea to resolve my problem?
Or is it impossible in Kafka Stream?

Related

Apache Kafka Data Status ( Message Status )

I am a beginner in kafka and is trying to create a chat application with the features like forward,read,delivered.
Let me give my approaches first so that you would have an idea on whether I am going on the right path,
Approach 1:
Define Topic 'some_name' having 3 partitions.
These partitions denotes the below,
Partition 1 : Send
Partition 2 : Delivered
Partition 3 : Read
Here the messages will go through 1st partiton then once client provides a call back we dequeue it from the first queue and enqueue it to the 2nd and so on for the read part
Approach 2:
In this approach it would be just 1 topic and a partition and if Kafka provides a flag for each data present ( flag denoting if it has been consumed by any consumer ) , I can set that flag for read/delievered.
What I have tried:
I have tried going with the first approach as by maintaining 3 partitions but on the consumer side. I wasn't able read data from all 3 partitions together instead returning null.
These are the approaches that I have in mind and looking forward to explore more. I could really use a help on new approaches or the best way to overcome this.
Thanks.
I'm not sure what you mean by consumer returning nulls. The default behavior of subscribing to a topic is getting assigned all partitions.
But you cannot "dequeue" data and move records across topic partitions, regardless, so you may want to reconsider your design, such as using the Transactional Outbox Pattern.

Distribute messages on single Kafka topic to specific consumer

Avro encoded messages on a single Kafka topic, single partitioned. Each of these messages were to be consumed by a specific consumer only. For ex, message a1, a2, b1 and c1 on this topic, there are 3 consumers named A, B and C. Each consumer would get all the messages but ultimate A would consume a1 and a2, B on b1 and C on c1.
I want to know how typically this is solved when using avro on Kafka:
leave it for the consumers to deserialize the message then some application logic to decide to consume the message or drop the message
use partition logic to make each of the messages to go to a particular partition, then setup each consumer to listen to only a single partition
setup another 3 topics and a tiny kafka-stream application that would do the filtering + routing from main topic to these 3 specific topics
make use of kafka header to inject identifier for downstream consumers to filter
Looks like each of the options have their pros and cons. I want to know if there is a convention that people follow or there is some other ways of solving this.
It depends...
If you only have a single partitioned topic, the only option is to let each consumer read all data and filter client side which data the consumer is interested in. For this case, each consumer would need to use a different group.id to isolate the consumers from each other.
Option 2 is certainly possible, if you can control the input topic you are reading from. You might still have different group.ids for each consumer as it seems that the consumer represent different applications that should be isolated from each other. The question is still if this is a good model, because the idea of partitions is to provide horizontal scale out, and data-parallel processing; however, if each application reads only from one partition it seems not to align with this model. You also need to know which data goes into which partition producer side and consumer side to get the mapping right. Hence, it implies a "coordination" between producer and consumer what seems not desirable.
Option 3 seems to indicate that you cannot control the input topic and thus want to branch the data into multiple topics? This seems to be a good approach in general, as topics are a logical categorization of data. However, it would even be better to have 3 topic for the different data to begin with! If you cannot have 3 input topic from the beginning on, Option 3 seems not to provide a good conceptual setup, however, it won't provide much performance benefits, because the Kafka Streams application required to read and write each record once. The saving you gain is that each application would only consume from one topic and thus redundant data read is avoided here -- if you would have, lets say 100 application (and each is only interested in 1/100 of the data) you would be able to cut down the load significantly from an 99x read overhead to a 1x read and 1x write overhead. For your case you don't really cut down much as you go from 2x read overhead to 1x read + 1x write overhead. Additionally, you need to manage the Kafka Streams application itself.
Option 4 seems to be orthogonal, because is seems to answer the question on how the filtering works, and headers can be use for Option 1 and Option 3 to do the actually filtering/branching.
The data in the topic is just bytes, Avro shouldn't matter.
Since you only have one partition, only one consumer of a group can be actively reading the data.
If you only want to process certain offsets, you must either seek to them manually or skip over messages in your poll loop and commit those offsets

Kafka consume from 2 topics and take equal number of messages

I've jumped into a specific requirement and would like to hear people's views and certainly not re-invent the wheel.
I've got 2 Kafka topics - A and B.
A and B would be filled with messages at different ingest rate.
For example: A could be filled with 10K messages first and then followed by B. Or in some cases we'd have A and B would be filled with messages at the same time. The ingest process is something we have no control of. It's like a 3rd party upstream system for us.
I need to pick up the messages from these 2 topics and mix them at equal proportion.
For example: If the configured size is 50. Then I should pick up 50 from A and 50 from B (or wait until I have it) and then send it off to another kafka topic as 100 (with equal proportions of A and B).
I was wondering what's the best way to solve this? Although I was looking at the join semantics of KStreams and KTables, I'm not quite convinced that this is a valid use case for join (cause there's no key in the message that joins these 2 streams or tables).
Can this be done without Kafka Streams? Vanilla Kafka consumer (perhaps with some batching?) Thoughts?
With Spring, create 2 #KafkaListeners, one for A, one for B; set the container ack mode to MANUAL and add the Acknowledgment to the method signature.
In each listener, accumulate records until you get 50 then pause the listener container (so that Kafka won't send any more, but the consumer stays alive).
You might need to set the max.poll.records to 1 to better control consumption.
When you have 50 in each; combine and send.
Commit the offsets by calling acknowledge() on the last Acknowledgment received in A and B.
Resume the containers.
Repeat.
Deferring the offset commits will avoid record loss in the event of a server crash while you are in the accumulating stage.
When you have lots of messages in both topics, you can skip the pause/resume part.

Filter Stream (Kafka?) to Dynamic Number of Clients

We are currently designing a big data processing chain. We are currently looking at NiFi/MiNiFi to ingest data, do some normalization, and then export to a DB. After the normalization we are planning to fork the data so that we can have a real time feed which can be consumed by clients using different filters.
We looking at both NiFi and/or Kafka to send data to the clients, but are having design problems with both of them.
With NiFi we are considering adding a websocket server which listens for new connections and adds their filter to a custom stateful processor block. That block would filter the data and tag it with the appropriate socket id if it matched a users filter and then generate X number of flow files to be sent to the matching clients. That part seems like it would work, but we would also like to be able to queue data in the event that a client connection drops for a short period of time.
As an alternative we are looking at Kafka, but it doesn't appear to support the idea of a dynamic number of clients connecting. If we used kafka streams to filter the data it appears we would need 1 topic per client? Which would likely eventually overrun our Zookeeper instance. Or we could use NiFi to filter and then insert into different partitions sort of like here: Kafka work queue with a dynamic number of parallel consumers. But there is still a limit to the number of partitions that are supported correct? Not to mention we would have to juggle the producer and consumers to read from the right partition as we scaled up and down.
Am I missing anything for either NiFi or Kafka? Or is there another project out there I should consider for this filtered data sending?

Kafka: Can a single producer produce 2 different records to 2 different topics?

I have two types of records, let's call it X and Y. I want record X to go to TopicX and record Y to go to TopicY.
1) Do I need two different producers?
2) Is it better to have 2 partitions instead of 2 different topics?
3) How can I avoid having two different producers for better network usage.
Thank you!
if you are using the same key/value serializer (and other producer properties), you can use the same producer. Producer record contains information about topic to be send
common practice is to have topic per message type. For partitionion some ids are used (clientId, sessionId... ). So, if records you want to send have different logic, than it is better to use different topics.