Kafka Producer design - multiple topics - apache-kafka

I'm trying to implement a Kafka producer/consumer model, and am deliberating whether creating a separate publisher thread per topic would be preferred over having a single publisher handle multiple topics. Any help would be appreciated
PS: I'm new to Kafka

By separate publisher thread, I think you mean separate producer objects. If so..
Since messages are stored as key-value pairs in Kafka, different topics can have different key-value types.
So if your Kafka topics have different key-value types like for example..
Topic1 - key:String, value:Student
Topic2 - key:Long, value:Teacher
and so on, then you should be using multiple producers. This is because the KafkaProducer class at the time of constructing the object asks you for the key and value serializers.
Properties props=new Properties();
props.put("key.serializer",StringSerializer.class);
props.put("value.serializer",LongSerializer.class);
KafkaProducer<String,Long> producer=new KafkaProducer<>(props);
Though, you may also write a generic serializer for all the types! But, it is better to know before hand what we are doing with the producer.

I prefer Keep It Stupid Simple (KISS) approach for the sake of obvious reasons - one producer / multiple producers - one topic.
From Wikipedia,
The KISS principle states that most systems work best if they are kept simple rather than made complicated; therefore, simplicity should be a key goal in design, and unnecessary complexity should be avoided.
Talking about the possibility of one producer supporting multiple topics, it is also far from the truth.

Starting with version 2.5, you can use a RoutingKafkaTemplate to select the producer at runtime, based on the destination topic name.
https://docs.spring.io/spring-kafka/reference/html/#routing-template
Single Publisher can handle multiple Topics and you can customize the Producer Config as per Topic needs

I think a separate thread for each topic would be preferred because due to some reasons if the particular producer is down then the respected topic will be impacted and remaining all the topics will work smoothly without any problem.
If we create one publisher for all topics then, if the publisher is down for some reasons then all the topics would impact.

Related

How to scale to thousands of producer-consumer pairs in Kafka?

I have a usecase where I want to have thousands of producers writing messages which will be consumed by thousands of corresponding consumers. Each producer's message is meant for exactly one consumer.
Going through the core concepts here and here: it seems like each consumer-producer pair should have its own topic. Is this correct understanding? I also looked into consumer groups but it seems they are more for parallellizing consumption.
Right now I have multiple producer-consumer pairs sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time. Also in the event I have to delete the checkpoint this will be even more problematic as it starts reading from the very beginning.
Is creating thousands of topics the solution for this? Or is there any other way to use concepts like partitions, consumer groups etc? Both producers and consumers are spark streaming/batch applications. Thanks.
Each producer's message is meant for exactly one consumer
Assuming you commit the offsets, and don't allow retries, this is the expected behavior of all Kafka consumers (or rather, consumer groups)
seems like each consumer-producer pair should have its own topic
Not really. As you said, you have many-to-many relationship of clients. You do not need to have a known pair ahead of time; a producer could send data with no expected consumer, then any consumer application(s) in the future should be able to subscribe to that topic for the data they are interested in.
sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time
The consumption would take linearly more time on a higher production rate, yes, and partitions are the way to solve for that. Beyond that, you need faster network and processing. You still need to consume and deserialize in order to filter, so the filter is not the bottleneck here.
Is creating thousands of topics the solution for this?
Ultimately depends on your data, but I'm guessing not.
Is creating thousands of topics the solution for this? Or is there any
other way to use concepts like partitions, consumer groups etc? Both
producers and consumers are spark streaming/batch applications.
What's the reason you want to have thousands of consumers? or want to have a 1 to 1 explicit relationship? As mentioned earlier, only one consumer within a consumer group will process a message. This is normal.
If however you are trying to make your record processing extremely concurrent, instead of using very high partition counts or very large consumer groups, should use something like Parallel Consumer (PC).
By using PC, you can processing all your keys in parallel, regardless of how long it takes to process, and you can be as concurrent as you wish .
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Dynamically create and change Kafka topics with Flink

I'm using Flink to read and write data from different Kafka topics.
Specifically, I'm using the FlinkKafkaConsumer and FlinkKafkaProducer.
I'd like to know if it is possible to change the Kafka topics I'm reading from and writing to 'on the fly' based on either logic within my program, or the contents of the records themselves.
For example, if a record with a new field is read, I'd like to create a new topic and start diverting records with that field to the new topic.
Thanks.
If you have your topics following a generic naming pattern, for example, "topic-n*", your Flink Kafka consumer can automatically reads from "topic-n1", "topic-n2", ... and so on as they are added to Kafka.
Flink 1.5 (FlinkKafkaConsumer09) added support for dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees.
Consumer constructor that accepts subscriptionPattern: link.
Thinking more about the requirement,
1st step is - You will start from one topic (for simplicity) and will spawn more topic during runtime based on the data provided and direct respective messages to these topics. It's entirely possible and will not be a complicated code. Use ZkClient API to check if topic-name exists, if does not exist create a model topic with new name and start pushing messages into it through a new producer tied to this new topic. You don't need to restart job to produce messages to a specific topic.
Your initial consumer become producer(for new topics) + consumer(old topic)
2nd step is - You want to consume messages for new topic. One way could be to spawn a new job entirely. You can do this be creating a thread pool initially and supplying arguments to them.
Again be more careful with this, more automation can lead to overload of cluster in case of a looping bug. Think about the possibility of too many topics created after some time if input data is not controlled or is simply dirty. There could be better architectural approaches as mentioned above in comments.

Assign different group id to different consumers in the same application

I am aware of the parallelism advantages that kafka streams offer which are leveraged if your parallelism needs are aligned with the partitioning of the topics.
I am considering having an application subscribe many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
Specifically I am thinking of having multiple threads consume the same topic to provide different results even though I know that I can express all my computation needs using the "chaining" computation paradigm that KStreams offer.
The reason why I am considering different threads is because I want multiple dynamically created KTable instances of the stream. Each one working on the same stream (not subset) and aggregating different results. Since it's dynamic it can create really heavy load that could be alleviated by adding thread parallelism. I believe the idea that each thread can work on its own streams instance (and consumer group) is valid.
Of course I can also add thread parallelism by having multiple threads consuming smaller subsets of the data and individually doing all the computations (e.g. each one maintaining subsets of all the different KTables) which will still provide concurrency.
So, two main points in my question
Are KafkaStreams not generally suited for thread parallelism, meaning is the library not intended to be used that way?
In the case where threads are being used to consume a topic would it be a better idea to make threads follow the general kafka parallelism concept of working on different subsets of the data, therefore making thread parallelism an application-level analogous to scaling up using more instances?
But I am wondering would it be okay to have an application that subscribes many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
What you could consider is running multiple KafkaStreams instances inside the same Java application. Each instance has its own StreamsConfig and thus its own application.id and consumer group id.
That said, depending on what your use case is, you might want to take a look at GlobalKTable (http://docs.confluent.io/current/streams/concepts.html#globalktable), which (slightly simplified) ensures that the data it reads from a Kafka topic is available in all instances of your Kafka Streams application. That is, this would allow you to "replicate the data globally" without having to run multiple KafkaStreams instances or the more complicated setup you asked about above.
Specifically I am considering having multiple threads consume the same topic to provide different kinds of results. Can I somehow define the consumer group that each KafkaStream consumer is listening to?
Hmm, perhaps you're looking at something else then.
You are aware that you can build multiple "chains" of computation from the same KStream and KTable instance?
KStream<String, Long> input = ...;
KTable<..., ...> firstChain = input.filter(...).groupByKey().count(...);
KTable<..., ...> secondChain = input.mapValues(...);
This would allow you to read a Kafka topic once but then compute different outcomes based on that topic.
Is this considered a bad idea in general?
If I understand you correctly I think there's a better and much simpler approach, see above. If you need something different, you may need to update/clarify your question.
Hope this helps!

Implement filering for kafka messages

I have started using Kafka recently and evaluating Kafka for few use cases.
If we wanted to provide the capability for filtering messages for consumers (subscribers) based on message content, what is best approach for doing this?
Say a topic named "Trades" is exposed by producer which has different trades details such as market name, creation date, price etc.
Some consumers are interested in trades for a specific markets and others are interested in trades after certain date etc. (content based filtering)
As filtering is not possible on broker side, what is best possible approach for implementing below cases :
If filtering criteria is specific to consumer. Should we use
Consumer-Interceptor (though interceptor are suggested for logging
purpose as per documentation)?
If filtering criteria (content based filtering) is common among consumers, what should be the approach?
Listen to topic and filter the messages locally and write to new topic (using either interceptor or streams)
If I understand you question correctly, you have one topic and different consumer which are interested in specific parts of the topic. At the same time, you do not own those consumer and want to avoid that those consumer just read the whole topic and do the filtering by themselves?
For this, the only way to go it to build a new application, that does read the whole topic, does the filtering (or actually splitting) and write the data back into two (multiple) different topics. The external consumer would consumer from those new topics and only receive the date they are interested in.
Using Kafka Streams for this purpose would be a very good way to go. The DSL should offer everything you need.
As an alternative, you can just write your own application using KafkaConsumer and KafkaProducer to do the filtering/splitting manually in your user code. This would not be much different from using Kafka Streams, as a Kafka Streams application would do the exact same thing internally. However, with Streams your effort to get it done would be way less.
I would not use interceptors for this. Even is this would work, it seems not to be a good software design for you use case.
Create your own interceptor class that implements org.apache.kafka.clients.consumer.ConsumerInterceptor and implement your logic in method 'onConsume' before setting 'interceptor.classes' config for the consumer.

Will there be duplication if I use two groups of Kafka-0.8.0 SimpleConsumers

This is with reference to SimpleConsumer Example and High Level Consumer Example.
As per the documentation, it seems to suggest that SimpleConsumers are responsible for managing the offsets themselves and they can choose to read a message multiple times or consume only a subset of the partitions in a topic. All this is possible because they can form their request and specify what offset they want.
Now, if I have two clusters of simple consumers and both use a different zookeeper to store the offsets, then it is very likely that both the clusters will read duplicate messages. Is that understanding correct? To void duplication among them, they have to use a single zookeeper-cluster to store the offsets.
The concept of consumer-group applies only to the High-Level consumer. So if I have two clusters of high-level consumers and both use the same group-ID, then then they will not get any duplicate messages.
Please suggest if the above is not correct.
Simple consumer don't use zookeeper to store the offsets. It's recommended not to use Zookeeper as a store for saving the processed record offsets.
The concept of consumer-group applies only to the High-Level consumer.
So if I have two clusters of high-level consumers and both use the
same group-ID, then then they will not get any duplicate messages
What do you mean by two clusters? If both the consumers belongs to the same group (having the same group-ID), then your statement is correct.
If you are using High-level consumers and same group-id, then there will be no duplication of messages while consuming from the same topic.
If using simple-consumers, it completely depends on how you are maintaing the offsets. If both the consumers have their offsets in sync i.e. they maintain the same offset level, then there won't be any duplication.
In your case, it may cause duplication since you are maintaining the offsets separately.