Consuming from multiple topics using single kafka stream - apache-kafka

which one is recommended to use :
1. Single kafka stream consuming from multiple topics
2. Different kafka streams consuming from different topics (I've used this one already with no issues encountered)
Is it possible to achieve #1 ? and if yes, what're the implications?
and if I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
kafka version : 2.2.0-cp2

Is it possible to achieve #1 (Single kafka stream consuming from multiple topics)
Yes, you can use StreamsBuilder#stream(Collection<String> topics)
If the data that you want to process is spread across multiple topics and that these multiple topics constitute one single source, then you can use this, but not if you want to process those topics in parallel.
It is like one consumer subscribing to all these topics which also means one thread for consuming all the topics. When you call poll() it returns ConsumerRecords from all the subscribed topics and not just one topic.
In Kafka streams, there is a term called Topology, which is basically a acyclic graph of sources, processors and sinks. A topology can contain sub-topologies.
Sub-topologies can then be executed as independent stream tasks through parallel threads (Reference)
Since each topology can have a source, which can be a topic, and if you want parallel processing of these topics, then you have to break-up your graph to sub-topologies.
If I use 'EXACTLY_ONCE' settings, what kind of complexities it'll bring?
When messages reach sink processor in a topology, then its source must be committed, where a source can be a single topic or collection of topics.
Multiple topics or one topic, we need to send offsets to the transaction from the producer, which is basically Map<TopicPartition, OffsetMetadata> that should be committed when the messages are produced.
So, I think it should not introduce any complexities whether it is single topic having 10 partitions or 10 topics with 1 partition each, because offset is at the TopicPartition level and not at topic level.

Related

how to publish messages intended for different consumers on same topic of Kafka server?

We have multiple consumers(separate microservices) for our topic and the events which we are publishing on the topic is intended for separate micro services or only one consumer at a time?
Can someone suggest what is the best approach to implement this?
eg. I have partition 0 & 1 in my Kafka topic which is being consumed by CG-A and CG-B.
I am publishing something like this
record-1 for CG-A then record-2 for CG-B then again record-3 for CG-A.
How do i make sure that CG-A consumes record-1 from the offset.
Producers and consumers are completely decoupled. Your producer cannot send records "to a consumer".
Consumers always read all records from the topic partitions they've been assigned, regardless of what processes produced into them.
If only certain records are meant for certain consumer groups, then that's processing logic unique to your own applications post-consumption from Kafka. I.e. add conditional statements to filter those events

Spring Kafka, subscribing to large number of topics using topic pattern

Are there any known limitations with the number of Kafka topics and the topics distribution between consumer instances while subscribing to Kafka topics using topicPattern.
Our use case is that we need to subscribe to a large number of topics (potentially few thousands of topics). All the topics follow a naming convention and have only 1 partition. We don't know the list of topic names before hand and new topics that match the pattern can be created anytime. Our consumers should be able to consume messages from all the topics that match the pattern.
Do we have any limit on the number of topics that can be subscribed this way using the topic pattern.
Can we scale up the number of consumer instances to read from more topics
Will the topics be distributed among all the running consumer instances. In our local testing it has been observed that all the topics are read by a single consumer despite having multiple consumer instances running in parallel. Only when the serving consumer instance is shutdown, the topics are picked by another instance (not distributed among available instances)
All the topics mentioned are single partition topics.
I don't believe there is a hard limit but...
No; you can't change concurrency at runtime, but you can set a larger concurrency than needed and you will have idle consumers waiting for assignment.
You would need to provide a custom partition assigner, or select one of the alternate provided ones e.g. RoundRobinAssignor will probably work for you https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy

Kafka throttle producer based on consumer lag

Is there any way to pause or throttle a Kafka producer based on consumer lag or other consumer issues? Would the producer need to determine itself if there is consumer lag then perform throttling itself?
Kafka is build on Pub/Sub design. Producer publish the message to centralized topic. Multiple consumers can subscribe to that topic. Since multiple consumers are involve you cannot decide on producer speed. One consumer can be slow other can be fast. Also it is against the design principle otherwise both system will become tightly couple. If you have use case of throttling may be you should evaluate other framework like direct rest call.
Producer and Consumer are decoupled.
Producer push data to Kafka topics (partitions topic), that are stored in Kafka Brokers. Producer doesn't know who and how often consume messages.
Consumer consume data from Brokers. Consumer doesn't know how many producers produce the messages. Even the same messages can be consumed by several consumers that are in different groups. In example some consumer can consume faster than the other.
You can read more about Producer and Consumer in Apache Kafka webpage
It is not possible to throttle the producer/producers weighing on performance of consumers.
In my scenario I don't want to loose events if the disk size is
exceeded before a message is consumed
To tackle your issue, you have to depend on the parallelism offering by the Kafka. Your Kafka topic should have multiple partitions and producers has to use different keys to populate the topic. So your data will be distributed across multiple partitions and bringing a consumer group you can manage load within a group of consumers. All data within a partition can be processed in order, that may be relevant since you are dealing with event processing.

Kafka consumer reading messages parallel

Can we have multiple consumers to consume from a topic to achieve parallel processing in kafka.
My use case is to read messages from a single partition in parallel.
Yes you can process messages in parallel using many Kafka consumers, but no, it's not possible if you only have one partition.
Parallelism in Kafka consuming is defined by the number of partitions, you can easily re-partition your topic at any time to create more partitions.
An example of how process messages in parallel using rapids-kafka-client below, a library to make Kafka parallel consuming easier.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "stocks")
.topics("stock_changed")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
Simply saying we can't achieve partition level parallelism for Consumers by default.
But you can try Akka Streams Kafka (Reactive kafka). Once go through thsese docs.
The number of partitions define the level of parallelism to read from a kafka topic. But reading is (more or less) only restricted by your network capacities.
A good pattern is to separate reading and processing of messages (one thread per topic-partition for reading and multiple threads for processing this messages).
You need multiple partitions to do this, or something like Parallel Consumer (PC) to sub divide the single partition.
However, it's recommended to have at least 3 partitions and have at least three consumers running in a group, to utilise high availability. You can again use PC to process all these partitions, sub divided by key, in parallel.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Can I have multiple consumer reading from same Kafka Topic in parallel?

Currently I have one Kafka topic.
Now I need to run multiple consumer so that message can be read and processed in parallel.
Is this possible.
I am using python and pykafka library.
consumer = topic.get_simple_consumer(consumer_group=b"charlie",
auto_commit_enable=True)
Is taking same message in both consumer. I need to process message only once.
You need to use BalancedConsumer instead of SimpleConsumer:
consumer = topic.get_balanced_consumer(consumer_group=b"charlie",
auto_commit_enable=True)
You should also ensure that the topic you're consuming has at least as many partitions as the number of consumers you're instantiating.
Generally you need multiple partitions and multiple consumer to do this, or, something like Parallel Consumer (PC) to sub divide the single partition.
However, it's recommended to have at least 3 partitions and have at least three consumers running in a group, to utilise high availability. You can again use PC to process all these partitions, sub divided by key, in parallel.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
Yes you can have multiple consumers reading from the same topic in parallel provided you use the same consumer group id and the number of partitions of the topic should greater than the consumers otherwise some of the consumers will not be assigned any partitions and those consumers won't fetch any data