Join multiple Kafka topics by key - apache-kafka

How can write a consumer that joins multiple Kafka topics in a scalable way?
I have a topic that published events with a key and a second topic that publishes other events related to a subset of the first with the same key. I would like to write a consumer that subscribes to both topics and performs some additional actions for the subset that appears in both topics.
I can do this easily with a single consumer: read everything from both topics, maintaining state locally and perform the actions when both events have been read for a given key. But I need the solution to scale.
Ideally I need to tie the topics together so that they are partitioned the same way and the partitions are assigned to consumers in sync. How can i do this?
I know Kafka Streams joins topics together such that keys are allocated to the same nodes. How do they do it? P.S. I can't used Kafka Streams because I'm using Python.

Too bad you are on Python -- Kafka Streams would be a perfect fit :)
If you want to do this manually, you will need to implement your own PartitionAssignor -- this, implementation must ensure, that partitions are co-located in the assignment: Assume you have 4 partitions per topic (let's call them A and B), than partitions A_0 and B_0 must be assigned to the same consumer (also A_1 and B_1, ...).
I hope Python consumer allows you to specify a custom partition assignor via config parameter partition.assignment.strategy.
This is the PartitionAssignor Kafka Streams uses: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignor.java
Streams uses the concept of tasks -- a tasks gets partitions of different topics with the same partition number assigned. Streams also tries to do a "sticky assignment" -- ie., don't move task (and thus partitions) in case of rebalance if possible. Thus, each consumer encodes its "old assignment" in the rebalance metadata.
Basically, the method #subscription() is called on each consumer that is alive. It will send the subscription information of the consumer (ie, to what topics a consumer wants to subscribe) plus optional metadata to the brokers.
In a second step, the leader of the consumer group, will compute the actual assignment, within #assign(). The responsible broker collects all information given by #subscription() in the first phase of the rebalance and hands it to #assign(). Thus, the leader gets a global overview over the whole group, and thus can ensure that partitions are assigned in a co-located manner.
In the last step, the broker received the computed assignment from the leader, and broadcasts it to all consumers of the group. This will result in a call to #onAssignment() on each consumer.
This might also help:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Architecture
http://docs.confluent.io/current/streams/architecture.html

Related

Publisher which subscribes to its own topic

I'm currently designing an application which is will have hundreds of log-compacted topics. Each topic is related to a failover group and should have a dynamic (e.g., to be changed on demand) set of producers and consumers.
For example, let's say I have 3 failover instances related to topic T1. Each of those failover instances should have the same data / state (eventually consistent). And each of the instances may consume and produce messages on that topic.
As I understand, I need to assign different group IDs for each consumer/producer in order to have every instance read the topic entirely.
Though given that the number of readers and writers for a topic are not fixed, how is it possible to avoid reading ones own messages for that topic?
Sure, I could add a source ID to the message and just dismiss the message when the consumer figures out that he is about to read a message he previously produced himself. But I'd rather avoid the data transfer entirely.
Producers and consumers are independent processes. If you subscribe to the same topic that's being produced to without some extra processing logic, you'll end up with an infinite loop.
You also cannot have more consumers than partitions, so the dynamic consumer amount will be limited by that.
need to assign different group IDs for each consumer/producer in order to have every instance read the topic entirely
Not necessarily. You've mentioned you have compacted topics, so I assume you are using Kafka Streams. In the Streams API, you can set num.standby.replicas for copying statestore data across instances of the same application.id

How do co-partitioning ensure that partition from 2 different topics end up assigned to the same Kafka Stream Task?

while i understand the pre-requisite of having co-partitioning as explained here Why does co-partitioning of two Kstreams in kafka require same number of partitions for both the streams? , I do not understand the mechanism that make sure that the partitions of each topic that have the same key, get assigned to the same KAFKA Stream. I do not see how the consumer group of KAFKA would enable that.
The way i understand it is that, we have 2 independent consumer groups, which actually may have the same name, because it is the same kafka stream application, although the suscription to each topic is independent from each other.
Somehow, the consumers in each consumer group, get assigned to partition that contains the same key. I did not know that consumer assignment to partition could be related to the content of the partitions. So far i though it was random.
Can someone explain that part ?
The way i understand it is that, we have 2 independent consumer groups, which actually may have the same name, because it is the same kafka stream application, although the suscription to each topic is independent from each other.
All members of a consumer group have the same "name" (ie, group.id) -- it is not possible to have two consumer groups with the same name. It would be one consumer group.
although the suscription to each topic is independent from each other
For KafkaConsumer it's possible to have different subscription for different members in the group (even if this should be a very rare scenario). For Kafka Streams however, it is required that all members of the group (ie, application instances) execute the exact some Topology with the exact some input topics (ie, their subscription must be the same).
I did not know that consumer assignment to partition could be related to the content of the partitions. So far i though it was random.
That is correct.
From your own answer:
In other words, if the number of partitions is the same, and the partition strategy of each producer of the topic is the same, message with same key will be assigned in the same way on the partition range, which is assigned to the consumer in the same way, i.e. as consecutive subset of partitions from each topic. Hence The same stream task will always have partitions of both topics which have the same key.
That is also correct.
Note, that Kafka Streams uses a special partition assignor (not the default ones the consumer offers) to ensure co-partitioning, stickiness (ie, state-store awareness), and to assign standby-tasks.
After refreshing I found the two following statement that explains it all:
A consumer group has a unique id. Each consumer group is a subscriber to one or more Kafka topics.
Hence a consumer group may involve multiple topics with their partition and a strategy to assign them to the consumer of the group.
PARTITION.ASSIGNMENT.STRATEGY (In Kafka Definitive Guide)
A PartitionAssignor is a class that, given consumers and topics they subscribed to, decides which partitions will be assigned to which consumer. By default, Kafka has two assignment strategies:
Range: Assigns to each consumer a consecutive subset of partitions from each topic it subscribes to. So if consumers C1 and C2 are subscribed to two topics, T1 and T2, and each of the topics has three partitions, then C1 will be assigned partitions 0 and 1 from topics T1 and T2, while C2 will be assigned partition 2 from those topics. Because each topic has an uneven number of partitions and the assignment is done for each topic independently, the first consumer ends up with more partitions than the second. This happens whenever Range assignment is used and the number of consumers does not divide the number of partitions in each topic neatly.
In other words, if the number of partitions is the same, and the partition strategy of each producer of the topic is the same, message with same key will be assigned in the same way on the partition range, which is assigned to the consumer in the same way, i.e. as consecutive subset of partitions from each topic. Hence The same stream task will always have partitions of both topics which have the same key.

If you have less consumers than partitions, what happens?

If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
What if you have multiple consumers on a given topic#partition? I guess the consumer has to somehow keep track of what messages it has already processed in case of duplicates?
In fact, each consumer belongs to a consumer group. When Kafka cluster sends data to a consumer group, all records of a partition will be sent to a single consumer in the group.
If there're more paritions than consumers in a group, some consumers will consume data from more than one partition. If there're more consumers in a group than paritions, some consumers will get no data. If you add new consumer instances to the group, they will take over some partitons from old members. If you remove a consumer from the group (or the consumer dies), its partition will be reassigned to other member.
Now let's take a look at your questions:
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
NO. Some consumers in the same consumer group will consume data from more than one partition.
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
Kafka will take care of it. If new consumers join the group, or old consumers dies, Kafka will do reblance.
What if you have multiple consumers on a given topic#partition?
You CANNOT have multiple consumers (in a consumer group) to consume data from a single parition. However, if there're more than one consumer group, the same partition can be consumed by one (and only one) consumer in each consumer group.
1) No that means you will one consumer handling more than one consumer.
2) Kafka never assigns same partition to more than one consumer because that will violate order guarantee within a partition.
3) You could implement ConsumerRebalanceListener, in your client code that gets called whenever partitions are assigned or revoked from consumer.
You might want to take a look at this article specically "Assigning partitions to consumers" part. In that i have a sample where you create topic with 3 partitions and then a consumer with ConsumerRebalanceListener telling you which consumer is handling which partition. Now you could play around with it by starting 1 or more consumers and see what happens. The sample code is in github
http://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html

How can Apache Kafka send messages to multiple consumer groups?

In the Kafka documentation:
Kafka handles this differently. Our topic is divided into a set of
totally ordered partitions, each of which is consumed by one consumer
at any given time. This means that the position of consumer in each
partition is just a single integer, the offset of the next message to
consume. This makes the state about what has been consumed very small,
just one number for each partition. This state can be periodically
checkpointed. This makes the equivalent of message acknowledgements
very cheap.
Yet, following their quick start guide in that same document, I was easily able to:
Create a topic with a single partition
Start a console-producer
Push a few messages
Start a consumer to consume --from-beginning
Start another consumer --from-beginning
And have both consumers successfully consume from the same partition.
But this seems at odds with the documentation above?
When using different consumer groups, consumers can consume the same partitions easily. You may consider group ids as different applications consuming a Kafka topic. Multiple different applications might want to use the data in a Kafka topic differently and thus not to conflict with other applications. That's why two consumers may consume one partition (in fact the only way how two consumers can consume one partition).
And when you start a console consumer it randomly generates a group id for it (link) thus these consumers are doing exactly what I just wrote.

Kafka how to consume one topic parallel

I read kafka document, still don't know how consume one topic parallel?
Suppose:
I have one topic like "something happened" (don't split this topic), and I have many customers that want to consume it.
So what should I do, so that multiple customers can consume it parallel? Should I use partitioning and customer groups?
I have one idea about this, but I'm not sure whether is it right.
Make many partitions about the same topic, and make one partition to one customer, so one producer must produce the same to these partitions, and every customer in the different customer group, is it right?
Using partitions is the way of being able to parallelize the consumption of a topic. Let´s say you have 10 partitions for your topic, then you can have 10 consumers in the same consumer group reading one partition each. If you have less consumers than partitions, then they would be responsible for more than one partition each. If you have more consumers than partitions, then there would be consumers who would not get any partition assigned to them and have nothing to do except being available to replace another consumer who has died.
Each topic in Kafka can be organized into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream
(say C1 uses 3, C2 uses 3 and C3 uses 4). In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Just to add the list of answers, Confluent has a library to do this for you, like Rapids. The project is here:
https://github.com/confluentinc/parallel-consumer
It's open source.
Note: I'm the author.
#Lundahl did all the didactic, I'll give you a pratical sample.
Create a topic for some meaning, e.g. news_events with the parallelism your consumers will need (partitions), you can calc that using the time to process one message, the number of messages you will have and the time you want to have all the messages processed.
Let's create consumers for that topic, you wan't to read the news and your sister or brother also, each one on your time, then every one needs a consumer group id, this way kafka will know that threads a,b,c are for one consumer group and the d,e,c are for the second consumer group, every consumer group will receive the same messages, process it at their time and won't affect each other.
A message will come at one or other partition, never at two, by default Kafka makes round robin to choose the partition, remember, all consumers groups can connect and read data from all the same partitions
I would suggest you to use rapids-kafka-client, a library which do that parallelism stuff for you, choose the number of threads equal the number of partitions you have, choose a consumer group, and see the magic happen.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "news-app")
.topics("news_events")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
You can read more about consumer groups, topics and partitions here
I assume what you want is parallel consumption between customers in a publish/subscribe fashion.
Beside that, you can also have parallel consumption within a single customer in order to scale the consumer application.
Parallel consumption between customers
If by "customers" you mean different organizations which are interested in consuming topic's messages independently, all you need is consumer groups.
This is a simple publish/subscribe pattern where each customer runs its own application and read all topic's messages without interfering with others.
Each customer application can be seen as a consumer group, made up by one or more Kafka consumers (whether running on a single node or spread across a cluster), all of them sharing the consumer group's identifier.
You achieve this goal regardless of partitions. In case topic is partitioned, you don't need to worry about writing the same message to all partitions. Remember that in Kafka messages are durable, a message read by a Kafka consumer is not deleted and is available to be read by other Kafka consumers from a different consumer group (until it expires). Furthermore, partitions are not meant to work like this, they help scale storage of data (at a certain point all topic's data wouldn't fit into just one node) and scale consumer applications as you can see below.
Parallel consumption within single customer
You can further parallelize, or better to say, scale the consumption of messages within a consumer group with, in fact, Kafka consumers.
Imagine topic is huge, producers write into it with an high rate, and consumer group has only one consumer: this poor consumer may struggle to keep up with the message arrival rate, especially if message processing is time-consuming too.
That's the case where you need partitions and more consumers in your consumer group, so that Kafka will assign partitions to consumers to distribute reading load among them.
How partition assignment works has been already explained in other answers here, but basically for a given consumer group:
each topic's partition is assigned exclusively to one consumer,
a consumer might get assigned more partitions
if consumers are more than topic's partitions, some of them will stay idle as they won't get assigned any partition to consume from.
Remember that message ordering in Kafka is guaranteed only at partition level, so if you have many partitions and ordering matters, you need to choose the right message key to partition data according to your requirements.
For example if you want messages be ordered by device, a device_id would be your key that guarantees messages of the same device will be written to the same partition.