Merge multiple partitions into an ordered single partition

Merge multiple partitions into an ordered single partition - apache-kafka

Let's say I got a topic with 3 different partitions that grow in an increasing order:
1: {1,1,1,2,2}, 2: {1,2,3,3}, 3: {2,2,2,3,4}
Would it be possible to stream them all into an orderd single partition of another topic?
1: {1,1,1,1,2,2,2,2,2,2,3,3,3,4}
What would be the best approach to do that?

First, create a topic with one partition.
Then, use a single Kafka Streams or consumer app to consume the original topic and write to the new one. Note: order in output topic is not guaranteed.

Related

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?

When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.

I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

When to create new Consumer in ConsumerGroup

I am newbie in Kafka world and was reading about Consumer and ConsumerGroup.I got the difference between them and understand why we need ConsumerGroup in Kafka.
But here my question is When we should decide when to create new Consumer within same Group.
When we have huge amount of data?
Could someone help me to understand any real use case.
Thanks

I think some very good points have already been mentioned and here are my few cents. As your primary question seems to be "When" to add a consumer in a group...
There are 2 scenarios I could think of:
If one or more consumers in a Consumer group are overloaded by consumption from multiple partitions and you intend to distribute that load and increase parallelism. In this case, you could add consumers and trigger a rebalance.
If the partitions in a topic are increasing. This is quite a tricky scenario and may disturb the existing consumers in some ways. Following are a few examples of when this might happen:
a) If the semantics of your data are changing as partitioning a topic
based on the semantics is quite a common use case
b) If the data volume is increasing and the semantics are also changing
c) If only the volume is increasing that is leading to Scenario 1
However, as you've pointed out in your question - if only the volume is increasing and the consumers in a group are nicely mapped to the partitions on a 1-to-1 basis then you may be better off leaving things as they are. Otherwise, you might end up in the Scenario 2b.
Hope this helps!

In Apache Kafka, the level of parallelism is defined by the number of partitions. The higher the number of partitions, the higher the level of parallelism one can achieve. Depending on the volume of data, you should set the number of partitions to the desired value. Note that you can not have more active consumers than number of partitions.
For example, assume that you have a topic test with 5 partitions and a consumer group test-group. At any given time, only 5 consumers can be active withing test-group. Say we've got 1000 messages in topic test, then each of the 5 active consumers will consume (approximately) 200 messages. In case you run more than 5 partitions, the remaining will be inactive meaning that they won't consumer any messages at all. Similarly, if you have less consumers than partitions, then some of your active consumers will consumer messages from more than one partition.
Another -less straight-forward- example would be the following (taken from):
In this scenario, we do have two topics (A and B), each of which has 3 partitions. Two consumers belonging to the same consumer group are consuming messages from both topics.

As mentioned above, Kafka scales the topic consumption by distributing partitions among a consumer group. A consumer group is nothing, but a set of consumers sharing the common identifier.
A consumer is responsible to consumer messages from one or more partitions. If there is a single consumer running in the consumer group, it will consume data from all partitions. If there are multiple consumers running with in same group, they distribute the load in consumes from different-different partitions.
Maximum number of consumers are equal to the maximum number of partitions. If the consumers number exceeds than number of partitions, excessive consumers will be idle.
Let's say if there is a topic with 4 partitions. There are two consumer groups A and B. Group A has two consumers C1,C2. Both consumers will consume from approx 2 and 2 partitions.
While in Consumer Group B, there are 4 consumers, each consumer will consume from one partition.
When to use single consumer or multiple consumer : It depends on the use case. If you want a consolidated output from the processing where the calculations are based on the entire data in the topic, you should use single consumer unless you have a post processing logic to merge the output from each consumer.
If you are just reading the data and want to parallelize the process by distributing load, use multiple consumers

Topics, partitions and keys

I am looking for some clarification on the subject.
In Kafka documentations I found the following:
Kafka only provides a total order over messages within a partition,
not between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over messages
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
So here are my questions:
Does it mean if i want to have more than 1 consumer (from the same group) reading from one topic I need to have more than 1 partition?
Does it mean I need same amount of partitions as amount of consumers for the same group?
How many consumers can read from one partition?
Also have some questions regarding relationship between keys and partitions with regard to API. I only looked at .net APIs (especially one from MS) but looks like the mimic Java API.
I see when using a producer to send a message to a topic there is a key parameter. But when consumer reads from a topic there is a partition number.
How are partitions numbered? Starting from 0 or 1?
What exactly relationship between a key and partition?
As I understand some function on key will determine a partition. is that correct?
If I have 2 partitions in a topic and want some particular messages go to one partition and other messages go to another I should use a specific key for one specific partition, and the rest for another?
What if I have 3 partitions and one type of messages to one particular partition and the rest to other 2?
How in general I send messages to a particular partition in order to know for a consumer from where to read?
Or I better off with multiple topics?
Thanks in advance.

Does it mean if i want to have more than 1 consumer (from the same
group) reading from one topic I need to have more than 1 partition?
Let's see the following properties of kafka:
each partition is consumed by exactly one consumer in the group
one consumer in the group can consume more than one partition
the number of consumer processes in a group must be <= number
of partitions
With these properties, kafka is smartly able to provide both ordering guarantees and load balancing over a pool of consumer processes.
To answer your question, yes, in the context of the same group, if you want to have N consumers, you have to have at least N partitions.
Does it mean I need same amount of partitions as amount of consumers
for the same group?
I think this has been explained in the first answer.
How many consumers can read from one partition?
The number of consumers that can read from one partition is always equal to the number of consumer groups subscribing to that topic.
Relationship between keys and partitions with regard to API
First, we must understand that the producer is responsible for choosing which record to assign to which partition within the topic.
Now, lets see how producer does so. First, lets see the class definition of ProducerRecord.java :
public class ProducerRecord<K, V> {
private final String topic;
private final Integer partition;
private final Headers headers;
private final K key;
private final V value;
private final Long timestamp;
}
Here, the field that we have to understand from the class is partition.
From the ProducerRecord docs,
If a valid partition number is specified, that partition will be used when sending the record.
If no partition is specified but a key is present a partition will be chosen using a hash of the key.
If neither key nor partition is present a partition will be assigned in a round-robin fashion.

Partitions increase parallelism of Kafka topic. Any number of consumers/producers can use the same partition. Its up to application layer to define the protocol. Kafka guarantees delivery. Regarding the API, you may want to look at Java docs as they may be more complete. Based on my experience:
Partitions start from 0
Keys may be used to send messages to the same partition. For example hash(key)%num_partition. The logic is pluggable to Producer. https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/Partitioner.html
Yes. but be careful not to end up with some key that will result in the "dedicated" partition. For this, you may want to have dedicated topic. For example, control topic and data topic
This seems to be the same question as 3.
I believe consumers should not make assumptions of the data based on partition. The typical approach is to have consumer group that can read from multiple partitions of a topic. If you want to have dedicated channels, it is better (safer/maintainable) to use separate topics.

Multiple topics on a single partition?

I was just curious and could not find any info on this. My question is can there be multiple topics on a single partition? If yes, how will they be produced in that partition or consumed by a consumer later? Or is it that one partition always holds one topic?

Kafka partition belongs only to one topic. Topic is higher level construct which is broken into partition, so there is guarantee that single partition never belongs to more than one topic.

In Kafka, one partition always holds data related to one topic. Having multiple topics data in one partition is little unusual use-case. If I understood your use-case correctly, If you want to store multiple datasets in one topic and one partition (which is not recommended though) then you can create flag field in input data which reveals that document belongs to particular dataset.
Hope This Helps!

Kafka distributing messages from a partition among consumers

I have a Kafka topic which currently has 3 partitions. I want my consumers to read from the same partition but each message should go to a different consumer in a round-robin fashion. Is it possible to achieve this?

In order to do that, you have to implement a consumer group. It's provided out of the box with Kafka. You have just to specify the same group.id to your tree consumer.
[edit] But, each consumers will read in different Kafka partition. I think that make difference consumer for mthe same group read in the same partition is not possible if you're using only the Kafka API.
See more in the documentation : http://kafka.apache.org/documentation.html#intro_consumers

How about this, at the producer, the messages are routed based on some key. It is possible to route message 1 to partition 1, message 2 to partition 2, message 3 to partition 3. Then you should group three consumers in one group. It is possible to make consumer 1 to consume partition 1, consumer 2 to consume partition 2, consumer 3 to consume partition 3.
By the way, how to implement it depends on which kafka client you are using, what the messages are. You should give more details....

What you are saying defeats the purpose of partitions. Partitions are not designed for simple load balancing in kafka. If you really want that, you have two options.
If you have a control over the producer producing to the topic, do a simple mod 3 hash partitioning. So the messages will be distributed equally in the 3 partitions. Now each of your consumer will consume from one partition. This effectively means every third message is read by each consumer. That solves your problem.
If you cannot control the producer, consume from the topic in the normal way. Write a producer with simple mod 3 hash partitioning and produce it to a new topic. Again consume from that topic. The same thing repeats as in the first case.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse