Kafka one consumer with two different checkpoints - apache-kafka

I have a Kafka consumer project which consumes data from a specific Kafka topic. The 90% of the records are processed as soon as I got them but I have the delay processing some of the records (10%).
This these records need to be delayed, I can't commit the records so it may cause Kafka to reassign the partitions to new nodes. In order to avoid that, I can read the same topic twice and delay the fetching data part in the second consumer but it requires deserialization twice so comes with an overhead.
Is it possible the read records using single consumer but have two separate commits with Kafka consumers? It will be basically similar to having two different consumers in terms of commit, consumer.poll will be called from a single consumer but there will be two consumer.commitSync for each batch. I will help me to avoid extra deserialization and also the network cost.

Below mentioned are the things you can do to achieve the above-mentioned task.
Create a pipe Line having two topics(T1, T2) push all the messages (90%) in topic T1 and rest all the messages 10% in topic T2.
Make your Kafka consumer configurable i.e. you can easily pass polling interval, batchSize, and batch timeout whenever you are starting your consumer.
Find a logic/ or if your second topic consumption is time-based then schedule the cron which will start and stop your consumer topic T2 when it is required.
Regarding consumer Groups, you can place both of your topics in the same group or indifferent. It's completely your choice.
By this way you will be keeping the topics clean.and each and every time you need to process the messages you can do it easily by setting up the pipeline just for once.

Related

Is there a limit on consumers and consumer groups?

Is there any limit on the number of consumers or consumer groups in Kafka?
I am planning to push 200 MB of data every 10 mins to a topic and have 200+ distinct consumers listen and consume from this topic. Is there any other recommended way to do this?
As Rohit answer states, there' no such limit.
Regarding your issue, it seems like you want to achieve some kind of paralellization of consumption. If you send 200 consumers with 200 different consumer groups, each consumer will read all the data independently, so you'll have 200 threads reading the same 200MB every 10 minutes (200x200 MB = 40GB received every 10 minutes). I guess you wanted every consumer to read 1MB every 10 mins with your approach, but that's not how it works.
If the logic implemented by each consumer is the same, you shouldn't declare more than a consumer group. If you declare two consumer groups, each one will read the same data, and you'll just repeat the job done, duplicating the output. Set different consumer groups if the job to be done on the topic's records is different: for example, one consumer group must store the records into a DDBB. The other consumer group must visualize the data into Grafana. Those are two different processing mechanisms, so each one must read all the data at its own. This is not the only reason to declare different consumer groups, but one example of them. There are multiple justifications for declaring more than a consumer group for a topic.
Imagine an scenario where the only job to be done is storing the messages into a DDBB. If you declare two consumer groups and launch your consumers, what you'll get is duplicate values stored in your database, as the first consumer group is just doing the same work than the second. Not only you are re-reading from kafka, you are re-storing the same messages into the ddbb.
In order to achieve launching multiple consumers that efficiently share the work (so for example, launching 4 consumers each one reads 50MB), you must partition your topic.
Only one consumer thread from the same consumer group can read from an specific partition. If you have 4 partitions in that topic, and 4 consumer threads that share the same consumer group, launching them will lead to each thread reading from one partition. If you launch two consumers, both will be assigned 2 partitions. Works like this:
And in this scenario, you do have a limit in the number of consumers concurrently reading if they share the same consumer group, which is, the number of partitions of that topic. If you launch a 5th consumer thread, one of them will block/wait, because it wasn't assigned any partition. In the example, consumer 5 waits until a partition is avaliable for him (so maybe waits forever).
What I suggest is: decide how many consumer threads you'll need to consume the data and partition the topic in base of that. If you, for example, partition the topic to 8 different partitions, you'll be able to launch 8 consumers from the same consumer group. Each one will then read, more or less, (depending on the producer partitioner) 25MB (200/8) of the incoming data, efficiently sharing the work load: Each consumer will read from its own partition.
If you launch 200 consumers with 200 different consumer groups,
you'll just multiply the work to be done x200, as every single consumer will read the data from start to end.
If you launch 200 consumers with the same consumer group and the topic has a single partition,
you'll have one thread doing all the job and 199 stale consumers.
In Kafka, there is no limit on the number of Consumer groups for a particular topic. However, the increase in consumer groups increases network utilization.
Worth nothing that newer versions of Kafka, store offsets in the internal Kafka topic called __consumer_offsets.

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

Kafka repartitioning

From my understanding partitions and consumers are tied up into a 1:1 relationship in which a single consumer processes a partition. However is there such a way to repartition in the middle of processing?
We are currently trying to optimize a process in which the topic gets consumed across a group but there are cases in which the data processing needs to take longer on a certain consumer while others are already idle. Its like data cleansing where a certain partition might no longer need cleansing while others require fuzzy matching thereby adding complexity to the task a consumer performs.
Your understanding with regards to partitions and consumers is not quite right.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
If you have one consumer per partition, then some of the partitions might receive more messages and this is why some of your consumers might be idle while some others might still processing some messages. Note that messages are not always inserted into topic partitions in a round-robin fashion as messages with the same key are placed into the same partition.
in kafka topics are partitioned, and even if you can add partitions to a topic there is no repartitioning: all the data already written to a partition stays there, new data will be partitioned among the existing partitions (in a round robin fashion if you do not define keys, otherwise one key will always land in the same partition as long as you do not add partitions.)
But if you have a consumer group, and you add or remove consumers to this group, there is a group rebalancing where each consumer receives its share of partitions to exclusively consume from.
So if you have 3 partitions (with evenly distributed messages among them) and 2 consumers (in the same group) one consumer will have twice as much messages to handle than the other; with 3 consumers each one will consume one partition; with 4 consumers one will stay idle...
So as you already have evenly distributed messages (which is good), you should have as many consumers as you have partitions, and if it is still not fast enough you may add n partitions and n consumers. (For sure you could also try to optimize the consumer but that is another story...)
Added to answer comment:
Once a consumer -- from a given group -- is consuming a partition, it will continue to do so and will be the only one from the group consuming this partition, even if a lot of other consumers from the same group are idle. In one group a partition is never shared between consumers. (If the consumer crashes, another one will continue the work, and if a new consumer enters the group a rebalance will occur, but anyway only one consumer will work on one partition at a given time).
So one approach, as said in your comment would be to distribute the load evenly over the partitions. Another approach, would be to have a topic dedicated to expensive jobs, let it have a lot of partitions and a lot of consumers; and let the topic for non-expensive jobs have fever consumers.
Last approach that I would not recommend would be to not use the consumer group features and to manage yourself how you consume from Kafka, by using assign and seek methods from the consumer. (See KafkaConsumer JavaDoc for more information). Spark Structured Streaming for example is using that approach, but it is much more complex...

Parallel Producing and Consuming in Kafka

1. Consuming concurrently on the same topic and same partition
Suppose I have 100 partitions for a given topic (e.g. Purchases), I can easily consume these 100 partitions (e.g. Electronics, Clothing, and etc...) in parallel using a consumer group with 100 consumers in it.
However, that is assigning one consumer to each subset of the total data on Purchases. What if I want just want to consume one subset of data with 100 consumers concurrently? For example, for all of my consumers, they just want to know Electronics partition of the Purchases topic.
Is there way they can consume this partition concurrently?
In general I just want all my consumers to receive the same data set concurrently.
From the information I've gathered, it seems to me that consumers CANNOT consume from replicas: Consuming from a replica
Can I produce the same data to multiple topics, like Purchase-1[Electronics] and Purchase-2[Electronics] so then I can consume them concurrently? Is this a recommended approach?
2. Producing concurrently on the same topic and same partition
When multiple producers are producing to the same topic and same partition, since we can only write to the partition leader and replicas are only there for fault-tolerance, does this mean there isn't any concurrency? (i.e. each commit must wait in line.)
If those 100 consumers belong to different consumer groups, they can consume from the same topic and partition simultaneously. In that case, you need to make sure each consumer is able to handle the load from the 100 partitions.
Producers can produce to the same topic partition at the same time, but the actual order of messages written to the partition is determined by the partition leader.
If you want to consumer from a single partition in parallel, use something like Parallel Consumer (PC).
By using PC, you can process all your keys in parallel, regardless of how long it takes, and you can be as concurrent as you wish.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Kafka - Multiple consumers (only one active) on same group/topic

Is it possible to have multiple copies of an application listen to the same Kafka group/topic so that only one is reading it at a time, but the other ones will start working if the main one crashes/stops reading?
I need to make an application highly available but can't tolerate doubling the traffic to the data store on the other end of the application by having multiple copies actively running.
FYI - Technically I'm using MapR streams but it adheres to the Kafka API and functionality, in case anyone knows a MapR stream-specific feature that helps the situation.
It is possible. If multi consumers are in same consumer group, when the group subscribes a topic, kafka will do a partition assignment work for your consumers: one partition could only be consumed by only one consumer in a same group.
So you could set your topic to have only one partition, then only one consumer to consume message, others will be idle. Once the consumer is shutdown, it will trigger the group rebalance operation : kafka will do the partition assignment again. And Then in your case , a new consumer will go ahead this work. It will process message from the last committed offset which commited by old consumer.
And if your case supports parallel processing, you could make many process(app) doing same work and set the topic to multi partitions. They will be assigned to consume different partitions and process different messages. So it will speed up your process and also can tolerant the fail over. As above said, if some consumers is failed, kafka will take care it for you, it will assign their paritition to other working consumer. So everything will be ok.