I have a single kafka client instance that is consuming from 200 partitions, now I want it to consume only on the first 3 kafka partition for debugging and sampling purpose.
Is there a way I can do that?
Or alternatively I can consume from all partition and drop message from partition that is not from the first 3 partition. Is there a way I can find out which partition is the message from?
You can use KafkaConsumer.assign(java.util.Collection<TopicPartition> partitions) to assign a specific set of partitions. To find out the parttion of the message you can use ConsumerRecord.partition()
if you want consume only partially partition,
implement org.apache.kafka.clients.consumer.internals.PartitionAssignor
already apache kafka's test use MockPartitionAssignor (extends PartitionAssignor)
implement PartitionAssignor and, setup "partition.assignment.strategy"
reference : https://kafka.apache.org/documentation/#newconsumerconfigs
Since you haven't specified which consumer API you use, I am going to give an example based on the Python kafka-python library.
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['host:9092'])
for message in consumer:
if message.partition in [0, 1, 2]:
# do something with message.value
If you really want to read only from a subset of partitions, you can do something like:
consumer.assign([TopicPartition('foobar', 2)])
next_message = next(consumer)
would work for you. But I would suggest the first approach. Irrespective of the language of development you choose, all Kafka consumers must be implementing the above features.
Related
Is there any way to recognize the last kafka message of every partition in multi-partitioned topic and multiple consumers apart from lag?
I know one way to identify the last message which is through AdminClient API/ kafka consumer API. But need to use different method.
The only ways are to query the end offset of the partitions using admin/consumer, as you've found.
You could install Confluent REST Proxy and use HTTP calls, maybe, or use shell scripts, and parse the output, but it's just an abstraction over the same methods
=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.
I have been studying Apache Kafka for a while now.
Lets consider the following example.
Consider I have a topic with 3 partitions. I have a single producer and single consumer. I am producing my messages without specifying the key attribute.
So i know on the producer side, when i publish a message, the strategy used by kafka to assign a message to either of those partitions would be Round-Robin.
Now, what i want to know is when I start a single consumer belonging to a certain consumer group listening to that same topic, what strategy will it use to pull the messages from the different partitons(as there are 3)?
Would it follow the a similar round-robin model, where it will send a fetch request to a leader of a partition 1, wait for a response, get the response, return the records to process. Then, send a fetch request to the leader of a partition 2 and so on?
If it follows some other strategy/algorithm, I would love to know what it is?
Thank you in advance.
There is no ordering guarantee outside of a partition so in a way that algorithm used is moot to the end user and subject to change.
Today, there is nothing terribly complex that happens in this instance. The protocol shows you that a fetch request includes a partition so you get a fetch per partition. That means the order depends on the consumer. A partition won't be starved because fetch requests will happen for all partitions assigned to the consumer.
I know, storm doesn't guarantee total ordering gurantee for kafka topics, but see in many documents, storm guarantees consumption/processing the messages maintaining the order at partition level.
I am looking for a sample storm topology, that consumes/processes the messages of a kafka topic maintaining the order of messages at a kafka partition level.. NOT Total Order!! ONLY partition level ordering guarantee.
please share if you know any sample application. Thanks a lot!!
Have you looked at Apache Storm examples here? https://github.com/apache/storm/tree/master/external/storm-kafka
You may want to consider standard example and scale it based on your needs. Also, while defining Schema for the KafkaSpuout, you may want to output some key as part of the tuple and later use FieldG
rouping.
I have a Kafka topic which currently has 3 partitions. I want my consumers to read from the same partition but each message should go to a different consumer in a round-robin fashion. Is it possible to achieve this?
In order to do that, you have to implement a consumer group. It's provided out of the box with Kafka. You have just to specify the same group.id to your tree consumer.
[edit] But, each consumers will read in different Kafka partition. I think that make difference consumer for mthe same group read in the same partition is not possible if you're using only the Kafka API.
See more in the documentation : http://kafka.apache.org/documentation.html#intro_consumers
How about this, at the producer, the messages are routed based on some key. It is possible to route message 1 to partition 1, message 2 to partition 2, message 3 to partition 3. Then you should group three consumers in one group. It is possible to make consumer 1 to consume partition 1, consumer 2 to consume partition 2, consumer 3 to consume partition 3.
By the way, how to implement it depends on which kafka client you are using, what the messages are. You should give more details....
What you are saying defeats the purpose of partitions. Partitions are not designed for simple load balancing in kafka. If you really want that, you have two options.
If you have a control over the producer producing to the topic, do a simple mod 3 hash partitioning. So the messages will be distributed equally in the 3 partitions. Now each of your consumer will consume from one partition. This effectively means every third message is read by each consumer. That solves your problem.
If you cannot control the producer, consume from the topic in the normal way. Write a producer with simple mod 3 hash partitioning and produce it to a new topic. Again consume from that topic. The same thing repeats as in the first case.