Is there any way to recognize the last kafka message of every partition in multi partition topic and multiple consumers apart from lag? - scala

Is there any way to recognize the last kafka message of every partition in multi-partitioned topic and multiple consumers apart from lag?
I know one way to identify the last message which is through AdminClient API/ kafka consumer API. But need to use different method.

The only ways are to query the end offset of the partitions using admin/consumer, as you've found.
You could install Confluent REST Proxy and use HTTP calls, maybe, or use shell scripts, and parse the output, but it's just an abstraction over the same methods

Related

Instruct Kafka Consumer App To Start Reading From Offset

If I have an application AppA that contains a Kafka consumer class, is it possible to instruct this consumer's behaviour pragmatically? For example, I may want to tell AppA over a rest API (or even via another topic) to wake up and begin consuming and processing messages from TopicB at offset or timestamp X and to stop at offset or timestamp Y. I may tell it to read the same sections of a topic repeatedly to perform different analysis of the data and I might want the consumer to sit idle when it's not performing an instruction.
Is it possible to control a consumer in this fashion? Essentially, I'm interested to know if I can read sections of topics on demand to produce processing/reports on its contents.. kind of in a similar to way to querying a relational DB via an admin console I guess.
Thanks in advance!
The Kafka consumer is able to consume topics at arbitrary positions.
You can use the seek() method to start consuming from a specific offset. You can also use the offsetsForTimes() method to find the offsets for a specific timestamp.
You can combine these two methods to consume specific sections of topics on demand.

If I use Kafka as simple message. Does it really worth

=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.

Processing Unprocessed Records in Kafka on Recovery/Rebalance

I'm using Spring Kafka to interface with my Kafka instance. Assume that I have a single topic with, say, 2+ partitions.
In the instances where, for example, my Spring Kafka-based application crashes (or even rebalances), and then comes back online and there are messages waiting in the topic, I'm currently using a strategy where the latest committed offsets for each partition are stored in an external store, which I then look up on a consumer's assignment to a partition and then seek to that offset to resume processing.
(This is based on a strategy I'd read about in an O'Reilly book.)
Is there a better way of handling this situation in order to implement "exactly once" semantics and not to miss any waiting messages? Or is there a better/more idiomatic way with Spring Kafka to handle this situation?
Thanks in advance.
Is there a reason you dont checkpoint your offsets to kafka itself?
generally, your options for "exactly once" processing are:
store your offsets and your side-effects together transactionally. this is only possible if your side effects go into a transaction-capable system (say a database)
use kafka transactions. this is a simplified variant of 1 as long as your side effects go to the same kafka cluster you read from
come up with a scheme that allows you to detect and disregard duplicates downstream of your kafka pipeline (aka idempotence)

Implementation of queues using kafka server

I want to implement a queue mechanism using kafka. But could not find anywhere that if it's possible to just peek data from the queue created for any topic without moving forward into it.
I want to read data from the queue and on the basis of different conditions want to remove the existing message or add another message into this queue. Also, is it possible to use a single kafka server from different machines.
I referred to tutorialspoint for learning more about it.
Thanks in advance. Any leads would be appreciated.
Keep in mind that Kakfa scales with multiple partitions per topic, and it doesn't give any ordering guarantee between partitions. So don't use kafka if you want strict ordering. Within a consumer group, if you want n consumers per topic, you need to have atleast n partitions.
Consumers don't remove messages, they commit the offset of a message. Default configuration in most clients is to auto commit offset on read. You can re-insert messages into the topic anytime. But you cannot skip a message and expect to process it later.
You can connect as many machines as you want to a kafka server. Typically, you have multiple servers as a kafka cluster, with replication for fault tolerance.

Can kafka client select specific partition to consume?

I have a single kafka client instance that is consuming from 200 partitions, now I want it to consume only on the first 3 kafka partition for debugging and sampling purpose.
Is there a way I can do that?
Or alternatively I can consume from all partition and drop message from partition that is not from the first 3 partition. Is there a way I can find out which partition is the message from?
You can use KafkaConsumer.assign(java.util.Collection<TopicPartition> partitions) to assign a specific set of partitions. To find out the parttion of the message you can use ConsumerRecord.partition()
if you want consume only partially partition,
implement org.apache.kafka.clients.consumer.internals.PartitionAssignor
already apache kafka's test use MockPartitionAssignor (extends PartitionAssignor)
implement PartitionAssignor and, setup "partition.assignment.strategy"
reference : https://kafka.apache.org/documentation/#newconsumerconfigs
Since you haven't specified which consumer API you use, I am going to give an example based on the Python kafka-python library.
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['host:9092'])
for message in consumer:
if message.partition in [0, 1, 2]:
# do something with message.value
If you really want to read only from a subset of partitions, you can do something like:
consumer.assign([TopicPartition('foobar', 2)])
next_message = next(consumer)
would work for you. But I would suggest the first approach. Irrespective of the language of development you choose, all Kafka consumers must be implementing the above features.