Consuming from added partitions in Apache Kafka - apache-kafka

I have a KafkaConsumer and use manual partition assignment. To distribute partitions I use consumer.partitionsFor(topicId) in regular intervals to detect added partitions, because the job runs forever and I want to support this case. However, this always returns initial list of partitions, unless I restart the consumer.
Is there a way detect added partitions from consumer? What to poll or listen for?

KafkaConsumer has a configuration property "metadata.max.age.ms", which defaults to 5 minutes. That means, that each call to consumer.partitionsFor will return cached copy for that time and only then fetch new metadata.
Setting the property to 0 fetches new metadata each time.

Maybe this is in the process of the partition reassign when you query by the assignment() method. You could sleep 5 seconds and then check this to test it.
Get the set of partitions currently assigned to this consumer. If subscription happened by directly assigning partitions using assign(Collection) then this will simply return the same partitions that were assigned. If topic subscription was used, then this will give the set of topic partitions currently assigned to the consumer (which may be none if the assignment hasn't happened yet, or the partitions are in the process of getting reassigned).

Related

Force assignment of ALL partitions

My topic has many partitions, many producers, but only one consumer. The consumer can run only for a short period of time, let's say one minute. During this period, I need to make sure it will consume all the records from all the partitions that were produced before the consumer was initialized, but ALSO the records produced during the minute the consumer was running.
The problem is that can't find the correct partition assignation strategy that guarantee I will get all the partitions. If I use consumer.Subscribe(topic), I will only get some partitions, not all. If I use consumer.Assign(partition) during the initialization, I WILL get all the active partitions, but if a new partition comes along and receives records, I will miss those.
The only solution I have so far is to re-do the assignments periodically (every 10 seconds).
If I use consumer.Subscribe(topic), I will only get some partitions, not all
It should get all. If you don't get all, then that means you more than likely have some un-closed consumer instance in the same consumer group that has already been assigned other partitions.
You can periodically run kafka-consumer-groups --describe command to inspect this.
Using assignment doesn't use the consumer group protocol, thus why it would work.
There is no guarantee Kafka can provide that your consumer will read any/all data between two time intervals. You'd need to track this on your own, and may potentially require your consumer instance to run for longer than you expect.

Kafka - timestamp order

Assume I'm using log.message.timestamp.type=LogAppendTime.
Also assume number of messages per topic/partition during first read:
topic0:partition0: 5
topic0:partition1: 0
topic0:partition2: 3
topic1:partition0: 2
topic1:partition1: 0
topic1:partition2: 4
and during second read:
topic0:partition0: 5
topic0:partition1: 2
topic0:partition2: 3
topic1:partition0: 2
topic1:partition1: 4
topic1:partition2: 4
If I read first message from each partition, does Kafka guarantee that reading again from each partition won't return a message that's older than those I read during first read?
Focus on topic0:partition1 and topic1:partition1 which didn't have any messages during first read, but have during second read.
Kafka guarantees message ordering at partition level, so your use case perfectly fits kafka's architecture.
There are some concepts to explain in here. First of all, you have the starting consumer position (when you first launch a new consumer group), defined by the auto.offset.reset parameter.
This will kick in only if there's no saved offset for that group, or if a saved offset is not valid anymore (f.e, if it was already deleted by retention policies). You should normally only worry for this if you launch a new consumer group (and you want to decide wether it starts from the oldest messages, or from the present - newest one).
Regarding your example, in normal conditions (there are no consumer shutdowns, etc), you have nothing to worry about. Consumers within a same consumer group will only read their messages once, no matter the number of partitions nor the number of consumers. These consumers remember their last read offset, and periodically save it in the _consumer_offsets topic.
There are 2 properties that define this periodical recording:
enable.auto.commit
Setting it to true (which is the default value) will allow the automatic commit to the _consumer_offsets topic.
auto.commit.interval.ms
Defines when the offsets are commited. For example, with a value of 10000, your consumer offsets will be stored every 10 seconds.
You can also set enable.auto.commit to false and store your offsets in your own way (f.e to a database, etc), but this is a more special use case.
The auto offset committing will allow you to stop your consumers, and start them again later without losing any message nor reprocessing already processed ones (it's like a mark in a book's page). If you don't stop your consumers (and without any errors from broker/zookeeper/consumers), even less worries for you.
For more info, you can take a look here: https://docs.confluent.io/current/clients/consumer.html#concepts
Hope it helps!

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

How to reinstate a Kafka Consumer which has been kicked out of the group?

I have a situation where I have a single kafka consumer which retrieves records from kafka using the poll mechanism. Sometimes this consumer gets kicked out of the consumer group due to failure to call poll within the session.timeout period which I have configured to 30s. My question is if this happens will a poll at some later point of time re-add the consumer to the group or do I need to do something else?
I am using kafka version 0.10.2.1
Edit: Aug 14 2018
Some more info. After I do a poll I never process the records in the same thread. I simply add all the records to a separate queue (serviced by a separate thread pool) for processing.
Poll will initiate a "join group" request, if the consumer is not a member of the group yet, and will result in consumer joining the group (unless some error situation prevents it). Note that depending on the group status (other members in the group, subscribed topics in the group) the consumer may or may not get the same partitions it was consuming from before it was kicked out. This would not be the case if the consumer is the only consumer in the group.
Consumer gets kicked out if it fails to send heart beat in designated time period. Every call to poll sends one heart beat to consumer group coordinator.
You need to look at how much time is it taking to process your single record. Maybe it is exceeding session.timeout.ms value which you have set as 30s. Try increasing that. Also keep max.poll.records to a lower value. This setting determines how many records are fetched after the call to poll method. If you fetch too many records then even if you keep session.timeout.ms to a large value your consumer might still get kicked out and group will enter rebalancing stage.
Vahid already mentioned what happens when a kicked-out consumer rejoins the group. You can also tune the below configuration so that consumer won't be kicked out of the group.
max.poll.records - which gives the pre-defined number of records in the poll loop (default: 500)
max.poll.interval.ms - which gives you the amount of time required to process the messages that are received in the poll. (default: 5 min)
You can see the impacts of updating the above configuration in KIP-62
Alternatively, you can use KafkaConsumer#assign mode as you've mentioned that you're using only one consumer. This mode won't do any re-balance.

kafka subscribe commit offset manually

I am using Kafka 9 and confused with the behavior of subscribe.
Why does it expects group.id with subscribe.
Do we need to commit the offset manually using commitSync. Even if don't do that I see that it always starts from the latest.
Is there a way a replay the messages from beginning.
Why does it expects group.id with subscribe?
The concept of consumer groups is used by Kafka to enable parallel consumption of topics - every message will be delivered once per consumer group, no matter how many consumers actually are in that group. This is why the group parameter is mandatory, without a group Kafka would not know how this consumer should be treated in relation to other consumers that might subscribe to the same topic.
Whenever you start a consumer it will join a consumer group, based on how many other consumers are in this consumer group it will then be assigned partitions to read from. For these partitions it then checks whether a list read offset is known, if one is found it will start reading messages from this point.
If no offset is found, the parameter auto.offset.reset controls whether reading starts at the earliest or latest message in the partition.
Do we need to commit the offset manually using commitSync? Even if
don't do that I see that it always starts from the latest.
Whether or not you need to commit the offset depends on the value you choose for the parameter enable.auto.commit. By default this is set to true, which means the consumer will automatically commit its offset regularly (how often is defined by auto.commit.interval.ms). If you set this to false, then you will need to commit the offsets yourself.
This default behavior is probably also what is causing your "problem" where your consumer always starts with the latest message. Since the offset was auto-committed it will use that offset.
Is there a way a replay the messages from beginning?
If you want to start reading from the beginning every time, you can call seekToBeginning, which will reset to the first message in all subscribed partitions if called without parameters, or just those partitions that you pass in.