Options for getting a specific item's produce and consume time stamps? - apache-kafka

Suppose I'm debugging an issue involving a single specific message that was produced and consumed. And I want to know when this message was produced and when it was consumed. What are my options for getting this info?
I guess when I construct a message I could include within it the current time. And when my consumer gets a message, it could write out a log entry.
But suppose I have many producer and consumer classes and none of the code is doing these things. Is there something already existing in kafka that could support finding out this info about a specific message without having to touch the implementation of these producers and consumers, something like the __consumer_offsets topic?

Kafka has built-in timestamp support for messages sent and this timestamp can be accessed via timestamp method of ConsumerRecord (link)
It can be configured with broker config (log.message.timestamp.type) or topic level config (message.timestamp.type). Its default value is CreateTime. You can also set this as LogAppendTime.
CreateTime: Timestamp is assigned when producer record is created (before sending).
LogAppendTime: broker will override the
timestamp with its current local time and append the message to the
log.
IMHO for consume timestamp your only option is to get currentTime of the system after message process is finished.
For more information about timestamp you can check this.

When it comes to consumption, there's no explicit way of specifying when message was consumed (also bear in mind that a single message can be consumed multiple times e.g. by consumers in different consumer groups).
However there are a few possible ways to track it on your own:
register the timestamps after the record has been received (after the .poll(...) call returns),
if using consumer groups, monitor consumer group's offset or see the value in __consumer_offsets (that would require you to deserialize internal format - see this answer for details (bear in mind, the timestamps of records in this topic correspond to consumer's commit timestamps, so they need to commit often enough to provide correct granurality,
log compaction + custom implementation: send the message with the same key and value that marks the consumption timestamp (however the message might still be re-read before the compaction happens).

Related

Kafka log compaction pointers

Reading about log compaction on a topic, I was wondering if there is any way for a consumer to get hold of any of the positions/offsets of the following?
end of the head
start of the tail
compaction cleaner point
Basically the point at which the compacted and non-compacted parts of the log meet?
I've read that there is a cleaner-offset-checkpoint file that sits on the broker at /var/lib/kafka/data/cleaner-offset-checkpoint but is the info in this file available to a consumer?
My use case is a consumer that will consume compacted keys one way and non-compacted keys another way.
thanks for any advice.
UPDATE:
thinking for example of a topic holding various customer events like here https://www.confluent.io/blog/put-several-event-types-kafka-topic/; new customer, customer updates name, customer updates address, etc. Log compaction, I believe, will leave one event per customer in the tail but still many events per customer in the head (assuming compaction is slower than message production..?) A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event? In any case I was wondering if a consumer could tell how far along a topic compaction has got, at any given time?
It's not possible, with the consumer api, no.
If you want to check that checkpoint file on disk, you could use Jssh, for example, to access a broker, and read the file. If it has offset data, you could then use seek methods, but keep in mind that the Log Cleaner thread may be actively running when you seek to or consume that data
A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event?
I don't think this is a valid use case. For a stream of customer updates, you'd just update a new customer model in a table via a streaming reduce function. If any consumer restarts, it'll have to always read from the beginning of the topic to rebuild its local state then continue reading any updates to those stored values, so doesn't make sense to skip past them all, or have two separate consumers
I also don't necessarily think you need different models. Some UUID would be unique, and every event can contain the full model of a "customer". Most fields can remain optional/nullable until they are provided with a new message with all those fields set (or not), and this defines a batch update since you can set/update/remove multiple attributes at once. If you need more granularity, that's also possible to define at the producer level by storing and looping over your attributes and producing individual "customer" objects with each new attribute

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

How to make restart-able producer?

Latest version of kafka support exactly-once-semantics (EoS). To support this notion, extra details are added to each message. This means that at your consumer; if you print offsets of messages they won't be necessarily sequential. This makes harder to poll a topic to read the last committed message.
In my case, consumer printed something like this
Offset-0 0
Offset-2 1
Offset-4 2
Problem: In order to write restart-able proudcer; I poll the topic and read the content of last message. In this case; last message would be offset#5 which is not a valid consumer record. Hence, I see errors in my code.
I can use the solution provided at : Getting the last message sent to a kafka topic. The only problem is that instead of using consumer.seek(partition, last_offset=1); I would use consumer.seek(partition, last_offset-2). This can immediately resolve my issue, but it's not an ideal solution.
What would be the most reliable and best solution to get last committed message for a consumer written in Java? OR
Is it possible to use local state-store for a partition? OR
What is the most recommended way to store last message to withstand producer-failure? OR
Are kafka connectors restartable? Is there any specific API that I can use to make producers restartable?
FYI- I am not looking for quick fix
In my case, multiple producers push data to one big topic. Therefore, reading entire topic would be nightmare.
The solution that I found is to maintain another topic i.e. "P1_Track" where producer can store metadata. Within a transaction a producer will send data to one big topic and P1_Track.
When I restart a producer, it will read P1_Track and figure out where to start from.
Thinking about storing last committed message in a database and using it when producer process restarts.

Kafka stream application not consume data after restart

After I did restart our Kafka cluster my application of Kafka streams didn't receive messages from input topic and I got an exception of "can׳t create internal topic". After some research, I did reset with the Kafka tool (to the input topic and the application) the tool is Kafka-streams-application-reset.sh.
Unfortunately, it didn't resolve the problem and I also got the exception again
From the error message, you can infer that the topic already exists and thus, cannot be created. The reason for the failure is, that the existing topic does not have the expected number of partitions (it has 1 instead of 150) -- if the number of partitions would match, Kafka Streams would just use the existing topic.
This can happen, if you have topic auto-create enabled at the brokers (and the topic was created with a wrong number of partitions), or if the number of partitions of your input topic changed. Kafka Streams does not automatically change the number of partitions for the repartition topic, because this might result in data corruption and thus lead to incorrect results.
One way to fix this, it to either manually delete this topic: note, that this might result in data loss and you should only do this, if you know that it is what you want.
Another (better way) would be, to reset the application cleanly using bin/kafka-streams-application-reste.sh in combination with KafkaStreams#cleanup().
Because you need to clean up the application and users should be aware of the implication, Kafka Streams fails to make user aware of the issue instead of "auto magically" take some actions that might be undesired from a user point of view.
Check out the docs for more details. There is also a blog post that explains application reset in details:
https://kafka.apache.org/11/documentation/streams/developer-guide/app-reset-tool.html
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Simple-Kafka-consumer message delivery duplication

I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.