How to request data from producer at beginning position that does not exist in Kafka? - apache-kafka

I have a database with time series data and this data is sent to Kafka.
Many consumers build aggregations and reporting based on this data.
My Kafka cluster stores data with TTL for 1 day.
But how I can build a new report and run a new consumer from 0th position that does not exist in Kafka but exists in source storage.
For example - some callback for the producer if I request an offset that does not exist in Kafka?
If it is not possible please advise other architectural solutions. I want to use the same codebase to aggregate this data.

For example - some callback for the producer if I request an offset
that does not exist in Kafka?
If the data does not exist in Kafka, you cannot consume it much less do any aggregation on top of it.
Moreover, there is no concept of a consumer requesting a producer. Producer sends data to Kafka broker(s) and consumers consume from those broker(s). There is no direct interaction between a producer and a consumer as such.
Since you say that the data still exists in the source DB, you can fetch your data from there and reproduce it to Kafka.
When you produce that data again, they will be new messages which will be eventually consumed by the consumers as usual.
In case you would like to differentiate between initial consumption and re-consumption, you can produce these messages to a new topic and have your consumers consume from them.
Other way is to increase your TTL (I suppose you mean retention in Kafka when you say TTL) and then you can seek back to a timestamp in the consumers using the offsetsForTimes(Map<TopicPartition,Long> timestampToSearch) and seek(TopicPartition topicPartition, long offset) methods.

Related

Kafka excatly-once producer consumer

I am implementing Exactly-once semantics for a simple data pipeline, with Kafka as message broker. I can force Kafka producer to write each produced record exactly once by setting set enable.idempotence=true.
However, on the consumption front I need to guarantee that the consumer reads each record exactly once (I am not interested in storing the consumed records to external system or to another Kafka topic just processing). To achieve this, I have to ensure that polled records are processed and their offsets are committed to __consumer_offsets topic atomically/transactionally (both succeed/fail together).
In such case do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop, where inside the transaction I perform: (1) processing of the consumed records and (2) committing their offsets, before closing the transaction. Would the normal commitSync/commitAsync serve in such case?
"on the consumption front I need to guarantee that the consumer reads each record exactly once"
The answer from Gopinath explains well how you can achieve exactly-once between a KafkaProducer and KafkaConsumer. These configurations (together with the application of Transaction API in the KafkaProducer) guarantees that all data send by the producer will be stored in Kafka exactly once. However, it does not guarantee that the Consumer is reading the data exactly once. This, of course, depends on your offset management.
Anyway, I understand your question that you want to know how the Consumer itself is processing a consumed message exactly once.
For this you need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally.
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
"do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop"
The Transaction API is only available for KafkaProducers and as far as I am aware cannot be used for your offset management.
'Exactly-once' functionality in Kafka can be achieved by a combination of these 3 settings:
isolation.level = read_committed
transactional.id = <unique_id>
processing.guarantee = exactly_once
More information on enabling the exactly-once functionality:
https://www.confluent.io/blog/simplified-robust-exactly-one-semantics-in-kafka-2-5/
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Using kafka for CQRS

Been reading a lot about kafka's use as an event store and a potential good candidate for CQRS.
I was wondering, since messages in kafka have a limited retention time, how will events be replayed after the messages were deleted from the disk where kafka retains messages?
Logically, when these messages are stored externally from kafka (after reading messages from kafka topics) in a db (sql/nosql), that would make more sense from an event store standpoint than kafka.
In lieu of above, given my understanding is correct, what is the real use case of kafka being used in CQRS even though the actual intent of kafka was just a high throughput messaging system?
You can use Kafka of event store and CQRS. You can use Kafka Stream to process all events generated by commands and store a snapshot of your entities in a changelog topic and store the changelog topic in a NOSQL one or more databases that meets your requirement. Also, all event can be store in a database(PostgresSql). What's important to know is that Kafka can be used as a store(its store files in high available way) or as a message query.
Retention time: You can set the retention time as long as you want or even keep messages forever in the topic.
Using Kafka as the data store: Sure, you can. There is a feature named Log Compaction. Let say the following scenario:
Insert product with ID=10, Name=Apple, Price=10
Insert product with ID=20, Name=Orange, Price=20
Update product with ID=10, Price becomes 30
When one topic is turned on the log compaction, a background job will periodically clean up messages on that topic. This job will check if any message has the same key then only keeps the final. With the above scenario, messages which are written to Kafka will the following format:
Message 1: Key=1, Name=Apple, Price=10
Message 2: Key=2, Name=Orange, Price=20
Message 3: Key=1, Name=Apple, Price=30 (Every update now includes all fields so it can self-contained)
After the log compaction, the topic will become:
Message 1: Key=2, Name=Orange, Price=20
Message 2: Key=1, Name=Apple, Price=30 (Keep the lastest record with the ID=1)
In reality, Kafka uses log compaction feature to make Kafka as the persistent data storage.

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Kafka Topic ordering when scaling up the partitions

Consider your producers create messages for the users of a system and the order of them is important in the user level.
My producers, add messages to the topic which have two partitions and I am using hashing against the user_id to put all the messages of each user in the same partition to guarantee the order.
How can I scale up the system and add more partitions to the topic while keeping the order of the messages?
How Kafka treat the messages that are already produced before partitioning?
What will happen to the messages that consume but not committed back to the Kafka to update the offset?
1.use a treeset(ordered set) cache messages at consumer client, keep 1 minute(or less); kafka only guarantee one partition's order, and I think producer also cannot guarantee order。
2.if you not commit offset manually, in the next fetch request ,will get same message. anyway, at consumer client, you should ensure message idempotency, even you conmmited offset.

How to find out the latest offset of a Kafka topic to know when my reader is up-to-date with topic?

I have a server that needs to keep an in-memory cache of all users. So assuming that a list won't be big - couple hundred thousands items, I'd like to use a Kafka topic with keyed messages where key is a userId to keep the current state of that list and the admin application will send new user object to that topic when something changed. So when the server starts it simply needs to read everything from that topic from the beginning and populate it's cache.
The population phase takes about 20-30 seconds depending on a connection to Kafka so the server needs not become online until it reads everything from the topic to have an up-to-date cache (all the messages in the topic at the moment of start is considered up-to-date). But I don't see how to determine if I read everything from Kafka stream to notify other services that cache is populated and the server can start server requests. I've read about high watermark but don't see it exposed in Java consumer API.
So how to find out the latest offset of a Kafka topic to know when my reader is up-to-date?
Assuming you are using High level consumer.
High watermark is not available in High level consumer.
**As you mentioned: all the messages in the topic at the moment of start is considered up-to-date**
when your application starts, you can do the following using SimpleConsumer Api :-
Find the number of partitions in topic by issuing a TopicMetadataRequest to any broker in the kafka cluster.
Create partition to latestOffset map, where key is partition and value is latestOffset available in that partition.
Map<Integer,Integer> offsetMap = new HashMap<>()
For each partition p in Topic:
A. Find the leader of partition p
B. Send an OffsetRequest to the leader
C. Get the latestOffset from the OffsetResponse
D. Add an entry to offsetMap where key is partition p and offset is
latestOffset.
Start reading messages from kafka using High level consumer:
A. For each message you get from KafkaStream:
AA. Get the partition && offset of the message
BB. if( offsetMap.get(partition)<=offset) stop Reading from this steam
Hope this helps.