How does Kafka internally order the messages within a partition? Does it store them as it received from the producer? - apache-kafka

I wanted to understand the order that Kafka follows internally to place messages in a parition that it received from a bunch of varying producers.

Partition is a sharding for the topic. And each partition will be write into a separate file under the same directory holds the topic's name. Writing or reading into a file is sequential, that is the way partition maintains its order.
Does it store them as it received from the producer?
Yes, as soon message received it will be written into its buffer quite similar to some relational data bases write to write ahead log. Kafka uses operating systems page cache as a buffer to obtain high performance of reading and writing. Periodically depends on the configuration Kafka writes data into the file.

Related

How is it possible to aggregate messages from Kafka topic based on duration (e.g. 1h)?

We are streaming messages to a Kafka topic at a rate of a few hundred per second. Each message has a timestamp and a payload. Ultimately, we would like aggregate one hour worth of data - based on the timestamp of the message - into parquet files and upload them to a cheap remote storage (object-store).
A naive approach would be to have the consumer simply read the messages from the topic and do the aggregation/roll-up in memory, and once there is one hour worth of data, generate and upload the parquet file.
However, in case the consumer crashes or needs to be restarted, we would lose all data since the beginning of the current hour - if we use enable.auto.commit=true or enable.auto.commit=false and manually commit after a batch of messages.
A simple solution for the Consumer could be to keep reading until one hour worth of data is in memory, do the parquet file generation (and upload it), and only then call commitAsync() or commitSync() (using enable.auto.commit=false and use an external store to keep track of the offsets).
But this would lead to millions of messages not being committed for at least one hour. I am wondering if Kafka does even allow to "delay" the commit of messages for so many messages / so long time (I seem to remember to have read about this somewhere but for the life of me I cannot find it again).
Actual questions:
a) is there a limit to the number of messages (or duration) not being committed before Kafka possibly considers the Consumer to be broken or stops giving additional messages to the consumer? this seems counter-intuitive though, since what would be the purpose of enable.auto.commit=false and managing the offsets in the Consumer (with e.g. the help of an external database).
b) in terms of robustness/redundancy and scalability, it would be great to have more than one Consumer in the consumer group; if I understand correctly, it is never possible to have more than one Consumer per partition. If we then run more than one Consumer and configure multiple partitions per topic we cannot do this kind of aggregation/roll-up, since now messages will be distributed across Consumers. The only way to work-around this issue would be to have an additional (external) temporary storage for all those messages belonging to such one-hour group, correct?
You can configure Kafka Streams with a TimestampExtractor to aggregate data into different types of time-windows
into parquet files and upload them to a cheap remote storage (object-store).
Kafka Connect S3 sink, or Pinterest Secor tool, already do this

Apache Kafka - KStream and KTable hard disk space requirements

I am trying to, better, understand what happens in the level of resources when you create a KStream and a KTable. Below, I wil mention some conclusions that I have come to, as I understand them (feel free to correct me).
Firstly, every topic has a number of partitions and all the messages in those partitions are stored in the hard disk(s) in continuous order.
A KStream does not need to store the messages, that are read from a topic, again to another location, because the offset is sufficient to retrieve those messages from the topic which is connected to.
(Is this correct? )
The question regards the KTable. As I have understand, a KTable, in contrast with a KStream, updates every message with the with the same key. In order to do that, you have to either store externally the messages that arrive from the topic to a static table, or read all the message queue, each time a new message arrives. The later does not seem very efficient regarding time performance. Is the first approach I presented correct?
read all the message queue, each time a new message arrives.
All messages are only read at the fresh start of the application. Once the app reads up to the latest offset, it's just updating the table like any other consumer
How disk usage is determined ultimately depends on the state store you've configured for the application, along with its own settings. For example, in-memory vs rocksdb vs an external state store interface that you've written on your own

Using kafka for CQRS

Been reading a lot about kafka's use as an event store and a potential good candidate for CQRS.
I was wondering, since messages in kafka have a limited retention time, how will events be replayed after the messages were deleted from the disk where kafka retains messages?
Logically, when these messages are stored externally from kafka (after reading messages from kafka topics) in a db (sql/nosql), that would make more sense from an event store standpoint than kafka.
In lieu of above, given my understanding is correct, what is the real use case of kafka being used in CQRS even though the actual intent of kafka was just a high throughput messaging system?
You can use Kafka of event store and CQRS. You can use Kafka Stream to process all events generated by commands and store a snapshot of your entities in a changelog topic and store the changelog topic in a NOSQL one or more databases that meets your requirement. Also, all event can be store in a database(PostgresSql). What's important to know is that Kafka can be used as a store(its store files in high available way) or as a message query.
Retention time: You can set the retention time as long as you want or even keep messages forever in the topic.
Using Kafka as the data store: Sure, you can. There is a feature named Log Compaction. Let say the following scenario:
Insert product with ID=10, Name=Apple, Price=10
Insert product with ID=20, Name=Orange, Price=20
Update product with ID=10, Price becomes 30
When one topic is turned on the log compaction, a background job will periodically clean up messages on that topic. This job will check if any message has the same key then only keeps the final. With the above scenario, messages which are written to Kafka will the following format:
Message 1: Key=1, Name=Apple, Price=10
Message 2: Key=2, Name=Orange, Price=20
Message 3: Key=1, Name=Apple, Price=30 (Every update now includes all fields so it can self-contained)
After the log compaction, the topic will become:
Message 1: Key=2, Name=Orange, Price=20
Message 2: Key=1, Name=Apple, Price=30 (Keep the lastest record with the ID=1)
In reality, Kafka uses log compaction feature to make Kafka as the persistent data storage.

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon