Apache Kafka - KStream and KTable hard disk space requirements - apache-kafka

I am trying to, better, understand what happens in the level of resources when you create a KStream and a KTable. Below, I wil mention some conclusions that I have come to, as I understand them (feel free to correct me).
Firstly, every topic has a number of partitions and all the messages in those partitions are stored in the hard disk(s) in continuous order.
A KStream does not need to store the messages, that are read from a topic, again to another location, because the offset is sufficient to retrieve those messages from the topic which is connected to.
(Is this correct? )
The question regards the KTable. As I have understand, a KTable, in contrast with a KStream, updates every message with the with the same key. In order to do that, you have to either store externally the messages that arrive from the topic to a static table, or read all the message queue, each time a new message arrives. The later does not seem very efficient regarding time performance. Is the first approach I presented correct?

read all the message queue, each time a new message arrives.
All messages are only read at the fresh start of the application. Once the app reads up to the latest offset, it's just updating the table like any other consumer
How disk usage is determined ultimately depends on the state store you've configured for the application, along with its own settings. For example, in-memory vs rocksdb vs an external state store interface that you've written on your own


Kafka client and aggregated events

In event-driven design we strive to find out events that we interested of. Using Kafka we can easily subscribe (a new group.id) to a topic and start consuming events. If retention policy is default one we could consume also one week old messages if specify auto.offset.reset=earliest. Right? But what if we want to start from the very beginning? I guess that KTable should be used but I'm not sure what will happened when a new client has subscribed to a stateful stream. Could you tell me is it true that the new subscriber will receive all aggregated messages?
You can't consume data that has been deleted.
That's why KTables are built on top of compacted topics, which will store the latest keys for each record, and have infinite retention.
If you want to read the "current state" of the table, to get all aggregated messages, then you can use Interactive Queries.
not sure what will happened when a new client has subscribed to a stateful stream
It needs to read the entire compacted topic, starting from the beginning (earliest available offset, not necessarily the first ever produced message) since it cannot easily find where in the topic that each unique key may start.

Kafka log compaction pointers

Reading about log compaction on a topic, I was wondering if there is any way for a consumer to get hold of any of the positions/offsets of the following?
end of the head
start of the tail
compaction cleaner point
Basically the point at which the compacted and non-compacted parts of the log meet?
I've read that there is a cleaner-offset-checkpoint file that sits on the broker at /var/lib/kafka/data/cleaner-offset-checkpoint but is the info in this file available to a consumer?
My use case is a consumer that will consume compacted keys one way and non-compacted keys another way.
thanks for any advice.
thinking for example of a topic holding various customer events like here https://www.confluent.io/blog/put-several-event-types-kafka-topic/; new customer, customer updates name, customer updates address, etc. Log compaction, I believe, will leave one event per customer in the tail but still many events per customer in the head (assuming compaction is slower than message production..?) A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event? In any case I was wondering if a consumer could tell how far along a topic compaction has got, at any given time?
It's not possible, with the consumer api, no.
If you want to check that checkpoint file on disk, you could use Jssh, for example, to access a broker, and read the file. If it has offset data, you could then use seek methods, but keep in mind that the Log Cleaner thread may be actively running when you seek to or consume that data
A new consumer of this topic would have to treat all compacted messages as CREATES, but then also treat non-compacted message as their more fine grained event?
I don't think this is a valid use case. For a stream of customer updates, you'd just update a new customer model in a table via a streaming reduce function. If any consumer restarts, it'll have to always read from the beginning of the topic to rebuild its local state then continue reading any updates to those stored values, so doesn't make sense to skip past them all, or have two separate consumers
I also don't necessarily think you need different models. Some UUID would be unique, and every event can contain the full model of a "customer". Most fields can remain optional/nullable until they are provided with a new message with all those fields set (or not), and this defines a batch update since you can set/update/remove multiple attributes at once. If you need more granularity, that's also possible to define at the producer level by storing and looping over your attributes and producing individual "customer" objects with each new attribute

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Kafka: topic compaction notification?

I was given the following architecture that I'm trying to improve.
I receive a stream of DB changes which end up in a compacted topic. The stream is basically key/value pairs and the keyspace is large (~4 GB).
The topic is consumed by one kafka stream process that stores the data in RockDB (separate for each consumer/shard). The processor does two different things:
join the data into another stream.
check if a message from the topic is a new key or an update to an existing one. If it is an update it sends the old key/value and the new key/value pair to a different topic (updates are rare).
The construct has a couple of problems:
The two different functionalities of the stream processor belong to different teams and should not be part of the same code base. They are put together to save memory. If we separate it we would have to duplicate RockDB's.
I would prefer to use a normal KTable join instead of the handcrafted join that's currently in the code.
RockDB seems to be a bit of overkill if the data is already persisted in a topic. We currently running into some performance issues and I assume it would be faster if we just keep everything in memory.
Question 1:
Is there a way to hook into the compaction process of a compacted topic? I would like a notification (to a different topic) for every key that is actually compacted (including the old and new value).
If this is somehow possible I could easily split the code bases apart and simplify the join.
Question 2:
Any other idea on how this can be solved more elegantly?
You overall design makes sense.
About your join semantics: I guess you need to stick with Processor API as regular KTable cannot provide you want. It's also not possible to hook into the compaction process.
However, Kafka Streams also supports in-memory state stores: https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html#state-stores
RocksDB is used by default, to allow the state to be larger than available main-memory. Spilling to disk with RocksDB to reliability -- however, it also has the advantage, that stores can be recreated quicker if an instance come back online on the same machine, as it's not required to re-read the whole changelog topic.
If you want to split the app into two, is your own decision on how much resources you want to provide.

Delayed message consumption in Kafka

How can I produce/consume delayed messages with Apache Kafka? Seems like standard Kafka (and Java kafka-client) functionality doesn't have this feature. I know that I could implement it myself with standard wait/notify mechanism, but it doesn't seem very reliable, so any advices and good practices are appreciated.
Found related question, but it didn't help.
As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
Indeed, kafka lowest structure is a partition, which are sequential events in a queue with incremental offset - you can't insert a log anywhere else than the end at the moment you produce it. There is no concept of delayed messages.
What do you want to achieve exactly?
Some possibilities in your case:
You want to push a message at a specific time (for example, an event "start job"). In this case, use a scheduled task (not from kafka, use some standard way on your os / language / custom app / whatever) to send the message at the given time - consumers will receive them at the proper time.
You want to send an event now, but which should not be taken into account now by consumers. In this case, you can use a custom structure which would include a "time" in its payload. Consumers will have to understand this field and have custom processing to deal with it. For exemple: "start job at 2017-12-27T20:00:00Z". You could also use headers for this, but headers are not supported by all clients for now.
You can change the timestamp of the message sent. Internally, it would still be read in order, but some functions implying time would work differently, and consumer could use the timestamp of the message for its action - this is kinda like the previous proposition, except the timestamp is one metadata of the event, and not the event payload itself. I would not use this personally - I only deal with timestamp when I proxy some events.
For your last question: basically, yes, but with some notes:
Topics are actually split in partition, and order is only preserved in partition. All message with same key are send to same partition.
Most of time, you only read from memory, except if you read old events - in this case, as those are sequentially read from disk, this is very fast
You can choose where to begin to read - a given offset or a given time - and even change it at runtime
You can parallelize read across process - multiple consumers can read the same topics and never reading the same messages twice (each reading different partition, see consumer groups)