Kafka PersistentWindowStore rebalancing mechanics - apache-kafka

I am creating a 30-minute de-duplication store for a Kafka Streams application loosely based upon this confluent code (to solve a different problem to Kafka's exactly-once processing guarantee), and want to minimise topology startup time.
This code makes use of a persistent window store, which requires that I specify the number of log segments to make use of. Assuming I want to use 2 segments, and am using the default segment size of 1GB, does this mean that during rebalancing, the client will have to read 2GB of data before the application launches?

The segment parameter configures something different in Kafka Streams -- it's not related to segments in the brokers (just the same name).
Using a windowed store, the retention time of the store, is divided by the number of segments. If all data is a segment is older than the retention time, the complete segment is dropped and a new empty segment is created. Those segments, only exist client-side.
The number of record that need to be restored, only depend on the retention time (and your input data rate). It's independent of segments size. Segment size only defined how fine grained older records are expired.

Related

How does KStreams handle state store data when adding additional partitions?

I have one partition of data with one app instance and one local state store. It's been running for a time and has lots of stateful data. I need to update that to 5 partitions with 5 app instances. What happens to the one local state store when the partitions are added and the app is brought back online? Do I have to delete the local state store and start over? Will the state store be shuffled across the additional app instance state stores automatically according to the partitioning strategy?
Do I have to delete the local state store and start over?
That is the recommended way to handle it. (cf https://docs.confluent.io/platform/current/streams/developer-guide/app-reset-tool.html) As a matter of fact, if you change the number of input topic partitions and restart your application, Kafka Stream would fail with an error, because the state store has only one shard, while 5 shards would be expected given that you will have 5 input topic partitions now.
Will the state store be shuffled across the additional app instance state stores automatically according to the partitioning strategy?
No. Also note, that this also applies to your data in your input topic. Thus, if you plan to partition your input data by key (ie, when writing into the input topic upstream), old records would remain in the existing partition and thus would not be partitioned properly.
In general, it is recommended to over-partitions your input topics upfront, to avoid that you need to change the number of partitions later on. Thus, you might also consider to maybe go up to 10, or even 20 partitions instead of just 5.

Kafka KSQL equivale of VIEW for consumer that need a subset of data

We are implementing an ETL in Kafka to load data from a single source into different target system with different consumer.
Every consumer needs a subset of the data and for this we have the following topics:
topicA ---> infinite retention store all the data from the source
topicB --> finite retention populated by a KSQL statement with a where clause
Example:
CREATE STREAM streamA WITH (KAFKA_TOPIC='topicA')
CREATE STREAM streamB WITH (KAFKA_TOPIC='topicB') AS SELECT * FROM streamA WHERE gender='MALE'
After that we have a sink connector or a consumer connected to topicB to consume only data which gender is male or with some columns name remapped
Since we are running an initial import of an important amount of data I would like to understand if there is any way to reduce the amount of storage required for the streamB since its data is just a replica of topicA.
In SQL I would implement it a VIEW, how can I do that in KSQL?
My ideas is to have a lower retention period for topicB but this doesn't solve issue with the initial load (e.g if I have to load 10TB of data at the beginning even if I have 1 day retention period for one day I would need 10TB + 5TB). Is there any other solution?
I see the following options if you want to minimise the space that topicB takes up in your cluster:
Reduce your time based retention setting for the topic, e.g. to 6 hours, or 1 hour, or 30 minutes, etc.
Use a size based retention setting for the topic, e.g. 100MB per partition.
However, note, in each case it will be up to you to ensure your consumer is capable of consuming the data before the retention policy kicks in and deletes the data. If data is deleted before the consumer can consume it, the consumer will log a warning.
Reduce the replication factor of the topic. You're hopefully running with a replication factor of at least 3 for your main 'golden truth' topic, so that its resilient to machine failures. But you may be able to run with a lower factor for topic b, e.g. 2, or 1. This would halve / third the storage cost. Of course, if you lost a machine/disk during the process and you only had 1 replica, you'd lose data and need to recover from this.
Expand your Kafka cluster!

How does Apache Kafka use open file descriptors?

I wanted to know how does Kafka use open file descriptors. Why is it recommended to have a large number of open file descriptor. Does it impact Producer and Consumer throughput.
Brokers create and maintain file handles for each log segment files and network connections. The total number could be very huge if the broker hosts many partitions and partition has many log segment files. This applies for the network connection as well.
I don't immediately see any possible performance declines caused by setting a large file-max, but the page cache miss matters.
Kafka keeps one file descriptor open for every segment file, and it fails miserably if the limit is too low. I don't know if it affects consumer throughput, but I assume it doesn't since Kafka appears to ignore the limit until it is reached.
The number of segment files is the number of partitions multiplied by some number that is dependent on the retention policy. The default retention policy is to start a new segment after one week (or 1GB, whatever occurs first) and to delete a segment when all data in it is more than one week old.
(disclaimer: This answer is for Kafka 1.0 based on what I have learnt from one installation I have)
We can check in below ways.
if a broker hosts many partitions. For example, a Kafka broker needs at least the following number of file descriptors to just track log segment files:
(number of partitions)*(partition size / segment size)

How is Apache Kafka offset generated?

Went through
How is the kafka offset value computed?
From the kafka documentation on replication:
The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades.
From the kafka documentation on Efficiency:
The message log maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most important operation: network transfer of persistent log chunks.
I did not see anywhere details regarding how the offset is generated for a topic. Will be offsets be generated by a single machine in the cluster in which case there is one master or Kafka has distributed logging that relies on some kind of clock synchronization and generates messages in a consistent order among all the nodes.
Any pointers or additional information will be helpful.
Offsets are not generated explicitly for each message and messages do also no store their offset.
A topic consists of partitions, and messages are written to partitions in junks, called segments (on the file system, there will be a folder for a topic, with subfolders for each partition -- a segment corresponds to a file within a partitions folder).
Furthermore, a index is maintained per partitions and stored along with the segment files, that uses the offset of the first message per segment as key and point to the segment. For all consecutive messages within a segment, the offset of a message can be computed by it's logical position within the segment (including the offset of the first messages).
If you start a new topic or actually a new partition, a first segment is generated and its start offset zero is inserted into the index. Message get written to the segment until it's full. A new segment is started and it's start offset get's added to the index -- the start offset of the new segment can easily be computed by the start offset of the latest segment plus the number of message within this segment.
Thus, for each partitions, the broker that hosts this partitions (ie, the leader for this partition) tracks the offset for this partitions by maintaining the index. If segments are deleted because retention time passed, the segment file get's deleted and the entry in the index is removed.

Kafka optimal retention and deletion policy

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.