Auto expiration and deletion of znodes in Zookeeper? - apache-zookeeper

Is there a way to have a znode auto expire and delete (under Zookeeper control) after some length of time? I don't need to keep a znode around after that point. I want to conserve resources.
If your answer is "no, your application has to handle that", would you kindly point me to some documentation that makes that clear? (Right now, I suspect this may be the case, but I don't want to assume it too quickly.)
If the answer is "not currently, but Zookeeper could be extended to do that", then I would be especially thankful to a suggestion of a good starting point for making such an enhancement.

According to Patrick on The Zookeeper Mailing List: ZNODE time to live:
There is no TTL like feature in the current implementation.
That was 26 Apr 2012, which would correspond to version 3.3.5 according to the list of Apache ZooKeeper Releases.
I skimmed the release notes for 3.3.6, 3.4.4, and 3.4.5 and found no mention of "TTL" or "time to live" or anything along those lines.

Zookeeper 3.5.3 contains support for ttl on zknodes.
A new function call is added where you can provide a ttl for a node in milliseconds.

Related

Spring Kafka Consumer Configs - default values and at least once semantics

I am writing kafka consumer using spring-kafka template.
When I am instantiating consumers, Spring kafka takes in parameters like the following.
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, fetchMaxBytes);
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, maxPartitionFetchBytes);
I read the documentation and it looks like there are lots of other parameters that can be passed as the consumer configs too. Interestingly, each of these parameter has a default value.
My question is
On what basis were these arrived?
Will there be a real-need to change these values, if so what would be those
(IMHO, this is on case by case basis. But still would like to hear
it from experts)
The delivery semantic we have is atleast once.
So, for this (atleast-once) delivery semantic, should these be left
untouched and it would still process high volume of data.
Any pointers or answers would be of great help in clarifying my doubts.
The default values are an attempt to serve most of the use cases around Kafka. However, it would be an illusion to assume that those many different configurations can be set to serve all use cases.
A good starting point to understand the default values is the plain-Kafka ConsumerConfiguration and for Spring its documentation. In the Confluence docs you will also find for each configuration the "Importance". If this importance is set to high, it is recommended to really think about it. I have given some more background on the importance here.
at-least-once
For at least once semantics you want to control the commits of the consumed messages. For this, enable.autto.commit needs to be set to false which is the default value since spring version 2.3). In addition the AckMode is per default set to BATCH which is the basis for a at least once semantics.
So, depending on your Spring version it looks like you can leave the default configuration to achieve at-least-once semantics.

KStreamWindowAggregate 2.0.1 vs 2.5.0: skipping records instead of processing

I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.

Kafka connect error handling and improved logging

I was trying to leverage some enhancements in Kafka connect in 2.0.0 release as specified by this KIP https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect and I came across this good blog post by Robin https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues.
Here are my questions
I have set errors.tolerance=all in my connector config. If I understand correctly, it will not fail for bad records and move forward. Is my understanding correct?
In my case, the consumer doesn't fail and stays in the RUNNING state (which is expected) but the consumer offsets don't move forward for the paritions with the bad records. Any guess why this may be happening?
I have set errors.log.include.messages and errors.log.enable to true for my connector but I don't see any additional logging for the failed records. The logs are similar to what I used to see before enabling these properties. I didn't see any message like this https://github.com/apache/kafka/blob/5a95c2e1cd555d5f3ec148cc7c765d1bb7d716f9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/LogReporter.java#L67
Some Context:
In my connector, I do some transformations, validations for every record and if any of these fail, I throw RetriableException. Earlier I was throwing RuntimeException but I changed to RetriableException after reading the comments for RetryWithToleranceOperator class.
I have tried to keep it brief but let me know if any additional context is required.
Thanks so much in advance!

Clean up changelog topic backing session windows

We are aggregating in session windows by using the following code:
.windowedBy(SessionWindows.with(...))
.aggregate(..., ..., ...)
The state store that is created for us automatically is backed by a changelog topic with cleanup.policy=compact.
When redeploying our topology, we found that restoring the state store took much longer than expected (10+ minutes). The explanation seems to be that even though a session has been closed, it is still present in the changelog topic.
We noticed that session windows have a default maintain duration of one day but even after the inactivity + maintain durations have been exceeded, it does not look like messages are removed from the changelog topic.
a) Do we need to manually delete "old" (by our definition) messages to keep the size of the changelog topic under control? (This may be the case as hinted to by [1].)
b) Would it be possible to somehow have the changelog topic created with cleanup.policy=compact,delete and would that even make sense?
[1] A session store seems to be created internally by Kafka Stream's UnwindowedChangelogTopicConfig (and not WindowedChangelogTopicConfig) which may make this comment from Kafka Streams - reducing the memory footprint for large state stores relevant: "For non-windowed store, there is no retention policy. The underlying topic is compacted only. Thus, if you know, that you don't need a record anymore, you would need to delete it via a tombstone. But it's a little tricky to achieve... – Matthias J. Sax Jun 27 '17 at 22:07"
You are hitting a bug. I just created a ticket for this: https://issues.apache.org/jira/browse/KAFKA-7101.
I would recommend that you modify the topic config manually for fix the issue in your deployment.

Kafka Streams - Low-Level Processor API - RocksDB TimeToLive(TTL)

I'm kind of experimenting with the low level processor API. I'm doing data aggregation on incoming records using the processor API and writing the aggregated records to RocksDB.
However, I want to retain the records added in the rocksdb to be active only for 24hr period. After 24hr period the record should be deleted. This can be done by changing the ttl settings. However, there is not much documentation where I can get some help on this.
how do I change the ttl value? What java api should I use to set the ttl time to 24 hrs and whats the current default ttl settings time?
I believe this is not currently exposed via the api or configuration.
RocksDBStore passes a hard-coded TTL when opening a RocksDB:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBStore.java#L158
and the hardcoded value is simply TTL_SECONDS = TTL_NOT_USED (-1) (see line 79 in that same file).
There are currently 2 open ticket regarding exposing TTL support in the state stores: KAFKA-4212 and KAFKA-4273:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20text%20~%20%22rocksdb%20ttl%22
I suggest you comment on one of them describing your use case to get them moving forward.
In the interim, if you need the TTL functionality right now, state stores are pluggable, and the RocksDBStore sources readily available, so you can fork it and set your TTL value (or, like the pull request associated with KAFKA-4273 proposes, source it from the configs).
I know this is not ideal and sincerely hope someone comes up with a more satisfactory answer.