Kafka Streams - Synchronization of commit vs Rocks DB delete - apache-kafka

In the low level processor API, I wanted to delete a key from the store immediately after the corresponding value is forwarded to the downstream. In the event of a rebalance OR a commit failure, would the delete performed on the store rollback by itself OR stay permanently deleted? If later, is there a way to synchronize the store delete vs commit. Would the above behavior differ with cache enabled on the store vs not enabled?

The behavior is independent of caching, and if you run with "at-least-once" guarantee the store will not roll back.
If you need stricter guarantees you can enable "exactly-once" processing that will provide the synchronization with the store you ask for.

Related

General question: Firestore offline persistence and synchronization

I could not find detailed information in the documentation. I have several questions regarding the offline persistence of firestore.
I understood that firestore locally caches everything and syncs back once online. My questions:
If I attach an onCompleteListener to my setDocument method it only fires when the device is online and has network access. But with offline persistence enabled, how can I detect that data has successfully been written to the cache (Is it always successful?!) - I see data is immediatly there without any listener ever triggering.
What if I wrote data to the cache while the device is offline then comes back online and everything gets synched. What if now any sort of error happens (So the onSuccessListener would contain an error, but the persistence cache already has the data). How do I know that offline and online data are ALWAYS in sync once network connection is restored on all devices?
What about race conditions? Lets say two users update a document at the "same time" while the device is offline. What happens once it comes back online?
But the most pressing question is: right now I continue with my programflow when the onSuccessListener fires, but it never does as long as the device is offline (showing an indefinete progress bar forever). I still need to continue with my program (thats why we have offline persistence) - How do I do this?
How can I detect that data has successfully been written to the cache
This is the case when the statement that write the data has completed. If writing to the local cache fails, an exception is thrown from that write statement.
You second point is hard to summarize, but:
Firestore keeps the pending writes separate from the snapshots it returns for local reads, and will update the cached snapshot correctly both for successful and for rejected writes.
If you want to know whether the snapshot you read contains any pending writes, you can check the pendingWrites field in its metadata.
What about race conditions? Let's say two users update a document at the "same time" while the device is offline. What happens once it comes back online?
The last write wins. If that's not what you need, use security rules to enforce your requirements on the server.

Preventing deletion of noncurrent objects

I'm storing backups in Cloud Storage. A desirable property of such a backup is to ensure the device being backed up cannot erase the backups, to protect against ransomware or similar threats. At the same time, it is desirable to allow the backup client to delete so old files can be pruned. (Because the backups are encrypted, it isn't possible to use lifecycle management to do this.)
The solution that immediately comes to mind is to enable object versioning and use lifecycle rules to retain object versions (deleted files) for a certain amount of time. However, I cannot see a way to allow the backup client to delete the current version, but not historical versions. I thought it might be possible to do this with an IAM condition, but the conditional logic doesn't seem flexible enough to parse out the object version. Is there another way I've missed?
The only other solution that comes to mind is to create a second bucket, inaccessible to the backup client, and use a Cloud Function to replicate the first bucket. The downside of that approach is the duplicate storage cost.
To answer this:
However, I cannot see a way to allow the backup client to delete the current version, but not historical versions
When you delete a live object, object versioning will retain a noncurrent version of it. When deleting the noncurrent object version, you will have to specify the object name along with its generation number.
Just to add, you may want to consider using a transfer job to replicate your data on a separate bucket.
Either way, both approach (object versioning or replicating buckets) will incur additional storage costs.

What happens to the Kafka state store when you use the application reset tool?

What happens to your state store when you run the Kafka streams application reset tool to reset the app to a particular timestamp (say T-n)?
The document reads:
"Internal topics: Delete the internal topic (this automatically deletes any committed offsets)"
(Internal topics are used internally by the Kafka Streams application while executing, for example, the changelog topics for state stores)
Does this mean that I lose the state of my state store/RocksDB as it was at T-n?
For example, let's say I was processing a "Session Window" on the state store at that timestamp. It looks like I'll lose all existing data within that window during an application reset.
Is there possibly a way to preserve the state of the Session Window when resetting an application?
In other words, is there a way to preserve the state of my state store or RocksDB (at T-n) during an application reset?
The rest tool itself will not touch the local state store, however, it will delete the corresponding changelog topics. So yes, you effectively loose your state.
Thus, to keep your local state in-sync with the changelog you should actually delete the local state, too, and start with an empty state: https://docs.confluent.io/current/streams/developer-guide/app-reset-tool.html#step-2-reset-the-local-environments-of-your-application-instances
It is not possible currently, to also reset the state to a specific point atm.
The only "workaround" might be, to not use the rest tool but bin/kafka-consumer-groups.sh to only modify the input topic offsets. This way you preserve the changelog topics and local state stores. However, when you restart the app the state will of course be in it's last state. Not sure if this is acceptable.

Lucene merge stop resume

Our product requires real time Lucene index merge in an embeded device. The device may get shutdown anytime. My team is exploring posibility of resume merge after system reboot. In our POC, all consumers from SegmentMerger are overrided by customized codec and the whole merge process is divided into many fine-grain steps. When each step is done, its state is saved in disk to avoid redo after resume. From test, this can work. However, I am not able to determine how robust this solution can be or it is built on flawed foundation.
Thanks in advance for your response

SQL Server 2000 merge replication – Undo Reinitialize All subscriptions

We currently have one publisher and four subscribers using merge replication. Due to a change in the schema somebody performed a “Reinitialize All subscriptions” action without checking the “Upload the changes at the subscriber before reinitializing” option. When the replication agent for the first server was started, the database was cleaned out. (All tables dropped and recreated) And all of the changes since the last successful synchronization were lost. At this point we decided to disable the replication schedule completely. My question is, is there a way to undo the “Reinitialize All subscriptions” action? Preferably, in such a way, that all of the changes at the subscribers aren’t lost.
Thanks in advance,
David
We were able to restore a backup of the publisher database prior to the reinitialize action. (This was done after creating a separate backup for the current publisher database.) After this we manually re-applied the changes which had been done since the reinitialize action from the database with the reinitialize action in it to the restored backup. (We used Redgate sql data compare). At this point we were able to start the replication process and everything worked as it should. So apparently the snapshot information is completely stored inside the database to which it applies.
A special thanks to Hilary Cotter for pointing this out.