KStream/KsqlDb application with Persistent State Store in Kubernetes - kubernetes

Does anyone here have experience in deploying KStream/KsqlDb application with Persistent State Store in Kubernetes environment without loosing the auto scalability? i.e. Automatic creation of state store volume and the state for new container and rebalancing once the container is gone without loosing the topic partiton to the data volume mapping. Is it possible?
When a Persistent State store disappears(or gets deleted), will KStream restore the state store from the Change Log topic automatically or we have to manually reset the consumer offsets to earliest on the Change log topic consumer?

This is hard to achieve. However, you could use standby tasks to get HA for this case.
You don't need to do anything. Kafka Streams will automatically restore state from the changelog.

Related

What happens to the Kafka state store when you use the application reset tool?

What happens to your state store when you run the Kafka streams application reset tool to reset the app to a particular timestamp (say T-n)?
The document reads:
"Internal topics: Delete the internal topic (this automatically deletes any committed offsets)"
(Internal topics are used internally by the Kafka Streams application while executing, for example, the changelog topics for state stores)
Does this mean that I lose the state of my state store/RocksDB as it was at T-n?
For example, let's say I was processing a "Session Window" on the state store at that timestamp. It looks like I'll lose all existing data within that window during an application reset.
Is there possibly a way to preserve the state of the Session Window when resetting an application?
In other words, is there a way to preserve the state of my state store or RocksDB (at T-n) during an application reset?
The rest tool itself will not touch the local state store, however, it will delete the corresponding changelog topics. So yes, you effectively loose your state.
Thus, to keep your local state in-sync with the changelog you should actually delete the local state, too, and start with an empty state: https://docs.confluent.io/current/streams/developer-guide/app-reset-tool.html#step-2-reset-the-local-environments-of-your-application-instances
It is not possible currently, to also reset the state to a specific point atm.
The only "workaround" might be, to not use the rest tool but bin/kafka-consumer-groups.sh to only modify the input topic offsets. This way you preserve the changelog topics and local state stores. However, when you restart the app the state will of course be in it's last state. Not sure if this is acceptable.

Kafka Streams - Synchronization of commit vs Rocks DB delete

In the low level processor API, I wanted to delete a key from the store immediately after the corresponding value is forwarded to the downstream. In the event of a rebalance OR a commit failure, would the delete performed on the store rollback by itself OR stay permanently deleted? If later, is there a way to synchronize the store delete vs commit. Would the above behavior differ with cache enabled on the store vs not enabled?
The behavior is independent of caching, and if you run with "at-least-once" guarantee the store will not roll back.
If you need stricter guarantees you can enable "exactly-once" processing that will provide the synchronization with the store you ask for.

How to always consume from latest offset in kafka-streams

Our requirement is such that if a kafka-stream app is consuming a partition, it should start it's consumption from latest offset of that partition.
This seems like do-able using
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
Now, let's say using above configuration, the kafka-stream app started consuming data from latest offset for a partition. And after some time, the app crashes. When the app comes back live, we want it to consume data from the latest offset of that partition, instead of the where it left last reading.
But I can't find anything that can help achieve it using kafka-streams api.
P.S. We are using kafka-1.0.0.
That is not supported out-of-box.
Configuration auto.offset.reset only triggers, if there are no committed offsets and there is no config to change this behavior.
You could manipulate offsets manually before startup
using bin/kafka-consumer-groups.sh though—the application.id is the
group.id and and you could "seek to end" before you restart the application.
Update:
Since 1.1.0 release, you can use bin/kafka-streams-application-reset.sh tool to set starting offsets. To use the tool, the application must be offline. (cf: https://cwiki.apache.org/confluence/display/KAFKA/KIP-171+-+Extend+Consumer+Group+Reset+Offset+for+Stream+Application)

Data Re-processing with specific starting point in Kafka Streams

I am investigating data reprocessing with Kafka Streams. There is a nice tool available for data reprocessing with resetting the streaming application: Application Reset tool.
But this tool usually resets the application state to zero and reprocesses everything again from scratch.
There are scenarios when we want to reprocess the data from a specific point, i.e.:
Bug fix in the current application
Updating the application with some additional processor and run with the same application ID
As in Flink also, we have Savepoints concepts, which can restore the previous operator states and add the new operators without any error.
I also referred the following documents :
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
Would like to know :
Is there any checkpointing type of mechanism available in KStream?
How can we re-run the Kafka Streams application from a specific point?
What happens if we change the code in one of the application instance and run with the old application ID?
Kafka Streams does not have a savepoint concept at this point (version 1.0).
Not at the moment (v1.0)
Yes. In next release this will be part of the reset tool directly. In 1.0, you can use bin/kafka-consumer-groups.sh to commit a start offset for you application (note, application.id == group.id). For older Kafka version, you could build a custom tool to commit start offsets
In general, it breaks. Thus, you need to use a new application.id (it's a know issue and will be fixed in future releases).

how to delete an asset using the DAM update asset workflow in aem 6.1?

i should be able to copy renditions of the asset from the worker instance to master instance and then delete the asset inthe worker instance
using the DAM update asset offloading workflow
In my opinion its not a good practice to update the Update Asset workflow on worker instance -
This whole offloading is based on Sling Discovery and eventing mechanism. Which requires offloaded asset to be sent back (read reverse replication) to Leader instance
Adding a step within Update Asset Workflow may cause issues with reverse replication of asset.
You will have to build something independent of the offloading process to achieve this deletion. There are multiple ways to do it -
One possible way -
Have JMS based implementation to monitor reverse replication
If reverse replication is successful either delete the asset or mark the asset for deletion (highly recommended)
If following the approach of marking asset for deletion, setup a cleanup task to run only of worker instance (scheduled to convenient time). This cleanup task identifies the assets marked for deletion and processes them.
IMHO marking asset for deletion is better approach as its more performant and efficient. All assets are processed at once during off-peak time.
There are other ways to this as well but would require a lot of custom code to be written.
Updates -
Tapping into Reverse Replication -
You need to get into the details of the working of reverse replication.
Content to be reverse replicated is pushed to OUTBOX
If you look at /etc/replication/agents.publish/outbox/jcr:content on your local instance, look for property transportUri which is by default - repo://var/replication/outbox i.e. content to reverse replicated is pushed to '/var/replication/outbox'
Now look at /libs/cq/replication/components/revagent/revagent.jsp, This is the logic that works on the receiving instance.
Going through above will give you deeper understanding of how reverse replication is working.
Now you have two options to implement what you want -
To check the replication status, tap into the replication queue as the code in /libs/cq/replication/components/revagent/revagent.jsp is doing. This is the code that executes on Author instance where the content is reverse replicated, in your case its Leader instance. You will have to works around this code to make it work on Worker instance. To be more specific on implementation your code will update the line Agent agent = agentMgr.getAgents().get(id); where id is the OUTBOX agent id.
Have a event listener monitor the outbox. Check the payload that comes for replication and use it for your functionality.
What I have mentioned is the crude approach that doesn't cover the failover/recovery use-cases, i.e. how would you handle the deletion if your replication queue is blocked for any reason and the images have not been pushed back to leader.