How to always consume from latest offset in kafka-streams - apache-kafka

Our requirement is such that if a kafka-stream app is consuming a partition, it should start it's consumption from latest offset of that partition.
This seems like do-able using
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
Now, let's say using above configuration, the kafka-stream app started consuming data from latest offset for a partition. And after some time, the app crashes. When the app comes back live, we want it to consume data from the latest offset of that partition, instead of the where it left last reading.
But I can't find anything that can help achieve it using kafka-streams api.
P.S. We are using kafka-1.0.0.

That is not supported out-of-box.
Configuration auto.offset.reset only triggers, if there are no committed offsets and there is no config to change this behavior.
You could manipulate offsets manually before startup
using bin/kafka-consumer-groups.sh though—the application.id is the
group.id and and you could "seek to end" before you restart the application.
Update:
Since 1.1.0 release, you can use bin/kafka-streams-application-reset.sh tool to set starting offsets. To use the tool, the application must be offline. (cf: https://cwiki.apache.org/confluence/display/KAFKA/KIP-171+-+Extend+Consumer+Group+Reset+Offset+for+Stream+Application)

Related

Kafka last commit timestamp

Is there any way to get the time when any consumer group last committed offsets? In other words, can we find out the timestamp of the last commit? If there is, does the Kafka Java library allow it to be obtained?
I have tried to find this out but haven't found anything satisfactory.
The client Offset Explorer shows the timestamp for last commit by a consumer group. Does anyone have a clue how it manages to do so? And how similar clients could be working to fetch those details not available through programming apis.

KStream/KsqlDb application with Persistent State Store in Kubernetes

Does anyone here have experience in deploying KStream/KsqlDb application with Persistent State Store in Kubernetes environment without loosing the auto scalability? i.e. Automatic creation of state store volume and the state for new container and rebalancing once the container is gone without loosing the topic partiton to the data volume mapping. Is it possible?
When a Persistent State store disappears(or gets deleted), will KStream restore the state store from the Change Log topic automatically or we have to manually reset the consumer offsets to earliest on the Change log topic consumer?
This is hard to achieve. However, you could use standby tasks to get HA for this case.
You don't need to do anything. Kafka Streams will automatically restore state from the changelog.

What happens to the Kafka state store when you use the application reset tool?

What happens to your state store when you run the Kafka streams application reset tool to reset the app to a particular timestamp (say T-n)?
The document reads:
"Internal topics: Delete the internal topic (this automatically deletes any committed offsets)"
(Internal topics are used internally by the Kafka Streams application while executing, for example, the changelog topics for state stores)
Does this mean that I lose the state of my state store/RocksDB as it was at T-n?
For example, let's say I was processing a "Session Window" on the state store at that timestamp. It looks like I'll lose all existing data within that window during an application reset.
Is there possibly a way to preserve the state of the Session Window when resetting an application?
In other words, is there a way to preserve the state of my state store or RocksDB (at T-n) during an application reset?
The rest tool itself will not touch the local state store, however, it will delete the corresponding changelog topics. So yes, you effectively loose your state.
Thus, to keep your local state in-sync with the changelog you should actually delete the local state, too, and start with an empty state: https://docs.confluent.io/current/streams/developer-guide/app-reset-tool.html#step-2-reset-the-local-environments-of-your-application-instances
It is not possible currently, to also reset the state to a specific point atm.
The only "workaround" might be, to not use the rest tool but bin/kafka-consumer-groups.sh to only modify the input topic offsets. This way you preserve the changelog topics and local state stores. However, when you restart the app the state will of course be in it's last state. Not sure if this is acceptable.

Kafka Streams Application Updates

I've built a Kafka Streams application. It's my first one, so I'm moving out of a proof-of-concept mindset into a "how can I productionalize this?" mindset.
The tl;dr version: I'm looking for kafka streams deployment recommendations and tips, specifically related to updating your application code.
I've been able to find lots of documentation about how Kafka and the Streams API work, but I couldn't find anything on actually deploying a Streams app.
The initial deployment seems to be fairly easy - there is good documentation for configuring your Kafka cluster, then you must create topics for your application, and then you're pretty much fine to start it up and publish data for it to process.
But what if you want to upgrade your application later? Specifically, if the update contains a change to the topology. My application does a decent amount of data enrichment and aggregation into windows, so it's likely that the processing will need to be tweaked in the future.
My understanding is that changing the order of processing or inserting additional steps into the topology will cause the internal ids for each processing step to shift, meaning at best new state stores will be created with the previous state being lost, and at worst, processing steps reading from an incorrect state store topic when starting up. This implies that you either have to reset the application, or give the new version a new application id. But there are some problems with that:
If you reset the application or give a new id, processing will start from the beginning of source and intermediate topics. I really don't want to publish the output to the output topics twice.
Currently "in-flight" data would be lost when you stop your application for an upgrade (since that application would never start again to resume processing).
The only way I can think to mitigate this is to:
Stop data from being published to source topics. Let the application process all messages, then shut it off.
Truncate all source and intermediate topics.
Start new version of application with a new app id.
Start publishers.
This is "okay" for now since my application is the only one reading from the source topics, and intermediate topics are not currently used beyond feeding to the next processor in the same application. But, I can see this getting pretty messy.
Is there a better way to handle application updates? Or are my steps generally along the lines of what most developers do?
I think you have a full picture of the problem here and your solution seems to be what most people do in this case.
During the latest Kafka-Summit this question has been asked after the talk of Gwen Shapira and Matthias J. Sax about Kubernetes deployment. The responses were the same: If your upgrade contains topology modifications, that implies rolling upgrades can't be done.
It looks like there is no KIP about this for now.

Data Re-processing with specific starting point in Kafka Streams

I am investigating data reprocessing with Kafka Streams. There is a nice tool available for data reprocessing with resetting the streaming application: Application Reset tool.
But this tool usually resets the application state to zero and reprocesses everything again from scratch.
There are scenarios when we want to reprocess the data from a specific point, i.e.:
Bug fix in the current application
Updating the application with some additional processor and run with the same application ID
As in Flink also, we have Savepoints concepts, which can restore the previous operator states and add the new operators without any error.
I also referred the following documents :
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
Would like to know :
Is there any checkpointing type of mechanism available in KStream?
How can we re-run the Kafka Streams application from a specific point?
What happens if we change the code in one of the application instance and run with the old application ID?
Kafka Streams does not have a savepoint concept at this point (version 1.0).
Not at the moment (v1.0)
Yes. In next release this will be part of the reset tool directly. In 1.0, you can use bin/kafka-consumer-groups.sh to commit a start offset for you application (note, application.id == group.id). For older Kafka version, you could build a custom tool to commit start offsets
In general, it breaks. Thus, you need to use a new application.id (it's a know issue and will be fixed in future releases).