Data Re-processing with specific starting point in Kafka Streams - apache-kafka

I am investigating data reprocessing with Kafka Streams. There is a nice tool available for data reprocessing with resetting the streaming application: Application Reset tool.
But this tool usually resets the application state to zero and reprocesses everything again from scratch.
There are scenarios when we want to reprocess the data from a specific point, i.e.:
Bug fix in the current application
Updating the application with some additional processor and run with the same application ID
As in Flink also, we have Savepoints concepts, which can restore the previous operator states and add the new operators without any error.
I also referred the following documents :
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
Would like to know :
Is there any checkpointing type of mechanism available in KStream?
How can we re-run the Kafka Streams application from a specific point?
What happens if we change the code in one of the application instance and run with the old application ID?

Kafka Streams does not have a savepoint concept at this point (version 1.0).
Not at the moment (v1.0)
Yes. In next release this will be part of the reset tool directly. In 1.0, you can use bin/kafka-consumer-groups.sh to commit a start offset for you application (note, application.id == group.id). For older Kafka version, you could build a custom tool to commit start offsets
In general, it breaks. Thus, you need to use a new application.id (it's a know issue and will be fixed in future releases).

Related

Axon Event Published Multiple Times Over EventBus

Just want to confirm the intended behavior of Axon, versus what I’m seeing in my application. We have a customized Kafka publisher integrated with the Axon framework, as well as, a custom Cassandra-backed event store.
The issue I’m seeing is as follows: (1) I publish a command (e.g. CreateServiceCommand) which hits the constructor of the ServiceAggregate, and then (2) A ServiceCreatedEvent is applied to the aggregate. (3) We see the domain event persisted in the backend and published over the EventBus (where we have a Kafka consumer).
All well and good, but suppose I publish that same command again with the same aggregate identifier. I do see the ServiceCreatedEvent being applied to the aggregate in the debugger, but since a domain event record already exists with that key, nothing is persisted to the backend. Again, all well and good, however I see the ServiceCreatedEvent being published out to Kafka and consumed by our listener, which is unexpected behavior.
I’m wondering whether this is the expected behavior of Axon, or if our Kafka integrations ought to be ensuring we’re not publishing duplicate events over the EventBus.
Edit:
I swapped in Axon's JPA event store and saw the following log when attempting to issue a command to create the aggregate that already exists. This issue then is indeed due to a defect with our custom event store.
"oracle.jdbc.OracleDatabaseException: ORA-00001: unique constraint (R671659.UK8S1F994P4LA2IPB13ME2XQM1W) violated\n\n\tat oracle.jdbc.driver.T4CTTIoer11.processError
The given explanation has a couple of holes which make it odd to be honest, and hard to pinpoint where the problem lies.
In short no, Axon would not publish an event twice as a result of dispatching the exact same command a second time. This depends on your implementation. If the command creates an aggregate, then you should see a constraint violation on the uniqueness requirement of aggregate identifier and (aggregate) sequence number. If it's a command which works on an existing aggregate, it depends on your implementation whether it is idempotent yes/no.
From your transcript I guess you are talking about a command handler which creates an Aggregate. Thus the behavior you are seeing strikes me as odd. Either the event store is custom which inserts this undesired behavior, or it's due to not using Axon's dedicated Kafka Extension.
Also note that using a single solution for event storage and message distribution like Axon Server will omit the problem entirely. You'd no longer need to configure any custom event handling and publication on Kafka at all, saving you personal development work and infrastructure coordination. Added, it provides you the guarantees which I've discussed earlier. From more insights on why/how of Axon Server, you can check this other SO response of mine.

What happens to the Kafka state store when you use the application reset tool?

What happens to your state store when you run the Kafka streams application reset tool to reset the app to a particular timestamp (say T-n)?
The document reads:
"Internal topics: Delete the internal topic (this automatically deletes any committed offsets)"
(Internal topics are used internally by the Kafka Streams application while executing, for example, the changelog topics for state stores)
Does this mean that I lose the state of my state store/RocksDB as it was at T-n?
For example, let's say I was processing a "Session Window" on the state store at that timestamp. It looks like I'll lose all existing data within that window during an application reset.
Is there possibly a way to preserve the state of the Session Window when resetting an application?
In other words, is there a way to preserve the state of my state store or RocksDB (at T-n) during an application reset?
The rest tool itself will not touch the local state store, however, it will delete the corresponding changelog topics. So yes, you effectively loose your state.
Thus, to keep your local state in-sync with the changelog you should actually delete the local state, too, and start with an empty state: https://docs.confluent.io/current/streams/developer-guide/app-reset-tool.html#step-2-reset-the-local-environments-of-your-application-instances
It is not possible currently, to also reset the state to a specific point atm.
The only "workaround" might be, to not use the rest tool but bin/kafka-consumer-groups.sh to only modify the input topic offsets. This way you preserve the changelog topics and local state stores. However, when you restart the app the state will of course be in it's last state. Not sure if this is acceptable.

Kafka Streams Application Updates

I've built a Kafka Streams application. It's my first one, so I'm moving out of a proof-of-concept mindset into a "how can I productionalize this?" mindset.
The tl;dr version: I'm looking for kafka streams deployment recommendations and tips, specifically related to updating your application code.
I've been able to find lots of documentation about how Kafka and the Streams API work, but I couldn't find anything on actually deploying a Streams app.
The initial deployment seems to be fairly easy - there is good documentation for configuring your Kafka cluster, then you must create topics for your application, and then you're pretty much fine to start it up and publish data for it to process.
But what if you want to upgrade your application later? Specifically, if the update contains a change to the topology. My application does a decent amount of data enrichment and aggregation into windows, so it's likely that the processing will need to be tweaked in the future.
My understanding is that changing the order of processing or inserting additional steps into the topology will cause the internal ids for each processing step to shift, meaning at best new state stores will be created with the previous state being lost, and at worst, processing steps reading from an incorrect state store topic when starting up. This implies that you either have to reset the application, or give the new version a new application id. But there are some problems with that:
If you reset the application or give a new id, processing will start from the beginning of source and intermediate topics. I really don't want to publish the output to the output topics twice.
Currently "in-flight" data would be lost when you stop your application for an upgrade (since that application would never start again to resume processing).
The only way I can think to mitigate this is to:
Stop data from being published to source topics. Let the application process all messages, then shut it off.
Truncate all source and intermediate topics.
Start new version of application with a new app id.
Start publishers.
This is "okay" for now since my application is the only one reading from the source topics, and intermediate topics are not currently used beyond feeding to the next processor in the same application. But, I can see this getting pretty messy.
Is there a better way to handle application updates? Or are my steps generally along the lines of what most developers do?
I think you have a full picture of the problem here and your solution seems to be what most people do in this case.
During the latest Kafka-Summit this question has been asked after the talk of Gwen Shapira and Matthias J. Sax about Kubernetes deployment. The responses were the same: If your upgrade contains topology modifications, that implies rolling upgrades can't be done.
It looks like there is no KIP about this for now.

How to always consume from latest offset in kafka-streams

Our requirement is such that if a kafka-stream app is consuming a partition, it should start it's consumption from latest offset of that partition.
This seems like do-able using
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
Now, let's say using above configuration, the kafka-stream app started consuming data from latest offset for a partition. And after some time, the app crashes. When the app comes back live, we want it to consume data from the latest offset of that partition, instead of the where it left last reading.
But I can't find anything that can help achieve it using kafka-streams api.
P.S. We are using kafka-1.0.0.
That is not supported out-of-box.
Configuration auto.offset.reset only triggers, if there are no committed offsets and there is no config to change this behavior.
You could manipulate offsets manually before startup
using bin/kafka-consumer-groups.sh though—the application.id is the
group.id and and you could "seek to end" before you restart the application.
Update:
Since 1.1.0 release, you can use bin/kafka-streams-application-reset.sh tool to set starting offsets. To use the tool, the application must be offline. (cf: https://cwiki.apache.org/confluence/display/KAFKA/KIP-171+-+Extend+Consumer+Group+Reset+Offset+for+Stream+Application)

Using Kafka instead of Redis for the queue purposes

I have a small project that uses Redis for the task queue purposes. Here is how it basically works.
I have two components in the system: desktop client (can be more than one) and a server-side app. Server-side app has a pull of tasks for the desktop client(s). When a client comes, the first available task from the pull is given to it. As the task has an id, when the desktop client gets back with the results, the server-side app can recognize the task by its id. Basically, I do the following in Redis:
Keep all the tasks as objects.
Keep queue (pool) of tasks in several lists: queue, provided, processing.
When a task is being provided to the desktop client, I use RPOPLPUSH in Redis to move the id from the queue list to the provided list.
When I get a response from the desktop client, I use LREM for the given task ID from the provided list (if it fails, I got a task that was not provided or was already processed, or just never existed - so, I break the execution). Then I use LPUSH to add the task id to the processing list. Given that I have unique task ids (controlled on the level of my app), I avoid duplicates in the Redis lists.
When the task is finished (the result got from the desktop client is processed and somehow saved), I remove the task from the processing list and delete the task object from Redis.
If anything goes wrong on any step (i.e. the task gets stuck on the processing or provided list), I can move the task back to the queue list and re-process it.
Now, the question: is it somehow possible to do the similar stuff in Apache Kafka? I do not need the exact behavior as in Redis - all I need is to be able to provide a task to the desktop client (it shouldn't be possible to provide the same task twice) and mark/change its state according to the actual processing status (new, provided, processing), so that I could control the process and restore the tasks that were not processed due to some problem. If it's possible, could anyone please describe the applicable workflow?
It is possible for kafka to act as a standard queue. Check the consumer group feature.
If the question is about the appropriateness, please also refer Is Apache Kafka appropriate for use as a task queue?
We are using kafka as a task queue, one of the consideration went in favor of kafka was that it is already in our application ecosystem, found it easier than adding one more component.