Kafka Streams Application Updates - apache-kafka

I've built a Kafka Streams application. It's my first one, so I'm moving out of a proof-of-concept mindset into a "how can I productionalize this?" mindset.
The tl;dr version: I'm looking for kafka streams deployment recommendations and tips, specifically related to updating your application code.
I've been able to find lots of documentation about how Kafka and the Streams API work, but I couldn't find anything on actually deploying a Streams app.
The initial deployment seems to be fairly easy - there is good documentation for configuring your Kafka cluster, then you must create topics for your application, and then you're pretty much fine to start it up and publish data for it to process.
But what if you want to upgrade your application later? Specifically, if the update contains a change to the topology. My application does a decent amount of data enrichment and aggregation into windows, so it's likely that the processing will need to be tweaked in the future.
My understanding is that changing the order of processing or inserting additional steps into the topology will cause the internal ids for each processing step to shift, meaning at best new state stores will be created with the previous state being lost, and at worst, processing steps reading from an incorrect state store topic when starting up. This implies that you either have to reset the application, or give the new version a new application id. But there are some problems with that:
If you reset the application or give a new id, processing will start from the beginning of source and intermediate topics. I really don't want to publish the output to the output topics twice.
Currently "in-flight" data would be lost when you stop your application for an upgrade (since that application would never start again to resume processing).
The only way I can think to mitigate this is to:
Stop data from being published to source topics. Let the application process all messages, then shut it off.
Truncate all source and intermediate topics.
Start new version of application with a new app id.
Start publishers.
This is "okay" for now since my application is the only one reading from the source topics, and intermediate topics are not currently used beyond feeding to the next processor in the same application. But, I can see this getting pretty messy.
Is there a better way to handle application updates? Or are my steps generally along the lines of what most developers do?

I think you have a full picture of the problem here and your solution seems to be what most people do in this case.
During the latest Kafka-Summit this question has been asked after the talk of Gwen Shapira and Matthias J. Sax about Kubernetes deployment. The responses were the same: If your upgrade contains topology modifications, that implies rolling upgrades can't be done.
It looks like there is no KIP about this for now.

Related

Change Stream Duplication if Microservices instance got replicated

I have implemented MongoDB Change Stream in a Java Microservice, When i do a replica of my Microservice I See change stream watch is listening twice. Code is duplicated. Any way to stop this?
I gave a similar answer here, however as this question is directly related to Java, I feel it is actually more relevant on this question. I assume what you are after is each change being processed only once among many replicated processes.
Doing this with strong guarantees is difficult but not impossible. I wrote about the details of one solution here: https://www.alechenninger.com/2020/05/building-kafka-like-message-queue-with.html
This solution is implemented in a proof-of-concept library written in Java that accomplishes this which you are free to use/fork (the blog post explains how it works).
It comes down to a few techniques:
Each process attempts to obtain a lock
Each lock (or each change) has an associated fencing token
Processing each change must be idempotent
While processing the change, the token is used to ensure ordered, effectively-once updates.
More details in the blog post.

Lucene merge stop resume

Our product requires real time Lucene index merge in an embeded device. The device may get shutdown anytime. My team is exploring posibility of resume merge after system reboot. In our POC, all consumers from SegmentMerger are overrided by customized codec and the whole merge process is divided into many fine-grain steps. When each step is done, its state is saved in disk to avoid redo after resume. From test, this can work. However, I am not able to determine how robust this solution can be or it is built on flawed foundation.
Thanks in advance for your response

Data Re-processing with specific starting point in Kafka Streams

I am investigating data reprocessing with Kafka Streams. There is a nice tool available for data reprocessing with resetting the streaming application: Application Reset tool.
But this tool usually resets the application state to zero and reprocesses everything again from scratch.
There are scenarios when we want to reprocess the data from a specific point, i.e.:
Bug fix in the current application
Updating the application with some additional processor and run with the same application ID
As in Flink also, we have Savepoints concepts, which can restore the previous operator states and add the new operators without any error.
I also referred the following documents :
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
Would like to know :
Is there any checkpointing type of mechanism available in KStream?
How can we re-run the Kafka Streams application from a specific point?
What happens if we change the code in one of the application instance and run with the old application ID?
Kafka Streams does not have a savepoint concept at this point (version 1.0).
Not at the moment (v1.0)
Yes. In next release this will be part of the reset tool directly. In 1.0, you can use bin/kafka-consumer-groups.sh to commit a start offset for you application (note, application.id == group.id). For older Kafka version, you could build a custom tool to commit start offsets
In general, it breaks. Thus, you need to use a new application.id (it's a know issue and will be fixed in future releases).

How to ensure that parallel queries to ext. system are executed only once and then cached

Server frameworks: Scala, Play 2.2, ReactiveMongo, Heroku
I think I have quite interesting brain teaser for you:
In my trip-planning application I want to display weather forecast on a map(similar to this). I'm using a paid REST service to query weather data. To speed up user experience and reduce costs I plan to cache weather data for each location for one hour.
There are a few not-so obvious things to consider:
It might require to query up to 100 location for weather to display one weather map
Weather must be queried in parallel because it would take too long to query it in serial fashion considering network latency
However launching 100 threads for each user request is not an option as well (imagine just 5 users looking at a map at one time)
The solution is to have let's say 50 workers that query weather for user requests
Multiple users might be viewing the same portion of map
There is a possible racing condition where one location is queried multiple times.
However it should be queried only once and then cached.
The application is running in clustered environment meaning there will be several play instances.
Coming from a Java EE background I can come up with a pretty good solution using the Java EE stack.
However I wonder how to do this using something more natural to Scala/Play stack: Akka. There is an example (google "heroku scala akka") for similar problem but it doesn't solve one issue: Racing condition when multiple users query the same data at once.
How would you implement this?
EDIT: I have decided that the requirement to ensure that weather data is updated only once is not necessary. The situation would happen far too infrequently to be a real problem and all proposed solutions would bring too much overhead and complexity to the system to be viable.
Thanks everyone for your time and effort. I hope answers to this question will help someone in the future with similar problem.
In Akka you can choose from multiple routing strategies. ConsistentHashingRoutingLogic could serve you well in this situation. Since actors are single-threaded you can easily maintain a cache in each actor. This routing logic will assure that two equal messages will always hit the same actor.
Each actor can work in the following way:
1. check local cache (for example apache commons LRUMap)
- if found, return
2. check global cache (distributed memcache or any other key-value store)
- if found, store the result in the local cache and return
3. query the REST service
4. store the result in the global and local caches
You can have a look at this question, which I based my answer on.
I decided that I'll post my JMS solution as well.
Controller that processes the request for weather does following:
Query the DB for weather data. If there are NO locations with out-of-date data reply immediately. Otherwise continue:
Start listening on a topic (explained later).
For each location: Check whether the weather for the location isn't being updated.
If not send a weather update request message to queue.
Certain amount of workers (50?) listen to that queue.
Worker first marks the location weather as being updated
Worker retrieves updated weather and updates the DB.
Worker sends a message to a topic with weather data for that location.
When controller receives (via topic) weather updates for all out-of-date locations, combine it with up-to-date locations and reply.

Upgrading an app running on Lift Framework?

I've recently discovered the lift framework and have read that it's stateful.
Therefore, if I had a high-traffic site running on Lift - say something that was running a chat application that required users to be logged in - and I wanted to upgrade my app, would doing so kick everyone out of chat and make them have to log in again?
None of the previous answers are correct. Many of the artefacts held within the LiftSession are non-serilizable, so cant be stuffed into a database. You have two options for doing rollig upgrades of stateful applications:
1) Session bleeding. Basically you ween one of the deployments sessions away until their sessions have ended or X duration passes and then you remove the app from production whilst automatically rerouting traffic to another instance of Lift. Google around for rolling upgrades using HAProxy as this should help you from the cluster perspective.
2) If your state is fairly trivial (mostly primitive-style types: ints, strings etc) then you could think about using ContainerVar/MigratableSession and clustering the state using terracotta or similar. This comes with a range of limits though because it then uses the HTTPSession rather than LiftSession.
You might want to checkout chapter 15 of Lift in Action which details that latter solution in a fair amount of detail.
If you keep your state in memory and redeploy the web application, that state will be lost. You could save it to a database or a file before redeploying though and read it back from there.