Kubernetes deployment with Recreate strategy and maxSurge?

Kubernetes deployment with Recreate strategy and maxSurge? - kubernetes

Summary
Can I give a deployment the rollout strategy Recreate and also set a fixed maxSurge for the deployment?
More details
I am developing an application that runs in Kubernetes. The backend will have multiple replicas, and runs EF Core with database migrations. I understand there are several ways to solve this; here's my idea at the moment.
On a new release, I would like all replicas to be stopped. Then a single replica at a time should start, and for each replica there should be an init container that runs the migrations (if needed).
This seems to be possible, using the following two configuration values:
.spec.strategy.type==Recreate and
.spec.strategy.rollingUpdate.maxSurge==1
Is it possible to use these two together? If not, is there any way to control how many replicas a controller will start at once with the Recreate strategy?
"No! You should do this in a completely different way!"
Feel free to suggest other methods as well, if you think I am coming at this from the completely wrong angle.

Statefulset might help you in this case.
StatefulSets are valuable for applications that require one or more of the following.
Stable, unique network identifiers.
Stable, persistent storage.
Ordered, graceful deployment and scaling.
Ordered, automated rolling updates.

Related

Kubernetes control the order of scale and upgrade for a StatefulSet

I have the following scenario:
A StatefulSet with 1 replica
Update the template section and scale it in the same operation using helm as application manager
The order of operation is the following:
Scaling to 3
Update the replica with name 0
Because I cannot control to first update and after that scale, I am losing data because there is a specific logic in the new statefulset template.
Is there a way to control the ordering of those operations?
The service in question is Redis, we are trying to get from standalone mode (1 replica) to replication(HA) without losing data.

For the moment I resolved the problem using a helm pre-install job that is basically scaling the sts to zero, after that helm is coming with the update.

I am not Redis expert, but I think that solution below should help you.
I would try to install another Redis HA instance (B) next to the existing one (A), taking as a data source for B a snapshot of A's PV. This could to avoid losing your data. For more information you can read more about volume snapshots.
See also this related problem.

Liquibase with Kubernetes, how to prevent DB being left in a locked state

Firstly, yes I have read this https://www.liquibase.com/blog/using-liquibase-in-kubernetes
and I also read many SO threads where people are answering "I solved the issue by using init-container"
I understand that for most people this might have fixed the issue because the reason their pods were going down was because the migration was taking too long and k8s probes killed the pods.
But what about when a new deployment is applied and the previous deployment was stuck a failed state (k8s trying again and again to launches the pods without success) ?
When this new deployment is applied it will simply whip / replace all the failing pods and if this happens while Liquibase aquired the lock the pods (and its init containers) are killed and the DB will be left in a locked state requiring manual intervention.
Unless I missed something with k8s's init-container, using them doesn't really solve the issue described above right?
Is that the only solution currently available? What other solution could be used to avoid manual intervention ?
My first thought was to add some kind of custom code (either directly in the app before the Liquibase migration happens) or in init-container that would run before liquibase init-container runs to automatically unlock the DB if for example the lock is, let's say, 5 minutes old.
Would that be acceptable or will it cause other issues i'm not thinking about ?

How to reduce downtime caused by pulling images in the Kubernetes Recreate deployment strategy

Assuming I have a Kubernetes Deployment object with the Recreate strategy and I update the Deployment with a new container image version. Kubernetes will:
scale down/kill the existing Pods of the Deployment,
create the new Pods,
which will pull the new container images
so the new containers can finally run.
Of course, the Recreate strategy is exepected to cause a downtime between steps 1 and 4, where no Pod is actually running. However, step 3 can take a lot of time if the container images in question are or the container registry connection is slow, or both. In a test setup (Azure Kubernetes Services pulling a Windows container image from Docker Hub), I see it taking 5 minutes and more, which makes for a really long downtime.
So, what is a good option to reduce that downtime? Can I somehow get Kubernetes to pull the new images before killing the Pods in step 1 above? (Note that the solution should work with Windows containers, which are notoriously large, in case that is relevant.)
On the Internet, I have found this Codefresh article using a DaemonSet and Docker in Docker, but I guess Docker in Docker is no longer compatible with containerd.
I've also found this StackOverflow answer that suggests using an Azure Container Registry with Project Teleport, but that is in private preview and doesn't support Windows containers yet. Also, it's specific to Azure Kubernetes Services, and I'm looking for a more general solution.
Surely, this is a common problem that has a "standard" answer?
Update 2021-12-21: Because I've got a corresponding answer, I'll clarify that I cannot easily change the deployment strategy. The application in question does not support running Pods of different versions at the same time because it uses a database that needs to be migrated to the corresponding application version, without forwards or backwards compatibility.

Implement a "blue-green" deployment strategy. For instance, the service might be running and active in the "blue" state. A new deployment is created with a new container image, which deploys the "green" pods with the new container image. When all of the "green" pods are ready, the "switch live" step is run, which switches the active color. Very little downtime.
Obviously, this has tradeoffs. Your cluster will need more memory to run the additional transitional pods. The deployment process will be more complex.

Via https://www.reddit.com/r/kubernetes/comments/oeruh9/can_kubernetes_prepull_and_cache_images/, I've found these ideas:
Implement a DaemonSet that runs a "sleep" loop on all the images I need.
Use http://github.com/mattmoor/warm-image, which has no Windows support.
Use https://github.com/ContainerSolutions/ImageWolf, which says, "ImageWolf is currently alpha software and intended as a PoC - please don't run it in production!"
Use https://github.com/uber/kraken, which seems to be a registry, not a pre-pulling solution.
Use https://github.com/dragonflyoss/Dragonfly (now https://github.com/dragonflyoss/Dragonfly2), which also seems to do somethings completely different.
Use https://github.com/senthilrch/kube-fledged, which looks exactly right and more mature than the others, but has no Windows support.
Use https://github.com/dcherman/image-cache-daemon, which has no Windows support.
Use https://goharbor.io/blog/harbor-2.1/, which also seems to be a registry, not a pre-pulling solution.
Use https://openkruise.io/docs/user-manuals/imagepulljob/, which also looks right, but a) OpenKruise is huge and I'm not sure I want to install this just to preload images, and b) it seems it has no Windows support.
So, it seems I have to implement this on my own, with a DaemonSet. I still hope someone can provide a better answer than this one 🙂 .

Expressing that a service requires another

I'm new to k8s, so this question might be kind of weird, please correct me as necessary.
I have an application which requires a redis database. I know that I should configure it to connect to <redis service name>.<namespace> and the cluster DNS will get me to the right place, if it exists.
It feels to me like I want to express the relationship between the application and the database. Like I want to say that the application shouldn't be deployable until the database is there and working, and maybe that it's in an error state if the DB goes away. Is that something you'd normally do, and if so - how? I can think of other instances: like with an SQL database you might need to create the tables your app wants to use at init time.
Is the alternative to try to connect early and exit 1, so that the cluster keeps on retrying? Feels like that would work but it's not very declarative.

Design for resiliency
Modern applications and Kubernetes are (or should be) designed for resiliency. The applications should be designed without single point of failure and be resilient to changes in e.g. network topology. Also see Twelve factor-app: IV. Backing services.
This means that your Redis typically should be a cluster of e.g. 3 instances. It also means that your app should retry connections if connections fails - this can also happens same time after running - since upgrades of a cluster (or rolling upgrade of an app) is done by terminating one instance at a time meanwhile a new instance at a time is launched. E.g. the instance (of a cluster) that your app currently is connected to might go away and your app need to reconnect, perhaps establish a connection to a different instance in the same cluster.
SQL Databases and schemas
I can think of other instances: like with an SQL database you might need to create the tables your app wants to use at init time.
Yes, this is a different case. On Kubernetes your app is typically deployed with at least 2 replicas, or more (for high-availability reasons). You need to consider that when managing schema changes for your app. Common tools to manage the schema are Flyway or Liquibase and they can be run as Jobs. E.g. first launch a Job to create your DB-tables and after that deploy your app. And after some weeks you might want to change some tables and launch a new Job for this schema migration.

As you've seen, YAML objects can not express such dependencies. As suggested by #fabian-lopez, your application container may include an initContainer that would wait for dependencies to be available, before starting their main container.
Now, if you want a state machine, capable to provision a database, initialize its schema, maybe import some records, and only then create your application: you're looking for an operator. Then, you may use the operator-sdk ( https://github.com/operator-framework/operator-sdk ), or pretty much anything integrating with some Kubernetes cluster API.

I think Init Containers is something you could leverage for this use case

This is up to your application code, not something Kubernetes helps nor hinders.

Is it possible to run a single container Flink cluster in Kubernetes with high-availability, checkpointing, and savepointing?

I am currently running a Flink session cluster (Kubernetes, 1 JobManager, 1 TaskManager, Zookeeper, S3) in which multiple jobs run.
As we are working on adding more jobs, we are looking to improve our deployment and cluster management strategies. We are considering migrating to using job clusters, however there is reservation about the number of containers which will be spawned. One container per job is not an issue, but two containers (1 JM and 1 TM) per job raises concerns about memory consumption. Several of the jobs need high-availability and the ability to use checkpoints and restore from/take savepoints as they aggregate events over a window.
From my reading of the documentation and spending time on Google, I haven't found anything that seems to state whether or not what is being considered is really possible.
Is it possible to do any of these three things:
run both the JobManager and TaskManager as separate processes in the same container and have that serve as the Flink cluster, or
run the JobManager and TaskManager as literally the same process, or
run the job as a standalone JAR with the ability to recover from/take checkpoints and the ability to take a savepoint and restore from that savepoint?
(If anyone has any better ideas, I'm all ears.)

One of the responsibilities of the job manager is to monitor the task manager(s), and initiate restarts when failures have occurred. That works nicely in containerized environments when the JM and TMs are in separate containers; otherwise it seems like you're asking for trouble. Keeping the TMs separate also makes sense if you are ever going to scale up, though that may moot in your case.
What might be workable, though, would be to run the job using a LocalExecutionEnvironment (so that everything is in one process -- this is sometimes called a Flink minicluster). This path strikes me as feasible, if you're willing to work at it, but I can't recommend it. You'll have to somehow keep track of the checkpoints, and arrange for the container to be restarted from a checkpoint when things fail. And there are other things that may not work very well -- see this question for details. The LocalExecutionEnvironment wasn't designed with production deployments in mind.
What I'd suggest you explore instead is to see how far you can go toward making the standard, separate container solution affordable. For starters, you should be able to run the JM with minimal resources, since it doesn't have much to do.

Check this operator which automates the lifecycle of deploying and managing Flink in Kubernetes. The project is in beta but you can still get some idea about how to do it or directly use this operator if it fits your requirement. Here Job Manager and Task manager is separate kubernetes deployment.