Liquibase with Kubernetes, how to prevent DB being left in a locked state - kubernetes

Firstly, yes I have read this https://www.liquibase.com/blog/using-liquibase-in-kubernetes
and I also read many SO threads where people are answering "I solved the issue by using init-container"
I understand that for most people this might have fixed the issue because the reason their pods were going down was because the migration was taking too long and k8s probes killed the pods.
But what about when a new deployment is applied and the previous deployment was stuck a failed state (k8s trying again and again to launches the pods without success) ?
When this new deployment is applied it will simply whip / replace all the failing pods and if this happens while Liquibase aquired the lock the pods (and its init containers) are killed and the DB will be left in a locked state requiring manual intervention.
Unless I missed something with k8s's init-container, using them doesn't really solve the issue described above right?
Is that the only solution currently available? What other solution could be used to avoid manual intervention ?
My first thought was to add some kind of custom code (either directly in the app before the Liquibase migration happens) or in init-container that would run before liquibase init-container runs to automatically unlock the DB if for example the lock is, let's say, 5 minutes old.
Would that be acceptable or will it cause other issues i'm not thinking about ?

Related

How to start a POD in Kubernetes when another blocks an important resource?

I'm getting stuck in the configuration of a deployment. The problem is the following.
The application in the deployment is using a database which is stored in a file. While this database is open, it's locked (there's no way for read/write access for many).
If I delete the running POD the new one can't get in ready state, because the database is still locked. I read about preStop-Hook and tried to use it without success.
I could delete the lock file, which seems to be pretty harsh. What's the right way to solve this in Kubernetes?
This really isn't different than running this process outside of Kubernetes. When the pod is killed, it will be given a chance to shutdown cleanly. So the lock should be cleaned up. If the lock isn't cleaned up, there's not a lot of ways you can determined if the lock remains because an unclean shutdown was made, or a node is unhealthy, or if there is a network partition. So deleting the lock at pod startup does seem to be unwise.
I think the first step for you should be trying to determine why this lock file isn't getting cleaned up correctly. (Rather than trying to address the symptom.)

How to reduce downtime caused by pulling images in the Kubernetes Recreate deployment strategy

Assuming I have a Kubernetes Deployment object with the Recreate strategy and I update the Deployment with a new container image version. Kubernetes will:
scale down/kill the existing Pods of the Deployment,
create the new Pods,
which will pull the new container images
so the new containers can finally run.
Of course, the Recreate strategy is exepected to cause a downtime between steps 1 and 4, where no Pod is actually running. However, step 3 can take a lot of time if the container images in question are or the container registry connection is slow, or both. In a test setup (Azure Kubernetes Services pulling a Windows container image from Docker Hub), I see it taking 5 minutes and more, which makes for a really long downtime.
So, what is a good option to reduce that downtime? Can I somehow get Kubernetes to pull the new images before killing the Pods in step 1 above? (Note that the solution should work with Windows containers, which are notoriously large, in case that is relevant.)
On the Internet, I have found this Codefresh article using a DaemonSet and Docker in Docker, but I guess Docker in Docker is no longer compatible with containerd.
I've also found this StackOverflow answer that suggests using an Azure Container Registry with Project Teleport, but that is in private preview and doesn't support Windows containers yet. Also, it's specific to Azure Kubernetes Services, and I'm looking for a more general solution.
Surely, this is a common problem that has a "standard" answer?
Update 2021-12-21: Because I've got a corresponding answer, I'll clarify that I cannot easily change the deployment strategy. The application in question does not support running Pods of different versions at the same time because it uses a database that needs to be migrated to the corresponding application version, without forwards or backwards compatibility.
Implement a "blue-green" deployment strategy. For instance, the service might be running and active in the "blue" state. A new deployment is created with a new container image, which deploys the "green" pods with the new container image. When all of the "green" pods are ready, the "switch live" step is run, which switches the active color. Very little downtime.
Obviously, this has tradeoffs. Your cluster will need more memory to run the additional transitional pods. The deployment process will be more complex.
Via https://www.reddit.com/r/kubernetes/comments/oeruh9/can_kubernetes_prepull_and_cache_images/, I've found these ideas:
Implement a DaemonSet that runs a "sleep" loop on all the images I need.
Use http://github.com/mattmoor/warm-image, which has no Windows support.
Use https://github.com/ContainerSolutions/ImageWolf, which says, "ImageWolf is currently alpha software and intended as a PoC - please don't run it in production!"
Use https://github.com/uber/kraken, which seems to be a registry, not a pre-pulling solution.
Use https://github.com/dragonflyoss/Dragonfly (now https://github.com/dragonflyoss/Dragonfly2), which also seems to do somethings completely different.
Use https://github.com/senthilrch/kube-fledged, which looks exactly right and more mature than the others, but has no Windows support.
Use https://github.com/dcherman/image-cache-daemon, which has no Windows support.
Use https://goharbor.io/blog/harbor-2.1/, which also seems to be a registry, not a pre-pulling solution.
Use https://openkruise.io/docs/user-manuals/imagepulljob/, which also looks right, but a) OpenKruise is huge and I'm not sure I want to install this just to preload images, and b) it seems it has no Windows support.
So, it seems I have to implement this on my own, with a DaemonSet. I still hope someone can provide a better answer than this one 🙂 .

Will mongock work correctly with kubernetes replicas?

Mongock looks very promising. We want to use it inside a kubernetes service that has multiple replicas that run in parallel.
We are hoping that when our service is deployed, the first replica will acquire the mongockLock and all of its ChangeLogs/ChangeSets will be completed before the other replicas attempt to run them.
We have a single instance of mongodb running in our kubernetes environment, and we want the mongock ChangeLogs/ChangeSets to execute only once.
Will the mongockLock guarantee that only one replica will run the ChangeLogs/ChangeSets to completion?
Or do I need to enable transactions (or some other configuration)?
I am going to provide the short answer first and then the long one. I suggest you to read the long one too in order to understand it properly.
Short answer
By default, Mongock guarantees that the ChangeLogs/changeSets will be run only by one pod at a time. The one owning the lock.
Long answer
What really happens behind the scenes(if it's not configured otherwise) is that when a pod takes the lock, the others will try to acquire it too, but they can't, so they are forced to wait for a while(configurable, but 4 mins by default)as many times as the lock is configured(3 times by default). After this, if i's not able to acquire it and there is still pending changes to apply, Mongock will throw an MongockException, which should mean the JVM startup fail(what happens by default in Spring).
This is fine in Kubernetes, because it ensures it will restart the pods.
So now, assuming the pods start again and changeLogs/changeSets are already applied, the pods start successfully because they don't even need to acquire the lock as there aren't pending changes to apply.
Potential problem with MongoDB without transaction support and Frameworks like Spring
Now, assuming the lock and the mutual exclusion is clear, I'd like to point out a potential issue that needs to be mitigated by the the changeLog/changeSet design.
This issue applies if you are in an environment such as Kubernetes, which has a pod initialisation time, your migration take longer than that initialisation time an the Mongock process is executed before the pod becomes ready/health(and it's a condition for it). This last condition is highly desired as it ensures the application runs with the right version of the data.
In this situation imagine the Pod starts the Mongock process. After the Kubernetes initialisation time, the process is still not finished, but Kubernetes stops the JVM abruptly. This means that some changeSets were successfully executed, some other not even started(no problem, they will be processed in the next attempt), but one changeSet was partially executed and marked as not done. This is the potential issue. The next time Mongock runs, it will see the changeSet as pending and it will execute it from the beginning. If you haven't designed your changeLogs/changeSets accordingly, you may experience some unexpected results because some part of the data process covered by that changeSet has already taken place and it will happen again.
This, somehow needs to be mitigated. Either with the help of mechanisms like transactions, with a changeLog/changeSet design that takes this into account or both.
Mongock currently provides transactions with “all or nothing”, but it doesn’t really help much as it will retry every time from scratch and will probably end up in an infinite loop. The next version 5 will provide transactions per ChangeLogs and changeSets, which together with good organisation, is the right solution for this.
Meanwhile this issue can be addressed by following this design suggestions.
Just to follow up... Mongock's locking mechanism works fine with replicas. To solve the "long-running script" problem, we will run our Mongock scripts from Kubernetes initContainer. K8s will wait for the initContainers to finish before it starts the pod's main service containers.
For transactions, we will follow the advice above of making our scripts idempotent.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

Is it possible to run a single container Flink cluster in Kubernetes with high-availability, checkpointing, and savepointing?

I am currently running a Flink session cluster (Kubernetes, 1 JobManager, 1 TaskManager, Zookeeper, S3) in which multiple jobs run.
As we are working on adding more jobs, we are looking to improve our deployment and cluster management strategies. We are considering migrating to using job clusters, however there is reservation about the number of containers which will be spawned. One container per job is not an issue, but two containers (1 JM and 1 TM) per job raises concerns about memory consumption. Several of the jobs need high-availability and the ability to use checkpoints and restore from/take savepoints as they aggregate events over a window.
From my reading of the documentation and spending time on Google, I haven't found anything that seems to state whether or not what is being considered is really possible.
Is it possible to do any of these three things:
run both the JobManager and TaskManager as separate processes in the same container and have that serve as the Flink cluster, or
run the JobManager and TaskManager as literally the same process, or
run the job as a standalone JAR with the ability to recover from/take checkpoints and the ability to take a savepoint and restore from that savepoint?
(If anyone has any better ideas, I'm all ears.)
One of the responsibilities of the job manager is to monitor the task manager(s), and initiate restarts when failures have occurred. That works nicely in containerized environments when the JM and TMs are in separate containers; otherwise it seems like you're asking for trouble. Keeping the TMs separate also makes sense if you are ever going to scale up, though that may moot in your case.
What might be workable, though, would be to run the job using a LocalExecutionEnvironment (so that everything is in one process -- this is sometimes called a Flink minicluster). This path strikes me as feasible, if you're willing to work at it, but I can't recommend it. You'll have to somehow keep track of the checkpoints, and arrange for the container to be restarted from a checkpoint when things fail. And there are other things that may not work very well -- see this question for details. The LocalExecutionEnvironment wasn't designed with production deployments in mind.
What I'd suggest you explore instead is to see how far you can go toward making the standard, separate container solution affordable. For starters, you should be able to run the JM with minimal resources, since it doesn't have much to do.
Check this operator which automates the lifecycle of deploying and managing Flink in Kubernetes. The project is in beta but you can still get some idea about how to do it or directly use this operator if it fits your requirement. Here Job Manager and Task manager is separate kubernetes deployment.