Statefulset - Possible to Skip creation of pod 0 when it fails and proceed with the next one? - kubernetes

I currently do have a problem with the statefulset under the following condition:
I have a percona SQL cluster running with persistent storage and 2 nodes
now i do force both pods to fail.
first i will force pod-0 to fail
Afterwards i will force pod-1 to fail
Now the cluster is not able to recover without manual interference and possible dataloss
Why:
The statefulset is trying to bring pod-0 up first, however this one will not be brought online because of the following message:
[ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1
What i could do alternatively, but what i dont really like:
I could change ".spec.podManagementPolicy" to "Parallel" but this could lead to race conditions when forming the cluster. Thus i would like to avoid that, i basically like the idea of starting the nodes one after another
What i would like to have:
the possibility to have ".spec.podManagementPolicy":"OrderedReady" activated but with the possibility to adjust the order somehow
to be able to put specific pods into "inactive" mode so they are being ignored until i enable them again
Is something like that available? Does someone have any other ideas?

Unfortunately, nothing like that is available in standard functions of Kubernetes.
I see only 2 options here:
Use InitContainers to somehow check the current state on relaunch.
That will allow you to run any code before the primary container is started so you can try to use a custom script in order to resolve the problem etc.
Modify the database startup script to allow it to wait for some Environment Variable or any flag file and use PostStart hook to check the state before running a database.
But in both options, you have to write your own logic of startup order.

Related

How to create a cron job in a Kubernetes deployed app without duplicates?

I am trying to find a solution to run a cron job in a Kubernetes-deployed app without unwanted duplicates. Let me describe my scenario, to give you a little bit of context.
I want to schedule jobs that execute once at a specified date. More precisely: creating such a job can happen anytime and its execution date will be known only at that time. The job that needs to be done is always the same, but it needs parametrization.
My application is running inside a Kubernetes cluster, and I cannot assume that there always will be only one instance of it running at the any moment in time. Therefore, creating the said job will lead to multiple executions of it due to the fact that all of my application instances will spawn it. However, I want to guarantee that a job runs exactly once in the whole cluster.
I tried to find solutions for this problem and came up with the following ideas.
Create a local file and check if it is already there when starting a new job. If it is there, cancel the job.
Not possible in my case, since the duplicate jobs might run on other machines!
Utilize the Kubernetes CronJob API.
I cannot use this feature because I have to create cron jobs dynamically from inside my application. I cannot change the cluster configuration from a pod running inside that cluster. Maybe there is a way, but it seems to me there have to be a better solution than giving the application access to the cluster it is running in.
Would you please be as kind as to give me any directions at which I might find a solution?
I am using a managed Kubernetes Cluster on Digital Ocean (Client Version: v1.22.4, Server Version: v1.21.5).
After thinking about a solution for a rather long time I found it.
The solution is to take the scheduling of the jobs to a central place. It is as easy as building a job web service that exposes endpoints to create jobs. An instance of a backend creating a job at this service will also provide a callback endpoint in the request which the job web service will call at the execution date and time.
The endpoint in my case links back to the calling backend server which carries the logic to be executed. It would be rather tedious to make the job service execute the logic directly since there are a lot of dependencies involved in the job. I keep a separate database in my job service just to store information about whom to call and how. Addressing the startup after crash problem becomes trivial since there is only one instance of the job web service and it can just re-create the jobs normally after retrieving them from the database in case the service crashed.
Do not forget to take care of failing jobs. If your backends are not reachable for some reason to take the callback, there must be some reconciliation mechanism in place that will prevent this failure from staying unnoticed.
A little note I want to add: In case you also want to scale the job service horizontally you run into very similar problems again. However, if you think about what is the actual work to be done in that service, you realize that it is very lightweight. I am not sure if horizontal scaling is ever a requirement, since it is only doing requests at specified times and is not executing heavy work.

Liquibase with Kubernetes, how to prevent DB being left in a locked state

Firstly, yes I have read this https://www.liquibase.com/blog/using-liquibase-in-kubernetes
and I also read many SO threads where people are answering "I solved the issue by using init-container"
I understand that for most people this might have fixed the issue because the reason their pods were going down was because the migration was taking too long and k8s probes killed the pods.
But what about when a new deployment is applied and the previous deployment was stuck a failed state (k8s trying again and again to launches the pods without success) ?
When this new deployment is applied it will simply whip / replace all the failing pods and if this happens while Liquibase aquired the lock the pods (and its init containers) are killed and the DB will be left in a locked state requiring manual intervention.
Unless I missed something with k8s's init-container, using them doesn't really solve the issue described above right?
Is that the only solution currently available? What other solution could be used to avoid manual intervention ?
My first thought was to add some kind of custom code (either directly in the app before the Liquibase migration happens) or in init-container that would run before liquibase init-container runs to automatically unlock the DB if for example the lock is, let's say, 5 minutes old.
Would that be acceptable or will it cause other issues i'm not thinking about ?

How to start a POD in Kubernetes when another blocks an important resource?

I'm getting stuck in the configuration of a deployment. The problem is the following.
The application in the deployment is using a database which is stored in a file. While this database is open, it's locked (there's no way for read/write access for many).
If I delete the running POD the new one can't get in ready state, because the database is still locked. I read about preStop-Hook and tried to use it without success.
I could delete the lock file, which seems to be pretty harsh. What's the right way to solve this in Kubernetes?
This really isn't different than running this process outside of Kubernetes. When the pod is killed, it will be given a chance to shutdown cleanly. So the lock should be cleaned up. If the lock isn't cleaned up, there's not a lot of ways you can determined if the lock remains because an unclean shutdown was made, or a node is unhealthy, or if there is a network partition. So deleting the lock at pod startup does seem to be unwise.
I think the first step for you should be trying to determine why this lock file isn't getting cleaned up correctly. (Rather than trying to address the symptom.)

Is it possible to stop a job in Kubernetes without deleting it

Because Kubernetes handles situations where there's a typo in the job spec, and therefore a container image can't be found, by leaving the job in a running state forever, I've got a process that monitors job events to detect cases like this and deletes the job when one occurs.
I'd prefer to just stop the job so there's a record of it. Is there a way to stop a job?
1) According to the K8S documentation here.
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here are the details for the failedJobsHistoryLimit property in the CronJobSpec.
This is another way of retaining the details of the failed job for a specific duration. The failedJobsHistoryLimit property can be set based on the approximate number of jobs run per day and the number of days the logs have to be retained. Agree that the Jobs will be still there and put pressure on the API server.
This is interesting. Once the job completes with failure as in the case of a wrong typo for image, the pod is getting deleted and the resources are not blocked or consumed anymore. Not sure exactly what kubectl job stop will achieve in this case. But, when the Job with a proper image is run with success, I can still see the pod in kubectl get pods.
2) Another approach without using the CronJob is to specify the ttlSecondsAfterFinished as mentioned here.
Another way to clean up finished Jobs (either Complete or Failed) automatically is to use a TTL mechanism provided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.
Not really, no such mechanism exists in Kubernetes yet afaik.
You can workaround is to ssh into the machine and run a: (if you're are using Docker)
# Save the logs
$ docker log <container-id-that-is-running-your-job> 2>&1 > save.log
$ docker stop <main-container-id-for-your-job>
It's better to stream log with something like Fluentd, or logspout, or Filebeat and forward the logs to an ELK or EFK stack.
In any case, I've opened this
You can suspend cronjobs by using the suspend attribute. From the Kubernetes documentation:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#suspend
Documentation says:
The .spec.suspend field is also optional. If it is set to true, all
subsequent executions are suspended. This setting does not apply to
already started executions. Defaults to false.
So, to pause a cron you could:
run and edit "suspend" from False to True.
kubectl edit cronjob CRON_NAME (if not in default namespace, then add "-n NAMESPACE_NAME" at the end)
you could potentially create a loop using "for" or whatever you like, and have them all changed at once.
you could just save the yaml file locally and then just run:
kubectl create -f cron_YAML
and this would recreate the cron.
The other answers hint around the .spec.suspend solution for the CronJob API, which works, but since the OP asked specifically about Jobs it is worth noting the solution that does not require a CronJob.
As of Kubernetes 1.21, there alpha support for the .spec.suspend field in the Job API as well, (see docs here). The feature is behind the SuspendJob feature gate.

Kubernetes different container args depending on number of pods in replica set

I want to scale an application with workers.
There could be 1 worker or 100, and I want to scale them seamlessly.
The idea is using replica set. However due to domain-specific reasons, the appropriate way to scale them is for each worker to know its: ID and the total number of workers.
For example, in case I have 3 workers, I'd have this:
id:0, num_workers:3
id:1, num_workers:3
id:2, num_workers:3
Is there a way of using kubernetes to do so?
I pass this information in command line arguments to the app, and I assume it would be fine having it in environment variables too.
It's ok on size changes for all workers to be killed and new ones spawned.
Before giving the kubernetes-specific answer, I wanted to point out that it seems like the problem is trying to push cluster-coordination down into the app, which is almost by definition harder than using a distributed system primitive designed for that task. For example, if every new worker identifies themselves in etcd, then they can watch keys to detect changes, meaning no one needs to destroy a running application just to update its list of peers, their contact information, their capacity, current workload, whatever interesting information you would enjoy having while building a distributed worker system.
But, on with the show:
If you want stable identifiers, then StatefulSets is the modern answer to that. Whether that is an exact fit for your situation depends on whether (for your problem domain) id:0 being "rebooted" still counts as id:0 or the fact that it has stopped and started now disqualifies it from being id:0.
The running list of cluster size is tricky. If you are willing to be flexible in the launch mechanism, then you can have a pre-launch binary populate the environment right before spawning the actual worker (that example is for reading from etcd directly, but the same principle holds for interacting with the kubernetes API, then launching).
You could do that same trick in a more static manner by having an initContainer write the current state of affairs to a file, which the app would then read in. Or, due to all Pod containers sharing networking, the app could contact a "sidecar" container on localhost to obtain that information via an API.
So far so good, except for the
on size changes for all workers to be killed and new one spawned
The best answer I have for that requirement is that if the app must know its peers at launch time, then I am pretty sure you have left the realm of "scale $foo --replicas=5" and entered into the "destroy the peers and start all afresh" realm, with kubectl delete pods -l some-label=of-my-pods; which is, thankfully, what updateStrategy: type: OnDelete does, when combined with the delete pods command.
In the end, I've tried something different. I've used kubernetes API to get the number of running pods with the same label. This is python code utilizing kubernetes python client.
import socket
from kubernetes import client
from kubernetes import config
config.load_incluster_config()
v1 = client.CoreV1Api()
with open(
'/var/run/secrets/kubernetes.io/serviceaccount/namespace',
'r'
) as f:
namespace = f.readline()
workers = []
for pod in v1.list_namespaced_pod(
namespace,
watch=False,
label_selector="app=worker"
).items:
workers.append(pod.metadata.name)
workers.sort()
num_workers = len(workers)
worker_id = workers.index(socket.gethostname())