I have a replication controller containing a pod (with 1 replica) that takes ~10m to start. As my application grows over time, that duration is going to increase.
My problem is when I deploy a new version, the prior one is killed, then the new one can start.
Is it possible to make kubernetes not kill the old pod during a rolling update, until the new pod is running ?
It's okay for me to have multiple replicas if it is necessary, but that did not fix the issue.
The replication controller has livenessProbe and readinessProbe set correctly.
I kept searching, and it's not possible right now ( 13 Oct 2015 ), but I made an issue you can follow : https://github.com/kubernetes/kubernetes/issues/15557.
Related
I'm having a problem where a job runs out of memory, and K8s is continually trying to run it again, despite it having no chance of succeeding, since it's going to use the same amount of memory every time. I want it to simply let the job fail and sit there, and I'll take care of creating a new one with a higher memory limit, if desired, and/or deleting the existing failed job.
I have
restartPolicy: Never
backoffLimit: 0
From the not-so-clear things I've read, setting backoffLimit to 1 might do the trick. But is that true? Would that make it restart once, or is the 1 the number of times it can be run, including the first attempt?
Should I switch from jobs to pods? The main issue with that, is then I don't think K8s will restart the pod on another K8s worker node should the one it's running on go down, and that's a situation where I'd want the job to automatically be restarted on another node.
backoffLimit should be 1 as shown below
backoffLimit: 1
Setting backoffLimit to 0 is correct, if the Job is supposed to run once and not be restarted:
backoffLimit: Specifies the number of retries before marking this job failed.
Switching your workload to a Pod would make sense as long as you are not interested in restarts in combination with backoff limits.
Firstly, yes I have read this https://www.liquibase.com/blog/using-liquibase-in-kubernetes
and I also read many SO threads where people are answering "I solved the issue by using init-container"
I understand that for most people this might have fixed the issue because the reason their pods were going down was because the migration was taking too long and k8s probes killed the pods.
But what about when a new deployment is applied and the previous deployment was stuck a failed state (k8s trying again and again to launches the pods without success) ?
When this new deployment is applied it will simply whip / replace all the failing pods and if this happens while Liquibase aquired the lock the pods (and its init containers) are killed and the DB will be left in a locked state requiring manual intervention.
Unless I missed something with k8s's init-container, using them doesn't really solve the issue described above right?
Is that the only solution currently available? What other solution could be used to avoid manual intervention ?
My first thought was to add some kind of custom code (either directly in the app before the Liquibase migration happens) or in init-container that would run before liquibase init-container runs to automatically unlock the DB if for example the lock is, let's say, 5 minutes old.
Would that be acceptable or will it cause other issues i'm not thinking about ?
I have a cluster that includes a Cronjob scheduled to run every 5 minutes.
We recently experienced an issue that incurred downtime and required manual recovery of the cluster. Although now healthy again, this particular cronjob is failing to run with the following error:
Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
I understand that the Cronjob has 'missed' a number of scheduled jobs while the cluster was down, and this has past a threshold at which no further jobs will be scheduled.
How can I reset the number of missed start times and have these jobs scheduled again (without scheduling all the missed jobs to suddenly run?)
Per the kubernetes Cronjob docs, there does not seem to be a way to cleanly resolve this. Setting the .spec.startingDeadlineSeconds value to a large number will re-schedule all missed occurrences that fall within the increased window.
My solution was just to kubectl delete cronjob x-y-z and recreate it, which worked as desired.
I am running Cassandra as a Kubernetes pod . One pod is having one Cassandra container.we are running Cassandra of version 3.11.4 and auto_bootstrap set to true.I am having 5 node in production and it holds 20GB data.
Because of some maintenance activity and if I restart any Cassandra pod it is taking 30 min for bootstrap then it is coming UP and Normal state.In production 30 min is a huge time.
How can I reduce the bootup time for cassandra pod ?
Thank you !!
If you're restarting the existing node, and data is still there, then it's not a bootstrap of the node - it's just restart.
One of the potential problems that you have is that you're not draining the node before restart, and all commit logs need to be replayed on the start, and this can take a lot of time if you have a lot of data in commit log (you can just check system.log on what Cassandra is doing at that time). So the solution could be is to execute nodetool drain before stopping the node.
If the node is restarted before crash or something like, you can thing in the direction of the regular flush of the data from memtable, for example via nodetool flush, or configuring tables with periodic flush via memtable_flush_period_in_ms option on the most busy tables. But be careful with that approach as it may create a lot of small SSTables, and this will add more load on compaction process.
I have a working kubernetes cluster (v1.4.6) with an active job that has a single failing pod (e.g. it is constantly restarted) - this is a test, the job should never reach completion.
If I restart the same cluster (e.g. reboot the node), the job is properly re-scheduled and continues to be restarted
If I upgrade the cluster to v1.5.3, then the job is marked as completed once the cluster is up. The upgrade is basically the same as restart - both use the same etcd cluster.
Is this the expected behavior when going to v1.5.x? If not, what can be done to have the job continue running?
I should provide a little background on my problem - the job is to ultimately become a driver in the update process and it is important to have it running (even in face of cluster restarts) until it achieves a certain goal. Is this possible using a job?
In v1.5.0 extensions/v1beta1.Jobs was deprecated in favor of batch/v1.Job, so simply upgrading the cluster without updating the job definition is expected to cause side effects.
See the Kubernetes CHANGELOG for a complete list of changes and deprecations in v1.5.0.