Kubernetes: active job is erroneously marked as completed after cluster upgrade - kubernetes

I have a working kubernetes cluster (v1.4.6) with an active job that has a single failing pod (e.g. it is constantly restarted) - this is a test, the job should never reach completion.
If I restart the same cluster (e.g. reboot the node), the job is properly re-scheduled and continues to be restarted
If I upgrade the cluster to v1.5.3, then the job is marked as completed once the cluster is up. The upgrade is basically the same as restart - both use the same etcd cluster.
Is this the expected behavior when going to v1.5.x? If not, what can be done to have the job continue running?
I should provide a little background on my problem - the job is to ultimately become a driver in the update process and it is important to have it running (even in face of cluster restarts) until it achieves a certain goal. Is this possible using a job?

In v1.5.0 extensions/v1beta1.Jobs was deprecated in favor of batch/v1.Job, so simply upgrading the cluster without updating the job definition is expected to cause side effects.
See the Kubernetes CHANGELOG for a complete list of changes and deprecations in v1.5.0.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

Liquibase causing PostgreSQL DB Lock - Microservices running as a pod in AKS

We are multiple microservices(With Liquibase) running on Azure AKS cluster as a pod.
Frequently we have noticed DB locks and pods will crash as it will fail in health checks.
Is there a way to overcome this scenario as it is impacting a lot. We have to manually unlock DB table, so that pod will start.
In one of the logs, I’ve noticed below error
I believe, it needs to be handled from Application(Springboot).
You can write a piece of code that executes at the start of application that will release the lock if found. Then the database connection won't fail.
Currently using the same for our environment.

Upgrade container for long running jobs in kubernetes

I have long running workers running in kubernetes - more than 5 hours. I want to update the container without interrupting the long running jobs. I want any newly started work off the queue to start with the new version of the release but I don't want to interrupt the currently running work.
BTW I'm not actually using Jobs, I'm using Deployments with workers that get work off a redis queue.
What is the best way to do to do a release without killing the long running work?
Have a huge timeout for SIGTERM
preStop hooks?
Another container in the pod that checks for the latest version and updates once work is done?

GridGain server deployment/Statefulset Termination grace period

I deployed gridgain cluster in google kubernetes cluster following[1]. I enabled native persistency using statefulset. In statefulset.yaml in [2] terminationGracePeriodSeconds set to 60000. What is the purpose of this large timeout?
When deleting pod using kubectl delete pod command it take very large time. What is the suitable value for terminationGracePeriodSeconds without loss any data.
[1]. https://www.gridgain.com/docs/latest/installation-guide/kubernetes/gke-deployment
[2]. https://www.gridgain.com/docs/latest/installation-guide/kubernetes/gke-deployment#creating-pod-configuration
I believe the reason behind setting it to 60000 was - do not rely on it. Prior to Ignite 2.9 there was an issue with the startup script that didn't bypass SYS SIGNAL to the underlying Java app, making it impossible to perform a graceful shutdown.
If a node is being restarted gracefully and IGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN is enabled, Ignite will ensure that the node leave won't lead to a data loss. Sometimes a rebalance might take a while.
Keeping the above in mind: the hang issue might happen for Apache Ignite 2.8 and below, keeping the recommended terminationGracePeriodSeconds should be fine and never be used in practice (in a normal flow).

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.