Retiring or landing concourse workers - concourse

We have multiple workers on our concourse cluster. On average each worker has 130 containers. When retiring or landing a worker it will take up to 45 min. We suspect it waits until it removes all volumes from baggage claim.
We are not sure why it takes all that time. My understanding is that it will stop scheduling and wait till jobs are finished then land/retire. it seems it is doing much more.
We are using concourse 3.3.4 with binary deployment.

After upgrading to 3.4.0 the problem seems resolved.

Related

Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Thank you for reading this SO question, it may seem long, but I'll try to get as most information as possible in it to help to get the answer.
Summary
We are currently experiencing a scheduling issue with our Flink cluster.
The symptoms are that some/most/all (it depends, the symptoms are not always the same) of our tasks are shown as SCHEDULED but fail after a timeout. The jobs are then shown as RUNNING.
The failing exception is the following one:
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
After analysis, we assume (we cannot prove it, as there are not that much logs for that part of the code) that the failure is due to a deadlock/race condition that is happening when several jobs are being submitted at the same time to the Flink cluster, even though we have enough slots available in the cluster.
We actually have the error with 52 available task slots, and have 12 jobs that are not scheduled.
Additional information
Flink version: 1.13.1 commit a7f3192
Flink cluster in session mode
2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits sets on memory to 4Gb)
50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No limits set).
Our Flink cluster is shut down every night, and restarted every morning. The error seems to occur when a lot of jobs needs to be scheduled. The jobs are configured to restore their state, and we do not see any issues for jobs that are being scheduled and run correctly, it seems to really be related to a scheduling issue.
Questions
May it be that the issue described in FLINK-23409 is actually the same, but occurs only when there is a race condition when scheduling several jobs?
Is there any way to increase logging in the scheduler to debug this issue?
Is it a known issue? If yes, is there any workaround/solution to resolve it?
P.S: a while ago, I asked more or less the same question on the ML, but dropped it, I'm sorry if this is considered as cross-asking, it's not intended t. We are just opening a new thread as we have more information and the issue re-occur.

How to terminate only certain pods based on wheather or not they have finnished a certain task in kubernetes?

I'm having trouble with finding a solution that allows to terminate only certain pods in a deployment.
The application running inside the pods does some processing which can a take lot of time to be finished.
Let's say I have 10 tasks that are stored in a database and I issue a command to scale the deployment to 10 pods.
Let's say that after some time 3 of the pods have finished their tasks and are no longer required.
How can i scale down the deployment from 10 to 7 while terminate only the pods that have finished the tasks and not the pods that are still processing those tasks?
I don't know if more details are needed but i will happily edit the question if there are more details needed to give an answer for this kind of problem.
In this case Kubernetes Job might be better suited for this kind of task.

how do we choose --nthreads and --nprocs per worker in dask distributed running via helm on kubernetes?

I'm running some I/O intensive Python code on Dask and want to increase the number of threads per worker. I've deployed a Kubernetes cluster that runs Dask distributed via helm. I see from the worker deployment template that the number of threads for a worker is set to the number of CPUs, but I'd like to set the number of threads higher unless that's an anti-pattern. How do I do that?
It looks like from this similar question that I can ssh to the dask scheduler and spin up workers with dask-worker? But ideally I'd be able to configure the worker resources via helm so that I don't have to interact with the scheduler other than submitting jobs to it via the Client.
Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. For more information please follow the link 1 (Best practices described on Dask`s official documentation) and 2
Threading in Python is a careful art and is really dependent on your code. To do the easy one, -nprocs should almost certainly be 1, if you want more processes, launch more replicas instead. For the thread count, first remember the GIL means only one thread can be running Python code at a time. So you only get concurrency gains under two main sitations: 1) some threads are blocked on I/O like waiting to hear back from a database or web API or 2) some threads are running non-GIL-bound C code inside NumPy or friends. For the second situation, you still can't get more concurrency than the number of CPUs since that's just how many slots there are to run at once, but the first can benefit from more threads than CPUs in some situations.
There's a limitation of Dask's helm chart that doesn't allow for the setting of --nthreads in the chart. I confirmed this with the Dask team and filed an issue: https://github.com/helm/charts/issues/18708.
In the meantime, use Dask Kubernetes for a higher degree of customization.

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.

Kubernetes: active job is erroneously marked as completed after cluster upgrade

I have a working kubernetes cluster (v1.4.6) with an active job that has a single failing pod (e.g. it is constantly restarted) - this is a test, the job should never reach completion.
If I restart the same cluster (e.g. reboot the node), the job is properly re-scheduled and continues to be restarted
If I upgrade the cluster to v1.5.3, then the job is marked as completed once the cluster is up. The upgrade is basically the same as restart - both use the same etcd cluster.
Is this the expected behavior when going to v1.5.x? If not, what can be done to have the job continue running?
I should provide a little background on my problem - the job is to ultimately become a driver in the update process and it is important to have it running (even in face of cluster restarts) until it achieves a certain goal. Is this possible using a job?
In v1.5.0 extensions/v1beta1.Jobs was deprecated in favor of batch/v1.Job, so simply upgrading the cluster without updating the job definition is expected to cause side effects.
See the Kubernetes CHANGELOG for a complete list of changes and deprecations in v1.5.0.