I have a cluster where jobs are created in order of what my users do.
Sometimes I can have 0 job in parallel and sometimes 20 to 100.
I have set the following limits for each container:
cpu limit: 512m
memory limit: 512Mi;
cpu request: 256m;
memroy request: 128Mi;
I have by default 2 nodes and each one has:
7.91 CPU allocable
10.16 GB allocable
The node pool can scale to 5 nodes max.
But when the cluster starts to have 8 and more jobs in parallel, the new jobs start to be in pending, waiting for other jobs to get down.
If a job is selected to start directly it will be completed in 6 to 7 seconds.
But when the cluster starts to struggle from 8 or 10 jobs, each job take approximately 20 seconds to be completed, because it blocked in pending state or in ContainerCreating state.
I have IfNotPresent as imagePullPolicy and each image has a version.
I suppose the cluster will start struggling with 28 jobs knowing my allocable resources, then creates a new node and so on.
Why am I wrong ?
Is it possible to force each container to start without the pending state ?
I have found a new scheduler, but i am not sure if it can help me poseidon-firmament-alternate-scheduler ?
Related
I have containerapp environment which contains 7 containerapp, each containerapp has two containers inside, one container for the api and another one works as background worker
when making new revision, it was taking around 1 min to 5 min to be provisioned, when checking system logs, I found a log that says insufficient cpu
if we removed all containerapps inside the environment, and start with one containerapp, the issue is gone and revision takes few seconds to start
I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, two things happen:
If no tasks is running on a worker pod, it terminates within 40sec.
If any task is running on a worker pod, it terminates in about 8min, and after one more minute, I find the task failing on UI.
I have setup below two properties in helm chart for worker pod termiantion.
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 120
I can see that worker pod should shutdown after 5 mins or irrespective task running or not. So I am not sure why I see total of 8 min for worker pod termination. And my main issue is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod. I Would appreciate any solution around this.
Some more info: Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to recivie SIGTERM before they complete their execution on worker pod scaling down.
This is not possible due to limitations from K8 end. More details are available here. However by using a large value of "gracefulTerminationPeriod" works, although this is not what I intended to do but it works better than I originally thought. When large value of gracefulTerminationPeriod is set, workers doesn't wait around for gracefulTerminationPeriod time to terminate. If a worker pod is marked for termination it terminates as soon as tasks running on it reaches zero.
Until K8 accept proposed changes and new community helm chart is released, I think this is the best solution without incurring costs of keeping worker up.
I've deployed my application on GCP Kubernetes and at times, I need to delete a node from one of the node pools.
Once I run kubectl delete node <node-id>, it takes about half an hour to an hour for a new node to come up in its place even if they're gracefully stopped and then deleted, which is a lot. The auto-scaling is set at 1-3.
How do I make the node spawning process faster?
Any leads are appreciated!
Node version: 1.22.10-gke.600
Size: Number of nodes: 0
Autoscaling: On (1-5 nodes)
CPU target limit: 40%
Node zones: us-east1-b
It usually takes 1-2 minutes for the node to re spawn (when the condition matches). But when you are deleting the node there might be no need for the new node.
If you want to spawn it faster either you try to increase the traffic/load or you can decrease the CPU target limit in the HPA (let's say 50% or less).
For more you can check this answer
Having an issue allocating more than 1 cpu to a pod that is running code that requires more processing power.
I have set my limit for a container to 3 cpu's
and have set set the container to request 2 cpu;s with a limit of 3
But when running the container it does not go over 1000Mi of 1 cpu.
There is very little running during this process and keda will start new nodes if needed.
How can i assign more cpu power to this container?
UPDATE
So i changed the Default Limit as suggested by moonkotte but i can only ever get just over 1 cpu
New Nodes are coming online when more containers are required, through Keda.
each node has 4 cpu, so sufficient resources
this is the details of each node. in this it is running one of the containers in question
It just isn't using all the cpu allocated
Despite increasing the values of the variables that modify Airflow concurrency levels, I never get more than nine simultaneous pods.
I have an EKS cluster with two m4.large nodes, with capacity for 20 pods each. The whole system occupies 15 pods, so I have room to have 25 more pods but they never reach more than nine.
I have created an escalation policy because the scheduler gets a bit stressed by throwing 500 dags at the same time, but EKS creates an additional cluster that all it does is distribute the nine pods.
I have also tested with two m4.2xlarge nodes, with capacity for almost 120 pods and the result the same despite multiplying by 4 the performance of the system and increasing the number of threads from 2 to 6.
These are the environment variable values that I handle.
AIRFLOW__CORE__PARALLELISM = 1000
AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT = 1000
AIRFLOW__CORE__DAG_CONCURRENCY = 1000
AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE = 0
AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW = -1
That could be happening?
Ok, I've already seen where the problem is. Kubernetes does not manage pods well without requests or limits. I have added requests and limits and now the nodes are filled completely with 20 pods each.
Now I have another problem. The pods don't seem to disappear when they finish. The pods only print "Hello world", despite this, in dag_run there are dags that take from 49 seconds to 22 minutes. With the fact that although there are more pods in each node, the whole system still takes more than 20 minutes to complete, as before.
Something is wrong. If I have two nodes that can host 100 pods. And every pod takes a minute to finish, if I run five hundred pods simultaneously, all the work should end in five minutes. But it always takes between 16-20 minutes. The nodes are never full of pods at full capacity and the pods finish their work but take some time to be deleted. What makes it so slow?
Use Airflow 1.10.9 with this configuration:
ENV AIRFLOW__CORE__PARALLELISM=100
ENV AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT=100
ENV AIRFLOW__CORE__DAG_CONCURRENCY=100
ENV AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=100
ENV AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE=0
ENV AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW=-1
ENV AIRFLOW__KUBERNETES_WORKER_PODS_CREATION_BATCH_SIZE=10
ENV AIRFLOW__SCHEDULER__MAX_THREADS=6