I've deployed my application on GCP Kubernetes and at times, I need to delete a node from one of the node pools.
Once I run kubectl delete node <node-id>, it takes about half an hour to an hour for a new node to come up in its place even if they're gracefully stopped and then deleted, which is a lot. The auto-scaling is set at 1-3.
How do I make the node spawning process faster?
Any leads are appreciated!
Node version: 1.22.10-gke.600
Size: Number of nodes: 0
Autoscaling: On (1-5 nodes)
CPU target limit: 40%
Node zones: us-east1-b
It usually takes 1-2 minutes for the node to re spawn (when the condition matches). But when you are deleting the node there might be no need for the new node.
If you want to spawn it faster either you try to increase the traffic/load or you can decrease the CPU target limit in the HPA (let's say 50% or less).
For more you can check this answer
Related
Having an issue allocating more than 1 cpu to a pod that is running code that requires more processing power.
I have set my limit for a container to 3 cpu's
and have set set the container to request 2 cpu;s with a limit of 3
But when running the container it does not go over 1000Mi of 1 cpu.
There is very little running during this process and keda will start new nodes if needed.
How can i assign more cpu power to this container?
UPDATE
So i changed the Default Limit as suggested by moonkotte but i can only ever get just over 1 cpu
New Nodes are coming online when more containers are required, through Keda.
each node has 4 cpu, so sufficient resources
this is the details of each node. in this it is running one of the containers in question
It just isn't using all the cpu allocated
Despite increasing the values of the variables that modify Airflow concurrency levels, I never get more than nine simultaneous pods.
I have an EKS cluster with two m4.large nodes, with capacity for 20 pods each. The whole system occupies 15 pods, so I have room to have 25 more pods but they never reach more than nine.
I have created an escalation policy because the scheduler gets a bit stressed by throwing 500 dags at the same time, but EKS creates an additional cluster that all it does is distribute the nine pods.
I have also tested with two m4.2xlarge nodes, with capacity for almost 120 pods and the result the same despite multiplying by 4 the performance of the system and increasing the number of threads from 2 to 6.
These are the environment variable values that I handle.
AIRFLOW__CORE__PARALLELISM = 1000
AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT = 1000
AIRFLOW__CORE__DAG_CONCURRENCY = 1000
AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE = 0
AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW = -1
That could be happening?
Ok, I've already seen where the problem is. Kubernetes does not manage pods well without requests or limits. I have added requests and limits and now the nodes are filled completely with 20 pods each.
Now I have another problem. The pods don't seem to disappear when they finish. The pods only print "Hello world", despite this, in dag_run there are dags that take from 49 seconds to 22 minutes. With the fact that although there are more pods in each node, the whole system still takes more than 20 minutes to complete, as before.
Something is wrong. If I have two nodes that can host 100 pods. And every pod takes a minute to finish, if I run five hundred pods simultaneously, all the work should end in five minutes. But it always takes between 16-20 minutes. The nodes are never full of pods at full capacity and the pods finish their work but take some time to be deleted. What makes it so slow?
Use Airflow 1.10.9 with this configuration:
ENV AIRFLOW__CORE__PARALLELISM=100
ENV AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT=100
ENV AIRFLOW__CORE__DAG_CONCURRENCY=100
ENV AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=100
ENV AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE=0
ENV AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW=-1
ENV AIRFLOW__KUBERNETES_WORKER_PODS_CREATION_BATCH_SIZE=10
ENV AIRFLOW__SCHEDULER__MAX_THREADS=6
I have sidekiq custom metrics coming from prometheus adapter. Using thoes queue metrics from prometheus i have setup HPA. When jobs in queue in sidekiq goes above say 1000 jobs HPA triggers 10 new pods. Then each pod will execute 100 jobs in queue. When jobs are reduced to say 400. HPA will scale-down. But when scale-down happens, hpa kills pods say 4 pods are killed. Thoes 4 pods were still running jobs say each pod was running 30-50 jobs. Now when hpa deletes these 4 pods, jobs running on them are also terminated. And thoes jobs are marked as failed in sidekiq.
So what i want to achieve is stop hpa from deleting pods which are executing the jobs. Moreover i want hpa to not scale-down even after load is reduced to minimum, instead delete pods when jobs in queue in sidekiq metrics is 0.
Is there any way to achieve this?
Weird usage, honestly: you're wasting resources even your traffic is on the cool-down phase but since you didn't provide further details, here it is.
Actually, it's not possible to achieve what you desire since the common behavior is to support a growing load against your workload. The unique wait to achieve this (and this is not recommended) is to change the horizontal-pod-autoscaler-downscale-stabilization Kubernetes Controller Manager's flag to a higher value.
JFI, the doc warns you:
Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.
As per the discussion and the work done by #Hb_1993 it can be done with a pre-stop hook to delay the eviction, where the delay is based on operation time or some logic to know if the procession is done or not.
A pre-stop hook is a lifecycle method which is invoked before a pod is evicted, and we can then attach to this event and perform some logic like performing ping check to make sure that our pod has completed the processing of current request.
PS- Use this solution with a pinch of salt as this might not work in all the cases or produce unintended results.
To do this, we introduce asleep in the preStop hook that delays the
shutdown sequence.
More details can be found in this article.
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304
I have 2 Slave and 1 Master node kubernetes cluster.When a node down it takes approximately 5 minutes to kubernetes see that failure.I am using dynamic provisioning for volumes and this time is a little bit much for me.How can i reduce that detecting failure time ?
I found a post about it:
https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/
At the bottom of the post,it says, we can reduce that detection time by changing that parameters:
kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)
i can change node-status-update-frequency parameter from kubelet but i don't have any controller manager program or command on the cli.How can i change that parameters? Any other suggestions about reducing detect downtime will be appreciated.
..but i don't have any controller manager program or command on the
cli.How can i change that parameters?
You can change/add that parameter in controller-manger systemd unit file and restart the daemon. Please check the man pages for controller-manager here.
If you deploy controller-manager as micro service(pod), check the manifest file for that pod and change the parameters at container's command section(For example like this)
It's actually kube-controller-manager. You may also decrease --attach-detach-reconcile-sync-period from 1m to 15 or 30 seconds for kube-controller-manager. This will allow for more speedy volumes attach-detach actions. How you change those parameters depends on how you set up the cluster.
I have 3 yarn node managers working in a yarn cluster, and an issue connected with vcores avalibity per yarn node.
For e.g., I have:
on first node : available 15 vcores,
on second node : non vcores avalible,
on third node : available 37 vcores.
And now, job try to start and fails withe the error:
"Queue's AM resource limit exceeded"
Is this connected with the non vcores available on second node, or maybe I can somehow increase the resources limit in queue?
I also want to mention, that I have the following setting:
yarn.scheduler.capacity.maximum-am-resource-percent=1.0
That means, that your drivers have exceeded max memory configured in Max Application Master Resources. You can either increase max memory for AM or decrease driver memory in your jobs.