Running multiple pods simultaneously takes a lot of time in kubernetes - kubernetes

On the local machine, I am running multiple pods at the same time. It takes a lot of time to complete even though all the pods achieve the running state almost instantly. Internally i am running a docker image (1.8GB) on each pod. When i run pods on serial order, it takes around 12 sec/pod run but running parallely, the time increases exponentially not even at least the same as serial. What could be the probable cause for this?
EDIT 1
The operation is really cpu intensive, reaching above 90%. Is there a way to queue pods as they come for cpu resources, so that instead of all slowing down, each of them execute fast in a queue.

Related

Kubernetes Handling a Sudden Request of Processing Power (Such as a Python Script using 5 Processes)

I have a little scenario that I am running in my mind with the following setup:
A Django Web Server running in Kubernetes with the ability to autoscale resources (Google Kubernetes Engine), and I set the resource values to be requesting nodes with 8 Processing Units (8 Cores) and 16 GB Ram.
Because it is a web server, I have my frontend that can call a Python script that executes with 5 Processes, and here's what I am worried about:
I know that If I run this script twice on my webserver (located in the same container as my Django code), I am going to be using (to keep it simple) 10 Processes/CPUs to execute this code.
So what would happen?
Would the first Python script be ran on Pod 1 and the second Python script (since we used 5 out of the 8 processing units) trigger a Pod 2 and another Node, then run on that new replica with full access to 5 new processes?
Or, would the first Python script be ran on Replica 1, and then the second Python script be throttled to 3 processing units because, perhaps, Kubernetes is allocating based on CPU usage in the Replica, not how much processes I called the script with?
If your system has a Django application that launches scripts with subprocess or a similar mechanism, those will always be in the same container as the main server, in the same pod, on the same node. You'll never trigger any of the Kubernetes autoscaling capabilities. If the pod has resource limits set, you could get CPU utilization throttled, and if you exceed the configured memory limit, the pod could get killed off (with Django and all of its subprocesses together).
If you want to take better advantage of Kubernetes scheduling and resource management, you may need to restructure this application. Ideally you could run the Django server and each of the supporting tasks in a separate pod. You would then need a way to trigger the tasks and collect the results.
I'd generally build this by introducing a job queue system such as RabbitMQ or Celery into the mix. The Django application adds items to the queue, but doesn't directly do the work itself. Then you have a worker for each of the processes that reads the queue and does work. This is not directly tied to Kubernetes, and you could run this setup with a RabbitMQ or Redis installation and a local virtual environment.
If you deploy this setup to Kubernetes, then each of the tasks can run in its own deployment, fed by the work queue. Each task could run up to its own configured memory and CPU limits, and they could run on different nodes. With a little extra work you can connect a horizontal pod autoscaler to scale the workers based on the length of the job queue, so if you're running behind processing one of the tasks, the HPA can cause more workers to get launched, which will create more pods, which can potentially allocate more nodes.
The important detail here, though, is that a pod is the key unit of scaling; if all of your work stays within a single pod then you'll never trigger the horizontal pod autoscaler or the cluster autoscaler.

what will happen if the resources I required is not enough during the running of the job?

in slurm, what will happen if the resources I required is not enough during the running of the job?
For example, #SBATCH --memory=10G; #SBATCH --cpus-per-task=2; python mytrain.py is in myscript.sh. After I run sbatch myscript.sh the job is allocated the required cpu (2) and memory (10 G) successfully. But during the running of the job, the program need more memory than 10 Gb (like loading a big video dataset), I found the job would not be killed. The job will still work normally.
So my question is: is there any side effect when I underestimate the resource I need? (memory seems okay, but is it stll okay if the required cpu number is not enough?)
Slurm can be configured to constrain the jobs into their resource requests(the most usual setup) , which does not seem to be the case in the cluster you are using.
If it were the case, your job would be killed when trying to use more memory than requested, and it would be limited to the physical CPUs you requested.
In your case, using more memory than requested can lead to memory exhaustion on the node on which your job is running, possibly, having your processes (but also possibly processes of other jobs on the same node!), killed by the OOM killer. Using more CPUs than requested means the processes started by your job will compete with the processes of other jobs for the same physical CPU, leading to a general slow-down of all jobs on the node because of a large number of context switches. Jobs being slowed down can then exceed their maximum time and get killed.
Underestimating resources can thus lead to loss of your jobs. If nodes are shared among jobs, it can also lead to loss of jobs from other users.

Running multiple containers on the same Service Fabric node

I have a windows Service Fabric node with 4 cores and I want to host 3 containerized stateless services on it, where each windows container is allocated 1 core to read a message from a queue and process it. I run some experiments and got these results:
1 container running on the node: message takes ~18 sec to be
processed, avg cpu usage per container: 24.7%, memory usage: 1 GB
2 containers running on the node: message takes ~25 sec to be
processed, avg cpu usage per container: 24.4%, memory usage: 1 GB
3 containers running on the node: message takes ~35 sec to be
processed, avg cpu usage per container: 24.6%, memory usage: 1 GB
I thought that containers are supposedly isolated, and I expected the processing time to be constant at ~18s regardless of the number of containers, but in this case, it seems that adding one container affects the processing time in other containers. Each container is set to use 1 core, so they shouldn't be overstepping to use each other's resources, and cpu is not reaching full utilization. Even if cpu was a bottleneck here, I'd expect that at least 2 containers would be able to run with ~18 sec processing time.
Is there a logical explanation for the results? Isn't it not possible to run multiple containers on the same Service Fabric host without affecting the performance of each when there are enough compute resources? How big could the Service Fabric overhead possibly be when trying to run multiple containers on the same node?
Thanks!
Your container is not only using CPU, but also memory and I/O (disk, network), which can also become bottlenecks.
To see the overhead of SF, run the containers outside of SF and see if it makes a difference.
Use a machine with more memory, and after that, try using an SSD drive. See if that increases performance.
To avoid process overhead, consider using a single container and have multiple threads do parallel message processing. Make sure to assign it 3 cores.

Airflow Memory Error: Task exited with return code -9

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:
makes a query to get one of my BigQuery tables, and
writes the results to a collection in my Mongo database.
The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.
I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.
As you said, the error message you get corresponds to an out of memory issue.
Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.
You can check it with following steps:
In the Cloud Console, navigate to Kubernetes Engine -> Workloads
Click on airflow-worker, and look under Managed pods
If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
What are the possible ways to fix OOM issue?
Create a new Cloud Composer environment with a larger machine type than the current machine type.
Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed
Additionally you can check the behavior of CPU:
In the Cloud Console, navigate to Kubernetes Engine -> Clusters
Locate Node Pools at the bottom of the page, and expand the default-pool section
Click the link listed under Instance groups
Switch to the Monitoring tab, where you can find CPU utilization
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I hope you find the above pieces of information useful.
I am going to chunk the data so that less is loaded into any 1 task at any given time. I'm not sure yet whether I will need to use GCS/S3 for intermediary storage.

Prevent K8S HPA from deleting pod after load is reduced

I have sidekiq custom metrics coming from prometheus adapter. Using thoes queue metrics from prometheus i have setup HPA. When jobs in queue in sidekiq goes above say 1000 jobs HPA triggers 10 new pods. Then each pod will execute 100 jobs in queue. When jobs are reduced to say 400. HPA will scale-down. But when scale-down happens, hpa kills pods say 4 pods are killed. Thoes 4 pods were still running jobs say each pod was running 30-50 jobs. Now when hpa deletes these 4 pods, jobs running on them are also terminated. And thoes jobs are marked as failed in sidekiq.
So what i want to achieve is stop hpa from deleting pods which are executing the jobs. Moreover i want hpa to not scale-down even after load is reduced to minimum, instead delete pods when jobs in queue in sidekiq metrics is 0.
Is there any way to achieve this?
Weird usage, honestly: you're wasting resources even your traffic is on the cool-down phase but since you didn't provide further details, here it is.
Actually, it's not possible to achieve what you desire since the common behavior is to support a growing load against your workload. The unique wait to achieve this (and this is not recommended) is to change the horizontal-pod-autoscaler-downscale-stabilization Kubernetes Controller Manager's flag to a higher value.
JFI, the doc warns you:
Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.
As per the discussion and the work done by #Hb_1993 it can be done with a pre-stop hook to delay the eviction, where the delay is based on operation time or some logic to know if the procession is done or not.
A pre-stop hook is a lifecycle method which is invoked before a pod is evicted, and we can then attach to this event and perform some logic like performing ping check to make sure that our pod has completed the processing of current request.
PS- Use this solution with a pinch of salt as this might not work in all the cases or produce unintended results.
To do this, we introduce asleep in the preStop hook that delays the
shutdown sequence.
More details can be found in this article.
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304