what will happen if the resources I required is not enough during the running of the job? - hpc

in slurm, what will happen if the resources I required is not enough during the running of the job?
For example, #SBATCH --memory=10G; #SBATCH --cpus-per-task=2; python mytrain.py is in myscript.sh. After I run sbatch myscript.sh the job is allocated the required cpu (2) and memory (10 G) successfully. But during the running of the job, the program need more memory than 10 Gb (like loading a big video dataset), I found the job would not be killed. The job will still work normally.
So my question is: is there any side effect when I underestimate the resource I need? (memory seems okay, but is it stll okay if the required cpu number is not enough?)

Slurm can be configured to constrain the jobs into their resource requests(the most usual setup) , which does not seem to be the case in the cluster you are using.
If it were the case, your job would be killed when trying to use more memory than requested, and it would be limited to the physical CPUs you requested.
In your case, using more memory than requested can lead to memory exhaustion on the node on which your job is running, possibly, having your processes (but also possibly processes of other jobs on the same node!), killed by the OOM killer. Using more CPUs than requested means the processes started by your job will compete with the processes of other jobs for the same physical CPU, leading to a general slow-down of all jobs on the node because of a large number of context switches. Jobs being slowed down can then exceed their maximum time and get killed.
Underestimating resources can thus lead to loss of your jobs. If nodes are shared among jobs, it can also lead to loss of jobs from other users.

Related

Does assigning more nodes to a job on a SLURM server increase available RAM?

I am working with a program that needs a lot RAM. Currently I am running it on a SLURM cluster. Each node has 125GB RAM. When submitting the job to a single node it eventually fails as it runs out of memory. My rather naive question, as I am new to working on servers, is:
Does assigning more nodes with the command --nodes flag increase available RAM for the submitted job?
For example:
When assigning 10 nodes instead of 1, with the command below, the program fails at the same point as with with one node.
#SBATCH --nodes=10
Is there some other way to combine RAM from multiple nodes for a single job?
Any and all advice is welcome!
That depends on your program, but most likely no.
To use multiple nodes on a Slurm Cluster (or any cluster, for that matter), your program needs to be set up in very specific way, ie. you need inter node communictaion. This is usually done via MPI and the whole program has to be designed around it.
So if your program uses MPI it may be able to split the workload over several nodes. And even that does not guarantee lower memory as that is usually not the goal of such a parallelization.

Running multiple containers on the same Service Fabric node

I have a windows Service Fabric node with 4 cores and I want to host 3 containerized stateless services on it, where each windows container is allocated 1 core to read a message from a queue and process it. I run some experiments and got these results:
1 container running on the node: message takes ~18 sec to be
processed, avg cpu usage per container: 24.7%, memory usage: 1 GB
2 containers running on the node: message takes ~25 sec to be
processed, avg cpu usage per container: 24.4%, memory usage: 1 GB
3 containers running on the node: message takes ~35 sec to be
processed, avg cpu usage per container: 24.6%, memory usage: 1 GB
I thought that containers are supposedly isolated, and I expected the processing time to be constant at ~18s regardless of the number of containers, but in this case, it seems that adding one container affects the processing time in other containers. Each container is set to use 1 core, so they shouldn't be overstepping to use each other's resources, and cpu is not reaching full utilization. Even if cpu was a bottleneck here, I'd expect that at least 2 containers would be able to run with ~18 sec processing time.
Is there a logical explanation for the results? Isn't it not possible to run multiple containers on the same Service Fabric host without affecting the performance of each when there are enough compute resources? How big could the Service Fabric overhead possibly be when trying to run multiple containers on the same node?
Thanks!
Your container is not only using CPU, but also memory and I/O (disk, network), which can also become bottlenecks.
To see the overhead of SF, run the containers outside of SF and see if it makes a difference.
Use a machine with more memory, and after that, try using an SSD drive. See if that increases performance.
To avoid process overhead, consider using a single container and have multiple threads do parallel message processing. Make sure to assign it 3 cores.

Airflow Memory Error: Task exited with return code -9

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:
makes a query to get one of my BigQuery tables, and
writes the results to a collection in my Mongo database.
The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.
I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.
As you said, the error message you get corresponds to an out of memory issue.
Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.
You can check it with following steps:
In the Cloud Console, navigate to Kubernetes Engine -> Workloads
Click on airflow-worker, and look under Managed pods
If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
What are the possible ways to fix OOM issue?
Create a new Cloud Composer environment with a larger machine type than the current machine type.
Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed
Additionally you can check the behavior of CPU:
In the Cloud Console, navigate to Kubernetes Engine -> Clusters
Locate Node Pools at the bottom of the page, and expand the default-pool section
Click the link listed under Instance groups
Switch to the Monitoring tab, where you can find CPU utilization
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I hope you find the above pieces of information useful.
I am going to chunk the data so that less is loaded into any 1 task at any given time. I'm not sure yet whether I will need to use GCS/S3 for intermediary storage.

Queries regarding celery scalability

I have few questions regarding celery. Please help me with that.
Do we need to put the project code in every celery worker? If yes, if I am increasing the number of workers and also I am updating my code, what is the best way to update the code in all the worker instances (without manually pushing code to every instance everytime)?
Using -Ofair in celery worker as argument disable prefetching in workers even if have set PREFETCH_LIMIT=8 or so?
IMPORTANT: Does rabbitmq broker assign the task to the workers or do workers pull the task from the broker?
Does it make sense to have more than one celery worker (with as many subprocesses as number of cores) in a system? I see few people run multiple celery workers in a single system.
To add to the previous question, whats the performance difference between the two scenarios: single worker (8 cores) in a system or two workers (with concurrency 4)
Please answer my questions. Thanks in advance.
Do we need to put the project code in every celery worker? If yes, if I am increasing the number of workers and also I am updating my code, what is the best way to update the code in all the worker instances (without manually pushing code to every instance everytime)?
Yes. A celery worker runs your code, and so naturally it needs access to that code. How you make the code accessible though is entirely up to you. Some approaches include:
Code updates and restarting of workers as part of deployment
If you run your celery workers in kubernetes pods this comes down to building a new docker image and upgrading your workers to the new image. Using rolling updates this can be done with zero downtime.
Scheduled synchronization from a repository and worker restarts by broadcast
If you run your celery workers in a more traditional environment or for some reason you don't want to rebuild whole images, you can use some central file system available to all workers, where you update the files e.g. syncing a git repository on a schedule or by some trigger. It is important you restart all celery workers so they reload the code. This can be done by remote control.
Dynamic loading of code for every task
For example in omega|ml we provide lambda-style serverless execution of
arbitrary python scripts which are dynamically loaded into the worker process.
To avoid module loading and dependency issues it is important to keep max-tasks-per-child=1 and use the prefork pool. While this adds some overhead it is a tradeoff that we find is easy to manage (in particular we run machine learning tasks and so the little overhead of loading scripts and restarting workers after every task is not an issue)
Using -Ofair in celery worker as argument disable prefetching in workers even if have set PREFETCH_LIMIT=8 or so?
-O fair stops workers from prefetching tasks unless there is an idle process. However there is a quirk with rate limits which I recently stumbled upon. In practice I have not experienced a problem with neither prefetching nor rate limiting, however as with any distributed system it pays of to think about the effects of the asynchronous nature of execution (this is not particular to Celery but applies to all such such systems).
IMPORTANT: Does rabbitmq broker assign the task to the workers or do workers pull the task from the broker?
Rabbitmq does not know of the workers (nor do any of the other broker supported by celery) - they just maintain a queue of messages. That is, it is the workers that pull tasks from the broker.
A concern that may come up with this is what if my worker crashes while executing tasks. There are several aspects to this: There is a distinction between a worker and the worker processes. The worker is the single task started to consume tasks from the broker, it does not execute any of the task code. The task code is executed by one of the worker processes. When using the prefork pool (which is the default) a failed worker process is simply restarted without affecting the worker as a whole or other worker processes.
Does it make sense to have more than one celery worker (with as many subprocesses as number of cores) in a system? I see few people run multiple celery workers in a single system.
That depends on the scale and type of workload you need to run. In general CPU bound tasks should be run on workers with a concurrency setting that doesn't exceed the number of cores. If you need to process more of these tasks than you have cores, run multiple workers to scale out. Note if your CPU bound task uses more than one core at a time (e.g. as is often the case in machine learning workloads/numerical processing) it is the total number of cores used per task, not the total number of tasks run concurrently that should inform your decision.
To add to the previous question, whats the performance difference between the two scenarios: single worker (8 cores) in a system or two workers (with concurrency 4)
Hard to say in general, best to run some tests. For example if 4 concurrently run tasks use all the memory on a single node, adding another worker will not help. If however you have two queues e.g. with different rates of arrival (say one for low frequency but high-priority execution, another for high frequency but low-priority) both of which can be run concurrently on the same node without concern for CPU or memory, a single node will do.

Running multiple pods simultaneously takes a lot of time in kubernetes

On the local machine, I am running multiple pods at the same time. It takes a lot of time to complete even though all the pods achieve the running state almost instantly. Internally i am running a docker image (1.8GB) on each pod. When i run pods on serial order, it takes around 12 sec/pod run but running parallely, the time increases exponentially not even at least the same as serial. What could be the probable cause for this?
EDIT 1
The operation is really cpu intensive, reaching above 90%. Is there a way to queue pods as they come for cpu resources, so that instead of all slowing down, each of them execute fast in a queue.