Problem
We are trying to create an inference API that load PyTorch ResNet-101 model on AWS EKS. Apparently, it always killed OOM due to high CPU and Memory usage. Our log shows we need around 900m CPU resources limit. Note that we only tested it using one 1.8Mb image. Our DevOps team didn't really like it.
What we have tried
Currently we are using standard PyTorch load model module. We also clean model state dict to clean up the memory usage.
Is there any method to reduce the CPU usage to load PyTorch model?
Have you tried limiting the CPU available to the pods?
- name: pytorch-ml-model
image: pytorch-cpu-hog-model-haha
resources:
limits:
memory: "128Mi"
cpu: "1000m" # Replace this with CPU amount your devops guys will be happy about
If your error is OOM, you might want to consider the adding more memory allocated per pod? We as outsiders have no idea how large of memory you would require to execute your models, I would suggest using debugging tools like PyTorch profiler to understand how much memory you need for your inferencing use-case.
You might also want to consider, using memory-optimized worker nodes and applying deployment-node affinity through labels to ensure that inferencing pods are allocated in memory-optimized nodes in your EKS clusters.
Related
I am running a GPU server by referring to this document.
I have found that GPU is used in DL work with Jupyter notebook by creating a virtual environment of CPU pod on the GPU node as shown below.
Obviously there is no nvidia.com/GPU entry in Limits, Requests,
so I don't understand that GPU is used.
Limits:
cpu: 2
memory: 2000Mi
Requests:
cpu: 2
memory: 2000Mi
Is there a way to disable GPU for CPU pods?
Thank you.
Based on this topic on github:
This is currently not supported and we don't really have a plan to support it.
But...
you might want to take a look at the CUDA_VISIBLE_DEVICES environment variable that controls what devices a specific CUDA process can see:
https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
Below you can see the setup that I currently have. A Django app creates a set of requests as Celery tasks. Load is balanced using Istio across the gRPC server pods. The Python script processes the request and returns it. Everything is on AWS EKS and HPA and cluster scaling is also active.
The Python script is a CPU intensive process and depending on the request that Django sends, the CPU and Memory usages of the python script varies a lot. Visually inspecting it, for each request it can take anything between:
Best case (more common) -> 100m Memory, 100m CPU -> the python script takes a few seconds to process
To
Worst case (less common) -> 1000m Memory, 10,000m CPU -> the python script takes up to 3-4 minutes to process
Here is the current resources used for the gRPC server which is on a c5.2xlarge instance:
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 14
memory: 16384Mi
Also, the gRPC server has ThreadPoolExecutor with max_workers=16 which means it can respond to 16 requests at the same time.
The issue is that i'm trying to use the least amount of resource, and at the same time make sure each request don't take more than X minutes/seconds.
Scenarios that i can think of:
Using the same resources as defined above and setting max_workers=1. In this way i'm sure that each pod only process one request at a time, and i can somehow guarantee how long it'd take for the worst case to process. However, it'd be super expensive and probably not that scalable.
Using the same resources as defined above but setting max_workers=16 or a bigger number. In this case, even though each pod is taking up a lot of memory and CPU, but at least each gRPC server can handle multiple requests at the same time. However, the issue is that what if a few of the Worst case requests hit the same pod? Then it'd take a long time to process the requests.
Set max_workers=1 and modify the resources to something like below. In this way still each pod only process 1 request at a time, as well as using the minimum resources, but it can go up to the limit for the rare cases. I guess it's not a good practice for limits and requests to be that different.
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 100m
memory: 100m
I'd be grateful if you can take a look at the scenarios above. Any/all thoughts are highly appreciated.
Thanks
Background
Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.
Since ML is growing here. I am looking for a better way to make use of our infrastucture.
Requierments
Multi-user. PhDs and students should be able to run their tasks.
Job queue or scheduling (preferably something like time-sliced scheduling)
Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
Easy / Remote job submission: Maybe a webpage?
What I tried so far
I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.
First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.
SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.
In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.
My question
Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.
If you need more information, let me know.
Thanks
Tim!
As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.
There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757
I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.
This can be used in a following way:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.
I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.
Also you can read about Schedule GPUs which is still experimental.
I have a 4 cores CPU, I create a Kubernetes Pod with CPU resource limit 100m, which mean it will occupy 1/10 of a core power.
I wondering in this case, 100m is not even a full core, if my app is a multithread app, will my app's threads run in parallel? Or all the threads will run in the part of core (100 milli core) only?
Can anyone further explain the mechanism behind?
The closest answer I found so far is this one:
For a single-threaded program, a cpu usage of 0.1 means that if you
could freeze the machine at a random moment in time, and look at what
each core is doing, there is a 1 in 10 chance that your single thread
is running at that instant. The number of cores on the machine does
not affect the meaning of 0.1. For a container with multiple threads,
the container's usage is the sum of its thread's usage (per previous
definition.) There is no guarantee about which core you run on, and
you might run on a different core at different points in your
container's lifetime. A cpu limit of 0.1 means that your usage is not
allowed to exceed 0.1 for a significant period of time. A cpu request
of 0.1 means that the system will try to ensure that you are able to
have a cpu usage of at least 0.1, if your thread is not blocking
often.
I think above sound quite logical. Based on my question, 100m core of CPUs power will spread across all the CPU cores, which mean multithreading should work in Kubernetes.
Update:
In addition, this answer explain quite well that, although it might be running a thread in single core (or less than one core power as per question), due to operating system's scheduling capability, it will still try to run the instruction unit in parallel, but not exceed the clocking power (100m as per question) as specified.
Take a look to this documentation related to resources in Kubernetes:
You can use resources as described in the article:
To specify a CPU request for a Container, include the
resources:requests field in the Container resource manifest. To
specify a CPU limit, include resources:limits.
In this exercise, you create a Pod that has one Container. The
Container has a request of 0.5 CPU and a limit of 1 CPU. Here is the
configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
Additional to your question:
Yes it is not gonna run in parallels (multi core threads). But you can show few core for your application in pod and then use multi threads to execute it.
The args section of the configuration file provides arguments for
the Container when it starts. The -cpus "2" argument tells the
Container to attempt to use 2 CPUs.
I had a close look at the GitHub Issue Thread in question. There is a bit of back and forth in the thread, but I think I made sense of it and would like to share a couple of things that seem to be missing from the answers so far:
100m is not the same as a 1/10 of core power. It is an absolute quantity of CPU time and will remain the same regardless of the number of cores in the node.
While CPU time might well given on multiple cores of the node, true parallelism still depends on having a CPU limit that is well over the CPU required by a single thread. Otherwise, your multi-threaded application will run concurrently (i.e. threads take turns on the same CPU core) rather than in parallel (i.e. threads running on separate cores).
currently I recently switched our PostgreSQL cluster from a simple "bare-metal" (vms) workload to a containerised K8s cluster (also on vms).
Currently we run zalando-incubator/postgres-operator and use Local Volume's with volumeMode: FileSystem the volume itself is a "simple" xfs volume mounted on the host.
However we actually seen performance drops up to 50% on the postgres cluster inside k8s.
Some heavy join workloads actually perform way worse than on the old cluster which did not use containers at all.
Is there a way to tune the behavior or at least measure the performance of I/O to find the actual bottleneck (i.e. what is a good way to measure I/O, etc.)
Is there a way to tune the behavior
Be cognizant of two things that might be impacting your in-cluster behavior: increased cache thrashing and the inherent problem of running concurrent containers on a Node. If you haven't already tried it, you may want to use taints and tolerations to sequester your PG Pods away from other Pods and see if that helps.
what is a good way to measure I/O, etc.
I would expect the same iostat tools one is used to using would work on the Node, since no matter how much kernel namespace trickery is going on, it's still the Linux kernel.
Prometheus (and likely a ton of other such toys) surfaces some I/O specific metrics for containers, and I would presume they are at the scrape granularity, meaning you can increase the scrape frequency, bearing in mind the observation cost impacting your metrics :-(
It appears new docker daemons ship with Prom metrics, although I don't know what version introduced that functionality. There is a separate page discussing the implications of high frequency metric collection. There also appears to be a Prometheus exporter for monitoring arbitrary processes, above and beyond the PostgreSQL specific exporter.
Getting into my opinion, it may be a very reasonable experiment to go head-to-head with ext4 versus a non-traditional FS like xfs. I can't even fathom how much extra production experience has gone into ext4, merely by the virtue of almost every Linux on the planet deploying on it by default. You may have great reasons for using xfs, but I just wanted to ensure you had at least considered that xfs might have performance characteristics that make it problematic in a shared environment like a kubernetes cluster.