We have an application running under kubernetes that is NET6.0. This application is a controller and starts up 10 worker processes. The issue we are experiencing is that frequently these worker processes are being killed by kubernetes and have the exit code of 137. From my research that indicates that they were kill because they are consuming too much memory.
To make this issue further difficult to troubleshoot, it only happens in our production environment after a period of time. Our production environment is also very locked down, the docker images all run with a readonly root filesystem, with a non-root user and very low priviledges. So to monitor the application we created a dashboard that reports various things, the two I will focus on are these pieces of data:
DotnetTotalMemory = GC.GetTotalMemory(false) / (1024 * 1024),
WorkingSetSize = process.WorkingSet64 / (1024 * 1024),
The interesting thing is that the "DotnetTotalMemory" ranges anywhere from 200mb to 400mb, but the "WorkingSetSize" starts out between 400mb to 600mb, but at times it jumps up to 1300mb, even when the "DotnetTotalMemory" is hovering at 200mb.
Our quota is as follows:
resources:
limits:
cpu: '5'
memory: 10Gi
requests:
cpu: 1250m
memory: 5Gi
From what I have read, the limit amount is recognized as the "available system memory" for dotnet and is passed to it through some mechanism similar to docker run --memory=XX, correct?
I switched to Workstation GC and that seems to make them slightly more stable. Another thing I tried was setting the 'DOTNET_GCConserveMemory' environment variable to '9', again it seems to help some. But I can't get past the fact that the process seems to have 1100mb+ of memory that is not managed by the GC. Is there a way for me to reduce the working set used by these processes?
Related
Problem
We are trying to create an inference API that load PyTorch ResNet-101 model on AWS EKS. Apparently, it always killed OOM due to high CPU and Memory usage. Our log shows we need around 900m CPU resources limit. Note that we only tested it using one 1.8Mb image. Our DevOps team didn't really like it.
What we have tried
Currently we are using standard PyTorch load model module. We also clean model state dict to clean up the memory usage.
Is there any method to reduce the CPU usage to load PyTorch model?
Have you tried limiting the CPU available to the pods?
- name: pytorch-ml-model
image: pytorch-cpu-hog-model-haha
resources:
limits:
memory: "128Mi"
cpu: "1000m" # Replace this with CPU amount your devops guys will be happy about
If your error is OOM, you might want to consider the adding more memory allocated per pod? We as outsiders have no idea how large of memory you would require to execute your models, I would suggest using debugging tools like PyTorch profiler to understand how much memory you need for your inferencing use-case.
You might also want to consider, using memory-optimized worker nodes and applying deployment-node affinity through labels to ensure that inferencing pods are allocated in memory-optimized nodes in your EKS clusters.
I am running a GPU server by referring to this document.
I have found that GPU is used in DL work with Jupyter notebook by creating a virtual environment of CPU pod on the GPU node as shown below.
Obviously there is no nvidia.com/GPU entry in Limits, Requests,
so I don't understand that GPU is used.
Limits:
cpu: 2
memory: 2000Mi
Requests:
cpu: 2
memory: 2000Mi
Is there a way to disable GPU for CPU pods?
Thank you.
Based on this topic on github:
This is currently not supported and we don't really have a plan to support it.
But...
you might want to take a look at the CUDA_VISIBLE_DEVICES environment variable that controls what devices a specific CUDA process can see:
https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
Below you can see the setup that I currently have. A Django app creates a set of requests as Celery tasks. Load is balanced using Istio across the gRPC server pods. The Python script processes the request and returns it. Everything is on AWS EKS and HPA and cluster scaling is also active.
The Python script is a CPU intensive process and depending on the request that Django sends, the CPU and Memory usages of the python script varies a lot. Visually inspecting it, for each request it can take anything between:
Best case (more common) -> 100m Memory, 100m CPU -> the python script takes a few seconds to process
To
Worst case (less common) -> 1000m Memory, 10,000m CPU -> the python script takes up to 3-4 minutes to process
Here is the current resources used for the gRPC server which is on a c5.2xlarge instance:
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 14
memory: 16384Mi
Also, the gRPC server has ThreadPoolExecutor with max_workers=16 which means it can respond to 16 requests at the same time.
The issue is that i'm trying to use the least amount of resource, and at the same time make sure each request don't take more than X minutes/seconds.
Scenarios that i can think of:
Using the same resources as defined above and setting max_workers=1. In this way i'm sure that each pod only process one request at a time, and i can somehow guarantee how long it'd take for the worst case to process. However, it'd be super expensive and probably not that scalable.
Using the same resources as defined above but setting max_workers=16 or a bigger number. In this case, even though each pod is taking up a lot of memory and CPU, but at least each gRPC server can handle multiple requests at the same time. However, the issue is that what if a few of the Worst case requests hit the same pod? Then it'd take a long time to process the requests.
Set max_workers=1 and modify the resources to something like below. In this way still each pod only process 1 request at a time, as well as using the minimum resources, but it can go up to the limit for the rare cases. I guess it's not a good practice for limits and requests to be that different.
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 100m
memory: 100m
I'd be grateful if you can take a look at the scenarios above. Any/all thoughts are highly appreciated.
Thanks
I have a 4 cores CPU, I create a Kubernetes Pod with CPU resource limit 100m, which mean it will occupy 1/10 of a core power.
I wondering in this case, 100m is not even a full core, if my app is a multithread app, will my app's threads run in parallel? Or all the threads will run in the part of core (100 milli core) only?
Can anyone further explain the mechanism behind?
The closest answer I found so far is this one:
For a single-threaded program, a cpu usage of 0.1 means that if you
could freeze the machine at a random moment in time, and look at what
each core is doing, there is a 1 in 10 chance that your single thread
is running at that instant. The number of cores on the machine does
not affect the meaning of 0.1. For a container with multiple threads,
the container's usage is the sum of its thread's usage (per previous
definition.) There is no guarantee about which core you run on, and
you might run on a different core at different points in your
container's lifetime. A cpu limit of 0.1 means that your usage is not
allowed to exceed 0.1 for a significant period of time. A cpu request
of 0.1 means that the system will try to ensure that you are able to
have a cpu usage of at least 0.1, if your thread is not blocking
often.
I think above sound quite logical. Based on my question, 100m core of CPUs power will spread across all the CPU cores, which mean multithreading should work in Kubernetes.
Update:
In addition, this answer explain quite well that, although it might be running a thread in single core (or less than one core power as per question), due to operating system's scheduling capability, it will still try to run the instruction unit in parallel, but not exceed the clocking power (100m as per question) as specified.
Take a look to this documentation related to resources in Kubernetes:
You can use resources as described in the article:
To specify a CPU request for a Container, include the
resources:requests field in the Container resource manifest. To
specify a CPU limit, include resources:limits.
In this exercise, you create a Pod that has one Container. The
Container has a request of 0.5 CPU and a limit of 1 CPU. Here is the
configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
Additional to your question:
Yes it is not gonna run in parallels (multi core threads). But you can show few core for your application in pod and then use multi threads to execute it.
The args section of the configuration file provides arguments for
the Container when it starts. The -cpus "2" argument tells the
Container to attempt to use 2 CPUs.
I had a close look at the GitHub Issue Thread in question. There is a bit of back and forth in the thread, but I think I made sense of it and would like to share a couple of things that seem to be missing from the answers so far:
100m is not the same as a 1/10 of core power. It is an absolute quantity of CPU time and will remain the same regardless of the number of cores in the node.
While CPU time might well given on multiple cores of the node, true parallelism still depends on having a CPU limit that is well over the CPU required by a single thread. Otherwise, your multi-threaded application will run concurrently (i.e. threads take turns on the same CPU core) rather than in parallel (i.e. threads running on separate cores).
I pushed an app from GitHub to Bluemix after I created an Availability Monitoring Service instance. I see the following error message:
APP/0Cannot calculate memory: insufficient memory remaining for heap.
Memory limit 512M < allocated memory 603532K
(-XX:ReservedCodeCacheSize=240M, -XX:MaxDirectMemorySize=10M,
-XX:MaxMetaspaceSize=35839K, -XX:CompressedClassSpaceSize=4492K, -Xss1M * 300 threads)
There seems to be enough memory available (1.250 GB/8 GB Used)
Currently Availability Monitoring does not instrument the application in any fashion, but merely monitors it externally (by making GET requests). This looks to be an issue with the APP itself.
opened an issue, resulting in a change: https://github.com/watsonwork/watsonwork-java-starter/commit/8b6c0abd8e8052f154cc703cdceb77f00d558404