What is the difference between request and args in kubernetes resource config? - kubernetes

I am trying to set a deployment of containers in Kubernetes. I want the resource utilization to be controlled. I am referring this.
An example config from the docs -
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
But I am not able to clearly understand the differences between requests and args fields. limits is somewhat clear that the container should not be using more than the limit amount of resource.
What purpose does args serve exactly. Here, it is stated that this is the resource the container would start with. Then how is it different from requests ?

resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
Resource has request and limit field.
It means minimum 100Mi memory should be allocated to the container and this values is sufficient to run the container. In case of spike in traffic, it can burst memory consumption upto 200Mi. It is kind of upper bound. If it exceeds more than 200Mi the container will get killed/restarted.
Args are being passed to command(stress container) as command line arguments.
Stress Tool Docs
DockerImageForStress
looks like stress is consuming --vm-bytes=150M memory passed as an arg
I think with the help of stress tool, the docs are trying to indicate the container can consume memory between request and limit values.

Regards the args, since it's the only thing wasn't answered in the duplicated answer:
Args are not related to the resources definition, it just describes which arguments you pass to your docker container on start.
In the case of the example, the image is probably running java code, which and the user decided to pass some arguments related to memory to it.
If the image was using a different image, for example, node, the args could be some arguments to the node code running inside the container.
Hope that it answers your question.

Related

High working set size in dotnet under Kubernetes

We have an application running under kubernetes that is NET6.0. This application is a controller and starts up 10 worker processes. The issue we are experiencing is that frequently these worker processes are being killed by kubernetes and have the exit code of 137. From my research that indicates that they were kill because they are consuming too much memory.
To make this issue further difficult to troubleshoot, it only happens in our production environment after a period of time. Our production environment is also very locked down, the docker images all run with a readonly root filesystem, with a non-root user and very low priviledges. So to monitor the application we created a dashboard that reports various things, the two I will focus on are these pieces of data:
DotnetTotalMemory = GC.GetTotalMemory(false) / (1024 * 1024),
WorkingSetSize = process.WorkingSet64 / (1024 * 1024),
The interesting thing is that the "DotnetTotalMemory" ranges anywhere from 200mb to 400mb, but the "WorkingSetSize" starts out between 400mb to 600mb, but at times it jumps up to 1300mb, even when the "DotnetTotalMemory" is hovering at 200mb.
Our quota is as follows:
resources:
limits:
cpu: '5'
memory: 10Gi
requests:
cpu: 1250m
memory: 5Gi
From what I have read, the limit amount is recognized as the "available system memory" for dotnet and is passed to it through some mechanism similar to docker run --memory=XX, correct?
I switched to Workstation GC and that seems to make them slightly more stable. Another thing I tried was setting the 'DOTNET_GCConserveMemory' environment variable to '9', again it seems to help some. But I can't get past the fact that the process seems to have 1100mb+ of memory that is not managed by the GC. Is there a way for me to reduce the working set used by these processes?

What is the purpose of the args: -cpus -"#" in container manifest?

This page provides the following example from a container resource manifest of a pod configuration file:
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
And there are some notes describing what this means:
The args section of the configuration file provides arguments for the
container when it starts. The -cpus "2" argument tells the Container
to attempt to use 2 CPUs.
Recall that by setting -cpu "2", you configured the Container to
attempt to use 2 CPUs, but the Container is only being allowed to use
about 1 CPU. The container's CPU use is being throttled, because the
container is attempting to use more CPU resources than its limit.
What does it mean to say that the container is being "told to attempt to use 2 CPUs"?
From the example and note, it seems to imply that the container is being told to use more resources than what the limit constrains, but why would anyone do that?
I was assuming that the args above indicate across how many CPUs the container can span its resource utilization, but that the aggregate utilization across all of those CPUs would have to remain within the specified limits. But given the notes on the page, it seems to not work the way I expected, and that the args: -cpus is used for some other purpose/benefit, but I can't seem to glean what that is.
This example is misleading. There is nothing special about the -cpu 2 in the args section. The args section is just a way to provide paramaters to the container. The same way you would if running the container manually with docker run. In this example the command will end up being: docker run someimage:latest -cpus 2. The -cpus 2 portion gets passed along to the entrypoint of the image and (if supported) will get used by the application running inside that container.
I would imagine the aim was to show that you can have an application (container) that takes in a cpu parameter to control how much CPU that application consumes. But still limit the container usage from a k8's level using limits in the pod manifest.

PyTorch Inference High CPU Usage on Kubernetes

Problem
We are trying to create an inference API that load PyTorch ResNet-101 model on AWS EKS. Apparently, it always killed OOM due to high CPU and Memory usage. Our log shows we need around 900m CPU resources limit. Note that we only tested it using one 1.8Mb image. Our DevOps team didn't really like it.
What we have tried
Currently we are using standard PyTorch load model module. We also clean model state dict to clean up the memory usage.
Is there any method to reduce the CPU usage to load PyTorch model?
Have you tried limiting the CPU available to the pods?
- name: pytorch-ml-model
image: pytorch-cpu-hog-model-haha
resources:
limits:
memory: "128Mi"
cpu: "1000m" # Replace this with CPU amount your devops guys will be happy about
If your error is OOM, you might want to consider the adding more memory allocated per pod? We as outsiders have no idea how large of memory you would require to execute your models, I would suggest using debugging tools like PyTorch profiler to understand how much memory you need for your inferencing use-case.
You might also want to consider, using memory-optimized worker nodes and applying deployment-node affinity through labels to ensure that inferencing pods are allocated in memory-optimized nodes in your EKS clusters.

Is there a way to disable GPU for CPU pods?

I am running a GPU server by referring to this document.
I have found that GPU is used in DL work with Jupyter notebook by creating a virtual environment of CPU pod on the GPU node as shown below.
Obviously there is no nvidia.com/GPU entry in Limits, Requests,
so I don't understand that GPU is used.
Limits:
cpu: 2
memory: 2000Mi
Requests:
cpu: 2
memory: 2000Mi
Is there a way to disable GPU for CPU pods?
Thank you.
Based on this topic on github:
This is currently not supported and we don't really have a plan to support it.
But...
you might want to take a look at the CUDA_VISIBLE_DEVICES environment variable that controls what devices a specific CUDA process can see:
https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/

Properly define pod resources for a gRPC server if requests are very different in terms of usage

Below you can see the setup that I currently have. A Django app creates a set of requests as Celery tasks. Load is balanced using Istio across the gRPC server pods. The Python script processes the request and returns it. Everything is on AWS EKS and HPA and cluster scaling is also active.
The Python script is a CPU intensive process and depending on the request that Django sends, the CPU and Memory usages of the python script varies a lot. Visually inspecting it, for each request it can take anything between:
Best case (more common) -> 100m Memory, 100m CPU -> the python script takes a few seconds to process
To
Worst case (less common) -> 1000m Memory, 10,000m CPU -> the python script takes up to 3-4 minutes to process
Here is the current resources used for the gRPC server which is on a c5.2xlarge instance:
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 14
memory: 16384Mi
Also, the gRPC server has ThreadPoolExecutor with max_workers=16 which means it can respond to 16 requests at the same time.
The issue is that i'm trying to use the least amount of resource, and at the same time make sure each request don't take more than X minutes/seconds.
Scenarios that i can think of:
Using the same resources as defined above and setting max_workers=1. In this way i'm sure that each pod only process one request at a time, and i can somehow guarantee how long it'd take for the worst case to process. However, it'd be super expensive and probably not that scalable.
Using the same resources as defined above but setting max_workers=16 or a bigger number. In this case, even though each pod is taking up a lot of memory and CPU, but at least each gRPC server can handle multiple requests at the same time. However, the issue is that what if a few of the Worst case requests hit the same pod? Then it'd take a long time to process the requests.
Set max_workers=1 and modify the resources to something like below. In this way still each pod only process 1 request at a time, as well as using the minimum resources, but it can go up to the limit for the rare cases. I guess it's not a good practice for limits and requests to be that different.
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 100m
memory: 100m
I'd be grateful if you can take a look at the scenarios above. Any/all thoughts are highly appreciated.
Thanks