Kubernetes CPU multithreading

Kubernetes CPU multithreading - kubernetes

I have a 4 cores CPU, I create a Kubernetes Pod with CPU resource limit 100m, which mean it will occupy 1/10 of a core power.
I wondering in this case, 100m is not even a full core, if my app is a multithread app, will my app's threads run in parallel? Or all the threads will run in the part of core (100 milli core) only?
Can anyone further explain the mechanism behind?

The closest answer I found so far is this one:
For a single-threaded program, a cpu usage of 0.1 means that if you
could freeze the machine at a random moment in time, and look at what
each core is doing, there is a 1 in 10 chance that your single thread
is running at that instant. The number of cores on the machine does
not affect the meaning of 0.1. For a container with multiple threads,
the container's usage is the sum of its thread's usage (per previous
definition.) There is no guarantee about which core you run on, and
you might run on a different core at different points in your
container's lifetime. A cpu limit of 0.1 means that your usage is not
allowed to exceed 0.1 for a significant period of time. A cpu request
of 0.1 means that the system will try to ensure that you are able to
have a cpu usage of at least 0.1, if your thread is not blocking
often.
I think above sound quite logical. Based on my question, 100m core of CPUs power will spread across all the CPU cores, which mean multithreading should work in Kubernetes.
Update:
In addition, this answer explain quite well that, although it might be running a thread in single core (or less than one core power as per question), due to operating system's scheduling capability, it will still try to run the instruction unit in parallel, but not exceed the clocking power (100m as per question) as specified.

Take a look to this documentation related to resources in Kubernetes:
You can use resources as described in the article:
To specify a CPU request for a Container, include the
resources:requests field in the Container resource manifest. To
specify a CPU limit, include resources:limits.
In this exercise, you create a Pod that has one Container. The
Container has a request of 0.5 CPU and a limit of 1 CPU. Here is the
configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
Additional to your question:
Yes it is not gonna run in parallels (multi core threads). But you can show few core for your application in pod and then use multi threads to execute it.
The args section of the configuration file provides arguments for
the Container when it starts. The -cpus "2" argument tells the
Container to attempt to use 2 CPUs.

I had a close look at the GitHub Issue Thread in question. There is a bit of back and forth in the thread, but I think I made sense of it and would like to share a couple of things that seem to be missing from the answers so far:
100m is not the same as a 1/10 of core power. It is an absolute quantity of CPU time and will remain the same regardless of the number of cores in the node.
While CPU time might well given on multiple cores of the node, true parallelism still depends on having a CPU limit that is well over the CPU required by a single thread. Otherwise, your multi-threaded application will run concurrently (i.e. threads take turns on the same CPU core) rather than in parallel (i.e. threads running on separate cores).

Related

PyTorch Inference High CPU Usage on Kubernetes

Problem
We are trying to create an inference API that load PyTorch ResNet-101 model on AWS EKS. Apparently, it always killed OOM due to high CPU and Memory usage. Our log shows we need around 900m CPU resources limit. Note that we only tested it using one 1.8Mb image. Our DevOps team didn't really like it.
What we have tried
Currently we are using standard PyTorch load model module. We also clean model state dict to clean up the memory usage.
Is there any method to reduce the CPU usage to load PyTorch model?

Have you tried limiting the CPU available to the pods?
- name: pytorch-ml-model
image: pytorch-cpu-hog-model-haha
resources:
limits:
memory: "128Mi"
cpu: "1000m" # Replace this with CPU amount your devops guys will be happy about
If your error is OOM, you might want to consider the adding more memory allocated per pod? We as outsiders have no idea how large of memory you would require to execute your models, I would suggest using debugging tools like PyTorch profiler to understand how much memory you need for your inferencing use-case.
You might also want to consider, using memory-optimized worker nodes and applying deployment-node affinity through labels to ensure that inferencing pods are allocated in memory-optimized nodes in your EKS clusters.

Pin Kubernetes pods/deployments/replica sets/daemon sets to run on specific cpu only

I need to restrict an app/deployment to run on specific cpus only (say 0-3 or just 1 or 2 etc.) I found out about CPU Manager and tried implement it with static policy but not able to achieve what I intend to.
I tried the following so far:
Enabled cpu manager static policy on kubelet and verified that it is enabled
Reserved the cpu with --reserved-cpus=0-3 option in the kubelet
Ran a sample nginx deployment with limits equal to requests and cpu of integer value i.e. QoS of guaranteed is ensured and able to validate the cpu affinity with taskset -c -p $(pidof nginx)
So, this makes my nginx app to be restricted to run on all cpus other than reserved cpus (0-3), i.e. if my machine has 32 cpus, the app can run on any of the 4-31 cpus. And so can any other apps/deployments that will run. As I understand, the reserved cpus 0-3 will be reserved for system daemons, OS daemons etc.
My questions-
Using the Kubernetes CPU Manager features, is it possible to pin certain cpu to an app/pod (in this case, my nginx app) to run on a specific cpu only (say 2 or 3 or 4-5)? If yes, how?
If point number 1 is possible, can we perform the pinning at container level too i.e. say Pod A has two containers Container B and Container D. Is it possible to pin cpu 0-3 to Container B and cpu 4 to Container B?
If none of this is possible using Kubernetes CPU Manager, what are the alternatives that are available at this point of time, if any?

As I understand your question, you want to set up your dedicated number of CPU for each app/pod. As I've searched.
I am only able to find some documentation that might help. The other one is a Github topic I think this is a workaround to your problem.
This is a disclaimer, based from what I've read, searched and understand there is no direct solution for this issue, only workarounds. I am still searching further for this.

Properly define pod resources for a gRPC server if requests are very different in terms of usage

Below you can see the setup that I currently have. A Django app creates a set of requests as Celery tasks. Load is balanced using Istio across the gRPC server pods. The Python script processes the request and returns it. Everything is on AWS EKS and HPA and cluster scaling is also active.
The Python script is a CPU intensive process and depending on the request that Django sends, the CPU and Memory usages of the python script varies a lot. Visually inspecting it, for each request it can take anything between:
Best case (more common) -> 100m Memory, 100m CPU -> the python script takes a few seconds to process
To
Worst case (less common) -> 1000m Memory, 10,000m CPU -> the python script takes up to 3-4 minutes to process
Here is the current resources used for the gRPC server which is on a c5.2xlarge instance:
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 14
memory: 16384Mi
Also, the gRPC server has ThreadPoolExecutor with max_workers=16 which means it can respond to 16 requests at the same time.
The issue is that i'm trying to use the least amount of resource, and at the same time make sure each request don't take more than X minutes/seconds.
Scenarios that i can think of:
Using the same resources as defined above and setting max_workers=1. In this way i'm sure that each pod only process one request at a time, and i can somehow guarantee how long it'd take for the worst case to process. However, it'd be super expensive and probably not that scalable.
Using the same resources as defined above but setting max_workers=16 or a bigger number. In this case, even though each pod is taking up a lot of memory and CPU, but at least each gRPC server can handle multiple requests at the same time. However, the issue is that what if a few of the Worst case requests hit the same pod? Then it'd take a long time to process the requests.
Set max_workers=1 and modify the resources to something like below. In this way still each pod only process 1 request at a time, as well as using the minimum resources, but it can go up to the limit for the rare cases. I guess it's not a good practice for limits and requests to be that different.
resources:
limits:
cpu: 14
memory: 16384Mi
requests:
cpu: 100m
memory: 100m
I'd be grateful if you can take a look at the scenarios above. Any/all thoughts are highly appreciated.
Thanks

Setting up a multi-user job scheduler for data science / ML tasks

Background
Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.
Since ML is growing here. I am looking for a better way to make use of our infrastucture.
Requierments
Multi-user. PhDs and students should be able to run their tasks.
Job queue or scheduling (preferably something like time-sliced scheduling)
Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
Easy / Remote job submission: Maybe a webpage?
What I tried so far
I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.
First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.
SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.
In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.
My question
Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.
If you need more information, let me know.
Thanks
Tim!

As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.
There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757
I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.
This can be used in a following way:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.
I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.
Also you can read about Schedule GPUs which is still experimental.

kubernetes / understanding CPU resources limits

Coming from numerous years of running node/rails apps on bare metal; i was used to be able to run as many apps as i wanted on a single machine (let's say, a 2Go at digital ocean could easily handle 10 apps without worrying, based on correct optimizations or fairly low amount of traffic)
Thing is, using kubernetes, the game sounds quite different. I've setup a "getting started" cluster with 2 standard vm (3.75Go).
Assigned a limit on a deployment with the following :
resources:
requests:
cpu: "64m"
memory: "128Mi"
limits:
cpu: "128m"
memory: "256Mi"
Then witnessing the following :
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default api 64m (6%) 128m (12%) 128Mi (3%) 256Mi (6%)
What does this 6% refers to ?
Tried to lower the CPU limit, to like, 20Mi… the app does to start (obviously, not enough resources). The docs says it is percentage of CPU. So, 20% of 3.75Go machine ? Then where this 6% comes from ?
Then increased the size of the node-pool to the n1-standard-2, the same pod effectively span 3% of node. That sounds logical, but what does it actually refers to ?
Still wonder what is the metrics to be taken in account for this part.
The app seems to need a large amount of memory on startup, but then it use only a minimal fraction of this 6%. I then feel like I misunderstanding something, or misusing it all
Thanks for any experienced tips/advices to have a better understanding
Best

According to the docs, CPU requests (and limits) are always fractions of available CPU cores on the node that the pod is scheduled on (with a resources.requests.cpu of "1" meaning reserving one CPU core exclusively for one pod). Fractions are allowed, so a CPU request of "0.5" will reserve half a CPU for one pod.
For convenience, Kubernetes allows you to specify CPU resource requests/limits in millicores:
The expression 0.1 is equivalent to the expression 100m, which can be read as “one hundred millicpu” (some may say “one hundred millicores”, and this is understood to mean the same thing when talking about Kubernetes). A request with a decimal point, like 0.1 is converted to 100m by the API, and precision finer than 1m is not allowed.
As already mentioned in the other answer, resource requests are guaranteed. This means that Kubernetes will schedule pods in a way that the sum of all requests will not exceed the amount of resources actually available on a node.
So, by requesting 64m of CPU time in your deployment, you are requesting actually 64/1000 = 0,064 = 6,4% of one of the node's CPU cores time. So that's where your 6% come from. When upgrading to a VM with more CPU cores, the amount of available CPU resources increases, so on a machine with two available CPU cores, a request for 6,4% of one CPU's time will allocate 3,2% of the CPU time of two CPUs.

The 6% of CPU means 6% (CPU requests) of the nodes CPU time is reserved for this pod. So it guaranteed that it always get at lease this amount of CPU time. It can still burst up to 12% (CPU limits), if there is still CPU time left.
This means if the limit is very low, your application will take more time to start up. Therefore a liveness probe may kill the pod before it is ready, because the application took too long. To solve this you may have to increase the initialDelaySeconds or the timeoutSeconds of the liveness probe.
Also note that the resource requests and limits define how many resources your pod allocates, and not the actual usage.
The resource request is what your pod is guaranteed to get on a node. This means, that the sum of the requested resources must not be higher than the total amount of CPU/memory on that node.
The resource limit is the upper limit of what your pod is allowed to use. This means the sum of of these resources can be higher than the actual available CPU/memory.
Therefore the percentages tell you how much CPU and memory of the total resources your pod allocates.
Link to the docs: https://kubernetes.io/docs/user-guide/compute-resources/
Some other notable things:
If your pod uses more memory than defined in the limit, it gets OOMKilled (out of memory).
If your pod uses more memory than defined in the requests and the node runs our of memory, the pod might get OOMKilled in order to guarantee other pods to survive, which use less than their requested memory.
If your application needs more CPU than requested it can burst up to the limit.
Your pod never gets killed, because it uses too much CPU.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse