Can kubernetes shared single GPU between pods? - kubernetes

Is there a possibility to share a single GPU between kubernetes pods ?

As the official doc says
GPUs are only supposed to be specified in the limits section, which means:
You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
You can specify GPU in both limits and requests but these two values must be equal.
You cannot specify GPU requests without specifying limits.
Containers (and pods) do not share GPUs. There’s no overcommitting of GPUs.
Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
Also, you can follow this discussion to get a little bit more information.

Yes, it is possible - at least with Nvidia GPUs.
Just don't specify it in the resource limits/requests. This way containers from all pods will have full access to the GPU as if they were normal processes.

Yes it's possible by making some changes to the scheduler, someone on github kindly open-sourced their solution, take a look here: https://github.com/AliyunContainerService/gpushare-scheduler-extender

Yes, you can use nano gpu for sharing gpu of nvidia.

Official docs says pods can't request fraction of CPU. If you are running machine learning application in multiple pods then you have to look into kubeflow. Those guys have solved this issue.

Related

can a dedicated GPU share to multiple kubernetes pods?

Is there a way we can share the GPU between multiple pods or we need some specific model of NVIDIA GPUS?
Short answer, yes :)
Long answer below :)
There is no "built-in" solution to achieve that, but you can use many tools (plugins) to control GPU. First look at the Kubernetes official site:
Kubernetes includes experimental support for managing AMD and NVIDIA GPUs (graphical processing units) across several nodes.
This page describes how users can consume GPUs across different Kubernetes versions and the current limitations.
Look also about limitations:
GPUs are only supposed to be specified in the limits section, which means:
- You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
- You can specify GPU in both limits and requests but these two values must be equal.
- You cannot specify GPU requests without specifying limits.
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
As you can see this supports GPUs between several nodes. You can find the guide how to deploy it.
Additionally, if you don't specify this in resource / request limits, the containers from all pods will have full access to the GPU as if they were normal processes. There is no need to do anything in this case.
For more look also at this github topic.

Guessing kubernetes limits for kubernetes deployments

Is there any way we can correctly guess how much resource limits we need to keep for running deployments on kubernetes clusters.
Yes, you can guess that single threaded application most likely won't need more that 1 CPU.
For any other programs: no, there is not easy way to guess it. Every application is different, and reacts differently under different workloads.
The easiest way to figure out how many resources it needs is to run it and measure it.
Run some benchmarks/profilers and see how application behaves. Then make decisions based on that.

In Kubernetes, how can I scale a Deployment to zero when idle

I'm running a fairly resource-intensive service on a Kubernetes cluster to support CI activities. Only a single replica is needed, but it uses a lot of resources (16 cpu), and it's only needed during work hours generally (weekdays, 8am-6pm roughly). My cluster runs in a cloud and is setup with instance autoscaling, so if this service is scaled to zero, that instance can be terminated.
The service is third-party code that cannot be modified (well, not easily). It's a fairly typical HTTP service other than that its work is fairly CPU intensive.
What options exist to automatically scale this Deployment down to zero when idle?
I'd rather not setup a schedule to scale it up/down during working hours because occasionally CI activities are performed outside of the normal hours. I'd like the scaling to be dynamic (for example, scale to zero when idle for >30 minutes, or scale to one when an incoming connection arrives).
Actually Kubernetes supports the scaling to zero only by means of an API call, since the Horizontal Pod Autoscaler does support scaling down to 1 replica only.
Anyway there are a few Operator which allow you to overtake that limitation by intercepting the requests coming to your pods or by inspecting some metrics.
You can take a look at Knative or Keda.
They enable your application to be serverless and they do so in different ways.
Knative, by means of Istio intercept the requests and if there's an active pod serving them, it redirects the incoming request to that one, otherwise it trigger a scaling.
By contrast, Keda best fits event-driven architecture, because it is able to inspect predefined metrics, such as lag, queue lenght or custom metrics (collected from Prometheus, for example) and trigger the scaling.
Both support scale to zero in case predefined conditions are met in a equally predefined window.
Hope it helped.
I ended up implementing a custom solution: https://github.com/greenkeytech/zero-pod-autoscaler
Compared to Knative, it's more of a "toy" project, fairly small, and has no dependency on istio. It's been working well for my use case, though I do not recommend others use it without being willing to adopt the code as your own.
There are a few ways this can be achieved, possibly the most "native" way is using Knative with Istio. Kubernetes by default allows you to scale to zero, however you need something that can broker the scale-up events based on an "input event", essentially something that supports an event driven architecture.
You can take a look at the offcial documents here: https://knative.dev/docs/serving/configuring-autoscaling/
The horizontal pod autoscaler currently doesn’t allow setting the minReplicas field to 0, so the autoscaler will never scale down to zero, even if the pods aren’t doing anything. Allowing the number of pods to be scaled down to zero can dramatically increase the utilization of your hardware.
When you run services that get requests only once every few hours or even days, it doesn’t make sense to have them running all the time, eating up resources that could be used by other pods.
But you still want to have those services available immediately when a client request comes in.
This is known as idling and un-idling. It allows pods that provide a certain service to be scaled down to zero. When a new request comes in, the request is blocked until the pod is brought up and then the request is finally forwarded to the pod.
Kubernetes currently doesn’t provide this feature yet, but it will eventually.
based on documentation it does not support minReplicas=0 so far. read this thread :-https://github.com/kubernetes/kubernetes/issues/69687. and to setup HPA properly you can use this formula to setup required pod :-
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
you can also setup HPA based on prometheus metrics follow this link:-
https://itnext.io/horizontal-pod-autoscale-with-custom-metrics-8cb13e9d475

Request all of the available huge-pages for a pod

How can I request kubernetes to allocate all of the available huge-pages on a node to my app pod?
My application uses huge-pages. Before deploying to kubernetes, I have configured kubernetes nodes for huge-pages and I know by specifying the limits & requests as mentioned in the k8s docs, my app pod can make use of huge-pages. But this somehow tightly couples the node specs to my app configuration. If my app was to run on different nodes with different amount of huge-pages, I will have to keep overriding these values based on target environment.
resources:
limits:
hugepages-2Mi: 100Mi
But then as per k8s doc, "Huge page requests must equal the limits. This is the default if limits are specified, but requests are not."
Is there a way I can somehow request k8s to allocate all available huge-pages to my app pod OR just keep it dynamic like in case of an unspecified memory or cpu requests/limits?
From the design proposal, it looks like the huge pages resource request is fixed and not designed for scenarios where there can be multiple sizes:
We believe it is rare for applications to attempt to use multiple huge page sizes.
Although you're not trying to use multiple values but dynamically modify them (once the pod is deployed), the sizes must be consistent and are rather used for pod pre-allocation and to determine how to treat reserved resources in the scheduler.
This means that is mostly used to assess resources in a fixed way, expecting a somewhat uniform scenario (where node-level page sizes were previously set).
Looks like you're going to have to rollout different pod specs depending on your nodes settings. For that, maybe some "traditional" tainting in the nodes would help identifying specific resources in an heterogeneous cluster.

Kubernetes priority of remove pods in 1.8.1

we have a small problem with the kubernetes cluster.
Because one of our applications is so demanding that sometimes consume all of our resources and finally some of pods are killed. The real problem starts when system pods like flannel or cache became removed.
Is there a recommended way to control what is being removed? How "save" system pods? Maybe someone has experience in this topic?
One of the ideas is to change QoS for all pods/apps from the kube-system to "Guaranteed". But I'm afraid that this will not work well if we limit resources, even with a large margin.
Btw. where can I read about what (default) requirements system services have? How set it on cluster creation phase?
The second idea is setting the Eviction Policy and/or Taints and Tolerations, but there is a anxiety that our key application will be (re)moved as one of the first. Unfortunately it currently works only in one copy and the initialization can take up to several minutes, so switching between nodes is currently unacceptable and impossible.
The final idea is to use Priority and Preemption, but from what I see in the 1.8.1 documentation is still in the "alpha" phase, and I have serious concerns about the stability of this solution.
Maybe there is something else I did not think about? I will be happy to listen other proposals.