AKS resource quota issue - not allowing to create pod

AKS resource quota issue - not allowing to create pod - kubernetes

We have created resource quota object in AKS using following Yaml details -
apiVersion: v1
kind: ResourceQuota
metadata:
name: {Namespacename}
spec:
hard:
requests.cpu: "1"
requests.memory: 2Gi
limits.cpu: "2"
limits.memory: 4Gi
and respective changes are applied at container level for request and limit (Memory and CP). After applying this changes, pods are not getting created. Even it doesn't show any error.
Need guidance/help on this if anyone face this issue already.
Thanks.

Namespace request/limits are the settings that affect every pod/container in a given namespace as a whole.
What this means is if you do not set your pod request/limits correctly you may not be able to deploy your pod to that quota enabled namespace regardless of the nodes available resources.
Instead of throttling, you simply won’t be allowed to deploy if the quota would be breached.
When you have namespace quota’s setup in order to deploy to that namespace your pod has to have request/limits defined or the pod will not be scheduled for deployment.
Please check AKS Performance: Resource Quotas for fuull explanation with work examples.

Related

Verifying resources in a deployment yaml

In a deployment yaml, how we can verify that the resources we need for the running pods are guaranteed by the k8s?
Is there a way to figure that out?

Specify your resource request in the deployment YAML. The kube-scheduler will ensure the resources even before scheduling the pods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
selector:
matchLabels:
app: guestbook
tier: frontend
replicas: 3
template:
metadata:
labels:
app: guestbook
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google-samples/gb-frontend:v4
resources:
requests:
cpu: 100m
memory: 100Mi
How Pods with resource requests are scheduled? Ref
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
N.B.: However, if you want a container not to use more than its allowed resources, specify the limit too.

There is QoS (Quality of Service) Classes for running pods in k8s. There is an option that is guaranteing and restricting request and limits. This option is qosClass: Guaranteed.
To be able to make your pods' QoS's Guaranteed;
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
These restrictions apply to init containers and app containers equally.
also check out the reference page for more info :
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

How to tag nodes in GKE, then assign pods to nodes?

I have the standard 3-nodes pool in GKE.
I want to label 1 of these nodes to be something like dev or test, so all the pods in the namespace dev, qa or stage are loaded in that node, preferrably.
The pods in the namespace prod would use the other available nodes, be it 2, 3 or more nodes available.
Basically I want to restrict the CPU / memory load for non-important deployments to affect the prod deployments.
How can I do this automatically? Because when Google updates the nodes (for a software update) a new one is created.

If you want to restrict CPU/memory, using labels on nodes is not the right way to do this. Instead, set a quota on the dev/test namespace.
https://kubernetes.io/docs/concepts/policy/resource-quotas/
Basically, it would look something like this
apiVersion: v1
kind: ResourceQuota
metadata:
name: low-priority
spec:
hard:
cpu: "5"
memory: 1Gi
pods: "10"
I have a blog post about this as well: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits
Edit:
If you really want to do this, I think the best way would be to use node pools, and then tag the pods to go into a specific pool.

Moving Pod to another node automatically

Is it possible for a pod/deployment/statefulset to be moved to another node or be recreated on another node automatically if the first node fails? The pod in question is set to 1 replica. So is it possible to configure some sort of failover for kubernetes pods? I've tried out pod affinity settings but nothing is moved automatically it has been around 10 minutes.
the yaml for the said pod is like below:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ceph-rbd-sc-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: ceph-rbd-sc
---
apiVersion: v1
kind: Pod
metadata:
name: ceph-rbd-pod-pvc-sc
labels:
app: ceph-rbd-pod-pvc-sc
spec:
containers:
- name: ceph-rbd-pod-pvc-sc
image: busybox
command: ["sleep", "infinity"]
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
nodeSelector:
etiket: worker
volumes:
- name: volume
persistentVolumeClaim:
claimName: ceph-rbd-sc-pvc
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
name: ceph-rbd-pod-pvc-sc
topologyKey: "kubernetes.io/hostname"
Edit:
I managed to get it to work. But now i have another problem, the newly created pod in the other node is stuck in "container creating" and the old pod is stuck in "terminating". I also get Multi-Attach error for volume stating that the pv is still in use by the old pod. The situation is the same for any deployment/statefulset with a pv attached, the problem is resolved only when the failed node comes back online. Is there a solution for this?

Answer from coderanger remains valid regarding Pods. Answering to your last edit:
Your issue is with CSI.
When your Pod uses a PersistentVolume whose accessModes is RWO.
And when the Node hosting your Pod gets unreachable, prompting Kubernetes scheduler to Terminate the current Pod and create a new one on another Node
Your PersistentVolume can not be attached to the new Node.
The reason for this is that CSI introduced some kind of "lease", marking a volume as bound.
With previous CSI spec & implementations, this lock was not visible, in terms of Kubernetes API. If your ceph-csi deployment is recent enough, you should find a corresponding "VolumeAttachment" object that could be deleted, to fix your issue:
# kubectl get volumeattachments -n ci
NAME ATTACHER PV NODE ATTACHED AGE
csi-194d3cfefe24d5f22616fabd3d2fb2ce5f79b16bdca75088476c2902e7751794 rbd.csi.ceph.com pvc-902c3925-11e2-4f7f-aac0-59b1edc5acf4 melpomene.xxx.com true 14d
csi-24847171efa99218448afac58918b6e0bb7b111d4d4497166ff2c4e37f18f047 rbd.csi.ceph.com pvc-b37722f7-0176-412f-b6dc-08900e4b210d clio.xxx.com true 90d
....
kubectl delete -n ci volumeattachment csi-xxxyyyzzz
Those VolumeAttachments are created by your CSI provisioner, before the device mapper attaches a volume.
They would be deleted only once the corresponding PV would have been released from a given Node, according to its device mapper - that needs to be running, kubelet up/Node marked as Ready according to the the API. Until then, other Nodes can't map it. There's no timeout, should a Node get unreachable due to network issues or an abrupt shutdown/force off/reset: its RWO PV are stuck.
See: https://github.com/ceph/ceph-csi/issues/740
One workaround for this would be not to use CSI, and rather stick with legacy StorageClasses, in your case installing rbd on your nodes.
Though last I checked -- k8s 1.19.x -- I couldn't manage to get it working, I can't recall what was wrong, ... CSI tends to be "the way" to do it, nowadays. Despite not being suitable for production use, sadly, unless running in an IAAS with auto-scale groups deleting Nodes from the Kubernetes API (eventually evicting the corresponding VolumeAttachments), or using some kind of MachineHealthChecks like OpenShift 4 implements.

A bare Pod is a single immutable object. It doesn't have any of these nice things. Related: never ever use bare Pods for anything. If you try this with a Deployment you should see it spawn a new one to get back to the requested number of replicas. If the new Pod is Unschedulable you should see events emitted explaining why. For example if only node 1 matches the nodeSelector you specified, or if another Pod is already running on the other node which triggers the anti-affinity.

What are the possible causes for a pod container to be restarted due to Out of Memory Killed?

I have the following deployment running in a pod from my system:
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
spec:
limits:
- default:
memory: 1Gi
defaultRequest:
memory: 256Mi
type: Container
Kubernetes is restarting this container some times, with this error code:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
According to the system monitoring (Grafana), the container was only consuming ~500Mb of memory at the time the kill signal was sent by Kubernetes.
Also, the node where the the pod is running has a lot of available memory (it was using around 15% of its capacity at the time the container has been restarted).
So is there any possible reason for Kubernetes to restart this container? This already happened ~5-7 over the last week.

The LimitRange k8s object is used to "Limit Range is a policy to constrain resource by Pod or Container in a namespace." So the objects in the namespace that the object LimitRange is created are consuming more than the limit specified in your LimitRange object. To test if this is true, remove the LimitRange temporarily to check the real usage of you ALL namespace resources, not just one pod. After that will be able to find the best limit config to fit the namespace.
In the k8s docs, you can find a good explanation and a lot of examples of how to restrict limits in your namespace.

Removing default CPU request and limits on GCP Kubernetes

Kubernetes on Google Cloud Platform configures a default CPU request and limit.
I make use of deamonsets and deamonset pods should use as much CPU as possible.
Manually increasing the upper limit is possible but the upper bound must be reconfigured in case of new nodes and the upper bound must be set much lower than what is available on the node in order to have rolling updates allowing pods scheduling.
This requires a lot of manual actions and some resources are just not used most of the time. Is there a way to completely remove the default CPU limit so that pods can use all available CPUs?

GKE, by default, creates a LimitRange object named limits in the default namespace looking like this:
apiVersion: v1
kind: LimitRange
metadata:
name: limits
spec:
limits:
- defaultRequest:
cpu: 100m
type: Container
So, if you want to change this, you can either edit it:
kubectl edit limitrange limits
Or you can delete it altogether:
kubectl delete limitrange limits
Note: the policies in the LimitRange objects are enforced by the LimitRanger admission controller which is enabled by default in GKE.

Limit Range is a policy to constrain resource by Pod or Container in a namespace.
A limit range, defined by a LimitRange object, provides constraints
that can:
Enforce minimum and maximum compute resources usage per Pod or Container in a namespace.
Enforce minimum and maximum storage
request per PersistentVolumeClaim in a namespace.
Enforce a ratio between request and limit for a resource in a namespace.
Set default request/limit for compute resources in a namespace and automatically inject them to Containers at runtime.
You need to find the LimitRange resource of your namespace and remove the spec.limits.default.cpu and spec.limits.defaultRequest.cpu that are defined (or simply delete the LimitRange to remove all constraints).

The resource limitation can be configured in 2 ways.
At object level:
kubectl edit limitrange limits
This object is created by default and the value is 100m (1/10 of CPU) and when a pod reach that limit, it's simply killed.
At manifest level:
Using statefulSet, DaemonSet, etc, through a yaml file and configured on
spec.containers.resources
it's look like this:
spec:
containers:
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200 Mi
As mentioned you can modify the configuration or simply delete them to remove the limitations.
However, they have some reasons why these limitations has been implemented.
I found a video from a Googler talking about it, take a look! [1]

On top of the Limit Range mentioned by Eduardo Baitello, you should also look out for admission controllers, which can intercept requests to the Kubernetes API and modify them (e.g. add limits, and other defaults).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse