AWS EKS Node - Volume Attach Limit - kubernetes

We recently upgraded our EKS environment to v1.12.7. After the upgrade we noticed that there is now an "allocatable" resource called attachable-volumes-aws-ebs and in our environment we have many EBS volumes attached to each node (they were all generated dynamically via PVCs).
Yet on every node in the "allocated resources" section, it shows 0 volumes attached:
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 16
ephemeral-storage: 96625420948
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 64358968Ki
pods: 234
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 5430m (33%) 5200m (32%)
memory 19208241152 (29%) 21358Mi (33%)
attachable-volumes-aws-ebs 0 0
Because of this, the scheduler is continuing to try and attach new volumes to nodes that already have 25 volumes attached.
How do we get kubernetes to recognize the volumes that are attached so that the scheduler can act accordingly?

First check your pods status, they can have pending status that's why probably you volumes does't count. Your volume could stuck in attaching state when using multiple PersistentVolumeClaim too.
Your volumes may be not attached due to NodeWithImpairedVolumes=true:NoSchedule flag and at the same time number of your attachable-volumes-aws-ebs.
Try to execute:
$ kubectl taint nodes node1 key:NoSchedule-
on every node with this label (NodeWithImpairedVolumes=true:NoSchedule).
If you use awsElasticBlockStore , there are some restrictions when using an awsElasticBlockStore volume:
nodes on which Pods are running must be AWS EC2 instances
those instances need to be in the same region and availability-zone as the EBS volume
EBS only supports a single EC2 instance mounting a volume.
You can use count/* resource quota, an object is charged against the quota if it exists in server storage. These types of quotas are useful to protect against exhaustion of storage resources use count/persistentvolumeclaims.
Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be:
[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved] - [Hard-Eviction-Threshold]
Note: Since kernel usage can fluctuate and is out of kubernetes control, it will be reported as a separate value (probably via the metrics API). Reporting kernel usage is out-of-scope for this proposal.

It seems like the current best option is using the EBS CSI driver with the new --volume-attach-limit flag
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/522

Related

My kubernetes pods are Evicting with ephemeral-storage issue

I am running a k8 cluster with 8 workers and 3 master nodes. And my pods are evicting repetively with the ephemeral storage issues.
Below is the error I am getting on Evicted pods:
Message: The node was low on resource: ephemeral-storage. Container xpaas-logger was using 30108Ki, which exceeds its request of 0. Container wso2am-gateway-am was using 406468Ki, which exceeds its request of 0.
To overcome the above error, I have added ephemeral storage limits and request to my namespace.
apiVersion: v1
kind: LimitRange
metadata:
name: ephemeral-storage-limit-range
spec:
limits:
- default:
ephemeral-storage: 2Gi
defaultRequest:
ephemeral-storage: 130Mi
type: Container
Even after adding the above limits and requests to my namespace, my pod is reaching its limits and then evicting.
Message: Pod ephemeral local storage usage exceeds the total limit of containers 2Gi.
How can I monitor my ephemeral storage, where does it store on my instance?
How can I set the docker logrotate to my ephemeral storage based on size? Any suggestions?
"Ephemeral storage" here refers to space being used in the container filesystem that's not in a volume. Something inside your process is using a lot of local disk space. In the abstract this is relatively easy to debug: use kubectl exec to get a shell in the pod, and then use normal Unix commands like du to find where the space is going. Since it's space inside the pod, it's not directly accessible from the nodes, and you probably can't use tools like logrotate to try to manage it.
One specific cause of this I've run into in the past is processes configured to log to a file. In Kubernetes you should generally set your logging setup to log to stdout instead. This avoids this specific ephemeral-storage problem, but also avoids a number of practical issues around actually getting the log file out of the pod. kubectl logs will show you these logs and you can set up cluster-level tooling to export them to another system.

Kubernetes: cpu request and total resources doubts

For better understand my doubts, I will put an example
Example:
We have one worker node with 3 allocatable cpus and kubernetes has scheduled three pods on it:
pod_1 with 500m cpu request
pod_2 with 700m cpu request
pod_3 with 300m cpu request
In this worker node I can't schedule other pods.
But if I check the real usage:
pod_1 cpu usage is 300m
pod_2: cpu usage is 600m
My question is:
Can pod_3 have a real usage of 500m or the request of other pods will limit the cpu usage?
Thanks
Pietro
It doesn't matter what the real usage is - the "request" means how much resources are guaranteed to be available for the pod. Your workload might be using only a fraction of the requested resources - but what will really count is the "request" itself.
Example - Let's say you have a node with 1CPU core.
Pod A - 100m Request
Pod B - 200m Request
Pod C - 700m Request
Now, no pod can be allocated in the node - because the whole 1 CPU resource is already requested by 3 pods. It doesn't really matter which fraction of the allocated resources each pod is using at any given time.
Another point worth noting is the "Limit". A requested resource usage could be surpassed by a workload - but it cannot surpass the "Limit". This is a very important mechanism to be understood.
Kubernetes will schedule the pods based on the request that you configure for the container(s) of pod (via the specs for the respective Deployment or other kinds).
Here's an example:
For simplicity, let's assume only one container for the pod.
containers:
- name: "foobar"
resources:
requests:
cpu: "300m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
If you ask for 300 millicpus as your request, Kubernetes will place the pod on a node that has at least 300 millicpus allocatable to that pod. If a node has less allocatable CPU available, the pod will not be placed on that node. Similarly, you can also set the value for memory request as well.
The limit works to limit the resource use by the container. In the example above, Kubernetes will evict the pod if the container ends up using more than 512MiB of memory; once evicted, the pod will be placed on a node that has at least 300 millicpus available (and if no such node exists, the pod will remain in Pending state with FailedScheduling as the reason, until a node with sufficient capacity is available).
Do note, that the resource request works only at the time of pod scheduling, and not at runtime (meaning, the actual consumption of the resources will not trigger a re-scheduling of the pod even if the container used more resources than what it requested as long as it remains below the limit, if specified).
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-requests-are-scheduled
So, in summary,
The total of all your requests is used as the what can be allocated regardless of the actual runtime utilization of your pod (as long as the limit is not crossed)
You can request for 300 millicpus, but only use 100 millicpus, or 400 millicpus; Kubernetes will still show the "allocated" value as 300
If your container crosses the limit, it will get evicted by Kubernetes

What is the default memory allocated for a pod

I am setting up a pod say test-pod on my google kubernetes engine. When I deploy the pod and see in workloads using google console, I am able to see 100m CPU getting allocated to my pod by default, but I am not able to see how much memory my pod has consumed. The memory requested section always shows 0 there. I know we can restrict memory limits and initial allocation in the deployment YAML. But I want to know how much default memory a pod gets allocated when no values are specified through YAML and what is the maximum limit it can avail?
If you have no resource requests on your pod, it can be scheduled anywhere at all, even the busiest node in your cluster, as though you requested 0 memory and 0 CPU. If you have no resource limits and can consume all available memory and CPU on its node.
(If it’s not obvious, realistic resource requests and limits are a best practice!)
You can set limits on individual pods
If not , you can set limits on the overall namespace
Defaults , no limits
But there are some ticks:
Here is a very nice view of this:
https://blog.balthazar-rouberol.com/allocating-unbounded-resources-to-a-kubernetes-pod
When deploying a pod in a Kubernetes cluster, you normally have 2
choices when it comes to resources allotment:
defining CPU/memory resource requests and limits at the pod level
defining default CPU/memory requests and limits at the namespace level
using a LimitRange
From Docker documentation ( assuming u are using docker runtime ):
By default, a container has no resource constraints and can use as
much of a given resource as the host’s kernel scheduler will allow
https://docs.docker.com/v17.09/engine/admin/resource_constraints/
Kubernetes pods' CPU and memory usage can be seen using the metrics-server service and the kubectl top pod command:
$ kubectl top --help
...
Available Commands:
...
pod Display Resource (CPU/Memory/Storage) usage of pods
...
Example in Minikube below:
minikube addons enable metrics-server
# wait 5 minutes for metrics-server to be up and running
$ kubectl top pod -n=kube-system
NAME CPU(cores) MEMORY(bytes)
coredns-fb8b8dccf-6t5k8 6m 10Mi
coredns-fb8b8dccf-sjkvc 5m 10Mi
etcd-minikube 37m 60Mi
kube-addon-manager-minikube 17m 20Mi
kube-apiserver-minikube 55m 201Mi
kube-controller-manager-minikube 30m 46Mi
kube-proxy-bsddk 1m 11Mi
kube-scheduler-minikube 2m 12Mi
metrics-server-77fddcc57b-x2jx6 1m 12Mi
storage-provisioner 0m 15Mi
tiller-deploy-66b7dd976-d8hbk 0m 13Mi
This link has more information.
Kubernetes doesn’t provide default resource limits out-of-the-box. This means that unless you explicitly define limits, your containers can consume unlimited CPU and memory.
More details here: https://medium.com/#reuvenharrison/kubernetes-resource-limits-defaults-and-limitranges-f1eed8655474
The real problem in many of these cases is not that the nodes are too small, but that we have not accurately specified resource limits for the pods.
Resource limits are set on a per-container basis using the resources property of a containerSpec, which is a v1 api object of type ResourceRequirements. Each object specifies both “limits” and “requests” for the types of resources.
If you do not specify a memory limit for a container, one of the following situations applies:
The container has no upper bound on the amount of memory it uses. The container could use all of the memory available on the Node where it is running which in turn could invoke the OOM Killer. Further, in case of an OOM Kill, a container with no resource limits will have a greater chance of being killed.
The container is running in a namespace that has a default memory limit, and the container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the memory limit.
When you set a limit, but not a request, kubernetes defaults the request to the limit. If you think about it from the scheduler’s perspective it makes sense.
It is important to set correct resource requests, setting them too low makes that nodes can get overloaded; too high makes that nodes will stuck idle.
Useful article: memory-limits.
Kubernetes doesn’t provide default resource limits out-of-the-box. This means that unless you explicitly define limits, your containers can consume unlimited CPU and memory.
https://medium.com/#reuvenharrison/kubernetes-resource-limits-defaults-and-limitranges-f1eed8655474

"Limits" property ignored when deploying a container in a Kubernetes cluster

I am deploying a container in Google Kubernetes Engine with this YAML fragment:
spec:
containers:
- name: service
image: registry/service-go:latest
resources:
requests:
memory: "20Mi"
cpu: "20m"
limits:
memory: "100Mi"
cpu: "50m"
But it keeps taking 120m. Why is "limits" property being ignored? Everything else is working correctly. If I request 200m, 200m are being reserved, but limit keeps being ignored.
My Kubernetes version is 1.10.7-gke.1
I only have the default namespace and when executing
kubectl describe namespace default
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container cpu - - 100m - -
Considering Resources Request Only
The google cloud console works well, I think you have multiple containers in your pod, this is why. The value shown above is the sum of resources requests declared in your truncated YAML file. You can verify easily with kubectl.
First verify the number of containers in you pod.
kubectl describe pod service-85cc4df46d-t6wc9
Then, look the description of the node via kubectl, you should have the same informations as the console says.
kubectl describe node gke-default-pool-abcdefgh...
What is the difference between resources request and limit ?
You can imagine your cluster as a big square box. This is the total of your allocatable resources. When you drop a Pod in the big box, Kubernetes will check if there is an empty space for the requested resources of the pod (is the small box fits in the big box?). If there is enough space available, then it will schedule your workload on the selected node.
Resources limits are not taken into account by the scheduler. All is done at the kernel level with CGroups. The goal is to restrict workloads to take all the CPU or Memory on the node they are scheduled on.
If your resources requests == resources limits then, workloads cannot escape their "box" and are not able to use available CPU/Memory next to them. In other terms, your resource are guaranteed for the pod.
But, if the limits are greater than your requests, this is called overcommiting resources. You bet that all the workloads on the same node are not fully loaded at the same time (generally the case).
I recommend to not overcommiting the memory resource, do not let the pod escape the "box" in term of memory, it can leads to OOMKilling.
You can try logging into the node running your pod and run:
ps -Af | grep docker
You'll see the full command line that kubelet sends to docker. Representing the memory limit it should have something like --memory. Note that the request value for memory is only used by the Kubernetes scheduler to determine whether it has exceeded all pods/containers running on a node.
Representing the requests for CPUs you'll see the --cpu-shares flag. In this case the limit is not a hard limit but again it's a way for the Kubernetes scheduler to not allocate containers/pod passed that limit when running multiple containers/pods on a specific node. You can learn more about cpu-shares here and from the Kubernetes side here. So in essence, if you don't have enough workloads on the node, it will always go over its CPU share if it needs to and that's what you are probably seeing.
Docker has other ways of restricting the CPUs such as cpu-period/cpu-quota and cpuset-cpus but not used bu Kubernetes as of this writing. In this, I believe mesos does somehow better when dealing with CPU/memory reservations and quotas imo.
Hope it helps.

Node not ready, pods pending

I am running a cluster on GKE and sometimes I get into a hanging state. Right now I was working with just two nodes and allowed the cluster to autoscale. One of the nodes has a NotReady status and simply stays in it. Because of that, half of my pods are Pending, because of insufficient CPU.
How I got there
I deployed a pod which has quite high CPU usage from the moment it starts. When I scaled it to 2, I noticed CPU usage was at 1.0; the moment I scaled the Deployment to 3 replicas, I expected to have the third one in Pending state until the cluster adds another node, then schedule it there.
What happened instead is the node switched to a NotReady status and all pods that were on it are now Pending.
However, the node does not restart or anything - it is just not used by Kubernetes. The GKE then thinks that there are enough resources as the VM has 0 CPU usage and won't scale up to 3.
I cannot manually SSH into the instance from console - it is stuck in the loading loop.
I can manually delete the instance and then it starts working - but I don't think that's the idea of fully managed.
One thing I noticed - not sure if related: in GCE console, when I look at VM instances, the Ready node is being used by the instance group and the load balancer (which is the service around an nginx entry point), but the NotReady node is only in use by the instance group - not the load balancer.
Furthermore, in kubectl get events, there was a line:
Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Failed to create load balancer for service default/proxy-service: failed to ensure static IP 104.199.xx.xx: error creating gce static IP address: googleapi: Error 400: Invalid value for field 'resource.address': '104.199.xx.xx'. Specified IP address is already reserved., invalid
I specified loadBalancerIP: 104.199.xx.xx in the definition of the proxy-service to make sure that on each restart the service gets the same (reserved) static IP.
Any ideas on how to prevent this from happening? So that if a node gets stuck in NotReady state it at least restarts - but ideally doesn't get into such state to begin with?
Thanks.
The first thing I would do is to define Resources and Limits for those pods.
Resources tell the cluster how much memory and CPU you think that the pod is going to use. You do this to help the scheduler to find the best location to run those pods.
Limits are crucial here: they are set to prevent your pods damaging the stability of the nodes. It's better to have a pod killed by an OOM than a pod bringing a node down because of resource starvation.
For example, in this case you're saying that you want 200m CPU (20%) for your pod but if for any chance it goes above 300 (30%), you want the scheduler to kill it and start a new one.
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
resources:
limits:
cpu: 300m
memory: 200Mi
requests:
cpu: 200m
memory: 100Mi
You can read more here: http://kubernetes.io/docs/admin/limitrange/
For AWS I can tell. You can create dynamic scaling policies based on CPU and memory utilization.
It goes in NotReady state because of out of memory or maybe insufficient CPU. You can create a custom memory metric to collect memory metric of all the worker nodes in the cluster collectively and push it to cloudwatch.
You can follow this documentation- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html
CPU metric is already there so no need to create it. So a memory metric will be created for you cluster.
You can now create an alarm for it when it goes above certain threshold. Now you have to go to the Auto Scaling Group through AWS console. Now you have to add a scaling policy for your autoscaling group selecting the alarm that you created and add number of instance accordingly.