Recently we've experienced issues with both Non-Production and Production clusters where the nodes encountered 'System OOM encountered' issue.
The nodes within the Non-Production cluster don't seem to be sharing the pods. It seems like a given node is running all the pods and putting a load on the system.
Also, the Pods are stuck in this status: 'Waiting: ContainerCreating'.
Any help/guidance with the above issues would be greatly appreciated. We are building more and more services in this cluster and want to make sure there's no instability and/or environment issues and place proper checks/configuration in place before we go live.
"I would recommend you manage container compute resources properly within your Kubernetes cluster. When creating a Pod, you can optionally specify how much CPU and memory (RAM) each Container needs to avoid OOM situations.
When Containers have resource requests specified, the scheduler can make better decisions about which nodes to place Pods on. And when Containers have their limits specified, contention for resources on a node can be handled in a specified manner. CPU specifications are in units of cores, and memory is specified in units of bytes.
An event is produced each time the scheduler fails, use the command below to see the status of events:
$ kubectl describe pod <pod-name>| grep Events
Also, read the official Kubernetes guide on “Configure Out Of Resource Handling”. Always make sure to:
reserve 10-20% of memory capacity for system daemons like kubelet and OS kernel
identify pods which can be evicted at 90-95% memory utilization to reduce thrashing and incidence of system OOM.
To facilitate this kind of scenario, the kubelet would be launched with options like below:
--eviction-hard=memory.available<xMi
--system-reserved=memory=yGi
Replacing x and y with actual memory values.
Having Heapster container monitoring in place should be helpful for visualization".
Read more reading on Kubernetes and Docker Administration
Unable to mount volumes for pod
"xxx-3615518044-6l1cf_xxx-qa(8a5d9893-230b-11e8-a943-000d3a35d8f4)":
timeout expired waiting for volumes to attach/mount for pod
"xxx-service-3615518044-6l1cf"/"xxx-qa"
That indicates your pod is having trouble mounting the volume specified in your configuration. This can often be a permissions issue. If you post your config files (like to a gist) with private info removed, we could probably be more helpful.
Related
As far as I understand from the VPA documentation the vertical pod autoscaler stop/restart the pod based-on the predicted request/limit's lower/upper bounds and target.
In the "auto" mode it says that the pod will be stopped and restarted, however, I don't get the point of doing a prediction and restarting the pod while it is still working because although we know that it might go out of resource eventually it is still working and we can wait to rescale it once it has really gone out of memory/cpu. Isn't it more efficient to just wait for the pod to go out of memory/cpu and then restart it with the new predicted request?
Is recovering from a dead container more costly than stopping and restarting the pod ourselves? If yes, in what ways?
Isn't it more efficient to just wait for the pod to go out of
memory/cpu and then restart it with the new predicted request?
In my opinion this is not the best solution. If the pod would try to use more CPU than available limits than the container's CPU use is being throttled, if the container is trying to use more memory than limits kubernetes OOM kills the container due to limit overcommit but limit on npods usually can be higher than sum of node capacity so this can lead to memory exhaust in the node and can case the death of other workload/pods.
Answering your question - VPA was designed to simplify those scenarios:
Vertical Pod Autoscaler (VPA) frees the users from necessity of
setting up-to-date resource limits and requests for the containers in
their pods. When configured, it will set the requests automatically
based on usage and thus allow proper scheduling onto nodes so that
appropriate resource amount is available for each pod. It will also
maintain ratios between limits and requests that were specified in
initial containers configuration.
In addition VPA should is not only responsible for scaling up but also for scaling down:
it can both down-scale pods that are over-requesting resources, and also up-scale pods that are under-requesting resources based on their usage over time.
Is recovering from a dead container more costly than stopping and
restarting the pod ourselves? If yes, in what ways?
Talking about the cost of recovering from the dead container - the main possible cost might be requests that can eventually get lost during OOM killing process as per the official doc.
As per the official documentation VPAs operates in those mode:
"Auto": VPA assigns resource requests on pod creation as well as
updates them on existing pods using the preferred update mechanism
Currently this is equivalent to "Recrete".
"Recreate": VPA assigns resource requests on pod creation as well as
updates them on existing pods by evicting them when the requested
resources differ significantly from the new recommendation (respecting
the Pod Disruption Budget, if defined).
"Initial": VPA only assigns resource requests on pod creation and
never changes them later.
"Off": VPA does not automatically change resource requirements of the
pods.
NOTE:
VPA Limitations
VPA recommendation might exceed available resources, such as you cluster capacity or your team’s quota. Not enough available resources may cause pods to go pending.
VPA in Auto or Recreate mode won’t evict pods with one replica as this would cause disruption.
Quick memory growth might cause the container to be out of memory killed. As out of memory killed pods aren’t rescheduled, VPA won’t apply new resource.
Please also take a look at some of the VPA Known limitations:
Updating running pods is an experimental feature of VPA. Whenever VPA updates the pod resources the pod is recreated, which causes all
running containers to be restarted. The pod may be recreated on a
different node.
VPA does not evict pods which are not run under a controller. For such pods Auto mode is currently equivalent to Initial.
VPA reacts to most out-of-memory events, but not in all situations.
Additional resources:
VERTICAL POD AUTOSCALING: THE DEFINITIVE GUIDE
My service running in a pod output too much log and cause low ephemeral storage. As a result, the pod is evicted and other services can't deploy to k8s.
So how I can determine pod resource ephemeral storage requests and limit to avoid this situation? I can't find any best practice about ephemeral storage.
Note that by default, if you have not set any limits on ephemeral-storage the pod has access to the entire disk of the node it is running on, so if you are certain that the pod is being evicted because of this, then you are certain that the pod consumed it all. You can check this from kubelet logs, as kubelet is the guy in charge of detecting this behavior and evicting the pod.
Now, here you have two options. Either you can set an ephemeral-storage limit, and make a controlled pod eviction, or just get an external volume, map it into the container, and get the logs outside of the node.
You can also monitor the disk usage, as suggesting shubham_asati, but if it is eating it all, it is eating it all. You are just going to look at how it is getting filled out.
I guess ephemeral storage for a pod can be defined as cpu request/limit.
See this https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#local-ephemeral-storage but this feature is in beta stage K8's version 1.16.
To check namespace level resource consumption view https://kubernetes.io/docs/concepts/policy/resource-quotas/#storage-resource-quota.
You can set request/limit ephemeral storage for each pod .
Regarding your issue
check namespace quotas for ephemeral storage using kubectl
describe namespace
try du -sh / inside a container.
then compare the storages from both outputs.
You need to deploy prometheus and grafana to find out how much memory and cpu are getting consumed by the pod. Then accordingly set those request and limits on that pod
Requests and limits setting for ephemeral storage is a new feature and is still in beta.You might have to wait few more months to use that feature.
However, if you are on k8s 1.18 then you can test Requests and limits setting for ephemeral storage
As a leaner of Kubernetes concepts, their working, and deployment with it. I have a couple of cases which I don't know how to achieve. I am looking for advice or some guideline to achieve it.
I am using the Google Cloud Platform. The current running flow is described below. A push to the google source repository triggers Cloud Build which creates a docker image and pushes the image to the running cluster nodes.
Case 1: Now I want that when new pods are up and running. Then traffic is routed to the new pods. Kill old pod but after each pod complete their running request. Zero downtime is what I'm looking to achieve.
Case 2: What will happen if the space of running pod reaches 100 and in the Debian case that the inode count reaches full capacity. Will kubernetes create new pods to manage?
Case 3: How to manage pod to database connection limits?
Like the other answer use Liveness and Readiness probes. Basically, a new pod is added to the service pool then it will only serve traffic after the readiness probe has passed. The old pod is removed from the Service pool, then drained and then terminated. This happens on a rolling fashion one pod at a time.
This really depends on the capacity of your cluster and the ability to schedule pods depending on the limits for the containers in them. For more about setting up limits for containers refer to here. In terms of the inode limit, if you reach it on a node, the kubelet won't be able to run any more pods on that node. The kubelet eviction manager also has a mechanism in where evicts some pods using the most inodes. You can also configure your eviction thresholds on the kubelet.
This would be more a limitation at the OS level combined your stateful application configuration. You can keep this configuration in a ConfigMap. And for example in something for MySql the option would be max_connections.
I can answer case 1 since Ive done it myself.
Use Deployments with readinessProbes & livelinessProbes
I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.
pod will not start due to "No nodes are available that match all of the following predicates:: Insufficient cpu"
In the above question, I had an issue starting a deployment with 3 containers.
Upon further investigation, it appears there is only 27% of the CPU quota available - which seems very low. The rest of the CPU seems to be assigned to some default bundled containers.
How is this normally mitigated? Is a larger node required? Do limits need to be set manually? Are all those additional containers necessary?
1 cpu for a single node cluster is probably too small.
From the containers in the original answer, both the dashboard and fluentd can be removed:
the dashboard is just a web UI, which can go away if you use kubectl (which you should, IMO);
fluentd should be reading the log files on disk to ship them somewhere (GCP's log aggregation, I think).
The unnecessary containers should be tied to a Deployment or ReplicaSet, which can be listed with kubectl get deployment and kubectl get rs, respectively. You can then kubectl delete them.
Increasing the resources on the node should not change the requirements for the basic pods, meaning they should all be free scheduling.