I'm checking on a scaling issue and we are suspecting it has something to do with the memory, but after running a load testing on local machine it doesn't seems to have memory leak.
We are hosting the .net core application in Kubernetes, with resources setting 800mi request memory without limit.
And as per describe from this Article
The trigger for Garbage collection occurs when, The system has low
physical memory and gets notification from OS.
So does that mean GC is unlikely to kick in until my nodes are low on memory if we did not setup memory limit, and it will eventually occupied most of memory in node?
Yes that's exactly what can happen, both with .NET and other pods.
Always set memory and CPU limits as this may have impact on other pods or Configure Default Memory Requests and Limits for a Namespace
#Martin is right but I would like to provide some more insight on this topic.
Kubernetes best practices: Resource requests and limits is a very good guide explaining the idea behind these mechanisms with a detailed explanation and examples.
Also, Managing Resources for Containers will provide you with the official docs regarding:
Requests and limits
Resource types
Resource requests and limits of Pod and Container
Resource units in Kubernetes
How Pods with resource requests are scheduled
How Pods with resource limits are run, etc
Bear in mind that it is very important is to have a good strategy when calculating how much resources you would need for each container.
Optimally, your pods should be using exactly the amount of resources you requested but that's almost impossible to achieve. If the usage is lower than your request, you are wasting resources. If it's higher, you are risking performance issues. Consider a 25% margin up and down the request value as a good starting point. Regarding limits, achieving a good setting would depend on trying and adjusting. There is no optimal value that would fit everyone as it depends on many factors related to the application itself, the demand model, the tolerance to errors etc.
And finally, you can use the metrics-server to get the CPU and memory usage of the pods.
Related
We have a deployment stack with about 20 microservices/pods. Each deployment goes to its own namespace. To make sure that the cpu and memory are guaranteed for each pod and not shared, we set the request amounts the same as limit amount. Now we sometimes need to deploy more stack into the same performance cluster, e.g. testing different releases of the same stack. The question is whether having more than one deployment in one cluster can invalidate the test result due to shared network or some other reasons?
Initially we were thinking to create one cluster for each performance testing to make sure it is isolated and test results are correct but creating a new cluster and maintaining it a very costly. We also thought about making sure each deployment goes to one node to avoid load testing on one stack impact the others but I'm not sure if that really helps. Please share your knowledge on this as Kubernetes is almost new to us.
If the containers are running on the same underlying hosts then bleedthrough is always possible. If you set all pods into Guaranteed QoS mode (aka requests == limits) then it at least reduces the bleedthrough to a minimum. Running things on one cluster is always fine but if you want to truly reduce the crosstalk to zero then you would need dedicated workload nodes for each.
This is a purely theoretical question. A standard Kubernetes clusted is given with autoscaling in place. If memory goes above a certain targetMemUtilizationPercentage than a new pod is started and it takes on the flow of requests that is coming to the contained service. The number of minReplicas is set to 1 and the number of maxReplicas is set to 5.
What happens when the number of pods that are online reaches maximum (5 in our case) and requests from clients are still coming towards the node? Are these requests buffered somewhere of they are discarded? Can I take any actions to avoid request loss?
Natively Kubernetes does not support messaging queue buffering. Depends on the scenario and setup you use your requests will most likely 'timeout'. To efficiently manage those you`ll need custom resource running inside Kubernetes cluster.
In that situations it very common to use a message broker which ensures communication between microservices is reliable and stable, that the messages are managed and monitored within the system and that messages don’t get lost.
RabbitMQ, Kafka and Redis appears to be most popular but choosing the right one will heaving depend on your requirement and features needed.
Worth to note since Kubernetes essentially runs on linux is that linux itself also manages/limits the requests coming in socket. You may want to read more about it here.
Another thing is that if you have pods limits set or lack of resource it is most likely that pods might be restarted or cluster will become unstable. Usually you can prevent it by configuring some kind of "circuit breaker" to limit amount of requests that could go to backed without overloading it. If the amount of requests goes beyond the circuit breaker threshold, excessive requests will be dropped.
It is better to drop some request than having cascading failure.
I managed to test this scenario and I get 503 Service Unavailable and 403 Forbidden on my requests that do not get processed.
Knative Serving actually does exactly this. https://github.com/knative/serving/
It buffers requests and informs autoscaling decisions based on in-flight request counts. It also can enforce per-Pod max in-flight requests and hold onto request until newly scaled-up Pods come up and then Knative proxies the request to them as it has this container named queue-proxy as a sidecar to its workload type called "Service".
How can I request kubernetes to allocate all of the available huge-pages on a node to my app pod?
My application uses huge-pages. Before deploying to kubernetes, I have configured kubernetes nodes for huge-pages and I know by specifying the limits & requests as mentioned in the k8s docs, my app pod can make use of huge-pages. But this somehow tightly couples the node specs to my app configuration. If my app was to run on different nodes with different amount of huge-pages, I will have to keep overriding these values based on target environment.
resources:
limits:
hugepages-2Mi: 100Mi
But then as per k8s doc, "Huge page requests must equal the limits. This is the default if limits are specified, but requests are not."
Is there a way I can somehow request k8s to allocate all available huge-pages to my app pod OR just keep it dynamic like in case of an unspecified memory or cpu requests/limits?
From the design proposal, it looks like the huge pages resource request is fixed and not designed for scenarios where there can be multiple sizes:
We believe it is rare for applications to attempt to use multiple huge page sizes.
Although you're not trying to use multiple values but dynamically modify them (once the pod is deployed), the sizes must be consistent and are rather used for pod pre-allocation and to determine how to treat reserved resources in the scheduler.
This means that is mostly used to assess resources in a fixed way, expecting a somewhat uniform scenario (where node-level page sizes were previously set).
Looks like you're going to have to rollout different pod specs depending on your nodes settings. For that, maybe some "traditional" tainting in the nodes would help identifying specific resources in an heterogeneous cluster.
we have a small problem with the kubernetes cluster.
Because one of our applications is so demanding that sometimes consume all of our resources and finally some of pods are killed. The real problem starts when system pods like flannel or cache became removed.
Is there a recommended way to control what is being removed? How "save" system pods? Maybe someone has experience in this topic?
One of the ideas is to change QoS for all pods/apps from the kube-system to "Guaranteed". But I'm afraid that this will not work well if we limit resources, even with a large margin.
Btw. where can I read about what (default) requirements system services have? How set it on cluster creation phase?
The second idea is setting the Eviction Policy and/or Taints and Tolerations, but there is a anxiety that our key application will be (re)moved as one of the first. Unfortunately it currently works only in one copy and the initialization can take up to several minutes, so switching between nodes is currently unacceptable and impossible.
The final idea is to use Priority and Preemption, but from what I see in the 1.8.1 documentation is still in the "alpha" phase, and I have serious concerns about the stability of this solution.
Maybe there is something else I did not think about? I will be happy to listen other proposals.
I have an Google Cloud Load Balancer-backed ingress in my Google Kubernetes Engine cluster. I have an autoscaler set up to scale the number of replicas of my deployment based on CPU usage. Let's say I have set the CPU threshold to 50%.
When there is a burst of requests, the CPU usage goes to 100%. The autoscaler takes a few minutes to realize the high load, create more pods, create new nodes if necessary, and pass health checks. During this scaling period, some or the majority of requests fail with the 502 error due to timeouts. I would rather return a 503 error code immediately if the server is under heavy load instead of returning a 502 error code after the 30 second timeout.
Is it possible to have the load balancer direct traffic to pods with the lowest CPU usage? Is is possible to return a 503 error code if none of the pods have a CPU usage below a certain threshold, say 80%?
What is standard practice for handling a large burst of traffic, and how should I go about resolving this issue in Kubernetes?
First problem you are describing (serving 503) is called "load shedding". Normally it's a responsibility of the application to say: "oops, I'm overloaded, 503, slow down". If you move this responsibility to the client, then it might be too slow to react to provide you any reasonable protection - its data will always be behind. From the system reliability point of view, it's better to keep this logic in the server application.
The second problem is CPU-aware load balancing. One possible approach to this problem is called weighted round-robin - it's like regular round-robin, but preferring less loaded nodes. If you install istio in Kubernetes, you can select from a list of load balancing policies. One of them is weighted least request - it relies on the number of requests in flight, not directly on CPU, but if all your requests have about the same CPU cost, it might be a good proxy to CPU load.
Also one possible solution, is to use Istio Circuit Breaker. You can config how many concurrent request made to your services, or you are also able to use outlier detection, it will detect your service failure and based on that it can improve your UX.