I have a question based on my experience trying to implement memory requests/limits correctly in an OpenShift OKD cluster. I started by setting no request, then watching to see what cluster metrics reported for memory use, then setting something close to that as a request. I ended up with high-memory-pressure nodes, thrashing, and oom kills. I have found I need to set the requests to something closer to the VIRT size in ‘top’ (include the program binary size) to keep performance up. Does this make sense? I'm confused by the asymmetry between request (and apparent need) and reported use in metrics.
You always need to leave a bit of memory headroom for overhead an memory spills. If for some reason the container exceeds the memory, either from your application, from your binary of some garbage collection system it will get killed. For example, this is common in Java apps, where you specify a heap and you need an extra overhead for the garbage collector and other things such as:
Native JRE
Perm / metaspace
JIT bytecode
JNI
NIO
Threads
This blog explains some of them.
Related
What is the correct way of memory handling in OpenShift/Kubernetes?
If I create a project in OKD, how can I determine optimal memory usage of pods? For example, if I use 1 deployment for 1-2 pods and each pod uses 300-500 Mb of RAM - Spring Boot apps. So technically, 20 pods uses around 6-10GB RAM, but as I see, sometimes each project could have around 100-150 containers which needs at least 30-50Gb of RAM.
I also tried with horizontal scale, and/or request/limits but still lot of memory used by each micro-service.
However, to start a pod, it requires around 500-700MB RAM, after spring container has been started they can live with around 300MB as mentioned.
So, I have 2 questions:
Is it able to give extra memory but only for the first X minutes for each pod start?
If not, than what is the best practice to handle memory shortage, if I have limited memory (16GB) and wants to run 35-40 pod?
Thanks for the answer in advance!
Is it able to give extra memory but only for the first X minutes for each pod start?
You do get this behavior when you set the limit to a higher value than the request. This allows pods to burst, unless they all need the memory at the same time.
If not, than what is the best practice to handle memory shortage, if I have limited memory (16GB) and wants to run 35-40 pod?
It is common to use some form of cluster autoscaler to add more nodes to your cluster if it needs more capacity. This is easy if you run in the cloud.
In general, Java and JVM is memory hungry, consider some other technology if you want to use less memory. How much memory an application needs/uses totally depends on your application, e.g what data structures are used.
I need to monitor my container memory usage running on kubernetes cluster. After read some articles there're two recommendations: "container_memory_rss", "container_memory_working_set_bytes"
The definitions of both metrics are said (from the cAdvisor code)
"container_memory_rss" : The amount of anonymous and swap cache memory
"container_memory_working_set_bytes": The amount of working set memory, this includes recently accessed memory, dirty memory, and kernel memory
I think both metrics are represent the bytes size on the physical memory that process uses. But there are some differences between the two values from my grafana dashboard.
My question is:
What is the difference between two metrics?
Which metrics are much proper to monitor memory usage? Some post said both because one of those metrics reaches to the limit, then that container is oom killed.
You are right. I will try to address your questions in more detail.
What is the difference between two metrics?
container_memory_rss equals to the value of total_rss from /sys/fs/cgroups/memory/memory.status file:
// The amount of anonymous and swap cache memory (includes transparent
// hugepages).
// Units: Bytes.
RSS uint64 `json:"rss"`
The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
container_memory_working_set_bytes (as already mentioned by Olesya) is the total usage - inactive file. It is an estimate of how much memory cannot be evicted:
// The amount of working set memory, this includes recently accessed memory,
// dirty memory, and kernel memory. Working set is <= "usage".
// Units: Bytes.
WorkingSet uint64 `json:"working_set"`
Working Set is the current size, in bytes, of the Working Set of this process. The Working Set is the set of memory pages touched recently by the threads in the process.
Which metrics are much proper to monitor memory usage? Some post said
both because one of those metrics reaches to the limit, then that
container is oom killed.
If you are limiting the resource usage for your pods than you should monitor both as they will cause an oom-kill if they reach a particular resource limit.
I also recommend this article which shows an example explaining the below assertion:
You might think that memory utilization is easily tracked with
container_memory_usage_bytes, however, this metric also includes
cached (think filesystem cache) items that can be evicted under memory
pressure. The better metric is container_memory_working_set_bytes as
this is what the OOM killer is watching for.
EDIT:
Adding some additional sources as a supplement:
A Deep Dive into Kubernetes Metrics — Part 3 Container Resource Metrics
#1744
Understanding Kubernetes Memory Metrics
Memory_working_set vs Memory_rss in Kubernetes, which one you should monitor?
Managing Resources for Containers
cAdvisor code
I've a web application (soap service) running in Tomcat 8 server in Openshift. The payload size is relatively small with 5-10 elements and the traffic is also small (300 calls per day, 5-10 max threads at a time). I'm little confused on the Pod resource restriction. How do I come up with min and max cpu and memory limits for each pod if I'm going to use min 1 and max 3 pods for my application?
It's tricky to configure accurate limitation value without performance test.
Because we don't expect your application is required how much resources process per requests. A good rule of thumb is to limit the resource based on heaviest workload on your environment. Memory limitation can trigger OOM-killer, so you should set up afforded value which is based on your tomcat heap and static memory size.
As opposed to CPU limitation will not kill your pod if reached the limitation value, but slow down the process speed.
My suggestion of each limitation value's starting point is as follows.
Memory: Tomcat(Java) memory size + 30% buffer
CPU: personally I think CPU limitation is useless to maximize the
process performance and efficiency. Even though CPU usage is afforded and the pod
can use full cpu resources to process the requests as soon as
possible at that time, the limitation setting can disturb it. But if
you should spread the resource usage evenly for suppressing some
aggressive resource eater, you can consider the CPU limitation.
This answer might not be what you want to, but I hope it help you to consider your capacity planning.
I understand the use case of setting CPU request less than limit - it allows for CPU bursts in each container if the instances has free CPU hence resulting in max CPU utilization.
However, I cannot really find the use case for doing the same with memory. Most application don't release memory after allocating it, so effectively applications will request up to 'limit' memory (which has the same effect as setting request = limit). The only exception is containers running on an instance that already has all its memory allocated. I don't really see any pros in this, and the cons are more nondeterministic behaviour that is hard to monitor (one container having higher latencies than the other due to heavy GC).
Only use case I can think of is a shaded in memory cache, where you want to allow for a spike in memory usage. But even in this case one would be running the risk of of one of the nodes underperforming.
Maybe not a real answer, but a point of view on the subject.
The difference with the limit on CPU and Memory is what happens when the limit is reached. In case of the CPU, the container keeps running but the CPU usage is limited. If memory limit is reached, container gets killed and restarted.
In my use case, I often set the memory request to the amount of memory my application uses on average, and the limit to +25%. This allows me to avoid container killing most of the time (which is good) but of course it exposes me to memory overallocation (and this could be a problem as you mentioned).
Actually the topic you mention is interesting and in the meantime complex, just as Linux memory management is. As we know when the process is using more memory than the limit it will quickly move up on the potential "to-kill" process "ladder". Going further, the purpose of limit is to tell the kernel when it should consider the process to be potentially killed. Requests on the other hand are a direct statement "my container will need this much memory", but other than that they provide valuable information to the Scheduler about where can the Pod be scheduled (based on available Node resources).
If there is no memory request and high limit, Kubernetes will default the request to the limit (this might result in scheduling fail, even if the pods real requirements are met).
If you set a request, but not limit - container will use the default limit for namespace (if there is none, it will be able to use the whole available Node memory)
Setting memory request which is lower than limit you will give your pods room to have activity bursts. Also you make sure that a memory which is available for Pod to consume during boost is actually a reasonable amount.
Setting memory limit == memory request is not desirable simply because activity spikes will put it on a highway to be OOM killed by Kernel. The memory limits in Kubernetes cannot be throttled, if there is a memory pressure that is the most probable scenario (lets also remember that there is no swap partition).
Quoting Will Tomlin and his interesting article on Requests vs Limits which I highly recommend:
You might be asking if there’s reason to set limits higher than
requests. If your component has a stable memory footprint, you
probably shouldn’t since when a container exceeds its requests, it’s
more likely to be evicted if the worker node encounters a low memory
condition.
To summarize - there is no straight and easy answer. You have to determine your memory requirements and use monitoring and alerting tools to have control and be ready to change/adjust the configuration accordingly to needs.
I'm testing running monbodb on the kubernetes platform where I can limit the resources used by the running container.
Say I set the memory limit to 256Mb. The problem is that for example while making backup memory consumption increases to the limit and container gets restarted by kubernetes.
So the question is is there a way to limit mongodb memory consumption for my case so that it would not cause the crush by exeeding memory limit set by platform.
I could of course increase the limit but I'm interested in a principal solution and would like to understand this process better because I don't really now how memory consumed by mongodb and container os. Is it possible to tune mongodb/underlying linux os to work inside existing limits.
The limits that you have set are good enough for a monogodb pod, these are the limits used by the community as well.
The only way I think you can get around this for backups is to increase the memory limits, but still it might fail, because in other places on stackoverflow people have experienced OOM killing on VMs with memory of giga bytes. MongoDB basically tries to eat any and every memory that is made available to it.
Also there are other ways to backup mongodb: https://dba.stackexchange.com/questions/76130/how-to-backup-large-mongodb-database
I am not sure how this aligns in the k8s world.