ECS, meaning of memoryReservation in task definition - amazon-ecs

I'm having hard time understanding memoryReservation in ECS task definition.
The soft limit (in MiB) of memory to reserve for the container. When
system memory is under contention, Docker attempts to keep the
container memory to this soft limit; however, your container can
consume more memory when needed, up to either the hard limit specified
with the memory parameter (if applicable), or all of the available
memory on the container instance, whichever comes first.
So what's the consequence of setting this value?
My uwsgi is getting killed because of memory, and I wonder if changing this value would help.

seems like you have set memory parameter somewhere in the task definition level or in the container level, both have a big difference.
Memory
The amount (in MiB) of memory to present to the container. If your container attempts to exceed the memory specified here, the container is killed.
MemoryReservation
The soft limit (in MiB) of memory to reserve for the container. When system memory is under heavy contention, Docker attempts to keep the container memory to this soft limit. However, your container can consume more memory.
So the best option is to specify the MemoryReservation only, it will help to avoid killing your container if that limit reached.
For example, if your container normally uses 128 MiB of memory, but
occasionally bursts to 256 MiB of memory for short periods of time,
you can set a memoryReservation of 128 MiB, and a memory hard limit of
300 MiB. This configuration would allow the container to only reserve
128 MiB of memory from the remaining resources on the container
instance, but also allow the container to consume more memory
resources when needed.
aws-properties-ecs-taskdefinition-containerdefinitions

It means that if you haven't specified memory parameters inside the task definition, then memoryReservation will be used to subtract memory from container instance and will be allocated for running the task. However, if you have specified memory parameter then it will be used to subtract that much memory from the container instance. If you're setting both parameters, then you're required to specify memory and memoryReservation parameters such that memory > memoryReservation. As you must have understood from the doc, memory => hard limit and memoryReservation => soft limit. soft limit is like reservation and hard limit is like the boundary. When your container is under heavy load or contention and requires more memory that soft limit it is allowed to consume upto hard limit of memory after which you might see getting the container killed.
You can try increasing these limits to see if all goes stable and OOM doesn't kill your uwsgi. But also be cautious of any memory leaks inside your code.

Related

Is there an "initial memory allocation" different from the "limit" in Openshift/Kubernetes?

In java, you have min heap space ( -Xms) and max heap space ( -Xmx). Min heap space is allocated to the JVM from start, "max heap space" is the limit where the JVM will say "out of heap space" when reaching it.
Are there such different values (initial and limit) for a pod in Openshift/Kubernetes, or initial memory allocation is always equal to limit for some reason ?
With modern Java versions (those that support +UseContainerSupport), the heap allocation within a K8s or Openshift Pod is dependent on the memory available to the container.
This is solely determined by the "containers.resources.limits.memory" value (as you speculated). Other values, e.g. "containers.resources.requests.memory" don't play a part in this. If no resource limits are set, the entire memory of the respective cluster node will be used for initial heap size ergonomics, which is a sure recipe for OOM kills.
in K8s the resources are defined as request and limit.
the request is for the initial allocation and when the pod is reaching the memory limit it will get OOMKILLED and afterwards will restart so i think its the same behaviour as you described.

What is the difference between “container_memory_working_set_bytes” and “container_memory_rss” metric on the container

I need to monitor my container memory usage running on kubernetes cluster. After read some articles there're two recommendations: "container_memory_rss", "container_memory_working_set_bytes"
The definitions of both metrics are said (from the cAdvisor code)
"container_memory_rss" : The amount of anonymous and swap cache memory
"container_memory_working_set_bytes": The amount of working set memory, this includes recently accessed memory, dirty memory, and kernel memory
I think both metrics are represent the bytes size on the physical memory that process uses. But there are some differences between the two values from my grafana dashboard.
My question is:
What is the difference between two metrics?
Which metrics are much proper to monitor memory usage? Some post said both because one of those metrics reaches to the limit, then that container is oom killed.
You are right. I will try to address your questions in more detail.
What is the difference between two metrics?
container_memory_rss equals to the value of total_rss from /sys/fs/cgroups/memory/memory.status file:
// The amount of anonymous and swap cache memory (includes transparent
// hugepages).
// Units: Bytes.
RSS uint64 `json:"rss"`
The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
container_memory_working_set_bytes (as already mentioned by Olesya) is the total usage - inactive file. It is an estimate of how much memory cannot be evicted:
// The amount of working set memory, this includes recently accessed memory,
// dirty memory, and kernel memory. Working set is <= "usage".
// Units: Bytes.
WorkingSet uint64 `json:"working_set"`
Working Set is the current size, in bytes, of the Working Set of this process. The Working Set is the set of memory pages touched recently by the threads in the process.
Which metrics are much proper to monitor memory usage? Some post said
both because one of those metrics reaches to the limit, then that
container is oom killed.
If you are limiting the resource usage for your pods than you should monitor both as they will cause an oom-kill if they reach a particular resource limit.
I also recommend this article which shows an example explaining the below assertion:
You might think that memory utilization is easily tracked with
container_memory_usage_bytes, however, this metric also includes
cached (think filesystem cache) items that can be evicted under memory
pressure. The better metric is container_memory_working_set_bytes as
this is what the OOM killer is watching for.
EDIT:
Adding some additional sources as a supplement:
A Deep Dive into Kubernetes Metrics — Part 3 Container Resource Metrics
#1744
Understanding Kubernetes Memory Metrics
Memory_working_set vs Memory_rss in Kubernetes, which one you should monitor?
Managing Resources for Containers
cAdvisor code

How do i calculate WiredTiger cache size in a docker container?

We run MongoDB mongod processes inside Docker containers in Kubernetes with clear memory limits.
I am trying to configure the mongod processes correctly for the imposed memory limits.
These are the information I could collect from the docs:
The memory usage of MongoDB is correlated to the WiredTiger cache size. Its is calculated using the formula 50% of (RAM - 1 GB) or a minimum of 256 MB https://docs.mongodb.com/manual/core/wiredtiger/#memory-use
RAM is the total amount of ram available on the system. In the case of containerized nodes, it is the available memory to the container (since MongoDB 4.0.9) https://docs.mongodb.com/manual/faq/diagnostics/#must-my-working-set-size-fit-ram
“If you run mongod in a container (e.g. lxc, cgroups, Docker, etc.) that does not have access to all of the RAM available in a system, you must set storage.wiredTiger.engineConfig.cacheSizeGB to a value less than the amount of RAM available in the container.” https://docs.mongodb.com/manual/faq/diagnostics/#must-my-working-set-size-fit-ram
The docs state that increasing the WiredTiger cache size above the default should be avoided. https://docs.mongodb.com/manual/faq/diagnostics/#must-my-working-set-size-fit-ram
This information is a little unclear.
Do I leave the default values of the WiredTiger cache size or do I set it to "a value less than the amount of RAM available in the container"? How much lower should that value be? (a higher value than the default would also contradict the advice to not increase it above the default value)
The default is to allow the WiredTiger cache to use slightly less than half of the total RAM on the system.
The process normally determines the total RAM automatically by querying the underlying operating system.
In the case of a Docker container which has been allocated 16GB of RAM but is running on a host machine that has 128GB RAM, the system call will report 128GB. The default in this case would be 63GB, which obviously would cause a problem.
In general:
Use the default in situations where the system call reports the true memory available in the environment. This includes bare metal, most VMs, cloud providers, etc.
In containers where the amount of memory reported by the system call does not reflect the total amount available to the container, manually make the calculation for what the default would have been if it did, and use that value instead.

Request vs limit cpu in kubernetes/openshift

I have some dilemma to choose what should be the right request and limit setting for a pod in Openshift. Some data:
during start up, the application requires at least 600 millicores to be able to fulfill the readiness check within 150 seconds.
after start up, 200 millicores should be sufficient for the application to stay in idle state.
So my understanding from documentation:
CPU Requests
Each container in a pod can specify the amount of CPU it requests on a node. The scheduler uses CPU requests to find a node with an appropriate fit for a container.
The CPU request represents a minimum amount of CPU that your container may consume, but if there is no contention for CPU, it can use all available CPU on the node. If there is CPU contention on the node, CPU requests provide a relative weight across all containers on the system for how much CPU time the container may use.
On the node, CPU requests map to Kernel CFS shares to enforce this behavior.
Noted that the scheduler will refer to the request CPU to perform allocation on the node, and then it is a guarantee resource once allocated.
Also on the other side, I might allocate extra CPU as the 600 millicores might be only required during start up.
So should i go for
resources:
limits:
cpu: 1
requests:
cpu: 600m
for guarantee resource or
resources:
limits:
cpu: 1
requests:
cpu: 200m
for better cpu saving
I think you didn't get the idea of Requests vs Limits, I would recommend you take a look on the docs before you take that decision.
In a brief explanation,
Request is how much resource will be virtually allocated to the container, it is a guarantee that you can use it when you need, does not mean it keeps reserved exclusively to the container. With that said, if you request 200mb of RAM but only uses 100mb, the other 100mb will be "borrowed" by other containers when they consume all their Requested memory, and will be "claimed back" when your container needs it.
Limit is simple terms, is how much the container can consume, requested + borrow from other containers, before it is shutdown for consuming too much resources.
If a Container exceeds its memory limit, it will probably be terminated.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
In simple terms, the limit is an absolute value, it should be equal or higher than the request, and the good practice is to avoid having the limits higher than the request for all containers, only in cases while certain workloads might need it, this is because most of the containers can consume more resources (ie: memory) than they requested, suddenly the PODs will start to be evicted from the node in an unpredictable way that makes it worse than if had a fixed limit for each one.
There is also a nice post in the docker docs about resources limits.
The scheduling rule is the same for CPU and Memory, K8s will only assign a POD to a the node if the node has enough CPU and Memory allocatable to fit all resources requested by the containers within a pod.
The execution rule is a bit different:
Memory is a limited resource in the node and the capacity is an absolute limit, the containers can't consume more than the node have capacity.
The CPU on the other hand is measure as CPU time, when you reserve a CPU capacity, you are telling how much CPU time a container can use, if the container need more time than the requested, it can be throttled and go to an execution queue until other containers have consumed their allocated time or finished their work. In summary is very similar to memory, but is very unlikely the container being killed for consuming too much CPU. The container will be able to use more CPU when the other containers does not use the full CPU time allocated to them. The main issue is when a container uses more CPU than was allocated, the throttling will degrade de performance of the application and at certain point might stop working properly. If you do not provide limits, the containers will start affecting other resources in the node.
Regarding the values to be used, there is no right value or right formula, each application requires a different approach, only measuring multiple times you can find the right value, the advice I give to you is to identify the min and the max and adjust somewhere in the middle, then keep monitoring to see how it behaves, if you feel is wasting\lacking resources you can reduce\increase to an optimal value. If the service is something crucial, start with higher values and reduce afterwards.
For readiness check, you should not use it as parameters to specify these values, you can delay the readiness using initialDelaySeconds parameter in the probe to give extra time to start the POD containers.
PS: I quoted the terms "Borrow" and "Claimed back" because the container is not actually borrowing from another container, in general, the node have a pool of memory and give you chunk of the memory to the container when they need it, so the memory is not technically borrowed from the container but from the Pool.

What is the use case of setting memory request less than limit in . K8s

I understand the use case of setting CPU request less than limit - it allows for CPU bursts in each container if the instances has free CPU hence resulting in max CPU utilization.
However, I cannot really find the use case for doing the same with memory. Most application don't release memory after allocating it, so effectively applications will request up to 'limit' memory (which has the same effect as setting request = limit). The only exception is containers running on an instance that already has all its memory allocated. I don't really see any pros in this, and the cons are more nondeterministic behaviour that is hard to monitor (one container having higher latencies than the other due to heavy GC).
Only use case I can think of is a shaded in memory cache, where you want to allow for a spike in memory usage. But even in this case one would be running the risk of of one of the nodes underperforming.
Maybe not a real answer, but a point of view on the subject.
The difference with the limit on CPU and Memory is what happens when the limit is reached. In case of the CPU, the container keeps running but the CPU usage is limited. If memory limit is reached, container gets killed and restarted.
In my use case, I often set the memory request to the amount of memory my application uses on average, and the limit to +25%. This allows me to avoid container killing most of the time (which is good) but of course it exposes me to memory overallocation (and this could be a problem as you mentioned).
Actually the topic you mention is interesting and in the meantime complex, just as Linux memory management is. As we know when the process is using more memory than the limit it will quickly move up on the potential "to-kill" process "ladder". Going further, the purpose of limit is to tell the kernel when it should consider the process to be potentially killed. Requests on the other hand are a direct statement "my container will need this much memory", but other than that they provide valuable information to the Scheduler about where can the Pod be scheduled (based on available Node resources).
If there is no memory request and high limit, Kubernetes will default the request to the limit (this might result in scheduling fail, even if the pods real requirements are met).
If you set a request, but not limit - container will use the default limit for namespace (if there is none, it will be able to use the whole available Node memory)
Setting memory request which is lower than limit you will give your pods room to have activity bursts. Also you make sure that a memory which is available for Pod to consume during boost is actually a reasonable amount.
Setting memory limit == memory request is not desirable simply because activity spikes will put it on a highway to be OOM killed by Kernel. The memory limits in Kubernetes cannot be throttled, if there is a memory pressure that is the most probable scenario (lets also remember that there is no swap partition).
Quoting Will Tomlin and his interesting article on Requests vs Limits which I highly recommend:
You might be asking if there’s reason to set limits higher than
requests. If your component has a stable memory footprint, you
probably shouldn’t since when a container exceeds its requests, it’s
more likely to be evicted if the worker node encounters a low memory
condition.
To summarize - there is no straight and easy answer. You have to determine your memory requirements and use monitoring and alerting tools to have control and be ready to change/adjust the configuration accordingly to needs.