How many threads for a kubernetes application? - kubernetes

I have a multithreaded application which does a CPU-intensive task that I want to run on Kubernetes.
The node I'm using has 56 cores and I set a request and limit of 2 cores for my pod.
Since it's CPU-intensive, typically there would be no point in having more threads than the # of cores (with hyperthreading, perhaps twice as many threads as cores), so I would allocate 2-4 threads and call it a day.
However, AFAIK, Kubernetes doesn't guarantee core affinity, so in the worst-case scenario, the 2 cores could be evenly split across the 56 cores, with each core working 2/56th of the time in parallel. If this happens and I allocate only 4 threads then at least 52 out of the 56 cores will be sitting idle.
If I understand it correctly, this problem is not unique to Kubernetes and applies to any virtualized environment where the hardware resources are shared.
What is the best practice when it comes to dealing with this potential worst-case scenario? Do you ignore it and assume you have high locality, or do you plan for the worst, or something in between?

Thread scheduling depends on OS including affinity, kubernetes doesn't affect it usually. If it's important to your workload - you can try Static policy in k8s to get exclusive CPU cores to your app.
You may use pod CPU limit 56 and run 56 threads to utilize all node cores. Assuming all other pods on this node use correct CPU requests, this should not affect other pods negatively.

Related

OpenShift/Kubernetes memory handling

What is the correct way of memory handling in OpenShift/Kubernetes?
If I create a project in OKD, how can I determine optimal memory usage of pods? For example, if I use 1 deployment for 1-2 pods and each pod uses 300-500 Mb of RAM - Spring Boot apps. So technically, 20 pods uses around 6-10GB RAM, but as I see, sometimes each project could have around 100-150 containers which needs at least 30-50Gb of RAM.
I also tried with horizontal scale, and/or request/limits but still lot of memory used by each micro-service.
However, to start a pod, it requires around 500-700MB RAM, after spring container has been started they can live with around 300MB as mentioned.
So, I have 2 questions:
Is it able to give extra memory but only for the first X minutes for each pod start?
If not, than what is the best practice to handle memory shortage, if I have limited memory (16GB) and wants to run 35-40 pod?
Thanks for the answer in advance!
Is it able to give extra memory but only for the first X minutes for each pod start?
You do get this behavior when you set the limit to a higher value than the request. This allows pods to burst, unless they all need the memory at the same time.
If not, than what is the best practice to handle memory shortage, if I have limited memory (16GB) and wants to run 35-40 pod?
It is common to use some form of cluster autoscaler to add more nodes to your cluster if it needs more capacity. This is easy if you run in the cloud.
In general, Java and JVM is memory hungry, consider some other technology if you want to use less memory. How much memory an application needs/uses totally depends on your application, e.g what data structures are used.

CPU limits (cores) more than 100% on nodes

I have just noticed on my kubernetes dashboard this:
CPU requests (cores)
0.66 (16.50%)
CPU limits (cores)
4.7 (117.50%)
I am quite confused as to why the limit is set as to 117.50%...? Is one of my service using too much, but wouldn't it be in the requests? Looking into kubectl describe node, I don't see any service using more than 2% (there are 43, which is a total of 86 max).
Thank you.
My approximate understanding is that Kubernetes lets you overcommit — that is, have resource requests on a particular node that exceed the capacity of the node — to let you be a little more efficient with your resource use.
For instance, suppose you're running deployments A and B, both of which require only 100 MB of memory (200 MB total) when they're idle, but require 1 GB of memory when they're actively processing a request. You could set things up to have each one of them run on a node with 1 GB of memory available. You could also put them on a single node with 1.5 GB of memory, assuming that A and B won't have to process traffic simultaneously, thereby saving yourself from a huge resource allocation.
This might be especially reasonable if you're using lots of microservices: you might even know that B can't process data until A has completed a request anyway, providing you a stronger guarantee things won't overlap and cause problems.
How Kubernetes decides to overcommit resources or not depends on the quality of service (QoS) tolerance that you've configured for the deployment. For instance, you won't get overcommitment on the Guaranteed QoS class, but you may see overcommitting if you use the default class, BestEffort.
You can read more about QoS classes in the Kubernetes documentation.
Limits (of all things) are allowed to overcommit the resources of the node. Requests cannot, so that should never be more than 100% of available. Basically the idea is "request" is a minimum requirement, but "limit" is a maximum burst range and it's not super likely everyone will burst at once. And if that is likely for you, you should set your requests and limits to the same value.

How to determine resource limit for Openshift Pods required for my tomcat application?

I've a web application (soap service) running in Tomcat 8 server in Openshift. The payload size is relatively small with 5-10 elements and the traffic is also small (300 calls per day, 5-10 max threads at a time). I'm little confused on the Pod resource restriction. How do I come up with min and max cpu and memory limits for each pod if I'm going to use min 1 and max 3 pods for my application?
It's tricky to configure accurate limitation value without performance test.
Because we don't expect your application is required how much resources process per requests. A good rule of thumb is to limit the resource based on heaviest workload on your environment. Memory limitation can trigger OOM-killer, so you should set up afforded value which is based on your tomcat heap and static memory size.
As opposed to CPU limitation will not kill your pod if reached the limitation value, but slow down the process speed.
My suggestion of each limitation value's starting point is as follows.
Memory: Tomcat(Java) memory size + 30% buffer
CPU: personally I think CPU limitation is useless to maximize the
process performance and efficiency. Even though CPU usage is afforded and the pod
can use full cpu resources to process the requests as soon as
possible at that time, the limitation setting can disturb it. But if
you should spread the resource usage evenly for suppressing some
aggressive resource eater, you can consider the CPU limitation.
This answer might not be what you want to, but I hope it help you to consider your capacity planning.

Azure Service Fabric reliable collections and memory

Let's say I'm running a Service Fabric cluster on 5 D1 class (1 core, 3.5GB RAM, 50GB SSD) VMs. and that I'm running 2 reliable services on this cluster, one stateless and one stateful. Let's assume that the replica target is 3.
How to calculate how much can my reliable collections hold?
Let's say I add one or more stateful services. Since I don't really know how the framework distributes services do I need to take most conservative approach and assume that a node may run all of my stateful services on a single node and that their cumulative memory needs to be below the RAM available on a single machine?
TLDR - Estimating the expected capacity of a cluster is part art, part science. You can likely get a good lower bound which you may be able to push higher, but for the most part deploying things, running them, and collecting data under your workload's conditions is the best way to answer this question.
1) In general, the collections on a given machine are bounded by the amount of available memory or the amount of available disk space on a node, whichever is lower. Today we keep all data in the collections in memory and also persist it to disk. So the maximum amount that your collections across the cluster can hold is generally (Amount of available memory in the cluster) / (Target Replica Set Size).
Note that "Available Memory" is whatever is left over from other code running on the machines, including the OS. In your above example though you're not running across all of the nodes - you'll only be able to get 3 of them. So, (unrealistically) assuming 0 overhead from these other factors, you could expect to be able to put about 3.5 GB of data into that stateful service replica before you ran out of memory on the nodes on which it was running. There would still be 2 nodes in the cluster left empty.
Let's take another example. Let's say that it is about the same as your example above, except in this case you set up the stateful service to be partitioned. Let's say you picked a partition count of 5. So now on each node, you have a primary replica and 2 secondary replicas from other partitions. In this case, each partition would only be able to hold a maximum of around 1.16 GB of state, but now overall you can pack 5.83 GB of state into the cluster (since all nodes can now be utilized fully). Incidentally, just to prove out the math works, that's (3.5 GB of memory per node * 5 nodes in the cluster) [17.5] / (target replica set size of 3) = 5.83.
In all of these examples, we've also assumed that memory consumption for all partitions and all replicas is the same. A lot of the time that turns out to not be true (at least temporarily) - some partitions can end up with more or less work to do and hence have uneven resource consumption. We also assumed that the secondaries were always the same as the primaries. In the case of the amount of state, it's probably fair to assume that these will track fairly evenly, though for other resource consumption it may not (just something to keep in mind). In the case of uneven consumption, this is really where the rest of Service Fabric's Cluster Resource Management will help, since we can come to know about the consumption of different replicas and pack them efficiently into the cluster to make use of the available space. Automatic reporting of consumption of resources related to state in the collections is on our radar and something we want to do, so in the future, this would be automatic but today you'd have to report this consumption on your own.
2) By default, we will balance the services according to the default metrics (more about metrics is here). So by default, the different replicas of those two different services could end up on the machine, but in your example, you'll end up with 4 nodes with 1 replica from a service on it and then 1 node with two replicas from the two different services. This means that each service (each with 1 partition as per your example) would only be able to consume 1.75 GB of memory in each service for a total of 3.5 GB in the cluster. This is again less than the total available memory of the cluster since there are some portions of nodes that you're not utilizing.
Note that this is the maximum possible consumption, and presuming no consumption outside the service itself. Taking this as your maximum is not advisable. You'll want to reduce it for several reasons, but the most practical reason is to ensure that in the presence of upgrades and failures that there's sufficient available capacity in the cluster. As an example, let's say that you have 5 Upgrade Domains and 5 Fault Domains. Now let's say that a fault domain's worth of nodes fails while you have an upgrade going on in an upgrade domain. This means that (a little less than) 40% of your cluster capacity can be gone at any time, and you probably want enough room left over on the remaining nodes to continue. This means that if your cluster previously could hold 5.83 GB of state (from our prior calculations), in reality you probably don't want to put more than about 3.5 GB of state in it since with more of that the service may not be able to get back to 100% healthy (note also that we don't build replacement replicas immediately so the nodes would have to be down for your ReplicaRestartWaitDuration before you ran into this case). There's a bunch more information about metrics, capacity, buffered capacity (which you can use to ensure that room is left on nodes for the failure cases) and fault and upgrade domains are covered in this article.
There are some other things that practically will limit the amount of state you'll be able to store. You'll want to do several things:
Estimate the size of your data. You can make a reasonable estimate up-front of how big your data is by calculating the size of each field your objects hold. Be sure to take into consideration 64-bit references. This will give you a lower-bound starting point.
Storage overhead. Each object you store in a collection will come with some overhead for storing that object. In the reliable collections depending on the collection and the operations currently in flight (copy, enumerations, updates, etc.) this overhead can range from between 100 and around 700 bytes per item (row) stored in the collections. Do know also that we're always looking for ways to reduce the amount of overhead we introduce.
We also strongly recommend running your service over some period of time and measuring actual resource consumption via performance counters. Simulating some sort of real workload and then measuring the actual usage of the metrics you care about will serve you pretty well. The reason we recommend this in particular is that you will be able to see consumption from things like which CLR object heap your objects end up placed in, how often GC is running, if there's leaks, or other things like this which will impact the amount of memory you can actually utilize.
I know that this has been a long answer but I hope you find it helpful and complete.

Questions about HPC on SLURM

I have a few questions about HPC. I have a code with serial and parallel sections. Parallel sections work on different chunks of memory and at some point they communicate with each other. For this I used MPI on our cluster. SLURM is the resource manager. Below is the specifications of a node in a cluster.
Specifications of a node:
Processor: 2x Intel Xeon E5-2690 (totally 16 cores 32 thread)
Memory : 256 GB 1600MHz ECC
Disk : 2 x 600 GB 2.5" SAS (configured with raid 1)
Questions:
1) Do all cores on a node share the same memory (RAM)? If yes, do all of cores access memory at the same speed?
2) Consider a case:
--nodes = 1
--ntasks-per-node = 1
--cpus-per-task = 16 (all cores on a node)
If all cores share the same memory (depends on answer to question 1) will all cores be used or 15 of them sleep since OpenMP (for shared memory) is not used?
3) If required memory is less total memory of a node, isn't it much better to use a single node, use OpenMP to achieve core-level parallelism, and avoid time loss due to communication between nodes? That is, use this
--nodes = 1
--ntasks-per-core = 1
instead of this:
--nodes = 16
--ntasks-per-node = 1
Rest of the questions are related to statements in this link.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
Does this statement mean that --ntasks-per-core is good when cores don't access RAM too often?
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
I just don't get this. What I know is all sockets and cores on sockets share the same memory. This is why I don't get why --ntasks-per-socket option is available?
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
Does this mean that, if memory required is more than total RAM of a single node then its better to use multiple nodes?
In order:
Yes, all cores share the same memory. But not usually at the same speed. Usually, each Processor (in your configuration, you have 2 processors or sockets) has memory that is 'closer' to it. Usually the Linux kernel will attempt to allocate memory on nearby memory. This is not something that a user application usually has to worry about.
If it is a serial job, then yes, 15 cores will sit idle. If your job uses MPI, then it can use the other cores on the same node. Actually, MPI on the same node usually much faster than MPI stretched across multiple nodes.
You can use OpenMP or MPI on a single node. I'm not sure about the speed difference, but if you are already familiar with MPI, I would just stick with that. The difference probably isn't that big. But, the difference between running MPI on a single node vs. multiple nodes is going to be large. Running MPI on a single node will be significantly faster than across multiple nodes.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
This is likely targeting OpenMP or single node parallel jobs.
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
See the answer to 1. Though it is the same memory, cores usually have separate bus's to memory.
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
If you need more RAM than a single node can provide, then you have no choice but to divide your program and use MPI.