Distributed Resource Allocation Architecture - service

I am currently working on scaling a large scale infrastructure that involves distributing complex calculations over a calculation farm (Cluster with a limited number of machines). The current system is based on a service oriented architecture, whereby a limited number of services runs on each machine in the cluster.
The resources used (CPU, Memory) by each request sent these services vary widely depending on the content of the request, but may be known (or at least predicted) in advance. In other word, it is possible to know, for a given request, the following:
Time it will take to process the request. (Can vary from ms to minutes to sometimes hours)
Maximum memory required to process the request. (From a few MB to several GB)
Maximum number of cores required to process the request. (Mostly mono-threaded, but sometimes multi-threaded)
Our current architecture is problematic because our 'scheduler' does not take any of those parameters into account. Because of this, we often run into issues where one particular server is occupied by very expensive/'incompatible' requests (In terms of memory usage, CPU cores used, etc.), so processing each of them becomes widely inefficient, while other servers are occupied by relatively 'cheap' requests.
We would like to optimise this allocation process by moving our current infrastructure to a more modern orchestration system, such as Kubernetes (or other). The question I have at the moment is, given those requirements (Efficient distribution of requests with varying resource requirements - known before processing the request), what platforms currently available could be a good fit to optimise this type of workflow?
Thanks,
Jon

Kubernetes seems a good fit for that type of workload. Each request could be run as a Job which would run one or more containers to process the request. These containers can each request a minimum amount of resource they will require ahead of time in their specification and also specify a limit on these resources (e.g. maximum memory and maximum number of cores) and the Kubernetes scheduler can pick a node within the cluster that can satisfy and enforce these requirements.
This will allow you to forget about where the workloads are actually running and focus on making sure you just describe the requirements of each request accurately.

Related

Service Fabric - Stateful Service Memory Footprint

When using stateful services (Reliable Services model), we observe that the baseline memory footprint per process is approximately 250MB of unmanaged memory and approximately 20MB of managed memory (in this scenario, it is simply a Stateful Web API with no other code, created from the Visual Studio templates).
In our application, we use this model, hosting our services in a single Service Fabric Application (with ServicePackageActivationMode.SharedProcess). Our application partitions data by tenants (conceptually a customer) and namespaces within a tenant (a container for a subset of the customer’s data). So, there may be 0 to many tenants, and each tenant may have 0 to many namespaces, all hosted in a single process (with secondary replicas on other nodes).
So, we have one process per node with a baseline memory of approximately 300MB of unmanaged memory and 20MB of managed memory. Of note, we are using .NET 5 (migrating to 6) so none of the unmanaged memory is directly ours, rather it is the Service Fabric overhead.
Our goal is to isolate tenants to minimize noisy neighbors, as well as allow better load balancing, roll out upgrades on a per-customer basis, among other potential benefits. So, we are considering changing our hosting model such that, rather than multi-tenancy as we have now, we want to have a Service Fabric Application per tenant (still using “Shared Process”), thus if we have ‘N’ tenants, each node will have ‘N’ processes, rather than the single process we currently use.
What concerns us is the baseline memory footprint (i.e., the ~250-300MB of unmanaged memory) per process. Even in the multi-tenancy model, each process has a similar baseline memory overhead. We expect to have many tenants hosted within a single cluster. So, if there are 100 tenants, with the application-per-tenant model, we would have an ambient memory overhead of more than 24GB without any customer data being in-process. Testing with “Exclusive Process”, we see similar baseline memory footprints. Even stateless services exhibit a similar baseline memory footprint.
Since we leverage Reliable Collections (RCs), we have not attempted to use a Guest Executable model, nor have we tried Reliable Actors (probably not a good fit anyway since we have I/O going to BLOB storage, along with querying across a subset of instances). Although we are researching alternatives to RCs to potentially decouple from Service Fabric.
My overall question then: Is the baseline memory footprint on a per-process basis just the cost of doing business with Service Fabric? I am curious if others have seen this same memory footprint, as well as if there are any ways to optimize (reduce) the memory.

How to determine resource limit for Openshift Pods required for my tomcat application?

I've a web application (soap service) running in Tomcat 8 server in Openshift. The payload size is relatively small with 5-10 elements and the traffic is also small (300 calls per day, 5-10 max threads at a time). I'm little confused on the Pod resource restriction. How do I come up with min and max cpu and memory limits for each pod if I'm going to use min 1 and max 3 pods for my application?
It's tricky to configure accurate limitation value without performance test.
Because we don't expect your application is required how much resources process per requests. A good rule of thumb is to limit the resource based on heaviest workload on your environment. Memory limitation can trigger OOM-killer, so you should set up afforded value which is based on your tomcat heap and static memory size.
As opposed to CPU limitation will not kill your pod if reached the limitation value, but slow down the process speed.
My suggestion of each limitation value's starting point is as follows.
Memory: Tomcat(Java) memory size + 30% buffer
CPU: personally I think CPU limitation is useless to maximize the
process performance and efficiency. Even though CPU usage is afforded and the pod
can use full cpu resources to process the requests as soon as
possible at that time, the limitation setting can disturb it. But if
you should spread the resource usage evenly for suppressing some
aggressive resource eater, you can consider the CPU limitation.
This answer might not be what you want to, but I hope it help you to consider your capacity planning.

How to set the right cpu millicores for a container?

I want to optimally configure the CPU cores without over or under allocation. How can I measure the required CPU millicore for a given container? It also brings the question of how much traffic a proxy will send it to any given pod based on CPU consumption so we can optimally use the compute.
Currently I send requests and monitor with,
kubectl top pod
Is there any tool that can measure, Requests, CPU and Memory over the time and suggest the optimal CPU recommendation for the pods.
Monitoring over time and per Pod yes, there's suggestions at https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring/ One of the more popular is the Prometheus-Grafana combination - https://grafana.com/dashboards/315
As for automatic suggestion of the request and limits, I don't think there is anything. Keep in mind Kubernetes already tries to balance giving each Pod what it needs without it taking too much. The limits and requests that you set are to help it do this more safely. There are limitations on automatically inference as an under-resourced Pod can still work but respond a bit slower - it is up to you to decide what level of slowness you would tolerate. It is also up to you to decide what level of resource consumption could be acceptable in peak load, as opposed to excessive consumption that might indicate a bug in your app or even an attack. There's a further limitation as the metric units are themselves an attempt to approximate resource power that can actually vary with types of hardware (memory and CPUs can differ in mode of operation as well as quantity) and so can vary across clusters or even nodes on a cluster if the hardware isn't all equal.
What you are doing with top seems to me a good way to get started. You'll want to monitor resource usage for the cluster anyway so keeping track of this and adjusting limits as you go is a good idea. If you can run the same app outside of kubernetes and read around to see what other apps using the same language do then that can help to indicate if there's anything you can do to improve utilisation (memory consumption on the JVM in containers for example famously requires some tweaking to get right).

Multiple node pools vs single pool with many machines vs big machines

We're moving all of our infrastructure to Google Kubernetes Engine (GKE) - we currently have 50+ AWS machines with lots of APIs, Services, Webapps, Database servers and more.
As we have already dockerized everything, it's time to start moving everything to GKE.
I have a question that may sound too basic, but I've been searching the Internet for a week and did not found any reasonable post about this
Straight to the point, which of the following approaches is better and why:
Having multiple node pools with multiple machine types and always specify in which pool each deployment should be done; or
Having a single pool with lots of machines and let Kubernetes scheduler do the job without worrying about where my deployments will be done; or
Having BIG machines (in multiple zones to improve clusters' availability and resilience) and let Kubernetes deploy everything there.
List of consideration to be taken merely as hints, I do not pretend to describe best practice.
Each pod you add brings with it some overhead, but you increase in terms of flexibility and availability making failure and maintenance of nodes to be less impacting the production.
Nodes too small would cause a big waste of resources since sometimes will be not possible to schedule a pod even if the total amount of free RAM or CPU across the nodes would be enough, you can see this issue similar to memory fragmentation.
I guess that the sizes of PODs and their memory and CPU request are not similar, but I do not see this as a big issue in principle and a reason to go for 1). I do not see why a big POD should run merely on big machines and a small one should be scheduled on small nodes. I would rather use 1) if you need a different memoryGB/CPUcores ratio to support different workloads.
I would advise you to run some test in the initial phase to understand which is the size of the biggest POD and the average size of the workload in order to properly chose the machine types. Consider that having 1 POD that exactly fit in one node and assign to it is not the right to proceed(virtual machine exist for this kind of scenario). Since fragmentation of resources would easily cause to impossibility to schedule a large node.
Consider that their size will likely increase in the future and to scale vertically is not always this immediate and you need to switch off machine and terminate pods, I would oversize a bit taking this issue into account and since scaling horizontally is way easier.
Talking about the machine type you can decide to go for a machine 5xsize the biggest POD you have (or 3x? or 10x?). Oversize a bit as well the numebr of nodes of the cluster to take into account overheads, fragmentation and in order to still have free resources.
Remember that you have an hard limit of 100 pods each node and 5000 nodes.
Remember that in GCP the network egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. However each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.
Regarding the prices of the virtual machines notice that there is no difference in price buying two machines with size x or one with size 2x. Avoid to customise the size of machines because rarely is convenient, if you feel like your workload needs more cpu or mem go for HighMem or HighCpu machine type.
P.S. Since you are going to build a pretty big Cluster, check the size of the DNS
I will add any consideration that it comes to my mind, consider in the future to update your question with the description of the path you chose and the issue you faced.
1) makes a lot of sense as if you want, you can still allow kube deployments treat it as one large pool (by not adding nodeSelector/NodeAffinity) but you can have different machines of different sizes, you can think about having a pool of spot instances, etc. And, after all, you can have pools that are tainted and so forth excluded from normal scheduling and available to only a particular set of workloads. It is in my opinion preferred to have some proficiency with this approach from the very beginning, yet in case of many provisioners it should be very easy to migrate from 2) to 1) anyway.
2) As explained above, it's effectively a subset of 1) so better to build up exp with 1) approach from day 1, but if you ensure your provisioning solution supports easy extension to 1) model then you can get away with starting with this simplified approach.
3) Big is nice, but "big" is relative. It depends on the requirements and amount of your workloads. Remember that while you need to plan for loss of a whole AZ anyway, it will be much more frequent to loose single nodes (reboots, decommissions of underlying hardware, updates etc.) so if you have more hosts, impact of loosing one will be smaller. Bottom line is that you need to find your own balance, that makes sense for your particular scale. Maybe 50 nodes is too much, would 15 cut it? Who knows but you :)

Azure Service Fabric reliable collections and memory

Let's say I'm running a Service Fabric cluster on 5 D1 class (1 core, 3.5GB RAM, 50GB SSD) VMs. and that I'm running 2 reliable services on this cluster, one stateless and one stateful. Let's assume that the replica target is 3.
How to calculate how much can my reliable collections hold?
Let's say I add one or more stateful services. Since I don't really know how the framework distributes services do I need to take most conservative approach and assume that a node may run all of my stateful services on a single node and that their cumulative memory needs to be below the RAM available on a single machine?
TLDR - Estimating the expected capacity of a cluster is part art, part science. You can likely get a good lower bound which you may be able to push higher, but for the most part deploying things, running them, and collecting data under your workload's conditions is the best way to answer this question.
1) In general, the collections on a given machine are bounded by the amount of available memory or the amount of available disk space on a node, whichever is lower. Today we keep all data in the collections in memory and also persist it to disk. So the maximum amount that your collections across the cluster can hold is generally (Amount of available memory in the cluster) / (Target Replica Set Size).
Note that "Available Memory" is whatever is left over from other code running on the machines, including the OS. In your above example though you're not running across all of the nodes - you'll only be able to get 3 of them. So, (unrealistically) assuming 0 overhead from these other factors, you could expect to be able to put about 3.5 GB of data into that stateful service replica before you ran out of memory on the nodes on which it was running. There would still be 2 nodes in the cluster left empty.
Let's take another example. Let's say that it is about the same as your example above, except in this case you set up the stateful service to be partitioned. Let's say you picked a partition count of 5. So now on each node, you have a primary replica and 2 secondary replicas from other partitions. In this case, each partition would only be able to hold a maximum of around 1.16 GB of state, but now overall you can pack 5.83 GB of state into the cluster (since all nodes can now be utilized fully). Incidentally, just to prove out the math works, that's (3.5 GB of memory per node * 5 nodes in the cluster) [17.5] / (target replica set size of 3) = 5.83.
In all of these examples, we've also assumed that memory consumption for all partitions and all replicas is the same. A lot of the time that turns out to not be true (at least temporarily) - some partitions can end up with more or less work to do and hence have uneven resource consumption. We also assumed that the secondaries were always the same as the primaries. In the case of the amount of state, it's probably fair to assume that these will track fairly evenly, though for other resource consumption it may not (just something to keep in mind). In the case of uneven consumption, this is really where the rest of Service Fabric's Cluster Resource Management will help, since we can come to know about the consumption of different replicas and pack them efficiently into the cluster to make use of the available space. Automatic reporting of consumption of resources related to state in the collections is on our radar and something we want to do, so in the future, this would be automatic but today you'd have to report this consumption on your own.
2) By default, we will balance the services according to the default metrics (more about metrics is here). So by default, the different replicas of those two different services could end up on the machine, but in your example, you'll end up with 4 nodes with 1 replica from a service on it and then 1 node with two replicas from the two different services. This means that each service (each with 1 partition as per your example) would only be able to consume 1.75 GB of memory in each service for a total of 3.5 GB in the cluster. This is again less than the total available memory of the cluster since there are some portions of nodes that you're not utilizing.
Note that this is the maximum possible consumption, and presuming no consumption outside the service itself. Taking this as your maximum is not advisable. You'll want to reduce it for several reasons, but the most practical reason is to ensure that in the presence of upgrades and failures that there's sufficient available capacity in the cluster. As an example, let's say that you have 5 Upgrade Domains and 5 Fault Domains. Now let's say that a fault domain's worth of nodes fails while you have an upgrade going on in an upgrade domain. This means that (a little less than) 40% of your cluster capacity can be gone at any time, and you probably want enough room left over on the remaining nodes to continue. This means that if your cluster previously could hold 5.83 GB of state (from our prior calculations), in reality you probably don't want to put more than about 3.5 GB of state in it since with more of that the service may not be able to get back to 100% healthy (note also that we don't build replacement replicas immediately so the nodes would have to be down for your ReplicaRestartWaitDuration before you ran into this case). There's a bunch more information about metrics, capacity, buffered capacity (which you can use to ensure that room is left on nodes for the failure cases) and fault and upgrade domains are covered in this article.
There are some other things that practically will limit the amount of state you'll be able to store. You'll want to do several things:
Estimate the size of your data. You can make a reasonable estimate up-front of how big your data is by calculating the size of each field your objects hold. Be sure to take into consideration 64-bit references. This will give you a lower-bound starting point.
Storage overhead. Each object you store in a collection will come with some overhead for storing that object. In the reliable collections depending on the collection and the operations currently in flight (copy, enumerations, updates, etc.) this overhead can range from between 100 and around 700 bytes per item (row) stored in the collections. Do know also that we're always looking for ways to reduce the amount of overhead we introduce.
We also strongly recommend running your service over some period of time and measuring actual resource consumption via performance counters. Simulating some sort of real workload and then measuring the actual usage of the metrics you care about will serve you pretty well. The reason we recommend this in particular is that you will be able to see consumption from things like which CLR object heap your objects end up placed in, how often GC is running, if there's leaks, or other things like this which will impact the amount of memory you can actually utilize.
I know that this has been a long answer but I hope you find it helpful and complete.