Distribute children elements or the top-level parent elements among monitoring hosts - distributed-computing

I've a monitoring system that periodically heartbeats compute resources assigned to my company's customers.
By default, each customer is assigned one physical server, but some of the bigger customers are assigned multiple physical servers.
The monitoring fleet has a bunch of servers running a monitoring software. As the number of customers with >1 physical servers are increasing, I am worried about uneven distribution of load, noisy neighbor problems etc.
However, I am unsure which of the following approaches I should adopt:
Distribute individual servers among the monitoring hosts. As a result, servers assigned to single customer may be monitored by different hosts. Note that we don't save any server state in the RAM of the monitoring host, so this distribution will not affect the correctness of my monitoring logic.
Distribute individual clusters among the monitoring hosts and use a load balancing algorithm with weight = size of clusters.
Any help is appreciated.
[Note: I am sorry if this is not the right place for this question]

Related

Multiple node pools vs single pool with many machines vs big machines

We're moving all of our infrastructure to Google Kubernetes Engine (GKE) - we currently have 50+ AWS machines with lots of APIs, Services, Webapps, Database servers and more.
As we have already dockerized everything, it's time to start moving everything to GKE.
I have a question that may sound too basic, but I've been searching the Internet for a week and did not found any reasonable post about this
Straight to the point, which of the following approaches is better and why:
Having multiple node pools with multiple machine types and always specify in which pool each deployment should be done; or
Having a single pool with lots of machines and let Kubernetes scheduler do the job without worrying about where my deployments will be done; or
Having BIG machines (in multiple zones to improve clusters' availability and resilience) and let Kubernetes deploy everything there.
List of consideration to be taken merely as hints, I do not pretend to describe best practice.
Each pod you add brings with it some overhead, but you increase in terms of flexibility and availability making failure and maintenance of nodes to be less impacting the production.
Nodes too small would cause a big waste of resources since sometimes will be not possible to schedule a pod even if the total amount of free RAM or CPU across the nodes would be enough, you can see this issue similar to memory fragmentation.
I guess that the sizes of PODs and their memory and CPU request are not similar, but I do not see this as a big issue in principle and a reason to go for 1). I do not see why a big POD should run merely on big machines and a small one should be scheduled on small nodes. I would rather use 1) if you need a different memoryGB/CPUcores ratio to support different workloads.
I would advise you to run some test in the initial phase to understand which is the size of the biggest POD and the average size of the workload in order to properly chose the machine types. Consider that having 1 POD that exactly fit in one node and assign to it is not the right to proceed(virtual machine exist for this kind of scenario). Since fragmentation of resources would easily cause to impossibility to schedule a large node.
Consider that their size will likely increase in the future and to scale vertically is not always this immediate and you need to switch off machine and terminate pods, I would oversize a bit taking this issue into account and since scaling horizontally is way easier.
Talking about the machine type you can decide to go for a machine 5xsize the biggest POD you have (or 3x? or 10x?). Oversize a bit as well the numebr of nodes of the cluster to take into account overheads, fragmentation and in order to still have free resources.
Remember that you have an hard limit of 100 pods each node and 5000 nodes.
Remember that in GCP the network egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. However each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.
Regarding the prices of the virtual machines notice that there is no difference in price buying two machines with size x or one with size 2x. Avoid to customise the size of machines because rarely is convenient, if you feel like your workload needs more cpu or mem go for HighMem or HighCpu machine type.
P.S. Since you are going to build a pretty big Cluster, check the size of the DNS
I will add any consideration that it comes to my mind, consider in the future to update your question with the description of the path you chose and the issue you faced.
1) makes a lot of sense as if you want, you can still allow kube deployments treat it as one large pool (by not adding nodeSelector/NodeAffinity) but you can have different machines of different sizes, you can think about having a pool of spot instances, etc. And, after all, you can have pools that are tainted and so forth excluded from normal scheduling and available to only a particular set of workloads. It is in my opinion preferred to have some proficiency with this approach from the very beginning, yet in case of many provisioners it should be very easy to migrate from 2) to 1) anyway.
2) As explained above, it's effectively a subset of 1) so better to build up exp with 1) approach from day 1, but if you ensure your provisioning solution supports easy extension to 1) model then you can get away with starting with this simplified approach.
3) Big is nice, but "big" is relative. It depends on the requirements and amount of your workloads. Remember that while you need to plan for loss of a whole AZ anyway, it will be much more frequent to loose single nodes (reboots, decommissions of underlying hardware, updates etc.) so if you have more hosts, impact of loosing one will be smaller. Bottom line is that you need to find your own balance, that makes sense for your particular scale. Maybe 50 nodes is too much, would 15 cut it? Who knows but you :)

Single Kubernetes/OpenShift cluster/instance across datacenters?

With the understanding that Ubernetes is designed to fully solve this problem, is it currently possible (not necessarily recommended) to span a single K8/OpenShift cluster across multiple internal corporate datacententers?
Additionally assuming that latency between data centers is relatively low and that infrastructure across the corporate data centers is relatively consistent.
Example: Given 3 corporate DC's, deploy 1..* masters at each datacenter (as a single cluster) and have 1..* nodes at each DC with pods/rc's/services/... being spun up across all 3 DC's.
Has someone implemented something like this as a stop gap solution before Ubernetes drops and if so, how has it worked and what would be some considerations to take into account on running like this?
is it currently possible (not necessarily recommended) to span a
single K8/OpenShift cluster across multiple internal corporate
datacententers?
Yes, it is currently possible. Nodes are given the address of an apiserver and client credentials and then register themselves into the cluster. Nodes don't know (or care) of the apiserver is local or remote, and the apiserver allows any node to register as long as it has valid credentials regardless of where the node exists on the network.
Additionally assuming that latency between data centers is relatively
low and that infrastructure across the corporate data centers is
relatively consistent.
This is important, as many of the settings in Kubernetes assume (either implicitly or explicitly) a high bandwidth, low-latency network between the apiserver and nodes.
Example: Given 3 corporate DC's, deploy 1..* masters at each
datacenter (as a single cluster) and have 1..* nodes at each DC with
pods/rc's/services/... being spun up across all 3 DC's.
The downside of this approach is that if you have one global cluster you have one global point of failure. Even if you have replicated, HA master components, data corruption can still take your entire cluster offline. And a bad config propagated to all pods in a replication controller can take your entire service offline. A bad node image push can take all of your nodes offline. And so on. This is one of the reasons that we encourage folks to use a cluster per failure domain rather than a single global cluster.

In Oracle RAC, will an application be faster, if there is a subset of the code using a separate Oracle service to the same database?

For example, I have an application that does lots of audit trails writing. Lots. It slows things down. If I create a separate service on my Oracle RAC just for audit CRUD, would that help speed things up in my application?
In other words, I point most of the application to the main service listening on my RAC via SCAN. I take the subset of my application, the audit trail data manipulation, and point it to a separate service listening but pointing same schema as the main listener.
As with anything else, it depends. You'd need to be a lot more specific about your application, what services you'd define, your workloads, your goals, etc. Realistically, you'd need to test it in your environment to know for sure.
A separate service could allow you to segregate the workload of one application (the one writing the audit trail) from the workload of other applications by having different sets of nodes in the cluster running each service (under normal operation). That can help ensure that the higher priority application (presumably not writing the audit trail) has a set amount of hardware to handle its workload even if the lower priority thread is running at full throttle. Of course, since all the nodes are sharing the same disk, if the bottleneck is disk I/O, that segregation of workload may not accomplish much.
Separating the services on different sets of nodes can also impact how frequently a particular service is getting blocks from the local node's buffer cache rather than requesting them from the other node and waiting for them to be shipped over the interconnect. It's quite possible that an application that is constantly writing to log tables might end up spending quite a bit of time waiting for a small number of hot blocks (such as the right-most block in the primary key index for the log table) to get shipped back and forth between different nodes. If all the audit records are being written on just one node (or on a smaller number of nodes), that hot block will always be available in the local buffer cache. On the other hand, if writing the audit trail involves querying the database to get information about a change, separating the workload may mean that blocks that were in the local cache (because they were just changed) are now getting shipped across the interconnect, you could end up hurting performance.
Separating the services even if they're running on the same set of nodes may also be useful if you plan on managing them differently. For example, you can configure Oracle Resource Manager rules to give priority to sessions that use one service over another. That can be a more fine-grained way to allocate resources to different workloads than running the services on different nodes. But it can also add more overhead.

Apache Zookeeper: distribution of nodes across data centers

I am working on a brand new SolrCloud - ZooKeeper infrastructure.
Some background information:
all other services (mostly web site infrastructure) are distributed across two data centers, with active-active configurations.
at the network level, the servers are setup on extended LANs, with dark fibre across the data centers. So latency is at a minimum.
the SolrCloud - ZooKeeper infrastructure will be used by most of these applications.
I got a SolrCloud, and a ZooKeeper ensemble running. Implementation at this level is fine.
But I wonder how to distribute my ZooKeeper servers. I must have an odd number of servers, but I only have two data centers. If one fails, I have a 50-50 chance that I will lose majority.
What should I do? So far I have thought of:
requesting a third data center (not likely to happen, $$$!)
host two per data center and two on an external cloud provider (Amazon or ...?). Again $$$
set up an odd number at data center 1 and use an observer on site 2. What then happens if site 1 fails? Can SolrCloud work with only one observer?
If your requirement is to serve all search requests from a local data center (at which request was origin) then you don’t need to go for a cross data center ZooKeeper deployment.
Because a cross data center ZooKeeper deployment is only needed to survive a DC crash (it is most likely not going to happen, and that's why you pay $$$$), so in that case there isn't any need to spawn a ZooKeeper cluster in multiple data centers.
I got a third site to host the other ZooKeeper instance. This site is another office of my company, not a "full data center". So each site has one ZooKeeper instance.
What allowed me to have one cluster spread over three data centers was that they are close enough together to get a dark fiber between them. The latency is very low and does not impact ZooKeeper performance.
Then for Solr, I got full replicas on the two main data centers. The third office only hosts a ZooKeeper for quorum. Using full replicas, I have all the data in each data center. If my Solr needs to increase later, I will shard, but for now our index is small.
It has proven solid for four years now, with one failure. And it was at the third office, not in a data center.

Sharding vs DFS

As far as I understand sharding (e.g in MongoDB) and distributed file systems (e.g. HDFS in HBase or HyperTable) are different mechanisms that databases use to scale-out, however I wonder how do they compare?
Traditional sharding involves breaking tables into a small number of pieces and running each piece (or "shard") in a separate database on a separate machine. Because of the large shard size, this mechanism can be prone to imbalances due to hot spots and unequal growth as was evidenced by the Foursquare incident. Also, because each shard is run on a separate machine, these systems can experience availability problems if one of the machines goes down. To mitigate this problem, most sharding systems, including MongoDB, implement replica groups. Each machine is replaced by a set of three machines in a master plus two slaves configuration. This way if a machine goes down, there are two remaining replicas to serve the data. There are a couple of problems with this design: First, if a replica fails in a replica group, and the group is only left with two members, to bring the replication count back to three, the data on one of these two machines needs to be cloned. Since there are only two machines in the entire cluster that can be used to re-create the replica, there will be enormous drag on one of these two machines while re-replication is taking place, causing serious performance problems on the shard in question (it takes over two hours to copy 1TB over a gigabit link). The second problem is that when one of the replicas goes down, it needs to be replaced with a new machine. Even if there is plenty of spare capacity across the cluster to resolve the replication problem, that spare capacity cannot be used to rectify the situation. The only way to solve it is to replace the machine. This becomes very challenging from an operational standpoint as cluster sizes grow up into the hundreds or thousands of machines.
The Bigtable+GFS design solves these problems. First, the table data is broken down into much finer grained "tablets". A typical machine in a Bigtable cluster will often have 500+ tablets. If an imbalance occurs, resolving it is just a simple matter of migrating a small number of tablets from one machine to another. If a TabletServer goes down, because the data set is broken down and replicated with such fine granularity, there can be hundreds of machines that participate in the recovery process, which distributes the recovery burden and speeds recovery time. Also, because the data is not tied to a specific machine or machines, the spare capacity on all machines in the cluster can be applied to the failure. There is no operational requirement to replace the machine since any of the spare capacity throughout the cluster can be used to rectify replication imbalance.
Doug Judd
CEO, Hypertable Inc.