Can Apache Mesos 'master' nodes be co-located on the same machine as Mesos 'slave' nodes? Similarly (for high-availability (HA) deploys), can the Apache Zookeeper nodes used in Mesos 'master' election be deployed on the same machines as Mesos 'slave' nodes?
Mesos recommends 3 'masters' be used for HA deploys, and Zookeeper recommends 5 nodes be used for its quorum election system. It would be nice to have these services running along side Mesos 'slave' processes instead of committing 8 machines to effectively 'non-productive' tasks.
If such a setup is feasible, what are the pros/cons of such a setup?
Thanks!
You can definitely run a master, slave, and zk process all on the same node. You can even run multiple master and slave processes on the same node, provided you give them each unique ports, but that's only useful for a test cluster.
Typically we recommend running ZK on the same nodes as your masters, but if you have extra ZKs, you can certainly run them on slaves, or mix-and-match as you see fit, as long as all master/slave/framework nodes can reach the ZK nodes, and all slaves can reach the masters.
For a smaller cluster (<10 nodes) it could make sense to run a slave process on each master, especially since the standby masters won't be doing much. Even an active master for a small cluster uses only a small amount of cpu, memory, and network resources. Just make sure you adjust the --resources on that slave to account for the master's resource usage.
Once your cluster grows larger (especially >100 nodes) the network traffic to/from the master as well as its cpu/memory utilization becomes significant enough that you wouldn't want to run a mesos slave on the same node as the master. It should be fine to co-locate ZK with your master even at large scale.
You didn't specifically ask, but I'll also discuss where to run your framework schedulers (e.g. Spark, Marathon, or Chronos). These could be co-located with any of the other components, but they only really need to be able to reach the master and zk nodes, since all communication to slaves goes through the master. Some customers run the schedulers on master nodes, some run them on edge nodes (so users don't have access to the slaves), and others use meta-frameworks like Marathon to run other schedulers on slaves as Mesos tasks.
Related
I want to understand what could be the possible impact of a master node failure in a k8s cluster with only one master node with internal etcd store.
As per my understanding, all kinds of deployed workload containers (including stateless and stateful sets with persistent volume claims) running on worker nodes would keep on running until recreation of any container is required as they don't have a direct functional dependency on the master node and etcd store for their core functions. And, the unavailability of the master node only affects the control plane operations for the cluster.
Is my understanding correct? If not, could you please explain the impact of the master node failure on my workload running on that cluster?
I understand that the best way to achieve HA for k8s cluster is to set up a multi-master cluster with possibly externalizing etcd stores also for decoupling of them. This question is to understand the exact impact of the master node failure to take an informed call before configuring a multi-master cluster.
Etcd operators on the quorum system so as long as the cluster sees a majority it will continue operating. If the failed node was the current leader, the others would trigger an election after the heartbeat timeout.
For kube-apiserver, it's a horizontal service so losing a node is not interesting, just like any other webapp. Some (most) controllers are singletons, but they run on every control plane node and use kube-apiserver for leader elections so as with Etcd, if the leader dies then a few seconds later another copy will get the leader lock and take over.
I see most of the K8S master components has a leader selection process except apiServer. If only one node will be the leader any point of time, why would we need more then 3 master cluster for bigger k8s cluster?
The requirement of minimum 3 hosts comes from the fact that Kubernetes HA cluster uses etcd for storing and syncing configuration. etcd requires minimum 3 nodes to ensure HA. In general case we need to use n+1 model when want to deploy Kubernetes HA cluster
In a single master setup, the master node manages the etcd database, API server, controller manager and scheduler, along with the worker nodes. However, if that single master node fails, all the worker node fail as well and entire cluster will be lost.
In a multi-master setup, by contrast, multi-master provides high availability for a single cluster and improves network performance because all the masters behave like a unified data center.
A multi-master setup protects against a wide range of failure modes, from a loss of single worker node to the failure of the master node’s etcd service. By providing redundancy, a multi-master cluster serves a highly available system for your end users.
Do not use a cluster with two master replicas. Consensus on a two-replica cluster requires both replicas running when changing persistent state. As a result, both replicas are needed and a failure of any replica turns cluster into majority failure state. A two-replica cluster is thus inferior, in terms of HA, to a single replica cluster.
Here are useful documentation: kubernetes-ha-cluster, creating-ha-cluster.
Articles: ha-cluster, ha.
A few days ago, I looked up why none of pods are being scheduled to the master node, and found this question: Allow scheduling of pods on Kubernetes master?
It tells that it is because the master node is tainted with "NoSchedule" effect, and gives the command to remove that taint.
But before I execute that command on my cluster, I want to understand why it was there in the first place.
Is there a reason why the master node should not run pods? Any best-practices it relates to?
The purpose of kubernetes is to deploy application easily and scale them based on the demand. The pod is a basic entity which runs the application and can be increased and decreased based on high and low demands respectively (Horizontal Pod Autoscalar).
These worker pods needs to be run on worker nodes specially if you’re looking at big application where your cluster might scale upto 100’s of nodes based on demand (Cluster Autoscalar). These increasing pods can put up pressure on your nodes and once they do you can always increase the worker node in cluster using cluster autoscalar.
Suppose, you made your master schedulable then the high memory and CPU pressure put your master at risk of crashing the master. Mind you can’t autoscale the master using autoscalar. This way you’re putting your whole cluster at risk. If you have single master then your will not be able to schedule anything if master crashed. If you have 3 master and one of them crashed, then the other two master has to take the extra load of scheduling and managing worker nodes and increasing the load on themselves and hence the increased risk of failure
Also, In case of larger cluster, you already need the master nodes with high resources just to manage your worker nodes. You can’t put additional load on master nodes to run the workload as well in that case. Please have a look at the setting up large cluster in kubernetes here
If you have manageable workload and you know it doesn’t increase beyond a certain level. You can make master schedulable. However for production cluster it is not recommended at all.
Primary role of master is cluster management. Already many components of k8 are running on master.Suppose If pods scheduled on master without limit of resources and pods are consuming all the resources( cpu or memory), then master and in turn whole cluster will be at risk.
So while designing Highly Available production cluster minimum 3 master, 3 etcd, 3 infra node are created and application pods are not scheduled on these nodes. Separate worker nodes added to assign workload.
Master is intended for cluster management tasks and should not be used to run workloads. In development and test environments it is ok to schedule pods on master servers but in production better to keep it only for cluster level management activities. Use workers or nodes to schedule workloads
I am running a Kubernetes cluster with 3 master and 3 nodes.
I have found this to auto-scale worker nodes based on the pod's status.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws
But, I couldn't find any blog or add-on to auto-scale master nodes.
Is there any reason to auto-scale master nodes, if yes how can we do that?
There is no need to autoscale the master nodes. In a practical world, your worker nodes responsibility is to run your work load and your master nodes responsibility is to make sure that your worker nodes are having desired state in the cluster.
Now all the end users will request your application (pods) and as the load increased they need to scale horizontally and more pods should be spawned. If the resources on worker nodes are insufficient to run those nodes, more worker nodes should be spawned.
In large cluster we do not run load on master node, but we need to make sure it is highly available so that there is no single point of failure to orchestrate the worker nodes. For that we can have 3 master multi-master cluster in place.
Worker nodes we worry about the horizontal scalability and In master node we worry about high availability.
But for building large cluster, you need to provide adequate resources to master nodes for handling the orchestration of load on worker nodes.
For more information on building large cluster, please refer official document:
https://kubernetes.io/docs/setup/cluster-large/
In a nutshell, You can even have one master for 1000 worker nodes if you provide enough resources to that node. So, there is no reason to autoscale master comparing to the challenges we face in doing so.
I'm reading the Mesos Architecture docs which, ironically, don't actually specify which components are supposed to run on which VMs/physicals.
It looks like, to run Mesos in HA, you need several categories of components:
Mesos Masters
ZooKeeper instances (quorum)
Hadoop clusters (job nodes? name nodes?)
But there's never any mention of how many you need of each type.
So I ask: How many VMs/physicals do you need to run Mesos with HA, and what components should be deployed to each?
Did you have a look at the HA docs? To run Mesos in HA, you'll need the Mesos Masters and ZooKeeper. Any Hadoop-related configurations are out of scope for Mesos HA itself.
To have a HA setup, you'll need a uneven number of nodes for the Masters and ZooKeeper (because of quorum mechanism). In our case, we're running 3 Master and 3 ZooKeeper nodes on 3 machines (one Master and one ZooKeeper instance per machine), and a number of Mesos Slaves/Agents on different machines.
Theoretically, the Slaves/Agents can run on the same machines as the Masters/ZooKeepers as well. I guess this is a matter of preferences and availability of machines, and your SLA needs.
If you want to run a large-scale production setup, it will probably make a lot of sense to even separate the Master and ZooKeeper instances.
Further references:
http://mesos.apache.org/documentation/latest/operational-guide/
http://mesos.apache.org/documentation/latest/configuration/ (see "Master Options")
Can Mesos 'master' and 'slave' nodes be deployed on the same machines?