How do I setup a Active / Passive environment with two nodes in OpenShift? - kubernetes

I am trying to configure a Active/Passive cluster with two nodes (using OpenShift). The second passive node should be a hot standby, in other words it is up and running but not doing anything, until the first node dies. Then the passive node becomes active and a new passive node is started.
I have read the High Availability documentation, however it just seems to cover the theory. Furthermore it seems like overkill ( I am thinking there might be an easier way to meet my goal).
Where would I start?

What you are asking for goes against the usual practice for how Kubernetes/OpenShift is used. You wouldn't have hot standby nodes, you would always use all nodes in the cluster. You would then allow for enough additional capacity in your cluster such that loosing a node doesn't cause a problem as other nodes would have enough capacity to then run the applications. In this scenario the Kubernetes scheduler would automatically restart any applications which were on a failed node on the other nodes in the cluster, without you needing to perform any explicit failover steps.
So don't try and do anything special, setup your cluster with the two nodes, with applications being distributed across both. If you need to have the ability to run with only a single node, make sure it has enough capacity to run everything. If over time you add more applications and one node is not enough, add a third node, with all three being used in normal case. You can then handle failure of a single node again.

Related

Fair distribution of verticles on a Vert.x cluster of nodes

I've been experimenting with Vert.x high availability features to test horizontal scalability and resiliency. I have a cluster of several nodes based on Hazelcast. I'm creating verticles on any nodes via an HTTP API. Verticles have the HA flag set when they are created.
Testing scalability
If I have n nodes Nn loaded with HA-verticles and if I add one additional node there is no verticle that is migrated from the Nn node on the new one so that the load would be balanced. Is there a way to tell Vert.x to do so, or not ? I believe it's not so simple...
Testing resilience
If I have n nodes Nn loaded with HA-verticles and I kill one of the nodes, all the verticles from that very node are migrated, but are migrated on one single of the remaining nodes that is not always the least loaded one. That destination node may become overloaded and the whole cluster would be at risk of freeze or crash. Same question as before: is there a way to force Vert.x to balance the restarted verticles on all nodes, or at least on the node that is the least loaded ?
Your observations are correct, there is no way:
to distribute verticles from a failed node over the rest of the nodes
to prevent starting verticles in a node that is already loaded
Improving the HA features is not on the Vert.x roadmap.
If, as it seems, you need more than basic failover, I would recommend to use specialized infrastructure tools that can leverage info from monitoring systems and start/stop new nodes as needed.

Kubernetes different container args depending on number of pods in replica set

I want to scale an application with workers.
There could be 1 worker or 100, and I want to scale them seamlessly.
The idea is using replica set. However due to domain-specific reasons, the appropriate way to scale them is for each worker to know its: ID and the total number of workers.
For example, in case I have 3 workers, I'd have this:
id:0, num_workers:3
id:1, num_workers:3
id:2, num_workers:3
Is there a way of using kubernetes to do so?
I pass this information in command line arguments to the app, and I assume it would be fine having it in environment variables too.
It's ok on size changes for all workers to be killed and new ones spawned.
Before giving the kubernetes-specific answer, I wanted to point out that it seems like the problem is trying to push cluster-coordination down into the app, which is almost by definition harder than using a distributed system primitive designed for that task. For example, if every new worker identifies themselves in etcd, then they can watch keys to detect changes, meaning no one needs to destroy a running application just to update its list of peers, their contact information, their capacity, current workload, whatever interesting information you would enjoy having while building a distributed worker system.
But, on with the show:
If you want stable identifiers, then StatefulSets is the modern answer to that. Whether that is an exact fit for your situation depends on whether (for your problem domain) id:0 being "rebooted" still counts as id:0 or the fact that it has stopped and started now disqualifies it from being id:0.
The running list of cluster size is tricky. If you are willing to be flexible in the launch mechanism, then you can have a pre-launch binary populate the environment right before spawning the actual worker (that example is for reading from etcd directly, but the same principle holds for interacting with the kubernetes API, then launching).
You could do that same trick in a more static manner by having an initContainer write the current state of affairs to a file, which the app would then read in. Or, due to all Pod containers sharing networking, the app could contact a "sidecar" container on localhost to obtain that information via an API.
So far so good, except for the
on size changes for all workers to be killed and new one spawned
The best answer I have for that requirement is that if the app must know its peers at launch time, then I am pretty sure you have left the realm of "scale $foo --replicas=5" and entered into the "destroy the peers and start all afresh" realm, with kubectl delete pods -l some-label=of-my-pods; which is, thankfully, what updateStrategy: type: OnDelete does, when combined with the delete pods command.
In the end, I've tried something different. I've used kubernetes API to get the number of running pods with the same label. This is python code utilizing kubernetes python client.
import socket
from kubernetes import client
from kubernetes import config
config.load_incluster_config()
v1 = client.CoreV1Api()
with open(
'/var/run/secrets/kubernetes.io/serviceaccount/namespace',
'r'
) as f:
namespace = f.readline()
workers = []
for pod in v1.list_namespaced_pod(
namespace,
watch=False,
label_selector="app=worker"
).items:
workers.append(pod.metadata.name)
workers.sort()
num_workers = len(workers)
worker_id = workers.index(socket.gethostname())

ejabberd cluster: Multi-master or Master-slave

So far what I've come across is this -
Setting up ejabberd cluster in a master-slave configuration, there would be a single point of failure and people have experienced issues when even after fixing the master (if it goes down), the cluster doesn't become operable again. Also sometimes, ejabberd instances of every slave would have to be revisited again to get them working properly, or mnesia commands would have to be in-putted again to make master communicate with the slaves.
Setting up ejabberd cluster in a multi-master configuration then any of the nodes can be taken out of the cluster without bringing the whole cluster down. Basically, there is no single point of failure and, this is also the way in which the official documentation for ejabberd tells you to do via the join_cluster argument they expose in the ejabberdctl script. HOWEVER, in this case, all the data is replicated across both nodes which is a big performance overhead in my opinion.
So it boils down to this.
What is the best/recommended/popular mode in which an ejabberd cluster of 2 nodes should be set up mostly with respect to performance but keeping other critical factors (fault tolerance, load balancing) in mind as well.
There is only a single mode in ejabberd. Basically, it works like what you describe as multi-master. master-slave would basically be the same setup without any traffic sent to the second node by load balancing mechanism.
So case 2 is the way to go.

Persistent storage for Apache Mesos

Recently I've discovered such a thing as a Apache Mesos.
It all looks amazingly in all that demos and examples. I could easily imagine how one would run for stateless jobs - that fits to the whole idea naturally.
Bot how to deal with long running jobs that are stateful?
Say, I have a cluster that consists of N machines (and that is scheduled via Marathon). And I want to run a postgresql server there.
That's it - at first I don't even want it to be highly available, but just simply a single job (actually Dockerized) that hosts a postgresql server.
1- How would one organize it? Constraint a server to a particular cluster node? Use some distributed FS?
2- DRBD, MooseFS, GlusterFS, NFS, CephFS, which one of those play well with Mesos and services like postgres? (I'm thinking here on the possibility that Mesos/marathon could relocate the service if goes down)
3- Please tell if my approach is wrong in terms of philosophy (DFS for data servers and some kind of switchover for servers like postgres on the top of Mesos)
Question largely copied from Persistent storage for Apache Mesos, asked by zerkms on Programmers Stack Exchange.
Excellent question. Here are a few upcoming features in Mesos to improve support for stateful services, and corresponding current workarounds.
Persistent volumes (0.23): When launching a task, you can create a volume that exists outside of the task's sandbox and will persist on the node even after the task dies/completes. When the task exits, its resources -- including the persistent volume -- can be offered back to the framework, so that the framework can launch the same task again, launch a recovery task, or launch a new task that consumes the previous task's output as its input.
Current workaround: Persist your state in some known location outside the sandbox, and have your tasks try to recover it manually. Maybe persist it in a distributed filesystem/database, so that it can be accessed from any node.
Disk Isolation (0.22): Enforce disk quota limits on sandboxes as well as persistent volumes. This ensures that your storage-heavy framework won't be able to clog up the disk and prevent other tasks from running.
Current workaround: Monitor disk usage out of band, and run periodic cleanup jobs.
Dynamic Reservations (0.23): Upon launching a task, you can reserve the resources your task uses (including persistent volumes) to guarantee that they are offered back to you upon task exit, instead of going to whichever framework is furthest below its fair share.
Current workaround: Use the slave's --resources flag to statically reserve resources for your framework upon slave startup.
As for your specific use case and questions:
1a) How would one organize it? You could do this with Marathon, perhaps creating a separate Marathon instance for your stateful services, so that you can create static reservations for the 'stateful' role, such that only the stateful Marathon will be guaranteed those resources.
1b) Constraint a server to a particular cluster node? You can do this easily in Marathon, constraining an application to a specific hostname, or any node with a specific attribute value (e.g. NFS_Access=true). See Marathon Constraints. If you only wanted to run your tasks on a specific set of nodes, you would only need to create the static reservations on those nodes. And if you need discoverability of those nodes, you should check out Mesos-DNS and/or Marathon's HAProxy integration.
1c) Use some distributed FS? The data replication provided by many distributed filesystems would guarantee that your data can survive the failure of any single node. Persisting to a DFS would also provide more flexibility in where you can schedule your tasks, although at the cost of the difference in latency between network and local disk. Mesos has built-in support for fetching binaries from HDFS uris, and many customers use HDFS for passing executor binaries, config files, and input data to the slaves where their tasks will run.
2) DRBD, MooseFS, GlusterFS, NFS, CephFS? I've heard of customers using CephFS, HDFS, and MapRFS with Mesos. NFS would seem an easy fit too. It really doesn't matter to Mesos what you use as long as your task knows how to access it from whatever node where it's placed.
Hope that helps!

Is my RabbitMQ cluster Active Active or Active Passive?

I have created a cluster consists of three RabbitMQ nodes using join_cluster command.
i.e.
rabbitmqctl –n rabbit2#MYPC1 join_cluster rabbit2#MYPC1
(currently the cluster runs on a single computer)
Questions:
In the documents it says there is one implemetation for active passive and one for active active.
What did I configure?
How do I know?
How can it be changed?
Is there a big performance trade off between Active Active & Active Passive?
What is the best practice to interact with active/active?
i.e. install a load balancer? apache that will round robin
What is the best practice to interact with active/passive?
if I interact with only the active - this is a single point f failure
Thanks.
I have been doing some research into availability options with RabbitMQ and while I am still fairly new, I'll attempt to answer your questions with the knowledge I do have. Please understand that these answers are not intended to be comprehensive.
Before getting to the questions and answers, I think it's worth pointing out that I think using the terms Active/Active and Active/Passive in the context of a cluster running on a single computer does not really apply. Active/Active and Active/Passive are typically terms used to describe highly available clusters where you have a system of more than one logical server (in your case, multiple RabbitMQ clusters), shared/redundant storage, network capabilities, power, etc.
What did I configure?
Without any load balancing for the nodes in your cluster or queue mirroring you have neither, meaning you do not have a highly available cluster.
How do I know?
RabbitMQ does not provide any connection management so traffic with a failed node will not automatically be passed on to a different node, which is required for an active/active cluster. Without queue mirroring you do not have fully redundant nodes in your cluster, which is required for active/passive.
How can it be changed?
Even if you implement load balancing and/or queue mirroring you are missing a number of requirements to offer a highly-available RabbitMQ cluster. Primarily, with a RabbitMQ cluster you only have a single logical broker (at least two are required for an HA cluster).
Is there a big performance trade off between Active Active & Active Passive?
I think you will start seeing performance penalties as you start introducing data replication and/or redundancy, which would affect both Active/Active and Active/Passive. If you are using synchronous data replication then you will see a bigger performance hit than if you replicate data asynchronously. There's a lot more to it, but to me this feels like there may be a bigger performance hit by using Active/Active but this depends heavily on how fast all of the pieces are working together. In Active/Passive where you may be using asynchronous replication across servers your performance may appear better but in a failover situation you would need to wait for that replication to complete before you can switch to your secondary server.
What is the best practice to interact with active/active? i.e. install a load balancer? apache that will round robin
RabbitMQ recommends using a load balancer so that you do not have to leak details about the nodes in your cluster to the clients.
What is the best practice to interact with active/passive? if I interact with only the active - this is a single point of failure
It is a point of failure but with Active/Passive you can implement a failure strategy to retry the next available server or all remaining servers. With these strategies in place you can establish a scenario where the capabilities of your cluster are merely degraded while a failover is happening instead of totally unavailable. Also, you can interact with the passive side but the types of interactions may be very different (i.e. read-only access) since there may be fewer resources available on the passive side and there may be delays in data replication.
Here are some references used to gather this information:
High-Availability Cluster on Wikipedia
Clustering with RabbitMQ
Highly Available Queues in a RabbitMQ Cluster
High Availability in RabbitMQ