RabbitMQ Best Practices for High Availability on Cloud - kubernetes

I'm planning to deploy RabbitMQ on Kubernetes Engine Cluster. I see there are two kinds of location types i.e. 1. Region 2. Zone
Could someone help me understand what kind of benefits I can think of respective to each location types? I believe having multi-zone set up
could help enhancing the network throughout. While multi-region set up can ensure an undisputed service even if in case of regional failure events. Is this understanding correct? I'm looking at relevant justifications to choose a location type. Please help.

I'm planning to deploy RabbitMQ on Kubernetes Engine Cluster. I see there are two kinds of location types:
Region
Zone
Could someone help me understand what kind of benefits I can think of respective to each location types?
A zone (Availability Zone) is typically a Datacenter.
A region is multiple zones located in the same geographical region. When deploying a "cluster" to a region, you typically have a VPC (Virtual private cloud) network spanning over 3 datacenters and you spread your components to those zones/datacenters. The idea is that you should be fault tolerant to a failure of a whole _datacenter, while still have relatively low latency within your system.
While multi-region set up can ensure an undisputed service even if in case of regional failure events. Is this understanding correct? I'm looking at relevant justifications to choose a location type.
When using multiple regions, e.g. in different parts of the world, this is typically done to be near the customer, e.g. to provide lower latency. CDN services is distributed to multiple geographical locations for the same reason. When deploying a service to multiple regions, communications between regions is typically done with asynchronous protocols, e.g. message queues, since latency may be too large for synchronous communication.

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

High availability feature within DB2 on cloud

As per the documentation , the high availablity feature in DB2 on cloud offers an additional redundant node within the same data center ( availability zone ) only. Why cant HA be provided atleast across different AZ's within the same region?
As Gilbert said, this is due to latency. The nodes are placed in the same datacenter because the HA replication is synchronous. They are kept on different power and networking pods to provide a level of isolation while still keeping them physically close.
For further physical isolation, there is the Disaster Recovery feature, where a node is added in a different datacenter altogether. This replication is asynchronous and the failovers are triggered manually by the user.

Multi-master Kubernetes nodes

I have a Kubernetes cluster on IBM Cloud Platform (not important, the question is related to Kubernetes itself).
If I wanted to replicate across different data centers in different regions then, should I use multiple and different master nodes for different regions? What's the best approach in this case and what would you suggest?
Thanks in advance,
I'll answer from an IBM Cloud perspective, since you are referring to data centers.
If you want to "replicate across different data centers in different regions", then you will need to create separate clusters in each of those data centers. Once you have done that, by definition you will have multiple masters (one for each of your clusters). So the short answer is yes, you will have multiple clusters (and masters).
See this doc for more info. In this case you're talking about scenario 3: https://console.bluemix.net/docs/containers/cs_clusters.html#planning_clusters
Note that you will need to provision a global load balancer to load balance between regions, as well as ensure your app can handle any data replication between regions that is needed.

Single Kubernetes/OpenShift cluster/instance across datacenters?

With the understanding that Ubernetes is designed to fully solve this problem, is it currently possible (not necessarily recommended) to span a single K8/OpenShift cluster across multiple internal corporate datacententers?
Additionally assuming that latency between data centers is relatively low and that infrastructure across the corporate data centers is relatively consistent.
Example: Given 3 corporate DC's, deploy 1..* masters at each datacenter (as a single cluster) and have 1..* nodes at each DC with pods/rc's/services/... being spun up across all 3 DC's.
Has someone implemented something like this as a stop gap solution before Ubernetes drops and if so, how has it worked and what would be some considerations to take into account on running like this?
is it currently possible (not necessarily recommended) to span a
single K8/OpenShift cluster across multiple internal corporate
datacententers?
Yes, it is currently possible. Nodes are given the address of an apiserver and client credentials and then register themselves into the cluster. Nodes don't know (or care) of the apiserver is local or remote, and the apiserver allows any node to register as long as it has valid credentials regardless of where the node exists on the network.
Additionally assuming that latency between data centers is relatively
low and that infrastructure across the corporate data centers is
relatively consistent.
This is important, as many of the settings in Kubernetes assume (either implicitly or explicitly) a high bandwidth, low-latency network between the apiserver and nodes.
Example: Given 3 corporate DC's, deploy 1..* masters at each
datacenter (as a single cluster) and have 1..* nodes at each DC with
pods/rc's/services/... being spun up across all 3 DC's.
The downside of this approach is that if you have one global cluster you have one global point of failure. Even if you have replicated, HA master components, data corruption can still take your entire cluster offline. And a bad config propagated to all pods in a replication controller can take your entire service offline. A bad node image push can take all of your nodes offline. And so on. This is one of the reasons that we encourage folks to use a cluster per failure domain rather than a single global cluster.

What are best practices for kubernetes geo distributed cluster?

What is the best practice to get Geo distributed cluster with asynchronous network channels ?
I suspect I would need to have some "load balancer" which should redirect connections "within" it's own DC, do you know anything like this already in place?
Second question, should we use one HA cluster or create dedicated cluster for each of the DC ?
The assumption of the kubernetes development team is that cross-cluster federation will be the best way to handle cross-zone workloads. The tooling for this is easy to imagine, but has not emerged yet. You can (on your own) set up regional or global load-balancers and direct traffic to different clusters based on things like GeoIP.
You should look into Byzantine Clients. My team is currently working on a solution for erasure coded storage in asynchronous network that prevents some problems caused by faulty clients, but it relies on correct clients to establish a consistent state across the servers.
The network consists of a set of servers {P1, ...., Pn} and a set of clients {C1, ..., Cn}, which are all PTIM with running time bounded by a polynomial in a given securty parameter. Servers and clients together are parties. Theres an adversary, which is a PITM with running time boundded by a polynoil. Servers nd clients are controlled by adversary. In this case, theyre calld corruptd, othrwise, theyre called honest. An adversary that contrls up to t servers is called t-limited.
If protecting innocent clients from getting inconsistent values is a priority, then you should go ne, but from the pointview of a client, problems caused by faulty clients don't really hurt the system.