How to achieve "manual" failover with Apache Artemis? - activemq-artemis

I've configured a small cluster with just one primary/backup group.
So far, failover and failback work as expected.
But as this configuration is susceptible to network-isolation problems and I don't think, the pinger approach heals this sufficiently, I would prefer the backup to just be there and receive updates, but not automatically do a failover when the primary is unreachable.
Instead I want an intelligent human with better situational awareness to make the failover decision.
The decreased availability introduced by such a procedure is acceptable for us.
I've tried to get the backup to act this way (arbitrarily delaying failover) by using the following ha-policy > replication > slave parameters:
quorum-size
quorum-vote-wait
vote-retries
vote-retry-wait
but had no success so far.
Is it possible to somehow delay the automatic failover arbitrarily, and trigger the actual failover by changing the broker.xml?

ActiveMQ Artemis doesn't implement the functionality you're looking for - at least not in any automated way. I expect you could arbitrarily delay failover by setting quorum-size to something larger than the actual size of the cluster. However, there is no management operation to tell the backup broker to activate and become live. The only way you could do that would be to stop the broker, change the ha-policy to be a master and then restart the broker.

Related

Is there a way during replication to turn off the automatic start of the backup server when the live server fails?

I have a Live - Backup pair with the replication HA policy. I would like to manage the replica startup myself when the live server fails, leaving only data replication between them. Is it possible to somehow achieve this behavior?
One of the main goals of the live/backup pair is failover (i.e. automatically starting the backup when the live fails). There is no way to disable this functionality and still use a live/backup pair.
However, you could potentially use a broker-connection with the "mirror" configuration to get the results you want.

Expected unvailability during Cloud SQL Postgres failovers and CPU/memory upgrades?

I have some experience with AWS RDS MySQL multi-AZ (HA). I'm looking at GCP Cloud SQL Postgres HA for a new project.
I'm trying to figure how certain maintenance operations work but can't figure it out from the Cloud SQL docs.
How much unavailability does a failover cause?
How much unavailability does a CPU/memory upgrade cause?
After a failover, is it important to eventually "failback" to the original primary instance? Or can I leave it running on the standby instance indefinitely? (The Cloud SQL HA failover diagram make it seem like the two instances aren't totally symmetric.)
Just FYI, the answers for AWS RDS
Failover: usually under 70 seconds of unavailability before my application is able to issue queries again.
This is for planned failovers. (For unplanned failovers, it may take a little longer for RDS to detect that the primary instance is unresponsive before it actually initiates the failover.)
A lot of the failover lag is likely due to DNS. Using the AWS RDS Proxy service may reduce that time (they claim by ~80%). The Cloud SQL HA failover diagram shows both instances sharing a virtual IP, which might mean no DNS lag?
CPU/memory upgrade: I think AWS can accomplish this with a single failover worth of unavailability. It upgrades the standby instance (no unavailability), performs a failover, then upgrades the other instance.
On RDS, I think the two instances that are part of the HA set up are symmetric. So if you failover to the standby, it's fine to leave it that way. There's no need (as far as RDS is concerned) to failover back to the original.
To answer your following questions:
As you mentioned, the duration of the unavailability would vary depending if it is a planned (manual) failover vs unplanned. It's best that you test and manually initiate the failover so you can see how long your instance would respond to it, usually it would take a minute or so. When it comes to unplanned failovers, it's pretty much covered in the docs that when failover occurs, any existing connections to the primary instance and read replicas are closed, and it will take approximately 2-3 minutes for connections to be reestablished.
To address this question, you need to understand the requirements for your instance to allow failover:
The primary instance must be in a normal operating state (not stopped, undergoing maintenance, or performing a long-running Cloud SQL instance operation such as a backup, import or export operation).
That means that failover doesn't count when upgrading your instance, changing your hardware specs (CPU/Memory) will incur downtime so you should plan ahead when making these changes.
To understand the importance of failback, here's an excerpt from this link:
High availability solutions continuously replicate data to a remote site or cloud. In the event that a primary system goes down, the remote, secondary system can be spun up and users are rerouted. This process is commonly referred to as “failover,” and it reduces downtime to seconds or minutes.
However, failover isn’t a permanent state. Once primary servers are up and running, data and applications must be restored so normal operations can resume. This process is known as failback, and it is very important from a DR testing standpoint. Here’s why: Not all replication technology is created equally when it comes to failback. In some cases, failing back to production servers can be painfully slow.
UPDATE 1:
HA on Cloud SQL will provision specs for your standby instance similar to your primary, that's why you'll get billed double the price of a non-HA instance. Also, the importance of failback is not limited to any cloud providers. It is simply a good practice to make sure that all the operation returns to your primary instance instead of just leaving it on a standby instance. On that case, failback (on Cloud SQL to be specific) is really necessary to make sure that everything is back to normal after an outage.
UPDATE 2:
If you don't failback, what could happen is that when there's an outage on the zone where your standby instance is running (you can't control what zone your standby instance will come from), you won't be able to do a failover as the operation will be blocked. (See the docs)
Unfortunately there's pretty much no option as the downtime is required whenever you change hardware. The procedure will require the instance to restart. Here's a link to see how long it would take.
Additional resources: https://severalnines.com/database-blog/achieving-mysql-failover-failback-google-cloud-platform-gcp

ejabberd cluster: Multi-master or Master-slave

So far what I've come across is this -
Setting up ejabberd cluster in a master-slave configuration, there would be a single point of failure and people have experienced issues when even after fixing the master (if it goes down), the cluster doesn't become operable again. Also sometimes, ejabberd instances of every slave would have to be revisited again to get them working properly, or mnesia commands would have to be in-putted again to make master communicate with the slaves.
Setting up ejabberd cluster in a multi-master configuration then any of the nodes can be taken out of the cluster without bringing the whole cluster down. Basically, there is no single point of failure and, this is also the way in which the official documentation for ejabberd tells you to do via the join_cluster argument they expose in the ejabberdctl script. HOWEVER, in this case, all the data is replicated across both nodes which is a big performance overhead in my opinion.
So it boils down to this.
What is the best/recommended/popular mode in which an ejabberd cluster of 2 nodes should be set up mostly with respect to performance but keeping other critical factors (fault tolerance, load balancing) in mind as well.
There is only a single mode in ejabberd. Basically, it works like what you describe as multi-master. master-slave would basically be the same setup without any traffic sent to the second node by load balancing mechanism.
So case 2 is the way to go.

Is my RabbitMQ cluster Active Active or Active Passive?

I have created a cluster consists of three RabbitMQ nodes using join_cluster command.
i.e.
rabbitmqctl –n rabbit2#MYPC1 join_cluster rabbit2#MYPC1
(currently the cluster runs on a single computer)
Questions:
In the documents it says there is one implemetation for active passive and one for active active.
What did I configure?
How do I know?
How can it be changed?
Is there a big performance trade off between Active Active & Active Passive?
What is the best practice to interact with active/active?
i.e. install a load balancer? apache that will round robin
What is the best practice to interact with active/passive?
if I interact with only the active - this is a single point f failure
Thanks.
I have been doing some research into availability options with RabbitMQ and while I am still fairly new, I'll attempt to answer your questions with the knowledge I do have. Please understand that these answers are not intended to be comprehensive.
Before getting to the questions and answers, I think it's worth pointing out that I think using the terms Active/Active and Active/Passive in the context of a cluster running on a single computer does not really apply. Active/Active and Active/Passive are typically terms used to describe highly available clusters where you have a system of more than one logical server (in your case, multiple RabbitMQ clusters), shared/redundant storage, network capabilities, power, etc.
What did I configure?
Without any load balancing for the nodes in your cluster or queue mirroring you have neither, meaning you do not have a highly available cluster.
How do I know?
RabbitMQ does not provide any connection management so traffic with a failed node will not automatically be passed on to a different node, which is required for an active/active cluster. Without queue mirroring you do not have fully redundant nodes in your cluster, which is required for active/passive.
How can it be changed?
Even if you implement load balancing and/or queue mirroring you are missing a number of requirements to offer a highly-available RabbitMQ cluster. Primarily, with a RabbitMQ cluster you only have a single logical broker (at least two are required for an HA cluster).
Is there a big performance trade off between Active Active & Active Passive?
I think you will start seeing performance penalties as you start introducing data replication and/or redundancy, which would affect both Active/Active and Active/Passive. If you are using synchronous data replication then you will see a bigger performance hit than if you replicate data asynchronously. There's a lot more to it, but to me this feels like there may be a bigger performance hit by using Active/Active but this depends heavily on how fast all of the pieces are working together. In Active/Passive where you may be using asynchronous replication across servers your performance may appear better but in a failover situation you would need to wait for that replication to complete before you can switch to your secondary server.
What is the best practice to interact with active/active? i.e. install a load balancer? apache that will round robin
RabbitMQ recommends using a load balancer so that you do not have to leak details about the nodes in your cluster to the clients.
What is the best practice to interact with active/passive? if I interact with only the active - this is a single point of failure
It is a point of failure but with Active/Passive you can implement a failure strategy to retry the next available server or all remaining servers. With these strategies in place you can establish a scenario where the capabilities of your cluster are merely degraded while a failover is happening instead of totally unavailable. Also, you can interact with the passive side but the types of interactions may be very different (i.e. read-only access) since there may be fewer resources available on the passive side and there may be delays in data replication.
Here are some references used to gather this information:
High-Availability Cluster on Wikipedia
Clustering with RabbitMQ
Highly Available Queues in a RabbitMQ Cluster
High Availability in RabbitMQ

Which PostgreSQL replication solution to use for my specific scenario

I need to replicate a PostgreSQL database server as follows:
Two servers are adjacent to each-other - one is the master and the other standby. If the master fails, the standby takes over. Replication from master to slave needs to be failsafe, hence, synchronous. The standby will not be used for any querying unless it has become a master. So, no high-availability/load-balancing is required.
There is another backup server at a remote location. Data from the master server mentioned above will be replicated to this remote server asynchronously and in batches. Time is not a factor at all in this replication - a couple of hours is just fine. This server would be used just for backup.
I've studied the currently available replication solutions from the PostgreSQL docs as well as from Google, but can't decide which combination of synchronous-asynchronous solutions would I need.
The closest I came up with is using pgpool-II for scenario 1 and Mammoth for scenario 2. However, as pgpool is statement-based, what would happen to queries containing rand() and now()?
Please note that I'd rather use free and open-source replication tools.
Also, just a side question - according to scenario 1 above, when the master fails, the standby will take over. Would the master-slave role be reversed after that, or would after the recovery of the master server the slave would go back to its standby state?
Any suggestion would be highly appreciated. Thanks.
I suggest using DRBD for scenario 1 and either 9.0 built-in replication or Slony for scenario 2.
Before PostgreSQL 9.1 (not yet released), there is no other synchronous replication solution available, and DRBD is widely established for this purpose. Together with Pacemaker or Heartbeat, which come with all the scripts needed for PostgreSQL monitoring and switchover, you have a very robust and fairly easy to manage solution. (In fact, I'd consider continuing to use DRBD even after 9.1 comes out; it's just a lot easier and has a longer track record.)
For the cross-site asynchronous, you could try the built-in replication of PostgreSQL 9.0, perhaps in conjunction with repmgr for monitoring and management. Alternatively, you could try the (now a bit) old-school Slony, but I'd guess it will more complicated for your needs.
You didn't mention if the server in question was on a specific version or if this was a new project with the freedom to choose the version. The answers vary based on that information.
If you are starting with a clean slate, I would recommend designing based on the PostgreSQL 9.1 beta. The final version will be released long before you would be ready to go into a production environment and it has binary synchronous replication built-in.
I've been using the built-in asynchronous replication in PostgreSQL for years in almost the exact same scenario you describe and it has always been rock-solid for me. It's become even better with 9.0 with Hot standby and it's become much easier to configure and maintain. 9.1 provides the only missing piece you require.
However, if you are trying to replicate an existing server, built-in asynchronous replication with aggressive settings for "checkpoint_timeout" a very frequent backup of unarchived WAL files could be sufficient until you can upgrade to 9.1.
The bottom line here is that you can get exactly what you want is with stock PostgreSQL 9.1--no third-party products required.
As for failover, it is not an automatic process, you'll need to handle that yourself. I would recommend that after a failover, switching the roles of the two machines until either the next failover event or until a controlled manual failover during a scheduled outage during a slow period of use. Again, this is not automatic and much be managed by the administrator (via shell scripts, presumably).