Clustered ActiveMQ Artemis and producer/consumer load balancing configuration when broker fails - activemq-artemis

I have created an ActiveMQ Artemis cluster with two active brokers. I have created a custom load balancer to be able to initially distribute my queues in a static way according to my requirements and workload.
<connectors>
<connector name="broker1-connector">tcp://myhost1:61616</connector>
<connector name="broker2-connector">tcp://myhost2:62616</connector>
</connectors>
<cluster-connections>
<cluster-connection name="myhost1-cluster">
<connector-ref>broker1-connector</connector-ref>
<retry-interval>500</retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>broker2-connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
My issue is that when broker1 is down then based on this topology I can recreate its queues on broker2 to avoid losing messages (by using connection string on producer (tcp://myhost1:61616,tcp://myhost2:62616)).
But then when broker1 becomes available again my producer is unaware of that and it stills uses the connection to broker2 (if that matters broker2 redistribution-delay is set to 0 and no consumers are registered). Is there a way or some configuration to resume my producer to write only to broker1.
This affects my consumers which are initially connected to broker1, and I am not sure if there is also some way/configuration to make them transparently bounce between these brokers or do I need to create two consumers (effectively one them will be idle) each one targeting the corresponding broker ?

There is no way for the broker to tell a client that it should connect to another node joining the cluster.
My recommendation would be to use HA with failback so that when one node fails then all the clients connected to that node failover to the backup and then when the original node comes back all the clients failback to the original node.
You may also find that you don't actually need a cluster of 2 brokers. Many users never perform the performance testing necessary to confirm that clustering is even necessary in the first place. They simply assume that a cluster is necessary. Such an assumption can needlessly complicate a platform's architecture and waste valuable resources. The performance of ActiveMQ Artemis is quite good. A single node can handle millions of messages per second in certain use-cases.

Related

ActiveMQ Artemis migration to the pluggable quorum configuration

I have question about ActiveMQ Artemis cluster migration to the pluggable quorum configuration.
Currently we have a cluster in the test environment which has 6 servers (3 pairs of master and slave with classic replication), and I plan to migrate to the cluster with the pluggable quorum voting. The version of Artemis is 2.23.1.
I have configured another (pretest) cluster with 3 zookeeper nodes and 2 nodes of primary/backup Artemis. It seems to work well, but it is a pretest environment where we perform some experiments, and there are no clients and workload. So I have decided to reconfigure the test cluster to use pluggable quorum voting.
At first I thought that we can change role of each server from master to primary, and from slave to backup.
Previous configuration was - master:
<ha-policy>
<replication>
<master>
<check-for-live-server>true</check-for-live-server>
<vote-on-replication-failure>true</vote-on-replication-failure>
<quorum-size>2</quorum-size>
<group-name>group-for-each-pair</group-name>
</master>
</replication>
</ha-policy>
Slave:
<ha-policy>
<replication>
<slave>
<allow-failback>true</allow-failback>
<group-name>group-for-each-pair</group-name>
</slave>
</replication>
</ha-policy>
The group name is used for slave to determine to which master it has to connect to.
Unfortunately, this setting does not work in the primary and backup sections. I tried to configure it and get xsd validation error for broker.xml.
In the documentation there are some words about settings which are no longer needed in the pluggable quorum configuration:
There are some no longer needed classic replication configurations:
vote-on-replication-failure quorum-vote-wait vote-retries
vote-retries-wait check-for-live-server
But there is nothing about <group-name>. Maybe it is a documentation issue.
New configuration is - primary:
<ha-policy>
<replication>
<primary>
<manager>
<class-name>org.apache.activemq.artemis.quorum.zookeeper.CuratorDistributedPrimitiveManager</class-name>
<properties>
<property key="connect-string" value="zookeeper-amq1:2181,zookeeper-amq2:2181,zookeeper-amq3:2181"/>
</properties>
</manager>
</primary>
</replication>
</ha-policy>
Backup:
<ha-policy>
<replication>
<backup>
<manager>
<class-name>org.apache.activemq.artemis.quorum.zookeeper.CuratorDistributedPrimitiveManager</class-name>
<properties>
<property key="connect-string" value="zookeeper-amq1:2181,zookeeper-amq2:2181,zookeeper-amq3:2181"/>
</properties>
</manager>
<allow-failback>true</allow-failback>
</backup>
</replication>
</ha-policy>
When I tried start the cluster with these settings, I found that backup servers try to connect to any primary server, and some of them cannot start. And I have reverted back to the old configuration.
I read the documentation and found some settings which could help:
<coordination-id>. Used in multi-primary configuration and probably will not work in the section.
namespace in the Apache Curator settings. Maybe it can help to split servers to pairs where each backup will connect to it's primary in the same namespace. But it may be designed for another purpose (to have one zookeeper for the several separate clusters), and there could be some other problems.
Another option is to remove unnecessary 4 ActiveMQ Artemis servers and use only 1 pair of servers. It will require client reconfiguration, but clients will continue to work with only 2 remaining servers even if there are 6 servers remain in the connection string.
Is there a preferred way to migrate from classic replication to the pluggable quorum voting without changing cluster topology (6 servers)?
Any changes in this test environment (if succeeded) will be performed on the UAT and production clusters which have the same topology. So we would prefer a smooth migration if possible.
I recommend just using group-name as you were before. For example on the primary:
<ha-policy>
<replication>
<primary>
<manager>
<class-name>org.apache.activemq.artemis.quorum.zookeeper.CuratorDistributedPrimitiveManager</class-name>
<properties>
<property key="connect-string" value="zookeeper-amq1:2181,zookeeper-amq2:2181,zookeeper-amq3:2181"/>
</properties>
</manager>
<group-name>group-for-each-pair</group-name>
</primary>
</replication>
</ha-policy>
And on the backup:
<ha-policy>
<replication>
<backup>
<manager>
<class-name>org.apache.activemq.artemis.quorum.zookeeper.CuratorDistributedPrimitiveManager</class-name>
<properties>
<property key="connect-string" value="zookeeper-amq1:2181,zookeeper-amq2:2181,zookeeper-amq3:2181"/>
</properties>
</manager>
<group-name>group-for-each-pair</group-name>
<allow-failback>true</allow-failback>
</backup>
</replication>
</ha-policy>
That said, I strongly encourage you to execute performance tests with a single HA pair of brokers. A single broker can potentially handle millions of messages per second so it's likely that you don't need a cluster of 3 primary brokers. Also, if your applications are connected to the cluster nodes such that messages are produced on one node and consumed from another then having a cluster may actually reduce overall message throughput due to the extra "hops" a message has to take. Obviously this wouldn't be an issue for a single HA pair.
Finally, dropping from 6 brokers down to 2 would significantly reduce configuration and operational complexity, and it's likely to reduce infrastructure costs substantially as well. This is one of the main reasons we implemented pluggable quorum voting in the first place.

Consume directly from ActiveMQ Artemis replica

In a cluster scenario using HA/Data replication feature is there a way for consumers to consume/fetch data from a slave node instead of always reaching out to the master node (master of that particular queue)?
If you think about scalability, having all consumers call a single node responsible to be the master of a specific queue means all traffic goes to a single node.
Kafka allows consumers to fetch data from the closest node if that node contains a replica of the leader, is there something similar on ActiveMQ?
In short, no. Consumers can only consume from an active broker and slave brokers are not active, they are passive.
If you want to increase scalability you can add additional brokers (or HA broker pairs) to the cluster. That said, I would recommend careful benchmarking to confirm that you actually need additional capacity before increasing your cluster size. A single ActiveMQ Artemis broker can handle millions of messages per second depending on the use-case.
As I understand it, Kafka's semantics are quite different from a "traditional" message broker like ActiveMQ Artemis so the comparison isn't particularly apt.

Does ActiveMQ Artemis support master to master failover?

I have two ActiveMQ Artemis servers (server1 and server2). Both are the master and there is no slave in this case. Does Artemis support master to master failover? If yes, can any one provide the broker configuration. Currently I have defined following configuration in both server's broker.xml file.
<ha-policy>
<shared-store>
<master>
<failover-on-shutdown>true</failover-on-shutdown>
</master>
</shared-store>
</ha-policy>
Also, if possible can you please provide the sample client code for test the master to master failover scenario?
Failover support in ActiveMQ Artemis is provided by a master/slave pair as that is the only configuration where two brokers have the same journal data (either via shared-storage or replication). Failover between one master and another master is not supported.

Kafka Producer, multi DC failover support

I have two distinct kafka clusters located in different data centers - DC1 and DC2. How to organize kafka producer failover between two DCs? If primary kafka cluster (DC1) becomes unavailable, I want producer to switch to failover kafka cluster (DC2) and continue publishing to it? Producer also should be able to switch back to primary cluster, once it is available. Any good patterns, existing libs, approaches, code examples?
Each partition of the Kafka topic your producer is publishing to has a separate leader, often spread across multiple brokers in the cluster, so the producer is connected to many “primary” brokers simultaneously. Should any one of them fail another In Sync Replica (ISR) will be elected as leader and automatically take over. You do not need to do anything in your client app for it to reconnect to the new leader(s), retry any failed requests, and continue.
If this is for Multi-Data Center (MDC) failover then things get much more complicated depending on if the client apps die as well or if they keep running and need just their cluster connections to failover. Offsets are not preserved across multiple Kafka clusters so while producers are simpler, consumers need to call GetOffsetsForTimes() upon failover.
For a great write up of the the MDC failover modes and best practices see the MDC Whitepaper here: https://www.confluent.io/white-paper/disaster-recovery-for-multi-datacenter-apache-kafka-deployments/
Since you asked only about producers, your app can detect if the primary cluster is down (say for a certain number of retries) and then instead of attempting to reconnect, it can instead connect to another brokerlist from the secondary cluster. Alternatively you can redirect the dns name of the brokerlist hosts to point to the secondary cluster.

Kafka Producers/Consumers over WAN?

I have a Kafka Cluster in a data center. A bunch of clients that may communicate across WANs (even the internet) will send/receive real time messages to/from the cluster.
I read from Kafka's Documentation:
...It is possible to read from or write to a remote Kafka cluster over the WAN though TCP tuning will be necessary for high-latency links.
It is generally not advisable to run a single Kafka cluster that spans multiple datacenters as this will incur very high replication latency both for Kafka writes and Zookeeper writes and neither Kafka nor Zookeeper will remain available if the network partitions.
From what I understand here and here:
Producing over a WAN doesn't require ZK and is okay, just mind tweaks to TCP for high latency connections. Great! Check.
The High Level consumer APIs require ZK connections.
Aren't then clients reading/writing to Kafka over a WAN subject to the same limitations for clusters in bold above?
The statements you have highlighted are mostly targeted at the internal communication between the Kafka/zookeeper cluster where evil things will happen during network partitions which are much more common across a WAN.
Producers are isolated and if there are network issues should be able to buffer/retry based on your settings.
High level consumers are trickier since, as you note, require a connection to zookeeper. Here when disconnects occur, there will be rebalancing and a higher chance messages will get duplicated.
Keep in mind, the producer will need to be able to get to every Kafka broker and the consumer will need to be able to get to all zookeeper nodes and Kafka brokers, a load balancer won't work.