ActiveMQ Artemis master slave error when backup becomes live - activemq-artemis

I have a master slave setup with 1 master and 2 slaves. When I kill the master, one of the slave tries to become master but fails with following exception:
2022/03/08 16:13:28.746 | mb | ERROR | 1-156 | o.a.a.a.c.server | | AMQ224000: Failure in initialisation: java.lang.IndexOutOfBoundsException: length(32634) exceeds src.readableBytes(32500) where src is: UnpooledHeapByteBuf(ridx: 78, widx: 32578, cap: 32578/32578)
at io.netty.buffer.AbstractByteBuf.checkReadableBounds(AbstractByteBuf.java:643)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1095)
at org.apache.activemq.artemis.core.message.impl.CoreMessage.reloadPersistence(CoreMessage.java:1207)
at org.apache.activemq.artemis.core.message.impl.CoreMessagePersister.decode(CoreMessagePersister.java:85)
at org.apache.activemq.artemis.core.message.impl.CoreMessagePersister.decode(CoreMessagePersister.java:28)
at org.apache.activemq.artemis.spi.core.protocol.MessagePersister.decode(MessagePersister.java:120)
at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.decodeMessage(AbstractJournalStorageManager.java:1336)
at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.lambda$loadMessageJournal$1(AbstractJournalStorageManager.java:1035)
at org.apache.activemq.artemis.utils.collections.SparseArrayLinkedList$SparseArray.clear(SparseArrayLinkedList.java:114)
at org.apache.activemq.artemis.utils.collections.SparseArrayLinkedList.clearSparseArrayList(SparseArrayLinkedList.java:173)
at org.apache.activemq.artemis.utils.collections.SparseArrayLinkedList.clear(SparseArrayLinkedList.java:227)
at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.loadMessageJournal(AbstractJournalStorageManager.java:990)
at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:3484)
at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:3149)
at org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:325)
at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:4170)
I'm also observing a lot of messages like this one:
2022/03/08 16:13:28.745 | AMQ224009: Cannot find message 36,887,402,768
2022/03/08 16:13:28.745 | AMQ224009: Cannot find message 36,887,402,768
Master setup:
<ha-policy>
<replication>
<master>
<check-for-live-server>true</check-for-live-server>
</master>
</replication>
</ha-policy>
<connectors>
<connector name="connector-server-0">tcp://172.16.134.51:62616</connector>
<connector name="connector-server-1">tcp://172.16.134.52:62616</connector>
<connector name="connector-server-2">tcp://172.16.134.28:62616</connector>
</connectors>
<acceptors>
<acceptor name="netty-acceptor">tcp://172.16.134.51:62616</acceptor>
<acceptor name="invm">vm://0</acceptor>
</acceptors>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>connector-server-0</connector-ref>
<retry-interval>500</retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>connector-server-1</connector-ref>
<connector-ref>connector-server-2</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
Slave 1 setup:
<ha-policy>
<replication>
<slave>
<allow-failback>true</allow-failback>
</slave>
</replication>
</ha-policy>
<connectors>
<connector name="connector-server-0">tcp://172.16.134.51:62616</connector>
<connector name="connector-server-1">tcp://172.16.134.52:62616</connector>
<connector name="connector-server-2">tcp://172.16.134.28:62616</connector>
</connectors>
<acceptors>
<acceptor name="netty-acceptor">tcp://172.16.134.52:62616</acceptor>
<acceptor name="invm">vm://0</acceptor>
</acceptors>
<cluster-connections>
<cluster-connection name="cluster">
<connector-ref>connector-server-1</connector-ref>
<retry-interval>500</retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>connector-server-0</connector-ref>
<connector-ref>connector-server-2</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
Slave 2
<ha-policy>
<replication>
<slave>
<allow-failback>true</allow-failback>
</slave>
</replication>
</ha-policy>
<connectors>
<connector name="connector-server-0">tcp://172.16.134.51:62616</connector>
<connector name="connector-server-1">tcp://172.16.134.52:62616</connector>
<connector name="connector-server-2">tcp://172.16.134.28:62616</connector>
</connectors>
<acceptors>
<acceptor name="netty-acceptor">tcp://172.16.134.28:62616</acceptor>
<acceptor name="invm">vm://0</acceptor>
</acceptors>
<cluster-connections>
<cluster-connection name="cluster">
<connector-ref>connector-server-2</connector-ref>
<retry-interval>500</retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>connector-server-0</connector-ref>
<connector-ref>connector-server-1</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
Could you please tell me what is not correct in my setup?
I'm using activemq-artemis version 2.17.0

I recommend you upgrade to the latest release and retry.
Also, I recommend simplifying your configuration to use just a single live/backup pair. The broker will only ever replicate data to one other broker. The second backup will be completely idle until either the master or current backup fails.
Lastly, using a single live/backup pair with the replication ha-policy is very dangerous due to the possibility of split-brain. I strongly recommend that you use shared-storage or once you move to the latest release you configure pluggable quorum voting with ZooKeeper to mitigate the risk of split-brain.

Related

Example multiple servers with scale down

We trying to use ActiveMQ Artemis within our Docker container, but one scenario I am not able to get working. This is probably due to some bad configuration. Any help is appreciated (e.g. example configuration).
Installation:
Docker instance containing an embedded ActiveMQ Artemis broker and a web application
The broker has clustering, HA and share store defined
We start 3 docker instances
Scenario:
Add 200 messages to the queue in one of the web application
I can see in the logging that all docker instance are handling the messages (this is as expected)
Kill one of the docker instances
Outcome of scenario:
Not all messages are being processed (every message on the queue should result to item in the database)
When restarting the killed docker instance will not result in every message being processed.
Expected outcome:
When a node is down that a other node is picking up the messages
When a node comes online again that is help picking up the messages
Questions:
HA scale down probably does not work because I kill the server.
Does this only work with persistence on the file system or should this also work in an RDBMS?
Configuration:
Below is the configuration which is in every Docker instance, only the host name (project-two) and the HA settings (master/slave) differ per docker instance. It could be a typo in below because I removed the customer specific names in the configuration.
<configuration
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="urn:activemq"
xsi:schemaLocation="urn:activemq /schema/artemis-server.xsd">
<core xmlns="urn:activemq:core">
<security-enabled>true</security-enabled>
<jmx-management-enabled>true</jmx-management-enabled>
<management-address>activemq.management</management-address>
<persistence-enabled>true</persistence-enabled>
<store>
<database-store>
<jdbc-driver-class-name>${artemis.databaseDriverClass}</jdbc-driver-class-name>
<jdbc-connection-url>${artemis.databaseConnectionUrl}</jdbc-connection-url>
<jdbc-user>${artemis.databaseUsername}</jdbc-user>
<jdbc-password>${artemis.databasePassword}</jdbc-password>
<bindings-table-name>ARTEMIS_BINDINGS</bindings-table-name>
<message-table-name>ARTEMIS_MESSAGE</message-table-name>
<page-store-table-name>ARTEMIS_PS</page-store-table-name>
<large-message-table-name>ARTEMIS_LARGE_MESSAGES</large-message-table-name>
<node-manager-store-table-name>ARTEMIS_NODE_MANAGER</node-manager-store-table-name>
</database-store>
</store>
<connectors>
<connector name="netty-connector">tcp://project-two:61617</connector>
</connectors>
<acceptors>
<acceptor name="netty-acceptor">tcp://project-two:61617</acceptor>
</acceptors>
<!-- cluster information -->
<broadcast-groups>
<broadcast-group name="my-broadcast-group">
<group-address>231.7.7.7</group-address>
<group-port>9876</group-port>
<broadcast-period>2000</broadcast-period>
<connector-ref>netty-connector</connector-ref>
</broadcast-group>
</broadcast-groups>
<discovery-groups>
<discovery-group name="my-discovery-group">
<group-address>231.7.7.7</group-address>
<group-port>9876</group-port>
<refresh-timeout>10000</refresh-timeout>
</discovery-group>
</discovery-groups>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>netty-connector</connector-ref>
<retry-interval>500</retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<discovery-group-ref discovery-group-name="my-discovery-group"/>
</cluster-connection>
</cluster-connections>
<security-settings>
</security-settings>
<!-- Settings for the redelivery -->
<address-settings>
<address-setting match="#">
<redelivery-delay>5000</redelivery-delay>
<max-delivery-attempts>2</max-delivery-attempts>
</address-setting>
</address-settings>
<addresses>
</addresses>
<ha-policy>
<shared-store>
<slave/>
</shared-store>
</ha-policy>
</core>
</configuration>
Docker makes deploying microservice applications very easy but it has some limitations for a production environment. I would take a look to the ArtemisCloud.io operator that provide a way to deploy the Apache ActiveMQ Artemis Broker on Kubernetes.
The ArtemisCloud.io operator also supports message migration for brokers scale down, see https://artemiscloud.io/docs/tutorials/scaleup_and_scaledown/

ActiveMQ Artemis 2.21 Master/Slave Status

I have installed ActiveMQ Artemis 2.21.0 in a 1 Master/ 1 Slave cluster with shared store. The cluster works fine, however, I don't know why in the web console in the STATUS section in "Cluster Info" show Lives: 2, when only one Master is Up. The backup node remains at 1, which is correct.
Can you help me with your feedback please?
The master configuration is as follows:
<connectors>
<connector name="artemis">tcp://localhost:61616</connector>
</connectors>
<cluster-user>admin</cluster-user>
<cluster-password>admin</cluster-password>
<broadcast-groups>
<broadcast-group name="bg-group1">
<group-address>231.7.7.7</group-address>
<group-port>9876</group-port>
<broadcast-period>5000</broadcast-period>
<connector-ref>artemis</connector-ref>
</broadcast-group>
</broadcast-groups>
<discovery-groups>
<discovery-group name="dg-group1">
<group-address>231.7.7.7</group-address>
<group-port>9876</group-port>
<refresh-timeout>10000</refresh-timeout>
</discovery-group>
</discovery-groups>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>artemis</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>0</max-hops>
<discovery-group-ref discovery-group-name="dg-group1"/>
</cluster-connection>
</cluster-connections>
<ha-policy>
<shared-store>
<master>
<failover-on-shutdown>true</failover-on-shutdown>
</master>
</shared-store>
</ha-policy>
The slave configuration is as follows:
<connectors>
<connector name="artemis">tcp://localhost:61617</connector>
<connector name="netty-live-connector">tcp://localhost:61616</connector>
</connectors>
<cluster-user>admin</cluster-user>
<cluster-password>admin</cluster-password>
<broadcast-groups>
<broadcast-group name="bg-group1">
<group-address>231.7.7.7</group-address>
<group-port>9876</group-port>
<broadcast-period>5000</broadcast-period>
<connector-ref>artemis</connector-ref>
</broadcast-group>
</broadcast-groups>
<discovery-groups>
<discovery-group name="dg-group1">
<group-address>231.7.7.7</group-address>
<group-port>9876</group-port>
<refresh-timeout>10000</refresh-timeout>
</discovery-group>
</discovery-groups>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>artemis</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>0</max-hops>
<discovery-group-ref discovery-group-name="dg-group1"/>
</cluster-connection>
</cluster-connections>
<ha-policy>
<shared-store>
<slave>
<allow-failback>true</allow-failback>
</slave>
</shared-store>
</ha-policy>
The web console image is:
You're using the default group-address and group-port (i.e. 231.7.7.7 & 9876 respectively) which means any other broker on the subnet also using the defaults will join that cluster. I recommend you try using a different group address and/or port and see if you still observe the same issue.

ActiveMQ Artemis two cluster in same network

I want to setup two clusters for Dev and QA (or prod and staging) in same network. Each cluster consists of several nodes. Nodes will be reachable on network and I want to eliminate QA node clustering with Dev node. For this what should I changed, following is my config.
<connectors>
<connector name="my-con">tcp://ip:61617</connector>
</connectors>
<acceptors>
<acceptor name="my-acc">tcp://ip:61617</acceptor>
</acceptors>
<broadcast-groups>
<broadcast-group name="bg-stage">
<group-address>${udp-address:231.7.7.7}</group-address>
<group-port>9877</group-port>
<broadcast-period>100</broadcast-period>
<connector-ref>my-con</connector-ref>
</broadcast-group>
</broadcast-groups>
<discovery-groups>
<discovery-group name="dg-stage">
<group-address>${udp-address:231.7.7.7}</group-address>
<group-port>9877</group-port>
<refresh-timeout>10000</refresh-timeout>
</discovery-group>
</discovery-groups>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>my-con</connector-ref>
<retry-interval>500</retry-interval>
<use-duplicate-detection>true</use-duplicate-detection>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<discovery-group-ref discovery-group-name="dg-stage"/>
</cluster-connection>
</cluster-connections>
Do I need to change different cluster-connection name, broadcast group and discovery groups?
Changing the group-port on the broadcast-group and discovery-group from the default 9876 to 9877 should be sufficient to isolate the clusters from each other. Changing the names in the configuration won't really impact anything other than your own ability to differentiate the configuration between the environments.

Artemis Replication In Kubernetes Issues

I am deploying a simple master/slave replication configuration within Kubernetes environment. They use static connectors. When I delete the master pod the slave successfully takes over duties, but when the master pod comes back up the slave pod does not terminate as live so I end up with two live servers. When this happens I notice they also form an internal bridge. I've ran the exact same configurations locally, outside of Kubernetes, and the slave successfully terminates and goes back to being a slave when the master comes back up. Any ideas as to why this is happening? I am using Artemis version 2.6.4.
master broker.xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<configuration xmlns="urn:activemq" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
<jms xmlns="urn:activemq:jms">
<queue name="jms.queue.acmadapter_to_acm_design">
<durable>true</durable>
</queue>
</jms>
<core xmlns="urn:activemq:core" xsi:schemaLocation="urn:activemq:core ">
<acceptors>
<acceptor name="netty-acceptor">tcp://0.0.0.0:61618</acceptor>
</acceptors>
<connectors>
<connector name="netty-connector-master">tcp://artemis-service-0.artemis-service.falconx.svc.cluster.local:61618</connector>
<connector name="netty-connector-backup">tcp://artemis-service2-0.artemis-service.falconx.svc.cluster.local:61618</connector>
</connectors>
<ha-policy>
<replication>
<master>
<!--we need this for auto failback-->
<check-for-live-server>true</check-for-live-server>
</master>
</replication>
</ha-policy>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>netty-connector-master</connector-ref>
<static-connectors>
<connector-ref>netty-connector-master</connector-ref>
<connector-ref>netty-connector-backup</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
</core>
</configuration>
slave broker.xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<configuration xmlns="urn:activemq" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
<core xmlns="urn:activemq:core" xsi:schemaLocation="urn:activemq:core ">
<acceptors>
<acceptor name="netty-acceptor">tcp://0.0.0.0:61618</acceptor>
</acceptors>
<connectors>
<connector name="netty-connector-backup">tcp://artemis-service2-0.artemis-service.test.svc.cluster.local:61618</connector>
<connector name="netty-connector-master">tcp://artemis-service-0.artemis-service.test.svc.cluster.local:61618</connector>
</connectors>
<ha-policy>
<replication>
<slave>
<allow-failback>true</allow-failback>
<!-- not needed but tells the backup not to restart after failback as there will be > 0 backups saved -->
<max-saved-replicated-journals-size>0</max-saved-replicated-journals-size>
</slave>
</replication>
</ha-policy>
<cluster-connections>
<cluster-connection name="my-cluster">
<connector-ref>netty-connector-backup</connector-ref>
<static-connectors>
<connector-ref>netty-connector-master</connector-ref>
<connector-ref>netty-connector-backup</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
</core>
</configuration>
It's likely that the master's journal is being lost when it is stopped. The journal (specifically the server.lock file) holds the node's unique identifier (which is shared by the replicated slave). If the journal is lost when the node is dropped then it has no way to pair with its slave when it comes back up which would explain the behavior you're observing. Make sure the journal is on a persistent volume claim.
Also, it's worth noting that a single master/slave pair is not recommended due to the risk of split brain. In general, it's recommended that you have 3 master/slave pairs to establish a proper quorum.

Wildfly 14 + Artemis pooledconnectionfactory jmsXA issue

Configured Artemis with ha policy as below:
Machine1: artemis-server1-master
<ha-policy>
<shared-store>
<master>
<failover-on-shutdown>true</failover-on-shutdown>
</master>
</shared-store>
</ha-policy>
Machine2: artemis-server2-master
<ha-policy>
<shared-store>
<master>
<failover-on-shutdown>true</failover-on-shutdown>
</master>
</shared-store>
</ha-policy>
server1-slave & server2-slave
<ha-policy>
<shared-store>
<slave>
<failover-on-shutdown>true</failover-on-shutdown>
<restart-backup>true</restart-backup>
<allow-failback>true</allow-failback>
</slave>
</shared-store>
</ha-policy>
In Wildfly, configured Artemis settings below:
<subsystem xmlns="urn:jboss:domain:messaging-activemq:4.0">
<server name="default">
<http-connector name="http-connector" socket-binding="http" endpoint="http-acceptor"/>
<remote-connector name="remote-artemis1" socket-binding="remote-artemis1"/>
<remote-connector name="remote-artemis2" socket-binding="remote-artemis2"/>
<connection-factory name="RemoteConnectionFactory" entries="java:jboss/exported/jms/RemoteConnectionFactory" connectors="remote-artemis1 remote-artemis2" client-failure-check-period="1000" reconnect-attempts="-1" retry-interval="1000" ha="true"/>
<pooled-connection-factory name="activemq-ra" entries="java:/JmsXA java:jboss/DefaultJMSConnectionFactory" connectors="remote-artemis1 remote-artemis2" client-id="Wildfly-14" transaction="xa" user="admin" password="admin" client-failure-check-period="1000" reconnect-attempts="-1" retry-interval="1000" ha="true"/>
</server>
</subsystem>
In ear application, there is a MDB to receiving message. Also using java:/jmsXA to lookup connection factory.
After did with above configuration and starts Wildfly, it is successfully processed and send message (i.e. create connection using jmsXA connection factory) to queue. But while processing few files and in between shutdown any of the Artemis master then mdb will receive the message successfully but using java:/jmsXA connectionfactory will not able to send message although other master is not down.