MSMQ Cluster losing messages on failover

MSMQ Cluster losing messages on failover - msmq

I've got a MSMQ Cluster setup with nodes (active/passive) that share a drive.
Here are the tests I'm performing. I send messages to the queue that are recoverable. I then take the MSMQ cluster group offline and then bring it online again.
Result: The messages are still there.
I then simulate failover by moving the group to node 2. Moves over successfully, but the messages aren't there.
I'm sending the messages as recoverable and the MSMQ cluster group has a drive that both nodes can access.
Anyone?
More Info:
The Quorum drive stays only on node 1.
I have two service/app groups. One MSMQ and one that is a generic service group.
Even more info:
When node 1 is active, I pump it full of messages. Failover to node 2. 0 message in the queue for 02. Then I failover back to 01, and the messages are in 01.

You haven't clustered MSMQ or aren't using clustered MSMQ properly.
What you are looking at are the local MSMQ services.
http://blogs.msdn.com/b/johnbreakwell/archive/2008/02/18/clustering-msmq-applications-rule-1.aspx
Cheers
John
==================================
OK, maybe the drive letter being used isn't consistently implemented.
What is the storage location being used by clustered MSMQ?
If you open this storage location up in Explorer from Node 1 AND Node 2 at the same time, are the folder contents exactly the same? If you create a text file via Node 1's Explorer window, does it appear after a refresh in Node 2's Explorer window?

Related

Messages are stuck in ActiveMQ Artemis cluster queues

We have a problem with Apache ActiveMQ Artemis cluster queues. Sometimes messages are beginning to pile up in the particular cluster queues. It usually happens 1-4 times per day and mostly on production (it was only one time for last 90 days when it has happened on one of the test environments).
These messages are not delivered to consumers on other cluster brokers until we restart cluster connector (or entire broker).
The problem looks related to ARTEMIS-3809.
Our setup is: 6 servers in one environment (3 pairs of master/backup servers). Operating system is Linux (Red Hat).
We have tried to:
upgrade from 2.22.0 to 2.23.1
increase minLargeMessageSize on the cluster connectors to 1024000
The messages are still being stuck in the cluster queues.
Another problem that I tried to configure min-large-message-size as it written in documentation (in cluster-connection), but it caused errors at start (broker.xml did not pass validation with xsd), so it was only option to specify minLargeMessageSize in the URL parameters of connector for each cluster broker. I don't know if this setting has effect.
So we had to make a script which checks if messages are stuck in the cluster queues and restarts cluster connector.
How can we debug this situation?
When the messages are stuck, nothing wrong is written to the log (no errors, no stacktraces etc.).
Which logging level (for what classes) should we enable to debug or trace level to find out what happens with the cluster connectors?

I believe you can remedy the situation by setting this on your cluster-connection:
<producer-window-size>-1</producer-window-size>
See ARTEMIS-3805 for more details.
Generally speaking, moving message around the cluster via the cluster-connection, while convenient, isn't terribly efficient (much less so for "large" messages). Ideally you would have a sufficient number of clients on each node to consume the messages that were originally produced there. If you don't have that many clients then you may want to re-evaluate the size of your cluster as it may actually decrease overall message throughput rather than increase it.
If you're just using 3 HA pairs in order to establish a quorum for replication then you should investigate the recently added pluggable quorum voting which allows integration with a 3rd party component (e.g. ZooKeeper) for leader election eliminating the need for a quorum of brokers.

How to add two more kafka brokers in the local machine if my current running kafka broker already has the data

I have one broker running in my local machine with Windows OS which has 2-3 topics with messages stored. I want to scale up my machine by adding two more broker instances. I have followed all the steps to configure 3 brokers on the same machine by creating different properties file.
My broker=0 getting shutdown when I am starting broker=1 server with below error.
[2019-07-11 13:56:33,580] INFO Stopping serving logs in dir C:\kafka_2.12-2.2.1\data\kafka (kafka.log.LogManager)
[2019-07-11 13:56:33,585] ERROR Shutdown broker because all log dirs in C:\kafka_2.12-2.2.1\data\kafka have failed (kafka.log.LogManager)
Is it possible to add more brokers if my existing broker instance has the data.
Or do I need to delete the data directory and freshly start the broker 0. Is there any possibility to preserve the data without deleting it from the kafka server.

Yes you can add brokers to your cluster and migrate/spread data across all your brokers.
The Expanding your cluster section in the documentation details the steps to achieve this.
After starting the new brokers, you basically need to use the bin/kafka-reassign-partitions.sh tool (other 3rd party tools also exists) to move data onto them.
Please note however that adding brokers on the same machine does not provide a lot of resiliency as if the machine was to go down, all brokers would be affected. But if you want to just play around and learn about Kafka that may be fine.

To run multiple brokers on the same physical machine, it is necessary for each broker in the config to specify a unique broker.id, different log.dirs and ports in listeners.
For example,
config/server{1,2,3}.properties
in every config set difference
broker.id=<id>
log.dirs=/data/kafka<id>
listeners=PLAINTEXT://localhost:909<id>
When all three brokers start, new topics will be created evenly throughout the cluster, but old ones need to be rebalanced.

During rolling upgrade/restart, how to detect when a kafka broker is "done"?

I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).
What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.
Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...

I would suggest exposing JMX metrics and tracking the following for cluster health
the controller count (must be 1 over the whole cluster)
under replicated partitions (should be zero for healthy cluster)
unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
ISR shrinks within a reasonable time period, like 10 minute window (should be none)
Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true

Assuming your cluster was healthy at the beginning of the restart operation, at a minimum, after each broker restart, you should ensure that the under-replicated partition count returns to zero before restarting the next broker.
As the previous responders mentioned, there is existing code out there to automate this. I don’t use Jolikia, myself, but my solution (which I’m working on now) also uses JMX metrics.

Kakfa Utils by Yelp is one of the best tools that can be used to detect when a kafka broker is "done". Specifically, kafka_rolling_restart is the tool which gets broker details from zookeeper and URP (Under Replicated Partitions) metrics from each broker. When a broker is restarted, total URPs across Kafka cluster is periodically collected and when it goes to zero, it restarts another broker. The controller broker is restarted at the last.

MSMQ in cluster

I configured MSMQ to run in cluster. Cluster consists of two Hyper-V virtual machines and uses common storage on third virtual machine (all virtual machines share windows domain and they see each other through the network). The fail-over cluster manager snap-in shows that MSMQ service is running. The non-clustered MSMQ services on machines, which are members of cluster, are shown to be runnung in services snap-in. Now I try to send message from remote computer (from third virtual machine) to clustered MSMQ service and to non-clustered MSMQ services. I use the following queue names:
FormatName:Direct=OS:{clustered-msmq-netbios-name}\private$\{queueName}
FormatName:Direct=TCP:{clustered-msmq-ip-address}\private$\{queueName}
FormatName:Direct=TCP:{non-clustered-msmq-ip-address}\private$\{queueName}
When non-clustered msmq ip address is specified the message is delivered to non-clustered msmq instance. But when I try to access clustered msmq instance the sent message stays in outgoing message queue and it says "Waiting to connect" (Failed to connect Winsock socket). And queue on clustered msmq instance is empty.
I tried to connect to clustered msmq service with telnet. For connection I specified clustered msmq ip-address and port 1801. It says "Could not open connection to the host, on port 1801: Connect failed".
Any idea?
Additional information. When I click on "Manage Message Queueing" menu item when both cluster servers are online then in the snap-in there is no Message Queue item in the tree. When I pause one server (the second) there appears Message Queueing item in the tree. And when there is a Message Queueing item in the tree the messages start to be processed (I see them disappear from outgoing message queue on sending server, but I don't see them appear on receiving server).

It seems that you can manage clustered message queueing only from that cluster node, which is now owner of role. On cluster node, which now is not active, there is no "Manage Message Queueing" menu item.
And considering the issue that messages were not delivered to clustered msmq instance I just reinstalled msmq windows feature on one of cluster nodes and recreated msmq cluster role. After these manipulations delivering messages just started working.

JBOSS messaging replicated queue

I am using JBOSS messaging in the following way:
1) Two JBOSS instances using 'all' config (i.e. clustered config)
2) One replicated queue created on each JBOSS instance wiht same JNDI name (clustered = true)
3) one producer attach locally to the queue on each instance (i.e. both the producer on both the nodes keep on adding messages to this replicated queue)
4) One JBOSS instance is marked as "consumer node" and queue message consumer is started on only this node (i.e. messages will be consumed on only one node). There is a logic which will decide which JBOSS instance is marked as "consumer node"
5) PostOffice used is clustered
6) server peer configured to not enforce message sequencing.
7) produced messages are non-persistent (deliveryMode = NON_PERSISTENT)
But I am facing problem with this. Messages produced on "non consumer node" do not get replicated to the queue on the "consumer node" and hence not available for consumption.
I enabled the logging and checked that postoffice finds two queues but only delivers to the local queue as it discovers that the remote queue is recoverable.
Any idea how to set it working?
FYI: I believe a message can be delivered to only one queue (local or remote). So, I want only one queue which is distributed but I am currently getting 2 different distributed queues (however their JNDI name is same). Is this a problem? If yes, how to solve this? Weblogic provides the option of creating a queue on admin server and thus a shared queue is possible there. What is the similar mechanism in JBOSS messaging? Or should I need to approach this problem as 2 queues which are synchronized. If yes, then how to achieve synchronization between them?
Thanks for taking out sometime to help me!!
Regards

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse