I just installed MSMQ in cluster and now test how it behaves. It appears, that when active cluster node is switched all messages, which were in the queue, are lost (even when we switch back to original node). For me it seems like undesired behavior. I thaught that all messages from source node should be moved towards destination node on node switch.
I tested node switch via Pause > Drain roles menu item and via Move > Select node menu item.
I want to know is decribed behavior is like MSMQ in cluster should behave or may be it is some misconfiguration issue?
Update. Found similar question here: MSMQ Cluster losing messages on failover. But the solution did not help in my situation.
It seems that I sent to message queue messages, which were not recoverable (as it is written here: https://blogs.msdn.microsoft.com/johnbreakwell/2009/06/03/i-restarted-msmq-and-all-my-messages-have-vanished). That's why these messages didn't survive service restart. When I send message with Recoverable flag set, messages started to recover after service restart and cluster node switch.
Related
JMS Inbound gateway is used for request processing at worker side. CustomMessageListenerContainer class is configured to impose back off max attempts as limited.
In some scenarios when active MQ server is not responding before max attempts limit reached container is being stopped with below message.
"Stopping container for destination 'senExtractWorkerInGateway': back-off policy does not allow for further attempts."
Wondering is there any configuration available to recover these containers once the Active MQ is back available.
sample configuration is given below.
<int-jms:inbound-gateway
id="senExtractWorkerInGateway"
container-class="com.test.batch.worker.CustomMessageListenerContainer"
connection-factory="jmsConnectionFactory"
correlation-key="JMSCorrelationID"
request-channel="senExtractProcessingWorkerRequestChannel"
request-destination-name="senExtractRequestQueue"
reply-channel="senExtractProcessingWorkerReplyChannel"
default-reply-queue-name="senExtractReplyQueue"
auto-startup="false"
concurrent-consumers="25"
max-concurrent-consumers="25"
reply-timeout="1200000"
receive-timeout="1200000"/>
You probably can emit some ApplicationEvent from the applyBackOffTime() of your CustomMessageListenerContainer when the super call returns false. This way you would know that something is wrong with ActiveMQ connection. At this moment you also need to stop() your senExtractWorkerInGateway - just autowire it into some controlling service as a Lifecycle. When you done fixing the connection problem, you just need to start this senExtractWorkerInGateway back. That CustomMessageListenerContainer is going to be started automatically.
How to stop the ENTIRE cluster with sharding (spanning multiple machines - nodes) from one actor?
I know I can stop the actor system on 'this' node context.system.terminate()
I know I can stop the local Sharding Region.
I found .prepareForFullClusterShutdown() but it doesn't actually stop the nodes.
I suppose there is no single command to do that, but there must be some way to do this.
There's no out-of-the-box way to do this that I'm aware of: the overall expectation is that there's an external control plane (e.g. kubernetes) which manages this.
However, one could have an actor on every node of the cluster that listens for membership events and also subscribes to a pubsub topic. This actor would track the current cluster membership and, when told to begin a cluster shutdown, it publishes a (e.g.) ShutdownCluster message to the topic and tracks which nodes leave. After some length of time (since distributed pubsub is at-most-once) if there are nodes besides this one that haven't left, it sends it again. Eventually, after all other nodes in the cluster have left, this actor then shuts down its node. When other nodes see a ShutdownCluster message, they immediately shut themselves down.
Of course, this sort of scheme will probably not play nicely with any form of external orchestration (whether it's a container scheduler like kubernetes, mesos, or nomad; or even something simple like monit which notices that the service isn't running and restarts it).
We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover
I am having an issue with MSMQ in a clustered environment. I have the following setup:
2 Nodes setup in a Windows Failover, lets call them "Node A" and "Node B".
I have then setup a Clustered Instance of MSMQ lets call it "MSMQ Instance".
I have also setup a Clustered instance of the DTC, lets call it "DTC Instance".
Within the DTC instance, I have allowed access both locally and also through the Clustered instance, basically I have taken all authentication off to test.
I have also created a clustered instance of our In house application, lets call it "Application Instance". Within this Application instance, I have other resources added, which are other services the application uses and also the Net.MSMQ adapter.
The Issue.......
When I seem to Cluster the Application Instance, it always seems to set the owner to be the opposite Node that I am using, so if I am creating the Clustered Instance on Node A it always sets the current owner to Node B, however that is not the issue.
The issue I have is that as long as the Application Instance is running on Node B, MSMQ seems to work.
The outbound queues are created locally, receive messages and are then processed through the MSMQ Cluster.
If I then Failover to Node A, the MSMQ refuses to work. The outbound queues are not created and therefore no messages are being processed.
I get an error in Event Viewer:
"The version check failed with the error: 'Unrecognized error -1072824309 (0xc00e000b)'. The version of MSMQ cannot be detected All operations that are on the queued channel will fail. Ensure that MSMQ is installed and is available"
If I then failover back to Node B it works.
The Application has been setup to use the MSMQ instance and all the permissions are correct.
Do I need to have a Clustered instance of DTC or can I just configure it as resource within the MSMQ instance?
Can anybody shed any light on this as I am at a brick wall with this?
Yes, you will need to have a clustered DTC setup.
For your clustered MSMQ instance you will then need to configure the clustered DTC as a "dependendy" Right click on MSMQ -> Properties -> Dependencies
I do not know if this is mandatory in all cases, but on our Cluster we also have a file share configured as a dependcy for the MSMQ. To my understanding this should ensure that temporary files that are needed by MSMQ are still available after a node switch.
Additionally, here are two articles that I found very helpful in setting up the cluster nodes. They might be helpful in confirming step-by-step that your configurations are correct:
"Building MSMQ cluster". You will find several other links in that article that will guide you further.
Microsoft also has a detailed document: "Deploying Message Queuing (MSMQ) 3.0 in a Server Cluster".
We are having MSMQ issues in a load balanced, high volume environment using NServiceBus.
Our environment looks as follows: 1 F5 distributing web traffic via round robin to 6 application servers. Each of these 6 servers uses a Bus.Send to 1 queue on a remote machine that resides on a cluster.
The event throughput during normal usage is approximately 5-10 per second, per server. So 30-60 events per second in the entire environment, depending on load.
The issue we're seeing is that 1 of the application boxes is able to send messages to the cluster queue, but the other 5 are not. Looking at the 5 boxes experiencing failure, the outgoing queue to the cluster is inactive.
There are also a high number of events in the transaction dead letter queue. When we purge that queue, the outgoing queue connects to the cluster, however, the messages grow as unacknowledged in the outgoing queue. This continues to grow until they move into the transaction dead letter queue again, and the outgoing queue changes state to inactive.
Interestingly, when we perform this purge operation, a different box will become the 'good box'. So we're pretty sure that the issue is not one bad box, it's that only 1 box at a time can reliably maintain a connection to the cluster queue.
Has anybody come across this before?
We have, and it was because of the issue described here: http://blogs.msdn.com/b/johnbreakwell/archive/2007/02/06/msmq-prefers-to-be-unique.aspx
Short version: Every MSMQ installation has an unique id assigned to it when you install MSMQ. It is called QMId and located in the registry under
HKLM\Software\Microsoft\MSMQ\Parameters\Machine Cache\QMid
It is used as an identifier when doing send to a remote receiver, which in turn uses it to send ACKs back to the correct sender. The receiver, in your case the cluster, maintains a cache that maps QMIds to IPs. Our problem was that several of our workers had the SAME QMId. This ment the cluster sent all ACKS for all messages from all the machines to the first machine who sent a message. At some point, and for some operations like a MSMQ windows service restart, the cache expires and ANOTHER machine magically "works".
So check your 6 servers and make sure none of them has the same QMid. Ours had the same value because they were all ghosted from a Windows image that was taken after MSMQ was installed.
The fix is easy, just reinstall the MSMQ feature on each machine to generate a new unique QMId.
If your machines are created from the same image, you probably have non-unique MachineCache IDs. You can fix this by running the following powershell script on each machine.
This can be done before the image is created, or on each machine after it is launched.
Remove-ItemProperty -Path 'Registry::HKEY_LOCAL_MACHINE\Software\Microsoft\MSMQ\Parameters\MachineCache' -name 'QMId'
Set-ItemProperty -Path 'Registry::HKLM\Software\Microsoft\MSMQ\Parameters' -Name SysPrep -Value 1
Restart-Service -Name 'MSMQ'