NServicebus failing while sending messages to a msmq cluster queue in a load balanced environment - msmq

We are having MSMQ issues in a load balanced, high volume environment using NServiceBus.
Our environment looks as follows: 1 F5 distributing web traffic via round robin to 6 application servers. Each of these 6 servers uses a Bus.Send to 1 queue on a remote machine that resides on a cluster.
The event throughput during normal usage is approximately 5-10 per second, per server. So 30-60 events per second in the entire environment, depending on load.
The issue we're seeing is that 1 of the application boxes is able to send messages to the cluster queue, but the other 5 are not. Looking at the 5 boxes experiencing failure, the outgoing queue to the cluster is inactive.
There are also a high number of events in the transaction dead letter queue. When we purge that queue, the outgoing queue connects to the cluster, however, the messages grow as unacknowledged in the outgoing queue. This continues to grow until they move into the transaction dead letter queue again, and the outgoing queue changes state to inactive.
Interestingly, when we perform this purge operation, a different box will become the 'good box'. So we're pretty sure that the issue is not one bad box, it's that only 1 box at a time can reliably maintain a connection to the cluster queue.
Has anybody come across this before?

We have, and it was because of the issue described here: http://blogs.msdn.com/b/johnbreakwell/archive/2007/02/06/msmq-prefers-to-be-unique.aspx
Short version: Every MSMQ installation has an unique id assigned to it when you install MSMQ. It is called QMId and located in the registry under
HKLM\Software\Microsoft\MSMQ\Parameters\Machine Cache\QMid
It is used as an identifier when doing send to a remote receiver, which in turn uses it to send ACKs back to the correct sender. The receiver, in your case the cluster, maintains a cache that maps QMIds to IPs. Our problem was that several of our workers had the SAME QMId. This ment the cluster sent all ACKS for all messages from all the machines to the first machine who sent a message. At some point, and for some operations like a MSMQ windows service restart, the cache expires and ANOTHER machine magically "works".
So check your 6 servers and make sure none of them has the same QMid. Ours had the same value because they were all ghosted from a Windows image that was taken after MSMQ was installed.
The fix is easy, just reinstall the MSMQ feature on each machine to generate a new unique QMId.

If your machines are created from the same image, you probably have non-unique MachineCache IDs. You can fix this by running the following powershell script on each machine.
This can be done before the image is created, or on each machine after it is launched.
Remove-ItemProperty -Path 'Registry::HKEY_LOCAL_MACHINE\Software\Microsoft\MSMQ\Parameters\MachineCache' -name 'QMId'
Set-ItemProperty -Path 'Registry::HKLM\Software\Microsoft\MSMQ\Parameters' -Name SysPrep -Value 1
Restart-Service -Name 'MSMQ'

Related

MSMQ in cluster behavior on node switch

I just installed MSMQ in cluster and now test how it behaves. It appears, that when active cluster node is switched all messages, which were in the queue, are lost (even when we switch back to original node). For me it seems like undesired behavior. I thaught that all messages from source node should be moved towards destination node on node switch.
I tested node switch via Pause > Drain roles menu item and via Move > Select node menu item.
I want to know is decribed behavior is like MSMQ in cluster should behave or may be it is some misconfiguration issue?
Update. Found similar question here: MSMQ Cluster losing messages on failover. But the solution did not help in my situation.
It seems that I sent to message queue messages, which were not recoverable (as it is written here: https://blogs.msdn.microsoft.com/johnbreakwell/2009/06/03/i-restarted-msmq-and-all-my-messages-have-vanished). That's why these messages didn't survive service restart. When I send message with Recoverable flag set, messages started to recover after service restart and cluster node switch.

Zooker Failover Strategies

We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover

LiveRebel Update Strategy

I am trying to utilize LiveRebel on my production environment. After most parts are configured I tried to perform update on my application from lets say version 1.1 to 1.3 as shown below
Does this mean that LiveRebel require two server installation on 2 physical IP addresses ? Can I have two server on 2 virtual IP addresses ?
Rolling restarts use request routing to achieve zero downtime for the users. Sessions are first drained by waiting for old sessions to expire and routing new ones to an identical application on another server. When all sessions are drained, application is updated, while the other server handles the requests.
So, as you can see, for zero downtime you need additional server to handle the requests while application is updated. Full restart doesn't have that requirement, but results in downtime for users.
As for the question about IPs, as long as the two server (virtual) machines can see each other , doesn't really make much difference.

Clustered MSMQ - Invalid queue path name when sending

We have a two node cluster running on Windows 2008 R2. I've installed MSMQ with the Message Queue Server and Directory Service Integration options on both nodes. I've created a clustered MSMQ resource named TESTV0Msmq (we use transactional queues so a DTC resource had been created previously).
The virtual resource resolves correctly when I ping it.
I created a small console executable in c# using the MessageQueue contructor to allow us to send basic messages (to both transactional and non transactional queues).
From the active node these paths work:
.\private$\clustertest
{machinename}\private$\clustertest
but TESTV0Msmq\private$\clustertest returns "Invalid queue path name".
According to this article:
http://technet.microsoft.com/en-us/library/cc776600(WS.10).aspx
I should be able to do this?
In particular, queues can be created on a virtual server, and
messages can be sent to them. Such queues are addressed using the
VirtualServerName\QueueName syntax.
Sounds like classic Clustering MSMQ problem:
Clustering MSMQ applications - rule #1
If you can access ".\private$\clustertest" or "{machinename}\private$\clustertest" from the active node then that means there is a queue called clustertest hosted by the LOCAL MSMQ queue manager. It doesn't work on the passive node because there is no queue called clustertest there yet. If you fail over the resource, it should fail.
You need to create a queue in the clustered resource instead. TESTV0Msmq\private$\clustertest fails because the queue was created on the local machine and not the virtual machine.
Cheers
John Breakwell

MSMQ redundancy

I'm looking into WCF/MSMQ.
Does anyone know how one handles redudancy with MSMQ? It is my understanding that the queue sits on the server, but what if the server goes down and is not recoverable, how does one prevent the messages from being lost?
Any good articles on this topic?
There is a good article on using MSMQ in the enterprise here.
Tip 8 is the one you should read.
"Using Microsoft's Windows Clustering tool, queues will failover from one machine to another if one of the queue server machines stops functioning normally. The failover process moves the queue and its contents from the failed machine to the backup machine. Microsoft's clustering works, but in my experience, it is difficult to configure correctly and malfunctions often. In addition, to run Microsoft's Cluster Server you must also run Windows Server Enterprise Edition—a costly operating system to license. Together, these problems warrant searching for a replacement.
One alternative to using Microsoft's Cluster Server is to use a third-party IP load-balancing solution, of which several are commercially available. These devices attach to your network like a standard network switch, and once configured, load balance IP sessions among the configured devices. To load-balance MSMQ, you simply need to setup a virtual IP address on the load-balancing device and configure it to load balance port 1801. To connect to an MSMQ queue, sending applications specify the virtual IP address hosted by the load-balancing device, which then distributes the load efficiently across the configured machines hosting the receiving applications. Not only does this increase the capacity of the messages you can process (by letting you just add more machines to the server farm) but it also protects you from downtime events caused by failed servers.
To use a hardware load balancer, you need to create identical queues on each of the servers configured to be used in load balancing, letting the load balancer connect the sending application to any one of the machines in the group. To add an additional layer of robustness, you can also configure all of the receiving applications to monitor the queues of all the other machines in the group, which helps prevent problems when one or more machines is unavailable. The cost for such queue-monitoring on remote machines is high (it's almost always more efficient to read messages from a local queue) but the additional level of availability may be worth the cost."
Not to be snide, but you kind of answered your own question. If the server is unrecoverable, then you can't recover the messages.
That being said, you might want to back up the message folder regularly. This TechNet article will tell you how to do it:
http://technet.microsoft.com/en-us/library/cc773213.aspx
Also, it will not back up express messages, so that is something you have to be aware of.
If you prefer, you might want to store the actual messages for processing in a database upon receipt, and have the service be the consumer in a producer/consumer pattern.