AkkaCluster: Terminated vs MemberRemoved message - scala

I am using akka clusters and having trouble figuring out when to use the Terminated message to figure out that a member has left the cluster vs using the UnreachableMember/MemberRemoved message. What is the difference between the 2 messages and some example use cases?

MemberRemoved is a cluster event, that member completely removed from the cluster.
You can subscribe to change notifications of the cluster membership by using Cluster(system).subscribe.
On the other hand, there is a DeathWatch, a mechanism to watch an actor to check if it is terminated. When DeathWatch is used, the watcher will receive a Terminated(watched) message when watched is terminated.

Related

How to stop the entire Akka Cluster (Sharding)

How to stop the ENTIRE cluster with sharding (spanning multiple machines - nodes) from one actor?
I know I can stop the actor system on 'this' node context.system.terminate()
I know I can stop the local Sharding Region.
I found .prepareForFullClusterShutdown() but it doesn't actually stop the nodes.
I suppose there is no single command to do that, but there must be some way to do this.
There's no out-of-the-box way to do this that I'm aware of: the overall expectation is that there's an external control plane (e.g. kubernetes) which manages this.
However, one could have an actor on every node of the cluster that listens for membership events and also subscribes to a pubsub topic. This actor would track the current cluster membership and, when told to begin a cluster shutdown, it publishes a (e.g.) ShutdownCluster message to the topic and tracks which nodes leave. After some length of time (since distributed pubsub is at-most-once) if there are nodes besides this one that haven't left, it sends it again. Eventually, after all other nodes in the cluster have left, this actor then shuts down its node. When other nodes see a ShutdownCluster message, they immediately shut themselves down.
Of course, this sort of scheme will probably not play nicely with any form of external orchestration (whether it's a container scheduler like kubernetes, mesos, or nomad; or even something simple like monit which notices that the service isn't running and restarts it).

MSMQ in cluster behavior on node switch

I just installed MSMQ in cluster and now test how it behaves. It appears, that when active cluster node is switched all messages, which were in the queue, are lost (even when we switch back to original node). For me it seems like undesired behavior. I thaught that all messages from source node should be moved towards destination node on node switch.
I tested node switch via Pause > Drain roles menu item and via Move > Select node menu item.
I want to know is decribed behavior is like MSMQ in cluster should behave or may be it is some misconfiguration issue?
Update. Found similar question here: MSMQ Cluster losing messages on failover. But the solution did not help in my situation.
It seems that I sent to message queue messages, which were not recoverable (as it is written here: https://blogs.msdn.microsoft.com/johnbreakwell/2009/06/03/i-restarted-msmq-and-all-my-messages-have-vanished). That's why these messages didn't survive service restart. When I send message with Recoverable flag set, messages started to recover after service restart and cluster node switch.

Partition is in quorum loss

I have a Service Fabric application that has a stateless web api and a stateful service with two partitions. The stateless web api defines a web api controller and uses ServiceProxy.Create to get a remoting proxy for the stateful service. The remoting call puts a message into a reliable queue.
The stateful service will dequeue the messages from the queue every X minutes.
I am looking at the Service Fabric explorer and my application has been in an error state for the past few days. When I drill down into the details the stateful service has the following error:
Error event: SourceId='System.FM', Property='State'. Partition is in
quorum loss.
Looking at the explorer I see that I have my primary replica up and running and it seems like a single ActiveSecondary, but the other two replicas show IdleSecondary and they keep going into a Standby / In Build state. I cannot figure out why this is happening.
What are some of the reasons my other secondaries keep failing to get to an ActiveSecondary state / causing this quorum loss?
Try to reset the cluster.
I was facing the same issue having 1 partition for my service.
The error was fixed with resetting the cluster
Have you checked the Windows Event Log on the nodes for additional error message?
I had a similar problem, except I was using a ReliableDictionary. Did you properly implement IEquatable<T> and IComparable<T>? I had a similar problem because my T had a dictionary field, and I was calling Equals on a dictionary directly, instead of comparing the keys and values. Same thing for GetHashCode.
The clue in the event logs was this message: Assert=Cannot update an item that does not exist (null). - it only happened when I edited a key ReliableDictionary.

How can I reproduce akka quarantie situation?

Dealing with akka cluster sometimes I get quarantine situation with such exception:
2016-03-22 10:01:37.090UTC WARN [ClusterSystem-akka.actor.default-dispatcher-2] Remoting - Association to [akka.tcp://ClusterSystem#10.10.80.26:2551] having UID [1417400423] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
The problem is that I can't reproduce it!
I just can't write any simple code snippet which provoces quarantine event.
How is there any simple way to provoce quarantine event?
Just start two seed nodes and stop the network of one of the seed node, you will see this issue.
I have reproduced this issue. I created multiple VMs to form a cluster and stopped network from one of the VM.

Clustered MSMQ Issue

I am having an issue with MSMQ in a clustered environment. I have the following setup:
2 Nodes setup in a Windows Failover, lets call them "Node A" and "Node B".
I have then setup a Clustered Instance of MSMQ lets call it "MSMQ Instance".
I have also setup a Clustered instance of the DTC, lets call it "DTC Instance".
Within the DTC instance, I have allowed access both locally and also through the Clustered instance, basically I have taken all authentication off to test.
I have also created a clustered instance of our In house application, lets call it "Application Instance". Within this Application instance, I have other resources added, which are other services the application uses and also the Net.MSMQ adapter.
The Issue.......
When I seem to Cluster the Application Instance, it always seems to set the owner to be the opposite Node that I am using, so if I am creating the Clustered Instance on Node A it always sets the current owner to Node B, however that is not the issue.
The issue I have is that as long as the Application Instance is running on Node B, MSMQ seems to work.
The outbound queues are created locally, receive messages and are then processed through the MSMQ Cluster.
If I then Failover to Node A, the MSMQ refuses to work. The outbound queues are not created and therefore no messages are being processed.
I get an error in Event Viewer:
"The version check failed with the error: 'Unrecognized error -1072824309 (0xc00e000b)'. The version of MSMQ cannot be detected All operations that are on the queued channel will fail. Ensure that MSMQ is installed and is available"
If I then failover back to Node B it works.
The Application has been setup to use the MSMQ instance and all the permissions are correct.
Do I need to have a Clustered instance of DTC or can I just configure it as resource within the MSMQ instance?
Can anybody shed any light on this as I am at a brick wall with this?
Yes, you will need to have a clustered DTC setup.
For your clustered MSMQ instance you will then need to configure the clustered DTC as a "dependendy" Right click on MSMQ -> Properties -> Dependencies
I do not know if this is mandatory in all cases, but on our Cluster we also have a file share configured as a dependcy for the MSMQ. To my understanding this should ensure that temporary files that are needed by MSMQ are still available after a node switch.
Additionally, here are two articles that I found very helpful in setting up the cluster nodes. They might be helpful in confirming step-by-step that your configurations are correct:
"Building MSMQ cluster". You will find several other links in that article that will guide you further.
Microsoft also has a detailed document: "Deploying Message Queuing (MSMQ) 3.0 in a Server Cluster".