Dealing with akka cluster sometimes I get quarantine situation with such exception:
2016-03-22 10:01:37.090UTC WARN [ClusterSystem-akka.actor.default-dispatcher-2] Remoting - Association to [akka.tcp://ClusterSystem#10.10.80.26:2551] having UID [1417400423] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
The problem is that I can't reproduce it!
I just can't write any simple code snippet which provoces quarantine event.
How is there any simple way to provoce quarantine event?
Just start two seed nodes and stop the network of one of the seed node, you will see this issue.
I have reproduced this issue. I created multiple VMs to form a cluster and stopped network from one of the VM.
Related
I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.
I just installed MSMQ in cluster and now test how it behaves. It appears, that when active cluster node is switched all messages, which were in the queue, are lost (even when we switch back to original node). For me it seems like undesired behavior. I thaught that all messages from source node should be moved towards destination node on node switch.
I tested node switch via Pause > Drain roles menu item and via Move > Select node menu item.
I want to know is decribed behavior is like MSMQ in cluster should behave or may be it is some misconfiguration issue?
Update. Found similar question here: MSMQ Cluster losing messages on failover. But the solution did not help in my situation.
It seems that I sent to message queue messages, which were not recoverable (as it is written here: https://blogs.msdn.microsoft.com/johnbreakwell/2009/06/03/i-restarted-msmq-and-all-my-messages-have-vanished). That's why these messages didn't survive service restart. When I send message with Recoverable flag set, messages started to recover after service restart and cluster node switch.
I have a Service Fabric application that has a stateless web api and a stateful service with two partitions. The stateless web api defines a web api controller and uses ServiceProxy.Create to get a remoting proxy for the stateful service. The remoting call puts a message into a reliable queue.
The stateful service will dequeue the messages from the queue every X minutes.
I am looking at the Service Fabric explorer and my application has been in an error state for the past few days. When I drill down into the details the stateful service has the following error:
Error event: SourceId='System.FM', Property='State'. Partition is in
quorum loss.
Looking at the explorer I see that I have my primary replica up and running and it seems like a single ActiveSecondary, but the other two replicas show IdleSecondary and they keep going into a Standby / In Build state. I cannot figure out why this is happening.
What are some of the reasons my other secondaries keep failing to get to an ActiveSecondary state / causing this quorum loss?
Try to reset the cluster.
I was facing the same issue having 1 partition for my service.
The error was fixed with resetting the cluster
Have you checked the Windows Event Log on the nodes for additional error message?
I had a similar problem, except I was using a ReliableDictionary. Did you properly implement IEquatable<T> and IComparable<T>? I had a similar problem because my T had a dictionary field, and I was calling Equals on a dictionary directly, instead of comparing the keys and values. Same thing for GetHashCode.
The clue in the event logs was this message: Assert=Cannot update an item that does not exist (null). - it only happened when I edited a key ReliableDictionary.
I am using akka clusters and having trouble figuring out when to use the Terminated message to figure out that a member has left the cluster vs using the UnreachableMember/MemberRemoved message. What is the difference between the 2 messages and some example use cases?
MemberRemoved is a cluster event, that member completely removed from the cluster.
You can subscribe to change notifications of the cluster membership by using Cluster(system).subscribe.
On the other hand, there is a DeathWatch, a mechanism to watch an actor to check if it is terminated. When DeathWatch is used, the watcher will receive a Terminated(watched) message when watched is terminated.
I'm interested in using Celery for an app I'm working on. It all seems pretty straight forward, but I'm a little confused about what I need to do if I have multiple load balanced application servers. All of the documentation assumes that the broker will be on the same server as the application. Currently, all of my application servers sit behind an Amazon ELB and tasks need to be able to come from any one of them.
This is what I assume I need to do:
Run a broker server on a separate instance
Configure each application instance to connect to that broker server
Each application instance will also be be a celery working (running
celeryd)?
My only beef with that is: What happens if my broker instance dies? Can I run 2 broker instances some how so I'm safe if one goes under?
Any tips or information on what to do in a setup like mine would be greatly appreciated. I'm sure I'm missing something or not understanding something.
For future reference, for those who do prefer to stick with RabbitMQ...
You can create a RabbitMQ cluster from 2 or more instances. Add those instances to your ELB and point your celeryd workers at the ELB. Just make sure you connect the right ports and you should be all set. Don't forget to allow your RabbitMQ machines to talk among themselves to run the cluster. This works very well for me in production.
One exception here: if you need to schedule tasks, you need a celerybeat process. For some reason, I wasn't able to connect the celerybeat to the ELB and had to connect it to one of the instances directly. I opened an issue about it and it is supposed to be resolved (didn't test it yet). Keep in mind that celerybeat by itself can only exist once, so that's already a single point of failure.
You are correct in all points.
How to make reliable broker: make clustered rabbitmq installation, as described here:
http://www.rabbitmq.com/clustering.html
Celery beat also doesn't have to be a single point of failure if you run it on every worker node with:
https://github.com/ybrs/single-beat