Using JEE7, Wildfly 8, ActiveMQ 5, Camel 2.13.2.
While developing, occasionally tickets will get caught in some impossible circumstance and will retrying 5, 10 times, cluttering up the logs during development.
Or, alternatively, I'll need to reboot the application server and will have to wait 5 minutes for inflight exchanges to timeout.
I've started using hawtio and with the level of detail presented about Camel, it seems like I should be able to cancel/delete/flush/purge those and move on with my life.
There aren't many buttons to push. When I select a route, I can see a list of properties containing the offending inflight exchanges. But the "Destroy" button appears to only trigger the graceful shutdown.
Is there a way to purge/flush/delete tickets from hawtio? Any way at all?
You can lower the timeout from the default 300 seconds to a lower value.
http://camel.apache.org/graceful-shutdown.html
And if you use Camel 2.15.x onwards then hawtio has a new inflight exchanges sub tab, where you can see all the inflights. Though you cannot kill them.
But there is no purge button, as the inflight exchanges are not like sitting still on a message queue you can drain. But they are actual live java threads doing work, so there is no single way of killing things cleanly.
There is a ticket in JIRA to have a kill button that attempts a shutdown but will do this quicker, and potentially leave some threads still inflight that may cause side effects due not being able to shutdown graceful.
Related
We have several ActiveMQ Artemis 2.17.0 clusters setup to replicate between data centres with mirroring.
Our previous failover had been an emergency, and it's likely the state had fallen out of sync. When we next performed our scheduled failover tests weeks-old messages were sent to the consumers. I know that mirroring is asynchronous so it is expected that synchronization may not be 100% all the time. However, these messages were not within the time frame of synchronization delays. It is worth noting that we've had several events which I expect might throw mirroring off. We had hit the NFS split brain issue as well as the past emergency fail over
As such, we are looking for a way to purge (or sync) all messages on the standby server after we know that there have been problems with the mirroring to prevent a similar scenario from happening. There are over 5,000 queues, so preferably the operation doesn't need to be run on a queue by queue basis.
Is there any way to accomplish this, either in ActiveMQ Artemis 2.17.0 or a later version?
There's no programmatic way to simply delete all the data from every queue on the broker. However, you can combine a few management operations (e.g. in a script) to get the same result. You can use the getQueueNames method to get the name of every queue and then pass those names to the destroyQueue(String) method.
However, the simplest way to clear all the data would probably be to simply stop the broker, clear the data directory, and then restart the broker.
I'd like to have more technical details about underling mechanisms of calling Actors in Azure Service Fabric, which I can't easily find online. Actors are know for their single-threaded scope, so unless any of its method execution is fully completed, no other clients are allowed to call it.
To be more specific, I need to know what happens if Actor is stuck for a while with its job initiated by one client call. How long are other clients supposed to wait until the job gets done? seconds, minutes, hours?
Is there any time-out mechanism, and if so is it somehow
configurable?
What happens if node where actor is located crashes,
would client receives immediate error, or ActorProxy somehow handles this situation and redirects call to newly created instance of Actor on healthy node?
There are quite a few SO answers with details about actor mechanisms and also in docs, I can point you a few:
This one does not exactly answer your question, but I described a bit how the locking works: Start a thread inside an Azure Service Fabric actor?
Q: Is there any time-out mechanism, and if so is it somehow configurable?
yes, there is a timeout, I have answered here: Acquisition of turn based concurrency lock for actor '{actorName}' timed out after {time}
The configuration docs are located in here
Q: What happens if node where actor is located crashes, would client receives immediate error, or ActorProxy somehow handles this situation and redirects call to newly created instance of Actor on healthy node?
Generally there is always a replica available when one replica goes down, new requests will start moving to the new replica when SF promotes a secondary replica to Primary.
Regarding the communication, by default, SF Actors use .Net Remoting for communication the same way as the reliable services, the behaviour is described very well in here, In summary, it retries transient failures, if the client can't connect to the service(Actor) it will retry until it reaches the connection timeout.
From the docs:
The service proxy handles all failover exceptions for the service partition it is created for. It re-resolves the endpoints if there are failover exceptions (non-transient exceptions) and retries the call with the correct endpoint. The number of retries for failover exceptions is indefinite. If transient exceptions occur, the proxy retries the call.
The Actor docs, has more info, in summary there are two points to keep in mind:
Message delivery is best effort.
Actors may receive duplicate messages from the same client.
That means, in case a transient failure occurs while delivering a message, it will retry, even though the message has been already delivered, causing duplicate messages.
I am starting to think this is impossible now so hopefully somebody can give me some guidance.
In Short, I have a Springboot application running apache camel routes, XA is configured using Atomikos and as far as I can tell all the XA specific configuration is as it should be. When a route executes I can see a message removed from the jms queue, a database insert is executed using a #Transacted JPA component and the message is routed to an output jms queue. All fine and I can see log information that supports the transaction manager committing both jms and JPA bits.
The issue comes when I have an exception, I need to be able to attempt re delivery 3 times and if that fails route the message on to a failure queue but not before the database insert is be rolled back.
I have a configured TransactionErrorHandlerBuilder which is setting the redelivery count to 3 and also managing the RedeliveryDelay, I can see all of that working but I never manage to divert the message after 3 delivery attempts to the route that I have setup to deliver to the failure queue. I have set the DeadLetterUri to point to the route but it seems that the transactionErrorHandler never makes use of it, camel just tries to redeliver the message 3 times over and over again until I kill the route.
Is what I am asking not supported? I am really hoping I am missing something obvious.
(Camel 2.19)
(SpringBoot 1.5.3)
thanks
Paul
I've been working on a project of mine using Akka to create a real-time processing system which takes in the Twitter stream (for now) and uses actors to process said messages in various ways. I've been reading about similar architectures that others have built using Akka and this particular blog post caught my eye:
http://blog.goconspire.com/post/64901258135/akka-at-conspire-part-5-the-importance-of-pulling
Here they explain different issues that arise when pushing work (ie. messages) to actors vs. having the actors pull work. To paraphrase the article, by pushing messages there is no built-in way to know which units of work were received by which worker, nor can that be reliably tracked. In addition, if a worker suddenly receives a large number of messages where each message is quite large you might end up overwhelmed and the machine could run out of memory. Or, if the processing is CPU intensive you could render your node unresponsive due to CPU thrashing. Furthermore, if the jvm crashes, you will lose all the messages that the actor(s) had in its mailbox.
Pulling messages largely eliminates these problems. Since a specific actor must pull work from a coordinator, the coordinator always knows which unit of work each worker has; if a worker dies, the coordinator knows which unit of work to re-process. Messages also don’t sit in the workers’ mailboxes (since it's pulling a single message and processing it before pulling another one) so the loss of those mailboxes if the actor crashes isn't an issue. Furthermore, since each worker will only request more work once it completes its current task, there are no concerns about a worker receiving or starting more work than it can handle concurrently. Obviously there are also issues with this solution like what happens when the coordinator itself crashes but for now let's assume this is a non-issue. More about this pulling pattern can also be found at the "Let It Crash" website which the blog references:
http://letitcrash.com/post/29044669086/balancing-workload-across-nodes-with-akka-2
This got me thinking about a possible alternative to doing this pulling pattern which is to do pushing but with durable mailboxes. An example I was thinking of was implementing a mailbox that used RabbitMQ (other data stores like Redis, MongoDB, Kafka, etc would also work here) and then having each router of actors (all of which would be used for the same purpose) share the same message queue (or the same DB/collection/etc...depending on the data store used). In other words each router would have its own queue in RabbitMQ serving as a mailbox. This way, if one of the routees goes down, those that are still up can simply keep retrieving from RabbitMQ without too much worry that the queue will overflow since they are no longer using typical in-memory mailboxes. Also since their mailbox isn't implemented in-memory, if a routee crashes, the most messages that it could lose would just be the single one it was processing before the crash. If the whole router goes down then you could expect RabbitMQ (or whatever data store is being used) to handle an increased load until the router is able to recover and start processing messages again.
In terms of durable mailboxes, it seems that back in version 2.0, Akka was gravitating towards supporting these more actively since they had implemented a few that could work with MongoDB, ZooKeeper, etc. However, it seems that for whatever reason they abandoned the idea at some point since the latest version (2.3.2 as of the writing of this post) makes no mention of them. You're still able to implement your own mailbox by implementing the MessageQueue interface which gives you methods like enqueue(), dequeue(), etc... so making one that works with RabbitMQ, MongoDB, Redis, etc wouldn't seem to be a problem.
Anyways, just wanted to get your guys' and gals' thoughts on this. Does this seem like a viable alternative to doing pulling?
This question also spawned a rather long and informative thread on akka-user. In summary it is best to explicitly manage the work items to be processed by a (persistent) actor from which a variable number of worker actors pull new jobs, since that allows better resource management and explicit control over what gets processed and how retries are handled.
We have a Pub / Sub system based on NServiceBus, where we have intermittent issues with messages getting stuck on the Publishers outgoing queue indefinitely, rather than being transmitted to the Subscribers input queues.
Points to note:
When we restart the Publisher Service and Subscriber services, message flow resumes normally for a while.
The problem seems to occur more often if a sustained period of time between messages occurs.
The publisher service resides on the LAN, the subscribers on the otherside of a firewall.
Some messages get through! As mentioned after service restarts, things go fine for a while.
Using QueueExplorer, I can see the messages on the Outgoing queue have a state of WAITING.
Annoyingly our development environment does not exhibit this behaviour, but then again the publisher and subscribers all reside on the same LAN in this environment.
MSMQ messages being stuck in an outgoing queue is purely an MSMQ issue. Restarting the Publisher and Subscriber services should make no difference as they are not directly involved in message delivery. If you can fix the problem by ONLY restarting the Pub/Sub services and NOT the Message Queuing services then it looks like a resources/memory leak problem.
I imagine something like this happening:
Messages flow to destination, which uses up kernel memory in storing them
For some reason, kernel memory runs out (too many messages, memory leak, whetever)
Destination now rejects new messages as they cannot be loaded into memory from the wire
Connection is reset and not re-connected until WaitTime value reached; Queue is "waiting" at this point
System loops through (3) and (4) until ...
Pub/Sub services are restarted and now there is sufficient resources for messages to be delivered
Goto (2)
Occasional messages get through when just enough kernel memory is temporarily freed up by one of the many services and device drivers that use it.
Item 4 of this blog post is the most likely culprit:
http://blogs.msdn.com/b/johnbreakwell/archive/2006/09/18/insufficient-resources-run-away-run-away.aspx
Cheers
John Breakwell
We had a similar scenario in production, it turned out we migrated one of our subscriber endpoints to a new physical host and forgot to unsubscribe before shutting down the old endpoint. Our publisher was trying to deliver messages to both the old and new endpoints but could only reach the new one. Eventually the publishers outbound queue grew so large that it started affecting all outgoing messages.
I have run into this issue as well, I know it is not Item 4, as I don't send anything to it before it gets stuck in the outgoing queue. If I let both publisher and subscriber sit for about 10 minutes before sending a message, it never leaves the outgoing queue. If I send a message before that amount of time, it flows fine. Also, if I restart the subscriber the message will then flow. This is reproducible every time I let them sit idle for 10 minutes.
I think I found the answer here, at least this fixed the issue I was having:
http://support.microsoft.com/kb/2554746
Also, in my case it had nothing to do with restarting, so don't let that throw you off, I did exhibit the symptoms in the netstat and messages would initially go through when the client was first started up.
Just to throw my 2p in:
We had an issue where the message queuing service had some kind of memory leak and would consume large amounts of memory which is did not release.
This lead to messages getting stuck for long periods of time - although they would eventually be delivered (sometimes after 3 days).
We have not bothered fixing this yet as it only happens when the service is under heavy load which does not happen often.