Request times out after being sent in AHC - timeoutexception

I am using the org.asynchttpclient library version(2.12.3).I am using a service which makes a lot of downstream api calls to different services.Earlier I was using a single AHC object instance for all the downstream service calls with (maxConnectionPool=1000) and (maxConnectionsPerHost=500) with keep-alive=true. What happened was the requests were being sent from my client service to the target services but many of them timed out with the message "Request timeout to after ms". What I observed is the response from the services was within the timeout duration but for some reason the request timed out.
As an experiment I increased the number of AHC instances and used a separate instance for each of the downstream service with the config remaining the same.I observed that the timeouts decreased significantly and the overall response time observed from the downstream calls improved.But still there are false timeouts which are seen.
Also have seen this issue happening when using an open channel.When using a new channel this never seems to happen.
I guess this is due to blocking of thread due to some reason that it is not able to process the new responses arriving in the channel.Is this some known bug or there may be some issue with my implementation,can someone throw some light on it?

Related

Akka ask timeouts

We have an akka/scala app that has some naively written error handling that I need to fix.
The REST endpoint communicates with an internal actor that makes a remote call to create an order.
It does this using ask, and when the ask times out e.g. because of a network or comms error, we send the client a message over the REST endpoint that the request has failed.
The problem is that the internal actor has its own queuing/retry logic and it will continue to call the remote interface until the request succeeds.
So we have the situation where we've told the client that the request has failed but it's really just queued (and will often eventually succeed). The client resubmits the request and we end up with 100's of duplicate orders.
My question is: does akka support a generic way of rolling back or poisoning an ask message when the ask request times out?
There are a couple of approaches one can take. Neither is generic, as (especially once network communication is involved) any choice made in this area has some cases where it's exactly the wrong thing for the business logic:
If there's something about the orders that can be used to determine that two submitted orders are actually the same (e.g. a client-supplied correlation ID), that can be used in the actor to piggyback on the queuing/retry logic handling the earlier order. This requires some client-visible API changes.
It's also possible to include a stop retrying and ignore if pulled from a queue after this time field in the message; you can set this time based on the ask timeout duration.

What are underling mechanisms of actor->actor or service->actor calls and how reliable they are?

I'd like to have more technical details about underling mechanisms of calling Actors in Azure Service Fabric, which I can't easily find online. Actors are know for their single-threaded scope, so unless any of its method execution is fully completed, no other clients are allowed to call it.
To be more specific, I need to know what happens if Actor is stuck for a while with its job initiated by one client call. How long are other clients supposed to wait until the job gets done? seconds, minutes, hours?
Is there any time-out mechanism, and if so is it somehow
configurable?
What happens if node where actor is located crashes,
would client receives immediate error, or ActorProxy somehow handles this situation and redirects call to newly created instance of Actor on healthy node?
There are quite a few SO answers with details about actor mechanisms and also in docs, I can point you a few:
This one does not exactly answer your question, but I described a bit how the locking works: Start a thread inside an Azure Service Fabric actor?
Q: Is there any time-out mechanism, and if so is it somehow configurable?
yes, there is a timeout, I have answered here: Acquisition of turn based concurrency lock for actor '{actorName}' timed out after {time}
The configuration docs are located in here
Q: What happens if node where actor is located crashes, would client receives immediate error, or ActorProxy somehow handles this situation and redirects call to newly created instance of Actor on healthy node?
Generally there is always a replica available when one replica goes down, new requests will start moving to the new replica when SF promotes a secondary replica to Primary.
Regarding the communication, by default, SF Actors use .Net Remoting for communication the same way as the reliable services, the behaviour is described very well in here, In summary, it retries transient failures, if the client can't connect to the service(Actor) it will retry until it reaches the connection timeout.
From the docs:
The service proxy handles all failover exceptions for the service partition it is created for. It re-resolves the endpoints if there are failover exceptions (non-transient exceptions) and retries the call with the correct endpoint. The number of retries for failover exceptions is indefinite. If transient exceptions occur, the proxy retries the call.
The Actor docs, has more info, in summary there are two points to keep in mind:
Message delivery is best effort.
Actors may receive duplicate messages from the same client.
That means, in case a transient failure occurs while delivering a message, it will retry, even though the message has been already delivered, causing duplicate messages.

JsonTextReader.ReadAsync not reacting on cancellation

I am implementing an application that relies on Office365 API Streaming Subscriptions. When listening on subscriptions, I am reading the almost infinite result of a POST request. Practically I am getting a stream that contains a single JSON object, which starts with some warm-up properties and has an array of events.
I am using the JsonTextReader to parse the stream. There is a while (await jsonReader.ReadAsync(readCts.Token))... loop that seamlessly parses the stream.
It is working just great... well, almost.
At predefined intervals, I get a KeepAlive notification. The reader is identifying them, thus I want to use this to reset the readCts CancellationTokenSource timeout. If the notification is not arriving at the time, the read operation should be canceled. So far so good. But timeout based canceling work only if the underlying network connection is healthy (simulated by setting the timeout less than the keepalive event interval).
After interrupting the connection, this async operation is hanging, the cancellation token has no effect over it. And the logical connection is lost as well, re-establishing physical one does not resume the stream.
I have tried setting HttpClient instance's timeout, but that had no any effect either. Finally, I managed it by setting WinHttpHandler.ReceiveDataTimeout. But for that, I am using a separate HttpClient instance.
1) Is the behavior of cancellation described above normal?
2) I know, that HttpClient instances should be reused. But in general API calls are not running for hours. And I will probably have several of such calls in parallel. Can I share one HttpClient instance, or do I need as many as parallel requests I have?
Thank you.

Semaphore error logged in mobicents sip servlet

We have an application written against Mobicents SIP Servlets, currently this is using v2.1.547 but I have also tested against v3.1.633 with the same behavior noted.
Our application is working as a B2BUA, we have an incoming SIP call and we also have an outbound SIP call being placed to an MRF which is executing VXML. These two SIP calls are associated with a single SipApplicationSession - which is the concurrency model we have configured.
The scenario which recreates this 100% of the time is as follows:
inbound call placed to our application (call is not answered)
outbound call placed to MRF
inbound call hangsup
application attempts to terminate the SipSession associated with the outbound call
I am seeing this being logged:
2015-12-17 09:53:56,771 WARN [SipApplicationSessionImpl] (MSS-Executor-Thread-14) Failed to acquire session semaphore java.util.concurrent.Semaphore#55fcc0cb[Permits = 0] for 30 secs. We will unlock the semaphore no matter what because the transaction is about to timeout. THIS MIGHT ALSO BE CONCURRENCY CONTROL RISK. app Session is5faf5a3a-6a83-4f23-a30a-57d3eff3281c;SipController
I am willing to believe somehow our application might be triggering this behavior but I can't see how at the moment. I would have thought acquiring/releasing the Semaphore was all internal to the implementation so it should ensure something doesn't acquire the Semaphore and never release it?
Any pointers on how to get to the bottom of this would be appreciated, as I said it is 100% repeatable so getting logs etc is all possible.
It's hard to tell without seeing any logs or application code on how you access and schedule messages to be sent. But if you use the same SipApplicationSession in an asynchronous manner you may want to use our vendor specific asynchronous API https://mobicents.ci.cloudbees.com/job/MobicentsSipServlets-Release/lastSuccessfulBuild/artifact/documentation/jsr289-extensions-apidocs/org/mobicents/javax/servlet/sip/SipSessionsUtilExt.html#scheduleAsynchronousWork(java.lang.String,%20org.mobicents.javax.servlet.sip.SipApplicationSessionAsynchronousWork) which will guarantee that the access to the SipapplicationSession is serialized and avoid any concurrency issues.

NServiceBus MSMQ messages intermittently get stuck on the Outgoing Queue

We have a Pub / Sub system based on NServiceBus, where we have intermittent issues with messages getting stuck on the Publishers outgoing queue indefinitely, rather than being transmitted to the Subscribers input queues.
Points to note:
When we restart the Publisher Service and Subscriber services, message flow resumes normally for a while.
The problem seems to occur more often if a sustained period of time between messages occurs.
The publisher service resides on the LAN, the subscribers on the otherside of a firewall.
Some messages get through! As mentioned after service restarts, things go fine for a while.
Using QueueExplorer, I can see the messages on the Outgoing queue have a state of WAITING.
Annoyingly our development environment does not exhibit this behaviour, but then again the publisher and subscribers all reside on the same LAN in this environment.
MSMQ messages being stuck in an outgoing queue is purely an MSMQ issue. Restarting the Publisher and Subscriber services should make no difference as they are not directly involved in message delivery. If you can fix the problem by ONLY restarting the Pub/Sub services and NOT the Message Queuing services then it looks like a resources/memory leak problem.
I imagine something like this happening:
Messages flow to destination, which uses up kernel memory in storing them
For some reason, kernel memory runs out (too many messages, memory leak, whetever)
Destination now rejects new messages as they cannot be loaded into memory from the wire
Connection is reset and not re-connected until WaitTime value reached; Queue is "waiting" at this point
System loops through (3) and (4) until ...
Pub/Sub services are restarted and now there is sufficient resources for messages to be delivered
Goto (2)
Occasional messages get through when just enough kernel memory is temporarily freed up by one of the many services and device drivers that use it.
Item 4 of this blog post is the most likely culprit:
http://blogs.msdn.com/b/johnbreakwell/archive/2006/09/18/insufficient-resources-run-away-run-away.aspx
Cheers
John Breakwell
We had a similar scenario in production, it turned out we migrated one of our subscriber endpoints to a new physical host and forgot to unsubscribe before shutting down the old endpoint. Our publisher was trying to deliver messages to both the old and new endpoints but could only reach the new one. Eventually the publishers outbound queue grew so large that it started affecting all outgoing messages.
I have run into this issue as well, I know it is not Item 4, as I don't send anything to it before it gets stuck in the outgoing queue. If I let both publisher and subscriber sit for about 10 minutes before sending a message, it never leaves the outgoing queue. If I send a message before that amount of time, it flows fine. Also, if I restart the subscriber the message will then flow. This is reproducible every time I let them sit idle for 10 minutes.
I think I found the answer here, at least this fixed the issue I was having:
http://support.microsoft.com/kb/2554746
Also, in my case it had nothing to do with restarting, so don't let that throw you off, I did exhibit the symptoms in the netstat and messages would initially go through when the client was first started up.
Just to throw my 2p in:
We had an issue where the message queuing service had some kind of memory leak and would consume large amounts of memory which is did not release.
This lead to messages getting stuck for long periods of time - although they would eventually be delivered (sometimes after 3 days).
We have not bothered fixing this yet as it only happens when the service is under heavy load which does not happen often.