Handleing AskTimeoutException on Akka based application - scala

I have the following HTTP-based application that routes every request to an Akka Actor which uses a long chain of Akka Actors to process the request.
path("process-request") {
post {
val startedAtAsNano = System.nanoTime()
NonFunctionalMetrics.requestsCounter.inc()
NonFunctionalMetrics.requestsGauge.inc()
entity(as[Request]) { request =>
onComplete(distributor ? [Response](replyTo => Request(request, replyTo))) {
case Success(response) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.OK.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
complete(response)
case Failure(ex) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.INTERNAL_SERVER_ERROR.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
logger.warn(s"A general error occurred for request: $request, ex: ${ex.getMessage}")
complete(InternalServerError, s"A general error occurred: ${ex.getMessage}")
}
}
}
}
As you can see, I'm sending the distributor an ask request for response.
The problem is that on really high RPS, sometimes, the distributor fails with the following exception:
2022-04-16 00:36:26.498 WARN c.d.p.b.http.AkkaHttpServer - A general error occurred for request: Request(None,0,None,Some(EntitiesDataRequest(10606082,0,-1,818052,false))) with ex: Ask timed out on [Actor[akka://MyApp/user/response-aggregator-pool#1374579366]] after [5000 ms]. Message of type [com.dv.phoenix.common.pool.WorkerPool$Request]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
This is a typical non-informative Exception, the normal processing time is about 700 micros, 5 seconds its must be stuck somewhere at the pipeline since it cannot be that high.
I want to monitor this, I thought about adding Kamon integration which provides Akka Actors module with mailboxes, etc.
I tried to add the following configurations but its not worked for me:
https://kamon.io/docs/latest/instrumentation/akka/ask-pattern-timeout-warning/ (didn't show any effect)
Is there other suggestions to understand the cause for this issue on high RPS system?
Thanks!

The Kamon instrumentation is useful for finding how you got to the ask. It can be useful if you have a lot of places where an ask can time out, but otherwise it's not likely to tell you the problem.
This is because an ask timeout is nearly always a symptom of some other problem (the lone exception is if many asks could plausibly be done in a stream (e.g. in a mapAsync or ask stage) but aren't; that doesn't apply in this code). Assuming that the timeouts aren't caused by (e.g.) a database being down so you're getting no reply or a cluster failing (both of these are fairly obvious, thus my assumption), the cause of a timeout (any timeout, generally) is often having too many elements in a queue ("saturation").
But which queue? We'll start with the distributor, which is an actor processing messages one-at-a-time from its mailbox (which is a queue). When you say that the normal processing time is 700 micros, is that measuring the time the distributor spends handling a request (i.e. the time before it can handle the next request)? If so, and the distributor is taking 700 micros, but requests come in every 600 micros, this can happen:
time 0: request 0 comes in, processing starts in distributor (mailbox depth 0)
600 micros: request 1 comes in, queued in distributor's mailbox (mailbox depth 1)
700 micros: request 0 completes (700 micros latency), processing of request 1 begins (mailbox depth 0)
1200 micros: request 2 comes in, queued (mailbox depth 1)
1400 micros: request 1 completes (800 micros latency), processing of request 2 begins (mailbox depth 0)
1800 micros: request 3 comes in, queued (mailbox depth 1)
2100 micros: request 2 completes (900 micros latency), processing of request 3 begins (mailbox depth 0)
2400 micros: request 4 comes in, queued (mailbox depth 1)
2800 micros: request 3 completes (1000 micros latency), processing of request 4 begins (mailbox depth 0)
3000 micros: request 5 comes in, queued (mailbox depth 1)
3500 micros: request 4 completes (1100 micros latency), processing of request 5 begins (mailbox depth 0)
3600 micros: request 6 comes in, queued (mailbox depth 1)
4200 micros: request 7 comes in, queued, request 5 completes (1200 micros latency), processing of request 6 begins (mailbox depth 1)
4800 micros: request 8 comes in, queued (mailbox depth 2)
4900 micros: request 6 completes (1300 micros latency), processing of request 7 begins (mailbox depth 1)
5400 micros: request 9 comes in, queued (mailbox depth 2)
and so on: the latency and depth increase without bound. Eventually, the depth is such that requests spend 5 seconds (or more, even) in the mailbox.
Kamon has the ability to track the number of messages in the mailbox of an actor (it's recommended to only do this on specific actors). Tracking the mailbox depth of distributor in this case would show it growing without bound to confirm that this is happening.
If the distributor's mailbox is the queue that's getting too deep, first consider how request N can affect request N + 1. The one-at-a-time processing model of an actor is only strictly required when the response to a request can be affected by the request immediately prior to it. If a request only concerns some portion of the overall state of the system then that request can be handled in parallel with requests that do not concern any part of that portion. If there are distinct portions of the overall state such that no request is ever concerned with 2 or more portions, then responsibility for each portion of state can be offloaded to a specific actor and the distributor looks at each request only for long enough to determine which actor to forward the request to (note that this will typically not entail the distributor making an ask: it hands off the request and its the responsibility of the actor it hands off to (or that actor's designee...) to reply). This is basically what Cluster Sharding does under the hood, and it's also noteworthy that doing this will probably increase the latency under low load (because you are doing more work), but increases peak throughput by up to the number of portions of state.
If that's not a workable way to address the distributor's mailbox being saturated (viz. there's no good way to partition the state), then you can at least limit the time requests spend in the mailbox by including a "respond-by" field in the request message (e.g. for a 5 second ask timeout, you might require a response by 4900 millis after constructing the ask). When the distributor starts processing a message and the respond-by time has passed, it moves onto the next request: doing this effectively means that when the mailbox starts to saturate, the message processing rate increases.
Of course, it's possible that your distributor's mailbox isn't the queue that's getting saturated, or that if it is, it's not because the actor is spending too much time processing messages. It's possible that the distributor (or other actors needed for a response) aren't processing messages.
Actors run inside a dispatcher which has the ability to have some number of actors (or Future callbacks or other tasks, each of which can be viewed as equivalent to an actor which is spawned for processing a single message) processing a message at a given time. If there are more actors which have a message in their respective mailboxes than the number that can be processing a message, those actors are in a queue to be scheduled (note that this applies even if you happen to have a dispatcher which will spawn as many threads as it needs to process a message: since there are a limited number of CPU cores, the OS kernel scheduler's queue will take the role of the dispatcher queue). Kamon can track the depth of this queue. In my experience, it's more valuable to detect thread starvation (basically whether the time between task submission and when the task starts executing exceeds some threshold) is occurring. Lightbend's package of commercial tooling for use with Akka (disclaimer: I am employed by Lightbend) provides tools for detecting, with minimal overhead, whether starvation is occurring and providing other diagnostic information.
If thread starvation is being observed, and things like garbage collection pauses, or CPU throttling (e.g. due to running in a container) are ruled out, the primary cause of starvation is actors (or actor-like things) taking too long to process a message either because they are executing blocking I/O or are doing too much in the processing of a single message. If blocking I/O is the culprit, try to move the I/O to actors or futures running in a thread pool with far more threads than the number of CPU cores (some even advocate for an unbounded thread pool for this purpose). If it's a case of doing too much computation in processing a single message, look for spots in the processing where it makes sense to capture the state needed for the remainder of the computation in a message and send that message to yourself (this is basically equivalent to a coroutine yielding).

Related

Operating system computations for throughput

Can i get help with this? I can't seem to understand the question
"In this problem you are to compare reading a file using a single-threaded file server with a multi- threaded file server. It takes 16 msec to get a request for work, dispatch it, and do the rest of the necessary processing, assuming the data are in the block cache. If a disk operation is needed (assume a spinning disk drive with 1 head), as is the case one-fourth of the time, an additional 32 msec is required."
Can i get help with this?
I don't think so (I don't think there's enough information in the question for anyone to be able to understand it).
Example 1
The file server is single-threaded and handles asynchronous requests, and the "16 msec" is primarily "request delivery latency" (time between a process sending a request and the file server receiving the request). A process sends a single request asking to read from 1000 files, the file server receives this request, "immediately" sends back 750 replies (for file data that was cached) and sends a single request asking something (file system code, disk driver?) to fetch the remaining 250 things; then file server "immediately" waits for more requests while waiting for the reply from something (file system code, disk driver?) to complete the early 250 things. In this case you can say that throughput for single-threaded file server is virtually infinite (e.g. infinite throughput for "file data cache hit", which is the only thing that matters because you can make more requests while waiting for slow disk IO).
Example 2
The file server has 8 threads and handles synchronous requests. A single-threaded process sends 1 request (to read from 1 file) and then has to wait for the reply, the request is given to one of the file server's threads (doesn't matter which) and that thread takes an average of "16 + 32*0.25 = 24 msec" for the request to be handled before the process can make it's next request; and the process does this in a loop because it wants to read 1000 files. In this case throughput is "1/0.024 = 41.66 requests per second", which is extremely bad (primarily because the single-threaded process can't send requests fast enough to keep all threads of the multi-threaded server busy).
Example 3
The file server has 8 threads and handles synchronous requests. A process with 1000 threads send 1 request (to read from 1 file) from each of its threads. In this case we need to know how many CPUs there are (and how scheduler works) to determine anything about throughput. E.g. if there's only 2 CPUs then you're not going to get 8 file server threads running in parallel at the same time.

How does jmeter starts sending requests to server

If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.

Performance issue sending larger messages via akka remote actors

The responses to concurrent requests to remote actors were taking long time to respond, aka 1 request takes 300 ms, but 100 concurrent requests took almost 30 seconds to complete! So it almost looks like the requests are being executed sequentially! The request size is small, but response size was about 120 kB in JVM before serialization. But the response had deep nested case class.
The response times are similar when running on two different JVMs on same machine as well. But responses are fast when in same JVM (i.e. local actors). It is a single client making concurrent requests to one remote actor.
I see this log in akka debug logs. What does this indicate?
DEBUG test-app akka.remote.EndpointWriter - Drained buffer with
maxWriteCount: 50, fullBackoffCount: 546, smallBackoffCount: 2,
noBackoffCount: 1 , adaptiveBackoff: 2000
The logs show that write to send-buffer failed. This could indicate that
send-buffer is too small
receive-buffer on the remote actor's side is too small
network issues
The send buffer size and receive buffer size directly limits the number of concurrent requests and responses! Increase the send buffer and receive buffer sizes, on both client and server, to support the required concurrency in both client and server.
If the buffer size is not adequate, netty will wait for the buffer to be cleared before attempting to rewrite to the buffer. And by default there will be a backoff time too, and this can be configured as well.
The settings are under remote.netty.tcp:
akka {
remote {
netty.tcp {
# Sets the send buffer size of the Sockets,
# set to 0b for platform default
send-buffer-size = 1024000b
# Sets the receive buffer size of the Sockets,
# set to 0b for platform default
receive-buffer-size = 2048000b
}
# Controls the backoff interval after a refused write is reattempted.
# (Transports may refuse writes if their internal buffer is full)
backoff-interval = 1 ms
}
}
For full configuration see Akka reference config.

Why is UDP packet reception seemingly optimized mid-execution?

I'm running a client server configuration over Ethernet and measuring packet latency at both ends. The client (windows) is sending packets every 5 ms (confirmed with wire shark) as it should. Yet, the server (embedded linux) only receives packets at 5 ms intervals for a few seconds, at which point it stops for 300 ms. After this break the latency is only 20 us. After another period of about a few seconds it takes another break for 300 ms. This repeats indefinitely (300ms break, 20 us packet latency burst). It seems as if the server program is being optimized mid-execution to read IO in shorter bursts. Why is this happening?
Disclaimer: I haven't posted the code, as the client and server are small subsets of more complex applications, however, I am willing to factor it out if an obvious answer doesn't present itself.
This is UDP so there is no handshake or any flow control mechanism. Those 300 ms must be because of work the server is doing in the processing of the UDP messages received. During those 300 ms the server has surely lost around 60 messages that were not read from client.
You might probably want to check the server does not take more than 5 ms in processing each message if it uses one thread to process. If the server uses multi-threading to process the messages and the processing takes some time, even if it takes 1 ms, you might be in a situation where at some point all threads are competing for resources and they don't finish in time to read the next message. For the problem you are describing I would bet the server is multithreaded and you have that problem. I cannot assure that 100% for lack of info though. But in any case, you want to check the time it takes to process messages because you might be dealing with real-time requirements.
I spaced out the measurements to 1 in every 1000 packets and now it is behaving itself. I was using printf every 5ms which must've eventually filled the printf tx queue entirely. This then delayed the execution for 300ms. Once printf caught its breath, the program had a queue full of incoming packets and thus was seemingly receiving packets every 20 us.

Using many consumers in SQS Queue

I know that it is possible to consume a SQS queue using multiple threads. I would like to guarantee that each message will be consumed once. I know that it is possible to change the visibility timeout of a message, e.g., equal to my processing time. If my process spend more time than the visibility timeout (e.g. a slow connection) other thread can consume the same message.
What is the best approach to guarantee that a message will be processed once?
What is the best approach to guarantee that a message will be processed once?
You're asking for a guarantee - you won't get one. You can reduce probability of a message being processed more than once to a very small amount, but you won't get a guarantee.
I'll explain why, along with strategies for reducing duplication.
Where does duplication come from
When you put a message in SQS, SQS might actually receive that message more than once
For example: a minor network hiccup while sending the message caused a transient error that was automatically retried - from the message sender's perspective, it failed once, and successfully sent once, but SQS received both messages.
SQS can internally generate duplicates
Simlar to the first example - there's a lot of computers handling messages under the covers, and SQS needs to make sure nothing gets lost - messages are stored on multiple servers, and can this can result in duplication.
For the most part, by taking advantage of SQS message visibility timeout, the chances of duplication from these sources are already pretty small - like fraction of a percent small.
If processing duplicates really isn't that bad (strive to make your message consumption idempotent!), I'd consider this good enough - reducing chances of duplication further is complicated and potentially expensive...
What can your application do to reduce duplication further?
Ok, here we go down the rabbit hole... at a high level, you will want to assign unique ids to your messages, and check against an atomic cache of ids that are in progress or completed before starting processing:
Make sure your messages have unique identifiers provided at insertion time
Without this, you'll have no way of telling duplicates apart.
Handle duplication at the 'end of the line' for messages.
If your message receiver needs to send messages off-box for further processing, then it can be another source of duplication (for similar reasons to above)
You'll need somewhere to atomically store and check these unique ids (and flush them after some timeout). There are two important states: "InProgress" and "Completed"
InProgress entries should have a timeout based on how fast you need to recover in case of processing failure.
Completed entries should have a timeout based on how long you want your deduplication window
The simplest is probably a Guava cache, but would only be good for a single processing app. If you have a lot of messages or distributed consumption, consider a database for this job (with a background process to sweep for expired entries)
Before processing the message, attempt to store the messageId in "InProgress". If it's already there, stop - you just handled a duplicate.
Check if the message is "Completed" (and stop if it's there)
Your thread now has an exclusive lock on that messageId - Process your message
Mark the messageId as "Completed" - As long as this messageId stays here, you won't process any duplicates for that messageId.
You likely can't afford infinite storage though.
Remove the messageId from "InProgress" (or just let it expire from here)
Some notes
Keep in mind that chances of duplicate without all of that is already pretty low. Depending on how much time and money deduplication of messages is worth to you, feel free to skip or modify any of the steps
For example, you could leave out "InProgress", but that opens up the small chance of two threads working on a duplicated message at the same time (the second one starting before the first has "Completed" it)
Your deduplication window is as long as you can keep messageIds in "Completed". Since you likely can't afford infinite storage, make this last at least as long as 2x your SQS message visibility timeout; there is reduced chances of duplication after that (on top of the already very low chances, but still not guaranteed).
Even with all this, there is still a chance of duplication - all the precautions and SQS message visibility timeouts help reduce this chance to very small, but the chance is still there:
Your app can crash/hang/do a very long GC right after processing the message, but before the messageId is "Completed" (maybe you're using a database for this storage and the connection to it is down)
In this case, "Processing" will eventually expire, and another thread could process this message (either after SQS visibility timeout also expires or because SQS had a duplicate in it).
Store the message, or a reference to the message, in a database with a unique constraint on the Message ID, when you receive it. If the ID exists in the table, you've already received it, and the database will not allow you to insert it again -- because of the unique constraint.
AWS SQS API doesn't automatically "consume" the message when you read it with API,etc. Developer need to make the call to delete the message themselves.
SQS does have a features call "redrive policy" as part the "Dead letter Queue Setting". You just set the read request to 1. If the consume process crash, subsequent read on the same message will put the message into dead letter queue.
SQS queue visibility timeout can be set up to 12 hours. Unless you have a special need, then you need to implement process to store the message handler in database to allow it for inspection.
You can use setVisibilityTimeout() for both messages and batches, in order to extend the visibility time until the thread has completed processing the message.
This could be done by using a scheduledExecutorService, and schedule a runnable event after half the initial visibility time. The code snippet bellow creates and executes the VisibilityTimeExtender every half of the visibilityTime with a period of half the visibility time. (The time should to guarantee the message to be processed, extended with visibilityTime/2)
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
ScheduledFuture<?> futureEvent = scheduler.scheduleAtFixedRate(new VisibilityTimeExtender(..), visibilityTime/2, visibilityTime/2, TimeUnit.SECONDS);
VisibilityTimeExtender must implement Runnable, and is where you update the new visibility time.
When the thread is done processing the message, you can delete it from the queue, and call futureEvent.cancel(true) to stop the scheduled event.