Akka Stream Exception Thrown When Downloading File From S3 - scala

I am trying to download a file from S3 using the following code:
wsClient
.url(url)
.withMethod("GET")
.withHttpHeaders(my_headers: _*)
.withRequestTimeout(timeout)
.stream()
.map {
case AhcWSResponse(underlying) =>
underlying.bodyAsBytes
}
When I run this I get the following exception:
akka.stream.StreamLimitReachedException: limit of 13 reached
Is this because I am using bodyAsBytes? What does this error mean ? I also see this warning message which is probably related:
blockingToByteString is a blocking and unsafe operation!

This happens because if you use stream(), you need to consume the source using bodyAsSource. It is important to do so or it would otherwise backpressure the connection. body or bodyAsBytes are implemented and do consume the source but for some reason the implementor decided to let you know that you should have used execute() instead of stream() by limiting the body to 13 ByteStrings and 50ms timeout.

You are getting StreamLimitReachedExpcetion because the number of incoming elements is larger than max.
val MAX_ALLOWED_SIZE = 100
// OK. Future will fail with a `StreamLimitReachedException`
// if the number of incoming elements is larger than max
val limited: Future[Seq[String]] =
mySource.limit(MAX_ALLOWED_SIZE).runWith(Sink.seq)
// OK. Collect up until max-th elements only, then cancel upstream
val ignoreOverflow: Future[Seq[String]] =
mySource.take(MAX_ALLOWED_SIZE).runWith(Sink.seq)
You can find more information about streaming process here

Related

Akka HTTP / Error Response entity was not subscribed after 1 second

I searched the other StackOverflow question/answers towards this error, but couldn't find a hint for solving this problem.
The Akka HTTP application runs for like 5 hours under high workload without problems, and than I start to get multiple:
Response entity was not subscribed after 1 second. Make sure to read the response `entity` body or call `entity.discardBytes()` on it -- in case you deal with `HttpResponse`, use the shortcut `response.discardEntityBytes()`. GET /api/name123 Empty -> 200 OK Default(142 bytes)
and later
The connection actor has terminated. Stopping now.
The actor is only sending out API requests and afterwards forwards those responses to another actor if successfully, in case of failure, that request is added back to the todo stack and retried later. This is the main code:
private def makeApiRequest(id: String): Unit = {
val url = UrlBuilder(id)
val request = HttpRequest(method = HttpMethods.GET, uri = url)
val f: Future[(StatusCode, String)] = Http(context.system)
.singleRequest(request)
.flatMap(_.toStrict(2.seconds))
.flatMap { resp =>
Unmarshal(resp.entity).to[String].map((resp.status, _))
}
context.pipeToSelf(f) {
case Success(response) =>
API_HandleResponseSuccess(id, response._1, response._2)
case Failure(e) =>
API_HandleResponseFailure(id, e.getMessage)
}
}
I don't really understand why I get the "Response entity was not subscribed..." error, as I do Unmarshal(resp.entity).to[String] and thereby would think, that no .DiscardEntityBytes() is needed, or does it needs to be still included somehow?
Side information: Also confusing to me, why the CPU performance doesn't stay constant.
Within the actor do I track the response times of each request and calculate the amount of max. parallel requests possible to handle with the given hardware conditions (restricted to a max max of 120 though) on a regular basis to account for API response time fluctuations, so there should be always enough room to make the requests without starving for that actor. In addition would that be the respective application.conf:
dispatcher-worker-io {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
fixed-pool-size = 120
keep-alive-time = 60s
allow-core-timeout = off
}
shutdown-timeout = 60s
throughput = 1
}
...
akka.http.client.host-connection-pool.max-connections = 180
akka.http.client.host-connection-pool.max-open-requests = 256
akka.http.client.host-connection-pool.max-retries = 0
Any ideas on why I after 5 hours without problems start to get those exceptions mentioned above?
or
Has an idea of which part of above shared code might leads to this non-linear CPU performance?
I also made multiple of those long lasting hour runs, and it always ends out like this, somehow it's starving after 5 to 6 hours.
val AkkaVersion = "2.6.15"
val AkkaHttpVersion = "10.2.6"
Directly from the docs (https://doc.akka.io/docs/akka-http/current/client-side/request-level.html):
Always make sure you consume the response entity streams (of type
Source[ByteString,Unit]). Connect the response entity Source to a
Sink, or call response.discardEntityBytes() if you don’t care about
the response entity.
Read the Implications of the streaming nature of Request/Response
Entities section for more details.
If the application doesn’t subscribe to the response entity within
akka.http.host-connection-pool.response-entity-subscription-timeout,
the stream will fail with a TimeoutException: Response entity was not
subscribed after ....
You need to .discardEntityBytes() in case of failure. Right now you only consume it on success.
Perhaps high CPU load is caused by all these unfreed resources on the JVM + retries of all the failures.

Gatling: Producer and consumer users

I have a load test where three sets of users create something and a different set of users perform some actions on them.
What is the recommended way to co-ordinate this behaviour in Gatling?
I'm currently using an object which contains a LinkedBlockingQueue which the "producers" put the ID and consumers take, see below.
However, it causes the test to hang after ~20s (targeting 1tps).
I've also tried using poll with a timeout, but instead of hanging the poll almost always fails (after 30s) or causes a hang if the timeout is larger (1m+).
This seems to be because all the threads are blocked waiting for something from the queue so isn't compatible with the way Gatling tests run (i.e. not 1 thread per user). Is there a non-blocking way to wait in the Gatling DSL?
Producer.scala
// ...
scenario("Produce stuff")
.exec(/* HTTP call which extracts an ID*/)
.exec(session => Queue.ids.put(session("my-id").as[String])
// ...
Consumer.scala
// ...
scenario("Consume stuff")
.exec(session => session.set("my-id", Queue.ids.take()))
.exec(/* HTTP call which users ID*/)
// ...
Queue.scala
object Queue {
val ids = new LinkedBlockingQueue[String]()
}
As an alternative I've tried to use the application functionality but it seems a harder problem to ensure that each user picks a unique item from the app.
Acknowledging this is all a hack, my current solution in Consumer.scala is:
doIf(_ => Queue.ids.size() < MIN_COUNT)(
pause(30) // wait for 30s if queue is initially too small
)
.doWhile(_ => Queue.ids.size() >= MIN_COUNT)(
exec(session => session.set("my-id", Queue.ids.take()))
.exec(...)
.pause(30)
)

Exceeded configured max-open-requests

recently I started to build some small web processing service using akka streams. It's quite simple, I'm pulling urls from redis, then I'm downloading those urls(they are images) later I'm processing images, and pushing them to s3 and some json to redis.
I'm downloading lot of different kinds of images from multiple sites, I'm getting whole bunch of errors like 404, Unexpected disconnect , Response Content-Length 17951202 exceeds the configured limit of 8388608, EntityStreamException: Entity stream truncation and redirects. With redirects I'm invoking requestWithRedirects with address founded in location header of response.
Part responsible for downloading is pretty much like this:
override lazy val http: HttpExt = Http()
def requestWithRedirects(request: HttpRequest, retries: Int = 10)(implicit akkaSystem: ActorSystem, materializer: FlowMaterializer): Future[HttpResponse] = {
TimeoutFuture(timeout, msg = "Download timed out!") {
http.singleRequest(request)
}.flatMap {
response => handleResponse(request, response, retries)
}.recoverWith {
case e: Exception if retries > 0 =>
requestWithRedirects(request, retries = retries - 1)
}
}
TimeoutFuture is quite simple it takes future and timeout. If future takes longer than timeout it returns other future with timeout exception.
The problem I'm having is: after some time I'm getting an error:
Message: RuntimeException: Exceeded configured max-open-requests value of [128] akka.http.impl.engine.client.PoolInterfaceActor$$anonfun$receive$1.applyOrElse in PoolInterfaceActor.scala::109
akka.actor.Actor$class.aroundReceive in Actor.scala::467
akka.http.impl.engine.client.PoolInterfaceActor.akka$stream$actor$ActorSubscriber$$super$aroundReceive in PoolInterfaceActor.scala::46
akka.stream.actor.ActorSubscriber$class.aroundReceive in ActorSubscriber.scala::208
akka.http.impl.engine.client.PoolInterfaceActor.akka$stream$actor$ActorPublisher$$super$aroundReceive in PoolInterfaceActor.scala::46
akka.stream.actor.ActorPublisher$class.aroundReceive in ActorPublisher.scala::317
akka.http.impl.engine.client.PoolInterfaceActor.aroundReceive in PoolInterfaceActor.scala::46
akka.actor.ActorCell.receiveMessage in ActorCell.scala::516
akka.actor.ActorCell.invoke in ActorCell.scala::487
akka.dispatch.Mailbox.processMailbox in Mailbox.scala::238
akka.dispatch.Mailbox.run in Mailbox.scala::220
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec in AbstractDispatcher.scala::397
scala.concurrent.forkjoin.ForkJoinTask.doExec in ForkJoinTask.java::260
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask in ForkJoinPool.java::1339
scala.concurrent.forkjoin.ForkJoinPool.runWorker in ForkJoinPool.java::1979
scala.concurrent.forkjoin.ForkJoinWorkerThread.run in ForkJoinWorkerThread.java::107
I'm not sure what could be the problem but I think I have some downloads that were not finished properly and they stay in some global pool of connections after a while causing mentioned error. Any ideas what could be causing the problem? Or how to try find root of the problem: I already tested 404 responses, and Response Content-Length exceeds... errors, and they doesn't seem to be my troublemakers.
EDIT:
Most likely the problem is with my TimeoutFuture. I'm filling it with error as described here https://stackoverflow.com/a/29330010/2963977 but in my opinion future that is actually downloading an image never completes and it's taking my connection pool resources.
I wonder why those settings doesn't have any impact in my case :
akka.http.client.connecting-timeout = 1 s
akka.http.client.idle-timeout = 1 s
akka.http.host-connection-pool.idle-timeout = 1 s
EDIT2:
Apparently timeouts are not supported yet. Here is my bug report
https://github.com/akka/akka/issues/17732#issuecomment-112315953

Use of Streams,Enumerator and Websockets in playframework

I'm looking for the proper way to use play's Enumerator (play.api.libs.iteratee.Enumerator[A]) in my code, i have a stream of object of type "InfoBlock" and i want to redirect it to a websocket.What i actually do is:
The data structure holding the blocks
private lazy val buf:mutable.Queue[InfoBlock] = new mutable.SynchronizedQueue[InfoBlock]
The callback to be used in the Enumerator
def getCallback: Future[Option[InfoBlock]] = Future{
if (!buf.isEmpty)
Some(buf.dequeue)
else
None}
Block are produced by another thread and added to the queue using:
buf += new InfoBlock(...)
Then in the controller i want to set up a websocket to stream that data,doing:
def stream = WebSocket.using[String]{ request =>
val in = Iteratee.consume[String]()
val enu:Enumerator[InfoBlock] = Enumerator.fromCallback1(
isFirst => extractor.getCallback
)
val out:Enumerator[String] = enu &> Enumeratee.map(blk => blk.author+" -> "+blk.msg)
(in,out)}
It works but with a big problem, when a connection is open it sends a bunch of blocks (=~ 50) and stops, if i open a new websocket then i get another bunch of blocks but no more.I tried to set some property to the js object WebSocket in particular i tried setting
websocket.binaryType = "arraybuffer"
because i thought using "blob" may be the cause but i was wrong the problem must be server side and i have no clue..
From the Enumerator ScalaDocs on Enumerator.fromCallback, describing the retriever function:
The input function. Returns a future eventually redeemed with Some value if there is input to pass, or a future eventually redeemed with None if the end of the stream has been reached.
This means that the enumerator will start by pulling everything off the queue. When it is empty, the callback will return a None. The enumerator sees this as the end of the stream, and closes sends a Done state downstream. It won't be looking for any more data
Rather than using a mutable queue for message passing, try and push the Enumerator/Iteratee paradigm into your worker. Create an Enumerator that outputs instances of the what you're creating, and have the iteratee pull from that instead. You can stick some enumerates in the middle to do some transforms if you need to.

Idiomatic way to continuously poll a HTTP server and dispatch to an actor

I need to write a client that continuously polls a web server for commands. A response from the server indicates that a command is available (in which case the response contains the command) or an instruction that no command is available, and you should fire off a new request for incoming commands.
I'm trying to figure out how to do it using spray-client and Akka, and I can think of ways to do it, but none of them look like they're the idiomatic way to get it done. So the question is:
what's the most sensible way to have a couple of threads poll the same web server for incoming commands and hand the commands off to an actor?
This example uses spray-client, scala futures, and Akka scheduler.
Implementation varies depending on desired behavior (execute many requests in parallel at the same time, execute in different intervals, send responses to one actor to process one response at a time, send responses to many actors to process in parallel... etc).
This particular example shows how execute many requests in parallel at the same time, and then do something with each result as it completes, without waiting for any other requests that were fired off at the same time to complete.
The code below will execute two HTTP requests every 5 seconds to 0.0.0.0:9000/helloWorld and 0.0.0.0:9000/goodbyeWorld in parallel.
Tested in Scala 2.10, Spray 1.1-M7, and Akka 2.1.2:
Actual scheduling code that handles periodic job execution:
// Schedule a periodic task to occur every 5 seconds, starting as soon
// as this schedule is registered
system.scheduler.schedule(initialDelay = 0 seconds, interval = 5 seconds) {
val paths = Seq("helloWorld", "goodbyeWorld")
// perform an HTTP request to 0.0.0.0:9000/helloWorld and
// 0.0.0.0:9000/goodbyeWorld
// in parallel (possibly, depending on available cpu and cores)
val retrievedData = Future.traverse(paths) { path =>
val response = fetch(path)
printResponse(response)
response
}
}
Helper methods / boilerplate setup:
// Helper method to fetch the body of an HTTP endpoint as a string
def fetch(path: String): Future[String] = {
pipeline(HttpRequest(method = GET, uri = s"/$path"))
}
// Helper method for printing a future'd string asynchronously
def printResponse(response: Future[String]) {
// Alternatively, do response.onComplete {...}
for (res <- response) {
println(res)
}
}
// Spray client boilerplate
val ioBridge = IOExtension(system).ioBridge()
val httpClient = system.actorOf(Props(new HttpClient(ioBridge)))
// Register a "gateway" to a particular host for HTTP requests
// (0.0.0.0:9000 in this case)
val conduit = system.actorOf(
props = Props(new HttpConduit(httpClient, "0.0.0.0", 9000)),
name = "http-conduit"
)
// Create a simple pipeline to deserialize the request body into a string
val pipeline: HttpRequest => Future[String] = {
sendReceive(conduit) ~> unmarshal[String]
}
Some notes:
Future.traverse is used for running futures in parallel (ignores order). Using a for comprehension on a list of futures will execute one future at a time, waiting for each to complete.
// Executes `oneThing`, executes `andThenAnother` when `oneThing` is complete,
// then executes `finally` when `andThenAnother` completes.
for {
oneThing <- future1
andThenAnother <- future2
finally <- future3
} yield (...)
system will need to be replaced with your actual Akka actor system.
system.scheduler.schedule in this case is executing an arbitrary block of code every 5 seconds -- there is also an overloaded version for scheduling messages to be sent to an actorRef.
system.scheduler.schedule(
initialDelay = 0 seconds,
frequency = 30 minutes,
receiver = rssPoller, // an actorRef
message = "doit" // the message to send to the actorRef
)
For your particular case, printResponse can be replaced with an actor send instead: anActorRef ! response.
The code sample doesn't take into account failures -- a good place to handle failures would be in the printResponse (or equivalent) method, by using a Future onComplete callback: response.onComplete {...}
Perhaps obvious, but spray-client can be replaced with another http client, just replace the fetch method and accompanying spray code.
Update: Full running code example is here:
git clone the repo, checkout the specified commit sha, $ sbt run, navigate to 0.0.0.0:9000, and watch the code in the console where sbt run was executed -- it should print Hello World!\n'Goodbye World! OR Goodbye World!\nHelloWorld! (order is potentially random because of parallel Future.traverse execution).
You can use HTML5 Server-Sent Events. It is implemented in many Scala frameworks. For example in xitrum code looks like:
class SSE extends Controller {
def sse = GET("/sse") {
addConnectionClosedListener {
// The connection has been closed
// Unsubscribe from events, release resources etc.
}
future {
respondEventSource("command1")
//...
respondEventSource("command2")
//...
}
}
SSE is pretty simple and can be used in any software not only in browser.
Akka integrated in xitrum and we use it in similar system. But it uses netty for async server it is also good for processing thousands of request in 10-15 threads.
So in this way your client will keep connection with server and reconnect when connection will be broken.