Stopping execution of an Enumerator on client cancellation of chunked response - scala

In the simple case of proxying a download from S3 to the client, I'm having trouble dealing with client disconnections mid-download.
val enumerator = Enumerator.outputStream{ out =>
val s3Object = s3Client.getObject(new GetObjectRequest(bucket, path))
val in = new BufferedInputStream(s3Object.getObjectContent())
val bufferedOut = new BufferedOutputStream(out)
Iterator.continually(in.read).takeWhile(_ != -1).foreach(bufferedOut.write)
in.close()
bufferedOut.close()
}
Ok.chunked(enumerator.andThen(Enumerator.eof).withHeaders(
"Content-Disposition" -> s"attachment; filename=${name}"
)
This works (mostly) beautifully as long as the client doesn't cancel the download before it completes. Otherwise, on cancellation the enumerator keeps filling up with data until the download from S3 is complete. Several cancelled downloads can hog quite a bit of resources.
Are there any better patterns that can prevent this from happening?

Move closing the InputStream into an onDoneEnumerating block, and use Enumerator.fromStream instead of Enumerator.outputStream:
val s3Object = s3Client.getObject(new GetObjectRequest(bucket, path))
val in = s3Object.getObjectContent()
val enumerator = Enumerator.fromStream(in).onDoneEnumerating {
in.close()
}
fromStream reads chunks (8 KiB by default) at a time, so a BufferedInputStream is unnecessary.
Also, with outputStream, there is no way for the Iteratee to push back if it is slow to consume the data, which could result in a large amount of data (up to the size of the S3 object) buffering in memory. This is dangerous, because the application could run out of memory. With fromStream, the next read won't occur until the Iteratee is ready for more data.

Related

Alpakka S3 connector stream won't handle the load, throwing akka.stream.BufferOverflowException

I have an akka-http service and I am trying out the alpakka s3 connector for uploading files. Previously I was using a temporary file and then uploading with Amazon SDK. This approach required some adjustments for Amazon SDK to make it more scala like, but it could handle even a 1000 requests at once. Throughput wasn't amazing, but all of the requests went through eventually. Here is the code before changes, with no alpakka:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
storeUploadedFile("csv", tempDestination) {
case (metadata, file) =>
val uploadFuture = upload(file, file.toPath.getFileName.toString)
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
case class S3UploaderException(msg: String) extends Exception(msg)
def upload(file: File, key: String): Future[String] = {
val s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new DefaultAWSCredentialsProviderChain())
.withRegion(Regions.EU_WEST_3)
.build()
val promise = Promise[String]()
val listener = new ProgressListener() {
override def progressChanged(progressEvent: ProgressEvent): Unit = {
(progressEvent.getEventType: #unchecked) match {
case ProgressEventType.TRANSFER_FAILED_EVENT => promise.failure(S3UploaderException(s"Uploading a file with a key: $key"))
case ProgressEventType.TRANSFER_COMPLETED_EVENT |
ProgressEventType.TRANSFER_CANCELED_EVENT => promise.success(key)
}
}
}
val request = new PutObjectRequest("S3_BUCKET", key, file)
request.setGeneralProgressListener(listener)
s3Client.putObject(request)
promise.future
}
```
When I changed this to use alpakka connector, the code looks much nicer as we can just connect the ByteSource and alpakka Sink together. However this approach cannot handle such a big load. When I execute 1000 requests at once (10 kb files) less than 10% go through and the rest fails with exception:
akka.stream.alpakka.s3.impl.FailedUpload: Exceeded configured
max-open-requests value of [32]. This means that the request queue of
this pool
(HostConnectionPoolSetup(bargain-test.s3-eu-west-3.amazonaws.com,443,ConnectionPoolSetup(ConnectionPoolSettings(4,0,5,32,1,30
seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.3),10
seconds,1
minute,512,None,WebSocketSettings(,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$4787/1279590204#4d809f4c),List(),ParserSettings(2048,16,64,64,8192,64,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,Map(If-Range
-> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, Content-MD5 -> 0, Date -> 0, If-Match -> 0, If-None-Match -> 0,
User-Agent ->
32),false,true,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4535/297570074#6b426c59),None,TCPTransport),New,1
second),akka.http.scaladsl.HttpsConnectionContext#7e0f3726,akka.event.MarkerLoggingAdapter#74f3a78b)))
has completely filled up because the pool currently does not process
requests fast enough to handle the incoming request load. Please retry
the request later. See
http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html
for more information.
Here is how the summary of a Gatling test looks like:
---- Response Time Distribution ----------------------------------------
t < 800 ms 0 ( 0%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 90 ( 9%)
failed 910 ( 91%)
When I execute 100 of simultaneous requests, half of it fails. So, still close to satisfying.
This is a new code:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
extractRequestContext { ctx =>
implicit val materializer = ctx.materializer
extractActorSystem { actorSystem =>
fileUpload("csv") {
case (metadata, byteSource) =>
val uploadFuture = byteSource.runWith(S3Uploader.sink("s3FileKey")(actorSystem, materializer))
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
}
def sink(s3Key: String)(implicit as: ActorSystem, m: Materializer) = {
val regionProvider = new AwsRegionProvider {
def getRegion: String = Regions.EU_WEST_3.getName
}
val settings = new S3Settings(MemoryBufferType, None, new DefaultAWSCredentialsProviderChain(), regionProvider, false, None, ListBucketVersion2)
val s3Client = new S3Client(settings)(as, m)
s3Client.multipartUpload("S3_BUCKET", s3Key)
}
```
The complete code with both endpoints can be seen here
I have a couple of questions.
1) Is this a feature? Is this what we can call a backpressure?
2) If I would like this code to behave like the old approach with a temporary file (no failed requests and all of them finish at some point) what do I have to do? I was trying to implement a queue for the stream (link to the source below), but this made no difference. The code can be seen here.
(* DISCLAIMER * I am still a scala newbie trying to quickly understand akka streams and find some workaround for the issue. There are big chances that there is something simple wrong in this code. * DISCLAIMER *)
It’s a backpressure feature.
Exceeded configured max-open-requests value of [32] In the config max-open-requests is set to 32 by default.
Streaming is used to work with big amount of data, not to handle many many requests per second.
Akka developers had to put something for max-open-requests. They choose 32 for some reason for sure. And they had no idea what it will be used for. May it be sending 1000 32KB files or 1000 1GB files at once? They don’t know. But they still want to make sure that by default (and 80% of people use defaults probably) the apps will be handled gracefully and safely. So they had to limit processing power.
You asked to do 1000 “now” but I am pretty sure AWS did not send 1000 files simultaneously but used some queue, which may be a good case for you too if you have many small files to upload.
But it is perfectly fine to tune it to your case!
If you know your machine and the target will take care of more simultaneous connections, you can change the number to a higher value.
Also, for a lot of HTTP calls use cached host connection pool.

Why scala futures do not work faster even with more threads in threads pool?

I have a following algorithm with scala:
Do initial call to db to initialize cursor
Get 1000 entities from db (Returns Future)
For every entity process one additional request to database and get modified entity (returns future)
Transform original entity
Put transformed entity to Future call back from #3
Wait for all Futures
In scala it some thing like:
val client = ...
val size = 1000
val init:Future = client.firstSearch(size) //request over network
val initResult = Await(init, 30.seconds)
var cursorId:String = initResult.getCursorId
while (!cursorId.isEmpty) {
val futures:Seq[Future] = client.grabWithSize(cursorId).map{response=>
response.getAllResults.map(result=>
val grabbedOne:Future[Entity] = client.grabOneEntity(result.id) //request over network
val resultMap:Map[String,Any] = buildMap(result)
val transformed:Map[String,Any] = transform(resultMap) //no future here
grabbedOne.map{grabbedOne=>
buildMap(grabbedOne) == transformed
}
}
Futures.sequence(futures).map(_=> response.getNewCursorId)
}
}
def buildMap(...):Map[String,Any] //sync call
I noticed that if I increase size say two times, every iteration in while started working slowly ~1.5. But I do not see that my PC processor loaded more. It loaded near zero, but time increases in ~1.5. Why? I have setuped:
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1024))
I think, that not all Futures executed in parallel. But why? And ho to fix?
I see that in your code, the Futures don't block each other. It's more likely the database that is the bottleneck.
Is it possible to do a SQL join for O(1) rather than O(n) in terms of database calls? (If you're using Slick, have a look under the queries section about joins.)
If the load is low, it's probably that the connection pool is maxed out, you'd need to increase it for the database and the network.

Are Iteratees safe for managing resources?

Suppose I were reading from an InputStream.
How I would normally do it:
val inputStream = ...
try {
doStuff(inputStream)
} finally {
inputStream.close()
}
Whether or not doStuff throws an exception, we will close the InputStream.
How I would do it with iteratees:
val inputStream ...
Enumerator.fromStream(inputStream)(Iteratee.foreach(doStuff))
Will the InputStream be closed (even if doStuff throws an exception)?
A little test:
val inputStream = new InputStream() { // returns 10, 9, ... 0, -1
private var i = 10
def read() = {
i = math.max(0, i) - 1
i
}
override def close() = println("closed") // looking for this
}
Enumerator.fromStream(inputStream)(Iteratee.foreach(a => 1 / 0)).onComplete(println)
We only see:
Failure(java.lang.ArithmeticException: / by zero)
The stream was never closed. Replace 1 / 0 with 1 / 1 and you'll see that the stream closes.
Of course, I could maintain a reference to the original stream and close it in case of failure, but AFAIK the idea of using iteratees is creating composable iteration without having to do that.
Is this expected behavior?
Is there a way to use iteratees so the resources are always disposed correctly?
Iteratees were specially designed for safe resource management. See the first sentence of the Iteratee IO: safe, practical, declarative input processing:
Iteratee IO is a style of incremental input processing with precise resource control.
The idea is that when your resource is only accessed through iteratees then the resource-owning code can tell exactly when the iteratee is finished with the resource and close it immediately. On the other hand, when iteration is managed manually (as with a traditional InputStream) the user of the resource is responsible for closing it. This can lead to leaks.
Saying that, there was a bug in Play 2.1 where fromStream didn't manage the closing of its underlying InputStream! This bug was fixed in Play 2.2.
You can see the fromStream code to see how the Enumerator was fixed by using onDoneEnumerating to close the resource when the iteratee is finished.

How to get InputStream from request in Play

Ithink this used to be possible in Play 1.x, but I can't find how to do it in Play 2.x
I know that Play is asynchronous and uses Iteratees. However, there is generally much better support for InputStreams.
(In this case, I will be using a streaming JSON parser like Jackson to process the request body.)
How can I get an InputStream from a chunked request body?
I was able to get this working with the following code:
// I think all of the parens and braces line up -- copied/pasted from code
val pos = new PipedOutputStream()
val pis = new PipedInputStream(pos)
val result = Promise[Either[Errors, String]]()
def outputStreamBodyParser = {
BodyParser("outputStream") {
requestHeader =>
val length = requestHeader.headers(HeaderNames.CONTENT_LENGTH).toLong
Future {
result.completeWith(saveFile(pis, length)) // save returns Future[Either[Errors, String]]
}
Iteratee.fold[Array[Byte], OutputStream](pos) {
(os, data) =>
os.write(data)
os
}.map {
os =>
os.close()
Right(os)
}
}
}
Action.async(parse.when(
requestHeaders => {
val maybeContentLength = requestHeaders.headers.get(HeaderNames.CONTENT_LENGTH)
maybeContentLength.isDefined && allCatch.opt(maybeContentLength.get.toLong).isDefined
},
outputStreamBodyParser,
requestHeaders => Future.successful(BadRequest("Missing content-length header")))) {
request =>
result.future.map {
case Right(fileRef) => Ok(fileRef)
case Left(errors) => BadRequest(errors)
}
}
Play 2 is meant to be fully asynchronous, so this isn't easily possible or desirable. The problem with InputStream is there is no push back, there is no way for the reader of the InputStream to communicate to the input that it wants more data without blocking on read. Technically it is possible to write an Iteratee that could read data and put it into an InputStream, and would wait for a call to read on the InputStream before asking the Enumerator for more data, but it would be dangerous. You would have to make sure that the InputStream was closed properly, or the Enumerator would sit waiting forever (or until it times out) and the call to read must be made from a thread that is not running on the same ExecutionContext as the Enumerator and Iteratee or the application could deadlock.

Play Framework WebSocket Async

I'm using a WebSocket end point exposed by my Play Framework controller. My client will however send a large byte array and I'm a bit confused on how to handle this in my Iteratee. Here is what I have:
def myWSEndPoint(f: String => String) = WebSocket.async[Array[Byte]] {
request =>
Akka.future {
val (out, chan) = Concurrent.broadcast[Array[Byte]]
val in: Iteratee[Array[Byte], Unit] = Iteratee.foreach[Array[Byte]] {
// How do I get the entire file?
}
(null, null)
}
}
As it can be seen in the code above, I'm stuck on the line on how to handle the Byte array as one request and send the response back as a String? My confusion is on the Iteratee.foreach call. Is this foreach a foreach on the byte array or the entire content of the request that I send as a byte array from my client? It is confusing!
Any suggestions?
Well... It depends. Is your client sending all binaries at once, or is it (explicitly) chunk by chunk?
-> If it's all at once, then everything will be in the first chunk (therefore why a websocket? Why an Iteratee? Actions with BodyParser will probably be more efficient for that).
-> If it's chunk by chunk you have to keep every chunks you receive, and concatenate them on close (on close, unless you have another way for the client to say: "Hey I'm done!").