TimeoutException when consuming files from S3 with akka streams - scala

I'm trying to consume a bunch of files from S3 in a streaming manner using akka streams:
S3.listBucket("<bucket>", Some("<common_prefix>"))
.flatMapConcat { r => S3.download("<bucket>", r.key) }
.mapConcat(_.toList)
.flatMapConcat(_._1)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.runForeach { x => println(x) }
Without increasing akka.http.host-connection-pool.response-entity-subscription-timeout I get
java.util.concurrent.TimeoutException: Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. for the second file, just after printing the last line of the first file, when trying to access the first line of the second file.
I understand the nature of that exception. I don't understand why the request to the second file is already in progress, while the first file is still being processed. I guess, there's some buffering involved.
Any ideas how to get rid of that exception without having to increase akka.http.host-connection-pool.response-entity-subscription-timeout?

Instead of merging the processing of downloaded files into one stream with flatMapConcat you could try materializing the stream within the outer stream and fully process it there before emitting your output downstream. Then you shouldn't begin downloading (and fully processing) the next object until you're ready.
Generally you want to avoid having too many stream materializations to reduce overhead, but I suspect that would be negligible for an app performing network I/O like this.
Let me know if something like this works: (warning: untested)
S3.listBucket("<bucket>", Some("<common_prefix>"))
.mapAsync(1) { result =>
val contents = S3.download("<bucket>", r.key)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.to(Sink.seq)(Keep.right)
.run()
contents
}
.mapConcat(identity)
.runForeach { x => println(x) }

Related

With Akka Stream, how to dynamically duplicate a flow?

I'm running a live video streaming server. There's an Array[Byte] video source. Note that I can't get 2 connections to my video source. I want every client connecting to my server to receive this same stream, with a buffer discarding the old frames.
I tried using a BroadcastHub like this :
val source =
Source.fromIterator(() => myVideoStreamingSource.zipWithIndex)
val runnableGraph =
source.toMat(BroadcastHub.sink(bufferSize = 2))(Keep.right)
runnableGraph.run().to(Sink.foreach { index =>
println(s"client A reading frame #$index")
}).run()
runnableGraph.run().to(Sink.foreach { index =>
println(s"client B reading frame #$index")
}).run()
I get :
client A reading frame #0
client B reading frame #1
client A reading frame #2
client B reading frame #3
We see that the main stream is partitioned between the two clients, whereas I'd expect my two client being able to see all the source stream's frames.
Did I miss something, or is there any other solution ?
The issue is the combination of Iterator with BroadcastHub. I assume you myVideoStreamingSource is something like:
val myVideoStreamingSource = Iterator("A","B","C","D","E")
I'll now quote from BroadcastHub.Sink:
Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own [[Source]] for consuming the [[Sink]] of that materialization.
The issue here for you, is that it does not yet consume the data from the iterator.
The thing with iterator, is that once you consumed its data, you won't get back to the beginning again. Add to that the fact that both graphs run in parallel, it looks like it "divides" the elements between the two. But actually that is completely random. For example, if you add a sleep of 1 second between the Client A and Client B, so the only client that will print will be A.
In order to get that work, you need to create a source that is reversible. For example, Seq, or List. The following will do:
val myVideoStreamingSource = Seq("A","B","C","D","E")
val source = Source.fromIterator(() => myVideoStreamingSource.zipWithIndex.iterator)

Akka Sink never closes

I am uploading a single file to an SFTP server using Alpakka but once the file is uploaded and I have gotten the success response the Sink stays open, how do I drain it?
I started off with this:
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
.map(_.wasSuccessful)
But that ends up never leaving the map step.
I tried to add a killswitch, but that seems to have had no effect (neither with shutdown or abort):
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
val (killswitch, result) = source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
result.map {
killswitch.shutdown()
_.wasSuccessful
}
Am I doing something fundamentally wrong? I just want one result.
EDIT The settings sent in to toPath:
SftpSettings(InetAddress.getByName(host))
.withCredentials(FtpCredentials.create(amexUsername, amexPassword))
.withStrictHostKeyChecking(false)
By asking you to put Await.result(result, Duration.Inf) at the end I wanted to check the theory expressed by A. Gregoris. Thus it's either
your app exits before Future completes or
(if you app does't exit) function in which you do this discards result
If your app doesn't exit you can try using result.onComplete to do necessary work.
I cannot see your whole code but it seems to me that in the snippet you posted that result value is a Future that is not completing before the end of your program execution and that is because the code in the map is not being executed either.

Offer to queue with some initial display

I want to offer to queue a string sent in load request after some initial delay say 10 seconds.
If the subsequent request is made with some short interval delay(1 second) then everything works fine, but if it is made continuously like from a script then there is no delay.
Here is the sample code.
def load(randomStr :String) = Action { implicit request =>
Source.single(randomStr)
.delay(10 seconds, DelayOverflowStrategy.backpressure)
.map(x =>{
println(x)
queue.offer(x)
})
.runWith(Sink.ignore)
Ok("")
}
I am not entirely sure that this is the correct way of doing what you want. There are some things you need to reconsider:
a delayed source has an initial buffer capacity of 16 elements. You can increase this with addAttributes(initialBuffer)
In your case the buffer cannot actually become full because every time you provide just one element.
Who is the caller of the Action? You are defining a DelayOverflowStrategy.backpressure strategy but is the caller able to handle this?
On every call of the action you are creating a Stream consisting of one element, how is the backpressure here helping? It is applied on the stream processing and not on the offering to the queue

Akka streams - shutdown stream with grouping without losing data

I have a source that groups elements and a sink that makes a batch request,
I'm using KillSwitch to be able to shutdown the graph at some arbitrary point in time. The problem that records of the latest incomplete batch that source outputs are getting lost when switch.shutdown() is being called
val source = Source.tick(10.millis, 10.millis, "tick").grouped(500)
val (switch, _) = source.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
Thread.sleep(3000) // wait some arbitrary time
switch.shutdown()
Is there a way to 'flush out' the incomplete batch when shutdown happens?
The behaviour of the kill switch shutdown is positional, as per its docs
After calling [[UniqueKillSwitch#shutdown()]] the running instance of
the [[Graph]] of [[FlowShape]] that materialized to the
[[UniqueKillSwitch]] will complete its downstream and cancel its
upstream (unless if finished or failed already in which case the
command is ignored).
See also more docs here.
Now the grouped stage will emit a partially filled group only at completion time, but not when cancelled.
This means that the graph below (grouped before killswitch) will behave like you observed
val switch =
Source.tick(10.millis, 175.millis, "tick")
.grouped(10)
.viaMat(KillSwitches.single)(Keep.right)
.toMat(Sink.foreach(println))(Keep.left)
.run()
whilst the graph below (grouped after killswitch) will emit partial groups downstream at completion
val switch =
Source.tick(10.millis, 175.millis, "tick")
.viaMat(KillSwitches.single)(Keep.right)
.grouped(10)
.toMat(Sink.foreach(println))(Keep.left)
.run()

very fast upstream followed by very slow downstream and out of memory issues

I've been taking my experiences gathered from fiddling with akka streams to a practical level and I'm in trouble.
Below is my attempt at writing a streaming pipeline that reads specific rows form a very large database table then for each row sends a http call to an external web service than inserts the received responses to another table:
(database library is slick and the http client is play framework's )
val query = (catalogIndices joinLeft catalogSummeries on ((i, s) => i.productId === s.icecatId) filter (t => t._2.isEmpty) map (_._1)).result
val source = Source.fromPublisher(db.stream(query))
source
.buffer(5, OverflowStrategy.backpressure)
.mapAsync(5) { i =>
icecatService.fetchCatalog(i.productId, Languages.en, IcecatService.Contents.all)
}
.map { r =>
for {
g <- (r.json \ "data" \ "GeneralInfo").validate[GeneralInfo]
img <- (r.json \ "data" \ "Image").validate[Image]
} yield (g, img)
}
.collect {
case JsSuccess((g, img), _) => CatalogSummeryData(g.IcecatId, g.Title, g.Brand, g.ProductName, g.BrandPartCode,
g.GTIN, g.Category.CategoryID, g.Category.Name.Value, img.LowPic, img.ThumbPic)
}
.grouped(100)
.runForeach(s => db.run(catalogSummeries ++= s))
I can tell that the db as the upstream is very fast and those web service calls are really slow. also if I increase the number of parallelism of the mapAsync method much further all my futures timeout.
My problem started when I migrated the entire platform of the project from a Vagrant virtual machine to my windows 10. Now every thing in my code runs much much faster but this piece of streaming is always resulting in GC errors or out of memory errors and such. Back in the virtual machine I once left this code running and successfully streamed 4000 results. how should I tweak this code to get it to perform ?
Edit: Also I must mention that on my windows machine absolutely no rows are inserted back before it all shuts down in failure.
EDIT: I receive many types of errors but here's a common one:
[ERROR] [SECURITY][09/27/2017 07:23:33.046] [application-scheduler-1]
[akka.actor.ActorSystemImpl(application)] Uncaught error from thread
[application-scheduler-1]: GC overhead limit exceeded, shutting down
ActorSystem[application]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at akka.dispatch.AbstractNodeQueue.<init>(AbstractNodeQueue.java:32)
at akka.actor.LightArrayRevolverScheduler$TaskQueue.<init (LightArrayRevolverScheduler.scala:304)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:269)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)