Akka streams - shutdown stream with grouping without losing data - scala

I have a source that groups elements and a sink that makes a batch request,
I'm using KillSwitch to be able to shutdown the graph at some arbitrary point in time. The problem that records of the latest incomplete batch that source outputs are getting lost when switch.shutdown() is being called
val source = Source.tick(10.millis, 10.millis, "tick").grouped(500)
val (switch, _) = source.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
Thread.sleep(3000) // wait some arbitrary time
switch.shutdown()
Is there a way to 'flush out' the incomplete batch when shutdown happens?

The behaviour of the kill switch shutdown is positional, as per its docs
After calling [[UniqueKillSwitch#shutdown()]] the running instance of
the [[Graph]] of [[FlowShape]] that materialized to the
[[UniqueKillSwitch]] will complete its downstream and cancel its
upstream (unless if finished or failed already in which case the
command is ignored).
See also more docs here.
Now the grouped stage will emit a partially filled group only at completion time, but not when cancelled.
This means that the graph below (grouped before killswitch) will behave like you observed
val switch =
Source.tick(10.millis, 175.millis, "tick")
.grouped(10)
.viaMat(KillSwitches.single)(Keep.right)
.toMat(Sink.foreach(println))(Keep.left)
.run()
whilst the graph below (grouped after killswitch) will emit partial groups downstream at completion
val switch =
Source.tick(10.millis, 175.millis, "tick")
.viaMat(KillSwitches.single)(Keep.right)
.grouped(10)
.toMat(Sink.foreach(println))(Keep.left)
.run()

Related

Close GRPC Netty channel in Scala Spark App

I'm trying to cleanly close a set of channels created during processing of a Spark map function. When I introduce the "shutdown/awaitTermination" methods after the main processing part (right before returning the "Result"), I get errors in other calls. As if the channel was shutdown prematurely (due to Spark scheduling of the actual tasks, I guess). Any recommendation? I have the following flow:
val someDF = initialDF.mapPartitions(iterator => {
val caller = createChannel(certificate, URI, port)
val innerDF = iterator.map(row => {
//do stuff with caller created above
result
}).toDF()
If I don't shutdown the channel, it runs fine (except I get some error messages in unit testing). But if I create a new channel during execution after the code as above, I end up with the following error:
ERROR io.grpc.internal.ManagedChannelOrphanWrapper - ~~~ Channel ManagedChannelImpl{logId=41, target=blablabla:443} was not shutdown properly!!! ~~~
How should I shutdown these channels? I'm not too bright with Spark...
Thanks!

With Akka Stream, how to dynamically duplicate a flow?

I'm running a live video streaming server. There's an Array[Byte] video source. Note that I can't get 2 connections to my video source. I want every client connecting to my server to receive this same stream, with a buffer discarding the old frames.
I tried using a BroadcastHub like this :
val source =
Source.fromIterator(() => myVideoStreamingSource.zipWithIndex)
val runnableGraph =
source.toMat(BroadcastHub.sink(bufferSize = 2))(Keep.right)
runnableGraph.run().to(Sink.foreach { index =>
println(s"client A reading frame #$index")
}).run()
runnableGraph.run().to(Sink.foreach { index =>
println(s"client B reading frame #$index")
}).run()
I get :
client A reading frame #0
client B reading frame #1
client A reading frame #2
client B reading frame #3
We see that the main stream is partitioned between the two clients, whereas I'd expect my two client being able to see all the source stream's frames.
Did I miss something, or is there any other solution ?
The issue is the combination of Iterator with BroadcastHub. I assume you myVideoStreamingSource is something like:
val myVideoStreamingSource = Iterator("A","B","C","D","E")
I'll now quote from BroadcastHub.Sink:
Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own [[Source]] for consuming the [[Sink]] of that materialization.
The issue here for you, is that it does not yet consume the data from the iterator.
The thing with iterator, is that once you consumed its data, you won't get back to the beginning again. Add to that the fact that both graphs run in parallel, it looks like it "divides" the elements between the two. But actually that is completely random. For example, if you add a sleep of 1 second between the Client A and Client B, so the only client that will print will be A.
In order to get that work, you need to create a source that is reversible. For example, Seq, or List. The following will do:
val myVideoStreamingSource = Seq("A","B","C","D","E")
val source = Source.fromIterator(() => myVideoStreamingSource.zipWithIndex.iterator)

TimeoutException when consuming files from S3 with akka streams

I'm trying to consume a bunch of files from S3 in a streaming manner using akka streams:
S3.listBucket("<bucket>", Some("<common_prefix>"))
.flatMapConcat { r => S3.download("<bucket>", r.key) }
.mapConcat(_.toList)
.flatMapConcat(_._1)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.runForeach { x => println(x) }
Without increasing akka.http.host-connection-pool.response-entity-subscription-timeout I get
java.util.concurrent.TimeoutException: Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. for the second file, just after printing the last line of the first file, when trying to access the first line of the second file.
I understand the nature of that exception. I don't understand why the request to the second file is already in progress, while the first file is still being processed. I guess, there's some buffering involved.
Any ideas how to get rid of that exception without having to increase akka.http.host-connection-pool.response-entity-subscription-timeout?
Instead of merging the processing of downloaded files into one stream with flatMapConcat you could try materializing the stream within the outer stream and fully process it there before emitting your output downstream. Then you shouldn't begin downloading (and fully processing) the next object until you're ready.
Generally you want to avoid having too many stream materializations to reduce overhead, but I suspect that would be negligible for an app performing network I/O like this.
Let me know if something like this works: (warning: untested)
S3.listBucket("<bucket>", Some("<common_prefix>"))
.mapAsync(1) { result =>
val contents = S3.download("<bucket>", r.key)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.to(Sink.seq)(Keep.right)
.run()
contents
}
.mapConcat(identity)
.runForeach { x => println(x) }

Offer to queue with some initial display

I want to offer to queue a string sent in load request after some initial delay say 10 seconds.
If the subsequent request is made with some short interval delay(1 second) then everything works fine, but if it is made continuously like from a script then there is no delay.
Here is the sample code.
def load(randomStr :String) = Action { implicit request =>
Source.single(randomStr)
.delay(10 seconds, DelayOverflowStrategy.backpressure)
.map(x =>{
println(x)
queue.offer(x)
})
.runWith(Sink.ignore)
Ok("")
}
I am not entirely sure that this is the correct way of doing what you want. There are some things you need to reconsider:
a delayed source has an initial buffer capacity of 16 elements. You can increase this with addAttributes(initialBuffer)
In your case the buffer cannot actually become full because every time you provide just one element.
Who is the caller of the Action? You are defining a DelayOverflowStrategy.backpressure strategy but is the caller able to handle this?
On every call of the action you are creating a Stream consisting of one element, how is the backpressure here helping? It is applied on the stream processing and not on the offering to the queue

Proper way to stop Akka Streams on condition

I have been successfully using FileIO to stream the contents of a file, compute some transformations for each line and aggregate/reduce the results.
Now I have a pretty specific use case, where I would like to stop the stream when a condition is reached, so that it is not necessary to read the whole file but the process finishes as soon as possible. What is the recommended way to achieve this?
If the stop condition is "on the outside of the stream"
There is a advanced building-block called KillSwitch that you could use to do this: http://doc.akka.io/japi/akka/2.4.7/akka/stream/KillSwitches.html The stream would get shut down once the kill switch is notified.
It has methods like abort(reason) / shutdown etc, see here for it's API: http://doc.akka.io/japi/akka/2.4.7/akka/stream/SharedKillSwitch.html
Reference documentation is here: http://doc.akka.io/docs/akka/2.4.8/scala/stream/stream-dynamic.html#kill-switch-scala
Example usage would be:
val countingSrc = Source(Stream.from(1)).delay(1.second,
DelayOverflowStrategy.backpressure)
val lastSnk = Sink.last[Int]
val (killSwitch, last) = countingSrc
.viaMat(KillSwitches.single)(Keep.right)
.toMat(lastSnk)(Keep.both)
.run()
doSomethingElse()
killSwitch.shutdown()
Await.result(last, 1.second) shouldBe 2
If the stop condition is inside the stream
You can use takeWhile to express any condition really, though sometimes take or limit may be also enough "take 10 lnes".
If your logic is very advanced, you could build a special stage that handles that special logic using statefulMapConcat that allows to express literally anything - so you could complete the stream whenever you want to "from the inside".