Proper way to stop Akka Streams on condition - scala

I have been successfully using FileIO to stream the contents of a file, compute some transformations for each line and aggregate/reduce the results.
Now I have a pretty specific use case, where I would like to stop the stream when a condition is reached, so that it is not necessary to read the whole file but the process finishes as soon as possible. What is the recommended way to achieve this?

If the stop condition is "on the outside of the stream"
There is a advanced building-block called KillSwitch that you could use to do this: http://doc.akka.io/japi/akka/2.4.7/akka/stream/KillSwitches.html The stream would get shut down once the kill switch is notified.
It has methods like abort(reason) / shutdown etc, see here for it's API: http://doc.akka.io/japi/akka/2.4.7/akka/stream/SharedKillSwitch.html
Reference documentation is here: http://doc.akka.io/docs/akka/2.4.8/scala/stream/stream-dynamic.html#kill-switch-scala
Example usage would be:
val countingSrc = Source(Stream.from(1)).delay(1.second,
DelayOverflowStrategy.backpressure)
val lastSnk = Sink.last[Int]
val (killSwitch, last) = countingSrc
.viaMat(KillSwitches.single)(Keep.right)
.toMat(lastSnk)(Keep.both)
.run()
doSomethingElse()
killSwitch.shutdown()
Await.result(last, 1.second) shouldBe 2
If the stop condition is inside the stream
You can use takeWhile to express any condition really, though sometimes take or limit may be also enough "take 10 lnes".
If your logic is very advanced, you could build a special stage that handles that special logic using statefulMapConcat that allows to express literally anything - so you could complete the stream whenever you want to "from the inside".

Related

How do I nest Streams in Dart (map Streams to Stream events)?

Similar to this Flutter question I want to nest Streams.
In Flutter, this can be achieved easily by nesting StreamBuilders, however, I do not want to use widgets. Instead, I want to solve the problem in Dart alone. (nesting here means that one stream depends on the values from another stream and these should be combined)
Let me illustrate the problem:
Stream streamB(String a);
streamA: 'Hi' --- 'Hello' ---- 'Hey'
As you can see, I have a streamA that continuously emits events and a streamB that arises from the events that streamA emits. In a streamC, I want to be updated about every event from streamB.
Regular stream mapping
If I had valueB instead of streamB, I could simply use streamA.map((event) => valueB(event)), however, Stream.map can only handle synchronous values.
There is also Stream.asyncMap, however, that only works for Futures.
Then, there is also Stream.expand, but that works only for synchronous iterables.
Stream.asyncExpand
There is actually an Stream.asyncExpand method:
streamC = streamA.asyncExpand((event) => streamB(event));
However, this has the problem that the result stream (streamC) will only move on to the next event in the source stream (streamA) if the sub stream (streamB) of the first event has closed. In the case of say Cloud Firestore, this will never work because the sub stream will not close.
Stream.concurrentAsyncExpand
Luckily, there is the stream_transform package!
streamC = streamA.concurrentAsyncExpand((event) => streamB(event));
This package provides a concurrent async expand functionality. This way, the result stream does not wait for the sub streams to close.
However, this has the downside that the previous sub streams are not automatically closed when a new event in the source stream is received.
Thus, this is also not useful for Cloud Firestore.
Stream.switchMap
Also from the stream_transform package:
streamC = streamA.switchMap((event) => streamB(event));
This solves the problem I outlined above.

Offer to queue with some initial display

I want to offer to queue a string sent in load request after some initial delay say 10 seconds.
If the subsequent request is made with some short interval delay(1 second) then everything works fine, but if it is made continuously like from a script then there is no delay.
Here is the sample code.
def load(randomStr :String) = Action { implicit request =>
Source.single(randomStr)
.delay(10 seconds, DelayOverflowStrategy.backpressure)
.map(x =>{
println(x)
queue.offer(x)
})
.runWith(Sink.ignore)
Ok("")
}
I am not entirely sure that this is the correct way of doing what you want. There are some things you need to reconsider:
a delayed source has an initial buffer capacity of 16 elements. You can increase this with addAttributes(initialBuffer)
In your case the buffer cannot actually become full because every time you provide just one element.
Who is the caller of the Action? You are defining a DelayOverflowStrategy.backpressure strategy but is the caller able to handle this?
On every call of the action you are creating a Stream consisting of one element, how is the backpressure here helping? It is applied on the stream processing and not on the offering to the queue

Akka streams - shutdown stream with grouping without losing data

I have a source that groups elements and a sink that makes a batch request,
I'm using KillSwitch to be able to shutdown the graph at some arbitrary point in time. The problem that records of the latest incomplete batch that source outputs are getting lost when switch.shutdown() is being called
val source = Source.tick(10.millis, 10.millis, "tick").grouped(500)
val (switch, _) = source.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
Thread.sleep(3000) // wait some arbitrary time
switch.shutdown()
Is there a way to 'flush out' the incomplete batch when shutdown happens?
The behaviour of the kill switch shutdown is positional, as per its docs
After calling [[UniqueKillSwitch#shutdown()]] the running instance of
the [[Graph]] of [[FlowShape]] that materialized to the
[[UniqueKillSwitch]] will complete its downstream and cancel its
upstream (unless if finished or failed already in which case the
command is ignored).
See also more docs here.
Now the grouped stage will emit a partially filled group only at completion time, but not when cancelled.
This means that the graph below (grouped before killswitch) will behave like you observed
val switch =
Source.tick(10.millis, 175.millis, "tick")
.grouped(10)
.viaMat(KillSwitches.single)(Keep.right)
.toMat(Sink.foreach(println))(Keep.left)
.run()
whilst the graph below (grouped after killswitch) will emit partial groups downstream at completion
val switch =
Source.tick(10.millis, 175.millis, "tick")
.viaMat(KillSwitches.single)(Keep.right)
.grouped(10)
.toMat(Sink.foreach(println))(Keep.left)
.run()

Spark::KMeans calls takeSample() twice?

I have many data and I have experimented with partitions of cardinality [20k, 200k+].
I call it like that:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
and I see that initRandom() calls takeSample() once.
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?
Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.
Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
I brought this to the mailing list of the Spark devs, so I am updating:
Details of 1st takeSample():
Details of 2nd takeSample():
where one can see that the same code is executed.
As suggested by Shivaram Venkataraman in Spark's mailing list:
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at GitHub
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.

Rx Extensions - Proper way to use delay to avoid unnecessary observables from executing?

I'm trying to use delay and amb to execute a sequence of the same task separated by time.
All I want is for a download attempt to execute some time in the future only if the same task failed before in the past. Here's how I have things set up, but unlike what I'd expect, all three downloads seem to execute without delay.
Observable.amb([
Observable.catch(redditPageStream, Observable.empty()).delay(0 * 1000),
Observable.catch(redditPageStream, Observable.empty()).delay(30 * 1000),
Observable.catch(redditPageStream, Observable.empty()).delay(90 * 1000),
# Observable.throw(new Error('Failed to retrieve reddit page content')).delay(10000)
# Observable.create(
# (observer) ->
# throw new Error('Failed to retrieve reddit page content')
# )
]).defaultIfEmpty(Observable.throw(new Error('Failed to retrieve reddit page content')))
full code can be found here. src
I was hoping that the first successful observable would cancel out the ones still in delay.
Thanks for any help.
delay doesn't actually stop the execution of what ever you are doing it just delays when the events are propagated. If you want to delay execution you would need to do something like:
redditPageStream.delaySubscription(1000)
Since your source is producing immediately the above will delay the actual subscription to the underlying stream to effectively delay when it begins producing.
I would suggest though that you use one of the retry operators to handle your retry logic though rather than rolling your own through the amb operator.
redditPageStream.delaySubscription(1000).retry(3);
will give you a constant retry delay however if you want to implement the linear backoff approach you can use the retryWhen() operator instead which will let you apply whatever logic you want to the backoff.
redditPageStream.retryWhen(errors => {
return errors
//Only take 3 errors
.take(3)
//Use timer to implement a linear back off and flatten it
.flatMap((e, i) => Rx.Observable.timer(i * 30 * 1000));
});
Essentially retryWhen will create an Observable of errors, each event that makes it through is treated as a retry attempt. If you error or complete the stream then it will stop retrying.