Skip errors in the infinite Stream - scala

I have an infinite fs2.Stream which may encounter errors. I'd like to skip those errors with doing nothing (probably log) and keep streaming further elements. Example:
//An example
val stream = fs2.Stream
.awakeEvery[IO](1.second)
.evalMap(_ => IO.raiseError(new RuntimeException))
In this specific case, I'd like to get infinite fs2.Stream of Left(new RuntimeException) emitting every second.
There is a Stream.attempt method producing the stream that got terminated after the first error is encountered. Is there a way to just skip errors and keep pulling further elements?
The IO.raiseError(new RuntimeException).attempt won't work in general since it would require attempting all effects in all places of the stream pipeline composition.

There's no way to handle errors in the way you described.
When stream encounters the first error it is terminated. Please check this gitter question.
You can handle it in two ways:
Attempt the effect (but you already mentioned it is not possible in your case).
Restart stream after it is terminated:
val stream: Stream[IO, Either[Throwable, Unit]] = Stream
.awakeEvery[IO](1.second)
.evalMap(_ => IO.raiseError(new RuntimeException))
.handleErrorWith(t => Stream(Left(t)) ++ stream) //put Left to the stream and restart it
//since stream will infinitely restart I take only 3 first values
println(stream.take(3).compile.toList.unsafeRunSync())

Related

When materialiser is actually used in Akka Streams Flows and when do we need to Keep values

I'm trying to learn Akka Streams and I'm stuck with this materialization here.
Every tutorial shows some basics source via to run examples where no real between Keep.left and Keep.right is explained. So I wrote this little piece of code, asked IntelliJ to add a type annotation to the values and started to dig the sources.
val single: Source[Int, NotUsed] = Source(Seq(1, 2, 3, 4, 5))
val flow: Flow[Int, Int, NotUsed] = Flow[Int].map(_ * 2)
val sink: Sink[Int, Future[Int]] = Sink.fold[Int, Int](0)(_ + _)
val run1: RunnableGraph[Future[Int]] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.right)
val run2: RunnableGraph[NotUsed] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.left)
val run3: RunnableGraph[(NotUsed, Future[Int])] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.both)
val run4: RunnableGraph[NotUsed] =
single.viaMat(flow)(Keep.right).toMat(sink)(Keep.none)
So far I can understand that at the end of the execution we can need the value of the Sink that is of type Future[Int]. But I cannot think of any case when I gonna need to keep some of the values.
In the third example it is possible to acces both left and right values of the materialized output.
run3.run()._2 onComplete {
case Success(value) ⇒ println(value)
case Failure(exception) ⇒ println(exception.getMessage)
}
It actually works absolutely the same way if I change it to viaMat(flowMultiply)(Keep.left) or none or both.
But in what scenarios the materialized value could be used within the graph? Why would we need it if the value is flowing within anyway? Why do we need one of the values if we aren't gonna keep it?
Could you pelase provide an example where changing from left to right will not just break the compiler, but will actually bring a difference to the program logic?
For most streams, you only care about the value at the end of the stream. Accordingly, most of the Source and nearly all of the standard Flow operators have a materialized value of NotUsed, and the syntactic sugar .runWith boils down to .toMat(sink)(Keep.right).run.
Where one might care about the materialized value of a Source or Flow stage is when you want to be able to control a stage outside of the stream. An example of this is Source.actorRef, which allows you to send messages to an actor which get forwarded to the stream: you need the Source's materialized ActorRef in order to actually send a message to it. Likewise, you probably still want the materialized value of the Sink (whether to know that the stream processing happened (Future[Done]) or for an actual value at the end of the stream). In such a stream you'd probably have something like:
val stream: RunnableGraph[(ActorRef, Future[Done])] =
Source.actorRef(...)
.viaMat(calculateStuffFlow)(Keep.left) // propagates the ActorRef
.toMat(Sink.foreach { ... })(Keep.both)
val (sendToStream, done) = stream.run()
Another reasonably common use-case for this is in the Alpakka Kafka integration, where it's possible for the consumer to have a controller as a materialized value: this controller allows you to stop consuming from a topic and not unsubscribe until any pending offset commits have happened.

Detecting that back pressure is happening

My Akka HTTP application streams some data via server-sent events, and clients can request way more events than they can handle. The code looks something like this
complete {
source.filter(predicate.isMatch)
.buffer(1000, OverflowStrategy.dropTail)
.throttle(20, 1 second)
.map { evt => ServerSentEvent(evt) }
}
Is there a way to detect the fact that a stage backpressures and somehow notify the client preferably using the same sink (by emitting a different kind of output) or if not possible just make Akka framework call some sort of callback that will deal with the fact through a control side-channel?
So, I'm not sure I understand your use case. Are you asking about back pressure at .buffer or at .throttle? Another part of my confusion is that you are suggesting emitting a new "control" element in a situation where the stream is already back pressured. So your control element might not be received for some time. Also, if you emit a control element every single time you receive back pressure you will likely create a flood of control elements.
One way to build this (overly naive) solution would be to use conflate.
val simpleSink: Sink[String, Future[Done]] =
Sink.foreach(e => println(s"simple: $e"))
val cycleSource: Source[String, NotUsed] =
Source.cycle(() => List("1", "2", "3", "4").iterator).throttle(5, 1.second)
val conflateFlow: Flow[String, String, NotUsed] =
Flow[String].conflate((a, b) => {
"BACKPRESSURE CONTROL ELEMENT"
})
val backpressureFlow: Flow[String, String, NotUsed] =
Flow[String]
.buffer(10, OverflowStrategy.backpressure) throttle (2, 1.second)
val backpressureTest =
cycleSource.via(conflateFlow).via(backpressureFlow).to(simpleSink).run()
To turn this into a more usable example you could either:
Make some sort of call inside of .conflate (and then just drop one of the elements). Be careful not to do anything blocking though. Perhaps just send a message that could be de-duplicated elsewhere.
Write a custom graph stage. Doing something simple like this wouldn't be too difficult.
I think I'd have to understand more about the use case though. Take a look at all of the off the shelf backpressure aware operators and see if one of them helps.

Apache flink broadcast state gets flushed

I am using the broadcast pattern to connect two streams and read data from one to another. The code looks like this
case class Broadcast extends BroadCastProcessFunction[MyObject,(String,Double), MyObject]{
override def processBroadcastElement(in2: (String, Double),
context: BroadcastProcessFunction[MyObject, (String, Double), MyObject]#Context,
collector:Collector[MyObject]):Unit={
context.getBroadcastState(broadcastStateDescriptor).put(in2._1,in2._2)
}
override def processElement(obj: MyObject,
readOnlyContext:BroadCastProcessFunction[MyObject, (String,Double),
MyObject]#ReadOnlyContext, collector: Collector[MyObject]):Unit={
val theValue = readOnlyContext.getBroadccastState(broadcastStateDesriptor).get(obj.prop)
//If I print the context of the state here sometimes it is empty.
out.collect(MyObject(new, properties, go, here))
}
}
The state descriptor:
val broadcastStateDescriptor: MapStateDescriptor[String, Double) = new MapStateDescriptor[String, Double]("name_for_this", classOf[String], classOf[Double])
My execution code looks like this.
val streamA :DataStream[MyObject] = ...
val streamB :DataStream[(String,Double)] = ...
val broadcastedStream = streamB.broadcast(broadcastStateDescriptor)
streamA.connect(streamB).process(new Broadcast)
The problem is in the processElement function the state sometimes is empty and sometimes not. The state should always contain data since I am constantly streaming from a file that I know it has data. I do not understand why it is flushing the state and I cannot get the data.
I tried adding some printing in the processBroadcastElement before and after putting the data to the state and the result is the following
0 - 1
1 - 2
2 - 3
.. all the way to 48 where it resets back to 0
UPDATE:
something that I noticed is when I decrease the value of the timeout of the streaming execution context, the results are a bit better. when I increase it then the map is always empty.
env.setBufferTimeout(1) //better results
env.setBufferTimeout(200) //worse result (default is 100)
Whenever two streams are connected in Flink, you have no control over the timing with which Flink will deliver events from the two streams to your user function. So, for example, if there is an event available to process from streamA, and an event available to process from streamB, either one might be processed next. You cannot expect the broadcastedStream to somehow take precedence over the other stream.
There are various strategies you might employ to cope with this race between the two streams, depending on your requirements. For example, you could use a KeyedBroadcastProcessFunction and use its applyToKeyedState method to iterate over all existing keyed state whenever a new broadcast event arrives.
As David mentioned the job could be restarting. I disabled the checkpoints so I could see any possible exception thrown instead of flink silently failing and restarting the job.
It turned out that there was an error while trying to parse the file. So the job kept restarting thus the state was empty and flink kept consuming the stream over and over again.

FS2: is it possible to complete Queue gracefully?

Suppose that I want to convert some legacy asynchronous API into FS2 Streams.
The API provides an interface with 3 callbacks: next element, success, error.
I'd like the Stream to emit all the elements and then complete upon receiving success or error callback.
FS2 guide (https://functional-streams-for-scala.github.io/fs2/guide.html) suggests using fs2.Queue for such situations,
and it works great for enqueueing, but all the examples I've seen so far expect that the stream that queue.dequeue returns will never complete -
there's no obvious way to handle success/error callback in my situation.
I've tried to use queue.dequeue.interruptWhen(...here goes the signal...), but if success/error callback arrives before the client has read the data from the stream,
stream gets terminated prematurely - there are still unread elements. I'd like the consumer to finish reading them before completing the stream.
Is it possible to do that with FS2? With Akka Streams it's trivial - SourceQueueWithComplete has complete and fail methods.
UPDATE:
I was able to get good enough result by wrapping elements in Option and considering None as a signal to stop reading the stream, and additionally by using a Promise to propagate errors:
queue.dequeue
.interruptWhen(interruptingPromise.get)
.takeWhile(_.isDefined).map(_.get)
However, did I overlook more natural way of doing such things?
One idiomatic way to do this is to create a Queue[Option[A]] instead of Queue[A]. When enqueueing, wrap in Some, and you can explicitly enqueue None to signal completion. On the dequeueing side, do q.dequeue.unNoneTerminate, which gives you a Stream[F, A] that terminates once the Queue emits None
Answer to your update: Combine unNoneTerminate with rethrow, which takes a Stream[F, Either[Throwable, A]] and returns a Stream[F, A] that errors out with Stream.raiseError when it encouters a throwable.
Your complete stack would then be a Stream[F, Either[Throwable, Option[A]]] and you unwrap into Stream[F,A] by calling .rethrow.unNoneTerminate

"Accumulating" scalaz-stream channel

I'm trying to implement a scalaz-stream channel that accumulates statistics about the events it receives and, once complete, emits the final statistics.
To give a concrete, simplified example: imagine that you have a Process[Task, String] where each string is a word. I'd like to have a Channel[Task, String, (String, Int)] that, when applied to that initial process, would drain it, count the number of times each word occurs, and emit that.
I realise this is trivial through a fold:
input.foldMap(w => Map(w -> 1))
.flatMap(m => Process.emitAll(m.toSeq))
.maximumBy(_._2)
What I'm trying to write is a collection of standard accumulators that I can then just pipe my processes through - rather than explicitly fold, say, I'd write:
input.through(wordFrequency)
.maximumBy(_._2)
I'm at a bit of a loss though - I can't work out how to do so without sharing state. Writing a Sink that accumulate to a Map[String, Int] is fairly simple, but there's no way to get the final state of the map and emit it once the sink has terminated.