Stop the fs2-stream after a timeout - scala

I want to use a function similar to take(n: Int) but in a time dimension:
consume(period: Duration. So I want a stream to terminate if a timeout occurs. I know that I could compile a stream to something like IO[List[T]] and cancel it, but then I'll lose the result. In reality I want to convert an endless stream into a limited one and preserve the results.
More on the wider scope of the problem. I have an endless stream of events from a messaging broker, but I also have rotating credentials to connect to the broker. So what I want is to consume the stream of events for some time, then stop, acquire new credentials, connect again to the broker creating a new stream and concatenate two streams into one.

There is a method, that does exactly this:
/**
* Interrupts this stream after the specified duration has passed.
*/
def interruptAfter[F2[x] >: F[x]: Concurrent: Timer](duration: FiniteDuration): Stream[F2, O]

You need something like that
import scala.util.Random
import scala.concurrent.ExecutionContext
import fs2._
import fs2.concurrent.SignallingRef
implicit val ex = ExecutionContext.global
implicit val t: Timer[IO] = IO.timer(ex)
implicit val cs: ContextShift[IO] = IO.contextShift(ex)
val effect: IO[Long] = IO.sleep(1.second).flatMap(_ => IO{
val next = Random.nextLong()
println("NEXT: " + next)
next
})
val signal = SignallingRef[IO, Boolean](false).unsafeRunSync()
val timer = Stream.sleep(10.seconds).flatMap(_ =>
Stream.eval(signal.set(true)).flatMap(_ =>
Stream.emit(println("Finish")).covary[IO]))
val stream = timer concurrently
Stream.repeatEval(effect).interruptWhen(signal)
stream.compile.drain.unsafeRunSync()
Also if you want to save your result of publishing data you need to have some unbounded Queue from fs2 for converting published data to your result via queue.stream

Related

Change a materialized value in a source using the contents of the stream

Alpakka provides a great way to access dozens of different data sources. File oriented sources such as HDFS and FTP sources are delivered as Source[ByteString, Future[IOResult]. However, HTTP requests via Akka HTTP are delivered as entity streams of Source[ByteString, NotUsed]. In my use case, I would like to retrieve content from HTTP sources as Source[ByteString, Future[IOResult] so I can build a unified resource fetcher that works from multiple schemes (hdfs, file, ftp and S3 in this case).
In particular, I would like to convert the Source[ByteString, NotUsed] source to
Source[ByteString, Future[IOResult] where I am able to calculate the IOResult from the incoming byte stream. There are plenty of methods like flatMapConcat and viaMat but none seem to be able to extract details from the input stream (such as number of bytes read) or initialise the IOResult structure properly. Ideally, I am looking for a method with the following signature that will update the IOResult as the stream comes in.
def matCalc(src: Source[ByteString, Any]) = Source[ByteString, Future[IOResult]] = {
src.someMatFoldMagic[ByteString, IOResult](IOResult.createSuccessful(0))(m, b) => m.withCount(m.count + b.length))
}
i can't recall any existing functionality, which can out of the box do this, but you can use alsoToMat (surprisingly didn't find it in akka streams docs, although you can look it in source code documentation & java api) flow function together with Sink.fold to accumulate some value and give it in the very end. eg:
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
the thing is that alsoToMat combines input mat value with the one provided in alsoToMat. at the same time the values produced by source are not affected by the sink in alsoToMat:
def alsoToMat[Mat2, Mat3](that: Graph[SinkShape[Out], Mat2])(matF: (Mat, Mat2) ⇒ Mat3): ReprMat[Out, Mat3] =
viaMat(alsoToGraph(that))(matF)
it's not that hard to adapt this function to return IOResult, which is according to the source code:
final case class IOResult(count: Long, status: Try[Done]) { ... }
one more last thing which you need to pay attention - you want your source be like:
Source[ByteString, Future[IOResult]]
but if you wan't to carry these mat value till the very end of stream definition, and then do smth based on this future completion, that might be error prone approach. eg, in this example i finish the work based on that future, so the last value will not be processed:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
object App extends App {
private implicit val sys: ActorSystem = ActorSystem()
private implicit val mat: ActorMaterializer = ActorMaterializer()
private implicit val ec: ExecutionContext = sys.dispatcher
val source: Source[Int, Any] = Source((1 to 5).toList)
def magic(source: Source[Int, Any]): Source[Int, Future[Int]] =
source.alsoToMat(Sink.fold(0)((acc, _) => acc + 1))((_, f) => f)
val f = magic(source).throttle(1, 1.second).toMat(Sink.foreach(println))(Keep.left).run()
f.onComplete(t => println(s"f1 completed - $t"))
Await.ready(f, 5.minutes)
mat.shutdown()
sys.terminate()
}
This can be done by using a Promise for the materialized value propagation.
val completion = Promise[IoResult]
val httpWithIoResult = http.mapMaterializedValue(_ => completion.future)
What is left now is to complete the completion promise when the relevant data becomes available.
Alternative approach would be to drop down to the GraphStage API where you get lower level control of materialized value propagation. But even there using Promises is often the chosen implementation for materialized value propagation. Take a look at built in operator implementations like Ignore.

Flink's broadcast state behavior

I am trying to play with flink's broacast state with a simple case.
I juste want to multiply an integer stream by another integer into a broadcast stream.
The behavior of my Broadcast is "weird", if I put too few elements in my input stream (like 10), nothing happen and my MapState is empty, but if I put more elements (like 100) I have the behavior I want (multiply the integer stream by 2 here).
Why broadcast stream is not taking into account if I gave too few elements ?
How can I control when the broadcast stream is working ?
Optional: I want to keep only the last element of my broadcast stream, is .clear() the good way ?
Thank you!
Here's my BroadcastProcessFunction:
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction
import org.apache.flink.util.Collector
import scala.collection.JavaConversions._
class BroadcastProcess extends BroadcastProcessFunction[Int, Int, Int] {
override def processElement(value: Int, ctx: BroadcastProcessFunction[Int, Int, Int]#ReadOnlyContext, out: Collector[Int]) = {
val currentBroadcastState = ctx.getBroadcastState(State.mapState).immutableEntries()
if (currentBroadcastState.isEmpty) {
out.collect(value)
} else {
out.collect(currentBroadcastState.last.getValue * value)
}
}
override def processBroadcastElement(value: Int, ctx: BroadcastProcessFunction[Int, Int, Int]#Context, out: Collector[Int]) = {
// Keep only last state
ctx.getBroadcastState(State.mapState).clear()
// Add state
ctx.getBroadcastState(State.mapState).put("key", value)
}
}
And my MapState:
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.scala._
object State {
val mapState: MapStateDescriptor[String, Int] =
new MapStateDescriptor(
"State",
createTypeInformation[String],
createTypeInformation[Int]
)
}
And my Main:
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
object Broadcast {
def main(args: Array[String]): Unit = {
val numberElements = 100
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val broadcastStream = env.fromElements(2).broadcast(State.mapState)
val input = (1 to numberElements).toList
val inputStream = env.fromCollection(input)
val outputStream = inputStream
.connect(broadcastStream)
.process(new BroadcastProcess())
outputStream.print()
env.execute()
}
}
Edit: I use Flink 1.5, and Broadcast State documentation is here.
Flink does not synchronize the ingestion of streams, i.e., streams produce data as soon as they can. This is true for regular and broadcast inputs. The BroadcastProcess will not wait for the first broadcast input to arrive before ingesting the regular input.
When you put more numbers into the regular input, it just takes more time to serialize, deserialize, and serve the input such that the broadcast input is already present, when the first regular number arrives.

How to abruptly stop an akka stream Runnable Graph?

I am not able to figure out how to stop akka stream Runnable Graph immediately ? How to use killswitch to achieve this? It has been just a few days that I started akka streams. In my case I am reading lines from a file and doing some operations in flow and writing to the sink. What I want to do is, stop reading file immediately whenever I want, and I hope this should possibly stop the whole running graph. Any ideas on this would be greatly appreciated.
Thanks in advance.
Since Akka Streams 2.4.3, there is an elegant way to stop the stream from the outside via KillSwitch.
Consider the following example, which stops stream after 10 seconds.
object ExampleStopStream extends App {
implicit val system = ActorSystem("streams")
implicit val materializer = ActorMaterializer()
import system.dispatcher
val source = Source.
fromIterator(() => Iterator.continually(Random.nextInt(100))).
delay(500.millis, DelayOverflowStrategy.dropHead)
val square = Flow[Int].map(x => x * x)
val sink = Sink.foreach(println)
val (killSwitch, done) =
source.via(square).
viaMat(KillSwitches.single)(Keep.right).
toMat(sink)(Keep.both).run()
system.scheduler.scheduleOnce(10.seconds) {
println("Shutting down...")
killSwitch.shutdown()
}
done.foreach { _ =>
println("I'm done")
Await.result(system.terminate(), 1.seconds)
}
}
The one way have a service or shutdownhookup which can call graph cancellable
val graph=
Source.tick(FiniteDuration(0,TimeUnit.SECONDS), FiniteDuration(1,TimeUnit.SECONDS), Random.nextInt).to(Sink.foreach(println))
val cancellable=graph.run()
cancellable.cancel
The cancellable.cancel can be part of ActorSystem.registerOnTermination

How to represent multiple incoming TCP connections as a stream of Akka streams?

I'm prototyping a network server using Akka Streams that will listen on a port, accept incoming connections, and continuously read data off each connection. Each connected client will only send data, and will not expect to receive anything useful from the server.
Conceptually, I figured it would be fitting to model the incoming events as one single stream that only incidentally happens to be delivered via multiple TCP connections. Thus, assuming that I have a case class Msg(msg: String) that represents each data message, what I want is to represent the entirety of incoming data as a Source[Msg, _]. This makes a lot of sense for my use case, because I can very simply connect flows & sinks to this source.
Here's the code I wrote to implement my idea:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.SourceShape
import akka.stream.scaladsl._
import akka.util.ByteString
import akka.NotUsed
import scala.concurrent.{ Await, Future }
import scala.concurrent.duration._
case class Msg(msg: String)
object tcp {
val N = 2
def main(argv: Array[String]) {
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val connections = Tcp().bind("0.0.0.0", 65432)
val delim = Framing.delimiter(
ByteString("\n"),
maximumFrameLength = 256, allowTruncation = true
)
val parser = Flow[ByteString].via(delim).map(_.utf8String).map(Msg(_))
val messages: Source[Msg, Future[Tcp.ServerBinding]] =
connections.flatMapMerge(N, {
connection =>
println(s"client connected: ${connection.remoteAddress}")
Source.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val F = builder.add(connection.flow.via(parser))
val nothing = builder.add(Source.tick(
initialDelay = 1.second,
interval = 1.second,
tick = ByteString.empty
))
F.in <~ nothing.out
SourceShape(F.out)
})
})
import scala.concurrent.ExecutionContext.Implicits.global
Await.ready(for {
_ <- messages.runWith(Sink.foreach {
msg => println(s"${System.currentTimeMillis} $msg")
})
_ <- system.terminate()
} yield (), Duration.Inf)
}
}
This code works as expected, however, note the val N = 2, which is passed into the flatMapMerge call that ultimately combines the incoming data streams into one. In practice this means that I can only read from that many streams at a time.
I don't know how many connections will be made to this server at any given time. Ideally I would want to support as many as possible, but hardcoding an upper bound doesn't seem like the right thing to do.
My question, at long last, is: How can I obtain or create a flatMapMerge stage that can read from more than a fixed number of connections at one time?
As indicated by Viktor Klang's comments I don't think this is possible in 1 stream. However, I think it would be possible to create a stream that can receive messages after materialization and use that as a "sink" for messages coming from the TCP connections.
First create the "sink" stream:
val sinkRef =
Source
.actorRef[Msg](Int.MaxValue, fail)
.to(Sink foreach {m => println(s"${System.currentTimeMillis} $m")})
.run()
This sinkRef can be used by each Connection to receive the messages:
connections foreach { conn =>
Source
.empty[ByteString]
.via(conn.flow)
.via(parser)
.runForeach(msg => sinkRef ! msg)
}

Custom Receiver stalls worker in Spark Streaming

I am trying to write a Spark Streaming Application with a Custom Receiver. I is supposed to simulate real-time input data by providing random values with a pre-defined interval. The (simplified) receiver looks as follows, with the according Spark Streaming app code below:
class SparkStreamingReceiver extends Actor with ActorHelper {
private val random = new Random()
override def preStart = {
context.system.scheduler.schedule(500 milliseconds, 1000 milliseconds)({
self ! ("string", random.nextGaussian())
})
}
override def receive = {
case data: (String, Double) => {
store[(String, Double)](data)
}
}
}
val conf: SparkConf = new SparkConf()
conf.setAppName("Spark Streaming App")
.setMaster("local")
val ssc: StreamingContext = new StreamingContext(conf, Seconds(2))
val randomValues: ReceiverInputDStream[(String, Double)] =
ssc.actorStream[(String,Double)](Props(new SparkStreamingReceiver()), "Receiver")
randomValues.saveAsTextFiles("<<OUTPUT_PATH>>/randomValues")
Running this code, I see that the receiver is working (Storing item, received single log entries). However, saveAsTextFiles will never output values.
I can work around the problem by changing the master to run with two threads (local[2]), but if I register another instance of my receiver (which I intend to do), it reappears. More specifically, I need to have at least one thread more than the number of my custom receivers registered to get any output.
It seems to me as though the worker threads are stalled by the receivers.
Can anyone explain this effect, and possibly how to fix my code?
Each receiver uses a compute slot. So 2 receivers will require 2 compute slots. If all the compute slots are taken by receivers, then there is no slot left to process the data. That is why "local" mode with 1 receiver, and "local[2]" with 2 receivers stalls the processing.