With Akka Stream, how to dynamically duplicate a flow?

With Akka Stream, how to dynamically duplicate a flow? - scala

I'm running a live video streaming server. There's an Array[Byte] video source. Note that I can't get 2 connections to my video source. I want every client connecting to my server to receive this same stream, with a buffer discarding the old frames.
I tried using a BroadcastHub like this :
val source =
Source.fromIterator(() => myVideoStreamingSource.zipWithIndex)
val runnableGraph =
source.toMat(BroadcastHub.sink(bufferSize = 2))(Keep.right)
runnableGraph.run().to(Sink.foreach { index =>
println(s"client A reading frame #$index")
}).run()
runnableGraph.run().to(Sink.foreach { index =>
println(s"client B reading frame #$index")
}).run()
I get :
client A reading frame #0
client B reading frame #1
client A reading frame #2
client B reading frame #3
We see that the main stream is partitioned between the two clients, whereas I'd expect my two client being able to see all the source stream's frames.
Did I miss something, or is there any other solution ?

The issue is the combination of Iterator with BroadcastHub. I assume you myVideoStreamingSource is something like:
val myVideoStreamingSource = Iterator("A","B","C","D","E")
I'll now quote from BroadcastHub.Sink:
Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own [[Source]] for consuming the [[Sink]] of that materialization.
The issue here for you, is that it does not yet consume the data from the iterator.
The thing with iterator, is that once you consumed its data, you won't get back to the beginning again. Add to that the fact that both graphs run in parallel, it looks like it "divides" the elements between the two. But actually that is completely random. For example, if you add a sleep of 1 second between the Client A and Client B, so the only client that will print will be A.
In order to get that work, you need to create a source that is reversible. For example, Seq, or List. The following will do:
val myVideoStreamingSource = Seq("A","B","C","D","E")
val source = Source.fromIterator(() => myVideoStreamingSource.zipWithIndex.iterator)

Related

TimeoutException when consuming files from S3 with akka streams

I'm trying to consume a bunch of files from S3 in a streaming manner using akka streams:
S3.listBucket("<bucket>", Some("<common_prefix>"))
.flatMapConcat { r => S3.download("<bucket>", r.key) }
.mapConcat(_.toList)
.flatMapConcat(_._1)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.runForeach { x => println(x) }
Without increasing akka.http.host-connection-pool.response-entity-subscription-timeout I get
java.util.concurrent.TimeoutException: Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. for the second file, just after printing the last line of the first file, when trying to access the first line of the second file.
I understand the nature of that exception. I don't understand why the request to the second file is already in progress, while the first file is still being processed. I guess, there's some buffering involved.
Any ideas how to get rid of that exception without having to increase akka.http.host-connection-pool.response-entity-subscription-timeout?

Instead of merging the processing of downloaded files into one stream with flatMapConcat you could try materializing the stream within the outer stream and fully process it there before emitting your output downstream. Then you shouldn't begin downloading (and fully processing) the next object until you're ready.
Generally you want to avoid having too many stream materializations to reduce overhead, but I suspect that would be negligible for an app performing network I/O like this.
Let me know if something like this works: (warning: untested)
S3.listBucket("<bucket>", Some("<common_prefix>"))
.mapAsync(1) { result =>
val contents = S3.download("<bucket>", r.key)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.to(Sink.seq)(Keep.right)
.run()
contents
}
.mapConcat(identity)
.runForeach { x => println(x) }

How do I nest Streams in Dart (map Streams to Stream events)?

Similar to this Flutter question I want to nest Streams.
In Flutter, this can be achieved easily by nesting StreamBuilders, however, I do not want to use widgets. Instead, I want to solve the problem in Dart alone. (nesting here means that one stream depends on the values from another stream and these should be combined)
Let me illustrate the problem:
Stream streamB(String a);
streamA: 'Hi' --- 'Hello' ---- 'Hey'
As you can see, I have a streamA that continuously emits events and a streamB that arises from the events that streamA emits. In a streamC, I want to be updated about every event from streamB.
Regular stream mapping
If I had valueB instead of streamB, I could simply use streamA.map((event) => valueB(event)), however, Stream.map can only handle synchronous values.
There is also Stream.asyncMap, however, that only works for Futures.
Then, there is also Stream.expand, but that works only for synchronous iterables.

Stream.asyncExpand
There is actually an Stream.asyncExpand method:
streamC = streamA.asyncExpand((event) => streamB(event));
However, this has the problem that the result stream (streamC) will only move on to the next event in the source stream (streamA) if the sub stream (streamB) of the first event has closed. In the case of say Cloud Firestore, this will never work because the sub stream will not close.
Stream.concurrentAsyncExpand
Luckily, there is the stream_transform package!
streamC = streamA.concurrentAsyncExpand((event) => streamB(event));
This package provides a concurrent async expand functionality. This way, the result stream does not wait for the sub streams to close.
However, this has the downside that the previous sub streams are not automatically closed when a new event in the source stream is received.
Thus, this is also not useful for Cloud Firestore.
Stream.switchMap
Also from the stream_transform package:
streamC = streamA.switchMap((event) => streamB(event));
This solves the problem I outlined above.

Can Flink Map be Called on and When Required (Not activated on input Stream)

I have a map in flink that gets activated once data comes through a stream.
I want to call that map even if no data come through.
I moved the map into a function (infinite function call) but then the flink job never runs. And if I add it within a map it will only get activated if and when data comes through.
The Idea is, have 1 map in an infinte loop, checking some shared variable and another flink stream monitoring a kafka queue, if data comes in it process it changes a shared variable that effects the infinite loop in some way and continues.
How Do I call an infinite loop map and run the flink maps together?
I tried creating a CollectionMap with random data to activate the stream and map to call the infinite loop, but exits almost immediately even though there is a while(true) condition within the map
In the IDE it works, when I push it to Flink.local it exits almost immediately not staying in loop
Stream 1
val data_stream = env.addSource(myConsumer)
.map(x => {process(x)})
Stream 2
val elements = List[String]("Start")
var read = env.fromElements(elements).map(x => ProcessData.infinteLoop())
How Do I call an infinite loop map and run the flink maps together?

You can create a window and a trigger and call the map every x seconds.
You can find the documentation heare: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html
Example:
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows
import org.apache.flink.streaming.api.windowing.triggers.{CountTrigger, PurgingTrigger}
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow
val data_stream = env.addSource(myConsumer)
.map(x => {process(x)})
val window: DataStream[String] = data_stream
.windowAll(GlobalWindows.create())
.trigger(PurgingTrigger.of(CountTrigger.of[GlobalWindow](5)))
.apply((w: GlobalWindow, x: Iterable[(Integer, String)], y: Collector[String]) => {})

Refresh cached values in spark streaming without reboot the batch

Maybe the question is too simple, at least look like that, but I have the following problem:
A. Execute spark submit in spark streaming process.
ccc.foreachRDD(rdd => {
rdd.repartition(20).foreachPartition(p => {
val repo = getReposX
t foreach (g => {
.................
B. getReposX is a function which one make a query in mongoDB recovering a Map wit a key/value necessary in every executor of the process.
C. Into each g in foreach I manage this map "cached"
The problem or the question is when in the collection of mongo anything change, I don't watch or I don't detect the change, therefore I am managing a Map not updated. My question is: How can I get it? Yes, I know If I reboot the spark-submit and the driver is executed again is OK but otherwise I will never see the update in my Map.
Any ideas or suggestion?
Regards.

Finally I Developed a solution. First I explain the question more in detail because what I really wanted to know is how to implement an object or "cache", which was refreshed every so often, or by some kind of order, without the need to restart the spark streaming process, that is, it would refresh in alive.
In my case this "cache" or refreshed object is an object (Singleton) that connected to a collection of mongoDB to recover a HashMap that was used by each executor and was cached in memory as a good Singleton. The problem with this was that once the spark streaming submit is executed it was cached that object in memory but it was not refreshed unless the process was restarted. Think of a broadcast as a counter mode to refresh when the variable reaches 1000, but these are read only and can not be modified. Think of an counter, but these can only be read by the driver.
Finally my solution is, within the initialization block of the object that loads the mongo collection and the cache, I implement this:
//Initialization Block
{
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
logger.info("Refresh - Inicialization")
initCache
}
}
val f = ex.scheduleAtFixedRate(task, 0, TIME_REFRES, TimeUnit.SECONDS)
}
initCache is nothing more than a function that connects mongo and loads a collection:
var cache = mutable.HashMap[String,Description]()
def initCache():mutable.HashMap[String, Description]={
val serverAddresses = Functions.getMongoServers(SERVER, PORT)
val mongoConnectionFactory = new MongoCollectionFactory(serverAddresses,DATABASE,COLLECTION,USERNAME,PASSWORD)
val collection = mongoConnectionFactory.getMongoCollection()
val docs = collection.find.iterator()
cache.clear()
while (docs.hasNext) {
var doc = docs.next
cache.put(...............
}
cache
}
In this way, once the spark streaming submit has been started, each task will open one more task, which will make every X time (1 or 2 hours in my case) refresh the value of the singleton collection, and always recover that value instantiated:
def getCache():mutable.HashMap[String, Description]={
cache
}

Scalaz-stream chunking UP to N

Given a queue like so:
val queue: Queue[Int] = async.boundedQueue[Int](1000)
I want to pull off this queue and stream it into a downstream Sink, in chunks of UP to 100.
queue.dequeue.chunk(100).to(downstreamConsumer)
works sort of, but it will not empty the queue if I have say 101 messages. There will be 1 message left over, unless another 99 are pushed in. I want to take as many as I can off the queue up to 100, as fast as my downstream process can handle.
Is there an existing combinator available?

For this, you may need to monitor size of the queue when dequeueing from it. then, if the size reaches 0 you simply won't wait to any more elements. In fact you may implement elastic sizing of the batching based on the size of queue. I.e. :
val q = async.unboundedQueue[String]
val deq:Process[Task,(String,Int)] = q.dequeue zip q.size
val elasticChunk: Process1[(String,Int), Vector[String]] = ???
val downstreamConsumer : Sink[Task,Vector[String]] = ???
deq.pipe(elasticChunk) to downstreamConsumer

I actually solved this a different way then I had intended.
The scalaz-stream queue now contains a dequeueBatch method that allows dequeuing all values in the queue, up to N, or blocks.
https://github.com/scalaz/scalaz-stream/issues/338

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

With Akka Stream, how to dynamically duplicate a flow? - scala

Related

TimeoutException when consuming files from S3 with akka streams

How do I nest Streams in Dart (map Streams to Stream events)?

Can Flink Map be Called on and When Required (Not activated on input Stream)

Refresh cached values in spark streaming without reboot the batch

Scalaz-stream chunking UP to N

Categories

Resources