An observable or operator that emits parallel streams but always includes the last emission from all streams

An observable or operator that emits parallel streams but always includes the last emission from all streams - rx-java2

Is there an observable or an Rx operator that can take multiple observables (multiple streams) and present them together in a function and where the most recent emission from each stream is provided? The closest I could find was combineLast but that will only work if all the streams have emitted an item. If I have 4 observables but only 3 have emitted items, I want to see those 3. Of those 3, if one is constantly emitting but the other two haven't emitted anything after some time, I still want to see the last emission from those 2 along with the latest from the one that is constantly emitting.
What I don't want is to have something that combines all the streams into one. They must remain separate.

You can combine them with combineLatest and provide a function to wrap the values into a custom class:
Observable.combineLatest(
source1,
source2,
source3,
source44,
Function4<String, String, String, String, LatestResult> { t1, t2, t3, t4 ->
LatestResult(t1, t2, t3, t4)
})
.subscribe { latestResult ->
// Access the latest results here:
println(latestResult.text1)
println(latestResult.text2)
println(latestResult.text3)
println(latestResult.text4)
}
}
data class LatestResult(val text1: String, val text2: String, val text3: String, val text4: String)
This example assumes all your observables emit strings but you can easily change it.

Related

Multiple Listeners on RxJava2 BehaviorSubject with Inconsistent Results

I have a BehaviorSubject that has three listeners that are subscribed prior to any emissions. I .onNext() two things: A followed by B.
Two of the listeners appropriately receive A and then B. But the third listener gets B, A. What could possibly explain this behavior? This is all on the same thread.
Here is some sample code (in Kotlin) that reproduces the results. Let me know if you need a Java version:
#Test
fun `rxjava test`() {
val eventHistory1 = ArrayList<String>()
val eventHistory2 = ArrayList<String>()
val eventHistory3 = ArrayList<String>()
val behaviorSubject = BehaviorSubject.create<String>()
behaviorSubject.subscribe {
eventHistory1.add(it)
}
behaviorSubject.subscribe {
eventHistory2.add(it)
if (it == "A") behaviorSubject.onNext("B")
}
behaviorSubject.subscribe {
eventHistory3.add(it)
}
behaviorSubject.onNext("A")
println(eventHistory1)
println(eventHistory2)
println(eventHistory3)
assert(eventHistory1 == eventHistory2)
assert(eventHistory2 == eventHistory3)
}
And here is the output from the test:
[A, B]
[A, B]
[B, A]

Subjects are not re-entrant thus calling onNext on the same subject that is currently servicing onNexts is an undefined behavior. The javadoc warns about this case:
Calling onNext(Object), onError(Throwable) and onComplete() is required to be serialized (called from the same thread or called non-overlappingly from different threads through external means of serialization). The Subject.toSerialized() method available to all Subjects provides such serialization and also protects against reentrance (i.e., when a downstream Observer consuming this subject also wants to call onNext(Object) on this subject recursively).
In your particular case, signaling "B" happens first for the 3rd observer while it was about to signal "A" to it, hence the swapped order.
Use toSerialized on the subject to make sure this doesn't happen.
val behaviorSubject = BehaviorSubject.create<String>().toSerialized()

Akka Streams: How to update one field with the result of the future

I have an entity passing down the akka stream and it has one field that has to be updated during one of the flows.
Let's say case class Entity(f: Int)
The value to update the entity is coming from the future.
Flow[Entity]
.map({ entity ⇒
entity.copy(
f = // get result of the future
)
})
There are several options coming to my mind.
First is to Await for the execution of the future. But in this case, I would have to provide it with its own execution context etc... How do I use the graph's execution context within a flow?
Second is passing a tuple of (Entity, Future[Int]) to the next stage. But it would be easier to transform it into Future[(Entity, Int)] and then mapAsync it. But is there a way to transform a Tuple with Future into a Future of a Tuple within akka stream?
What would be a perfect solution to this simple problem?

How about something like:
Flow[Entity]
.mapAsync { entity =>
createIntFuture.map { int =>
entity.copy(f = int)
}
}
?

How to clean up substreams in continuous Akka streams

Given I have a very long running stream of events flowing through something as show below. When a long time has passed there will be lots of sub streams created that is no longer needed.
Is there a way to clean up a specific substream at a given time, for
example the substream created by id 3 should be cleaned and the state
in the scan method lost at 13Pm (expires property of Wid)?
case class Wid(id: Int, v: String, expires: LocalDateTime)
test("Substream with scan") {
val (pub, sub) = TestSource.probe[Wid]
.groupBy(Int.MaxValue, _.id)
.scan("")((a: String, b: Wid) => a + b.v)
.mergeSubstreams
.toMat(TestSink.probe[String])(Keep.both)
.run()
}

TL;DR You can close a substream after some time. However, using input to dynamically set the time with built-in stages is another matter.
Closing a substream
To close a flow, you usually complete it (from upstream), but you can also cancel it (from downstream). For instance, the take(n: Int) flow will cancel once n elements have gone through.
Now, in the groupBy case, you cannot complete a substream, since upstream is shared by all substreams, but you can cancel it. How depends on what condition you want to put on its end.
However, be aware that groupBy removes inputs for subflows that have already been closed: If a new element with id 3 comes from upstream to the groupBy after the 3-substream has been closed, it will simply be ignored and the next element will be pulled in. The reason for this is probably that some elements might be lost in the process between closing and re-opening of the substream. Also, if your stream is supposed to run for a very long time, this will affect performances because each element will be checked against the list of closed substreams before being forwarded to the relevant (live) substream. You might want to implement your own stateful filter (say, with a bloom filter) if you're not satisfied with the performances of this.
To close a substream, I usually use either take (if you want only a given number of elements, but that's probably not the case on an infinite stream), or some kind of timeout: either completionTimeout if you want a fixed time from materialization to closure or idleTimeout if you want to close when no element have been coming through for some time. Note that these flows do not cancel the stream but fail it, so you have to catch the exception with a recover or recoverWith stage to change the failure into a cancel (recoverWith allows you to cancel without sending any last element, by recovering with Source.empty).
Dynamically set the timeout
Now what you want is to set dynamically the closing time according to the first passing element. This is more complicated because materialization of streams is independant of the elements that pass through them. Indeed, in the usual (without groupBy) case, streams are materialized before any element go through them, so it makes no sense to use elements to materialize them.
I had similar issues in that question, and ended up using a modified version of groupBy with signature
paramGroupBy[K, OO, MM](maxSubstreams: Int, f: Out => K, paramSubflow: K => Flow[Out, OO, MM])
that allows to define every substream using the key that defined it. This can be modified to have the first element (instead of the key), as parameter.
Another (probably simpler, in your case) way would be to write your own stage that does exactly what you want: get end-time from first element and cancel the stream at that time. Here is an example implementation for this (I used a scheduler instead of setting a state):
object CancelAfterTimer
class CancelAfter[T](getTimeout: T => FiniteDuration) extends GraphStage[FlowShape[T, T]] {
val in = Inlet[T]("CancelAfter.in")
val out = Outlet[T]("CancelAfter.in")
override val shape: FlowShape[T, T] = FlowShape(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) with InHandler with OutHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (!isTimerActive(CancelAfterTimer))
scheduleOnce(CancelAfterTimer, getTimeout(elem))
push(out, elem)
}
override def onTimer(timerKey: Any): Unit =
completeStage() //this will cancel the upstream and close the downstrean
override def onPull(): Unit = pull(in)
setHandlers(in, out, this)
}
}

Handle different states

I was wondering if it was possible to maintain radically different states across an application? For example, have the update function of the first state call the one from the second state?
I do not recall going through any such example, nor did I find any counter indication... Based on the example from https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.html, I know of no reason why I wouldn't be able to have different trackStateFuncs with different States, and still update those thanks to their Key, as shown below:
def firstTrackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[Long]): Option[(String, Long)] = {
val sum = value.getOrElse(0).toLong + state.getOption.getOrElse(0L)
val output = (key, sum)
state.update(sum)
Some(output)
}
and
def secondTrackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[Int]): Option[(String, Long)] = {
// disregard problems this example would cause
val dif = value.getOrElse(0) - state.getOption.getOrElse(0L)
val output = (key, dif)
state.update(dif)
Some(output)
}
I think this is possible but still remain unsure. I would like someone to validate or invalidate this assumption...

I was wondering if it was possible to maintain radically different
states across an application?
Every call to mapWithState on a DStream[(Key, Value)] can hold one State[T] object. This T needs to be the same for every invocation of mapWithState. In order to use different states, you can either chain mapWithState calls, where one's Option[U] is anothers input, or you can have split the DStream and apply a different mapWithState call to each one. You cannot, however, call a different State[T] object inside another, as they are isolated from one another, and one cannot mutate the state of the other.

#Yuval gave a great answer to chain mapWithState functions. However, I have another approach. Instead of having two mapWithState calls, you can put both the sum and the diff in the same State[(Int, Int)].
In this case, you would only need one mapWithState functions where you could update both the things. Something like this:
def trackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[(Long, Int)]): Option[(String, (Long, Int))] =
{
val sum = value.getOrElse(0).toLong + state.getOption.getOrElse(0L)
val dif = value.getOrElse(0) - state.getOption.getOrElse(0L)
val output = (key, (sum, diff))
state.update((sum, diff))
Some(output)
}

Partitioning of a Stream

I'm not sure if this is possible but I want to partition a stream based on some condition that depends on the output of the stream. It will make sense with an example I think.
I will create a bunch of orders which I will stream since the actual use case is a stream of orders coming in so it is not known up front what the next order will be or even the full list of orders:
scala> case class Order(item : String, qty : Int, price : Double)
defined class Order
scala> val orders = List(Order("bike", 1, 23.34), Order("book", 3, 2.34), Order("lamp", 1, 9.44), Order("bike", 1, 23.34))
orders: List[Order] = List(Order(bike,1,23.34), Order(book,3,2.34), Order(lamp,1,9.44), Order(bike,1,23.34))
Now I want to partition/group these orders into one set which contain duplicate orders and another set which contains unique orders. So in the above example, when I force the stream it should create two streams: one with the two orders for a bike (Since they are the same) and another stream containing all the other orders.
I tried the following:
created the partitioning function:
scala> def matchOrders(o : Order, s : Stream[Order]) = s.contains(o)
matchOrders: (o: Order, s: Stream[Order])Boolean
then tried to apply this to stream:
scala> val s : (Stream[Order], Stream[Order]) = orders.toStream.partition(matchOrders(_, s._1))
I got a null pointer exception since I guess the s._1 is empty initially?? I'm not sure. I've tried other ways but I'm not getting very far. Is there a way to achieve this partitioning?

That would not work anyway, because the first duplicate Order would have already gone to the unique Stream when you would process its duplicate.
The best way is to create a Map[Order, Boolean] which tells you if an Order appears more than once in the original orders list.
val matchOrders = orders.groupBy(identity).mapValues(_.size > 1)
val s : (Stream[Order], Stream[Order]) = orders.toStream.partition(matchOrders(_))

Note that you can only know that an order has no duplicates after your stream finishes. So since the standard Stream constructors require you to know whether the stream is empty, it seems they aren't lazy enough: you have to force your original stream to even begin building the no-duplicates stream. And of course if you do this, Helder Pereira's answer applies.