Multiple Listeners on RxJava2 BehaviorSubject with Inconsistent Results - rx-java2

I have a BehaviorSubject that has three listeners that are subscribed prior to any emissions. I .onNext() two things: A followed by B.
Two of the listeners appropriately receive A and then B. But the third listener gets B, A. What could possibly explain this behavior? This is all on the same thread.
Here is some sample code (in Kotlin) that reproduces the results. Let me know if you need a Java version:
#Test
fun `rxjava test`() {
val eventHistory1 = ArrayList<String>()
val eventHistory2 = ArrayList<String>()
val eventHistory3 = ArrayList<String>()
val behaviorSubject = BehaviorSubject.create<String>()
behaviorSubject.subscribe {
eventHistory1.add(it)
}
behaviorSubject.subscribe {
eventHistory2.add(it)
if (it == "A") behaviorSubject.onNext("B")
}
behaviorSubject.subscribe {
eventHistory3.add(it)
}
behaviorSubject.onNext("A")
println(eventHistory1)
println(eventHistory2)
println(eventHistory3)
assert(eventHistory1 == eventHistory2)
assert(eventHistory2 == eventHistory3)
}
And here is the output from the test:
[A, B]
[A, B]
[B, A]

Subjects are not re-entrant thus calling onNext on the same subject that is currently servicing onNexts is an undefined behavior. The javadoc warns about this case:
Calling onNext(Object), onError(Throwable) and onComplete() is required to be serialized (called from the same thread or called non-overlappingly from different threads through external means of serialization). The Subject.toSerialized() method available to all Subjects provides such serialization and also protects against reentrance (i.e., when a downstream Observer consuming this subject also wants to call onNext(Object) on this subject recursively).
In your particular case, signaling "B" happens first for the 3rd observer while it was about to signal "A" to it, hence the swapped order.
Use toSerialized on the subject to make sure this doesn't happen.
val behaviorSubject = BehaviorSubject.create<String>().toSerialized()

Related

Atomic compareAndSet parameters are evaluated even if it's not used

I have the following code that set the Atomic variable (both java.util.concurrent.atomic and monix.execution.atomic behaves the same:
class Foo {
val s = AtomicAny(null: String)
def foo() = {
println("called")
/* Side Effects */
"foo"
}
def get(): String = {
s.compareAndSet(null, foo())
s.get
}
}
val f = new Foo
f.get //Foo.s set from null to foo, print called
f.get //Foo.s not updated, but still print called
The second time it compareAndSet, it did not update the value, but still foo is called. This is causing problem because foo is having side effects (in my real code, it creates an Akka actor and give me error because it tries to create duplicate actors).
How can I make sure the second parameter is not evaluated unless it is actually used? (Preferably not using synchronized)
I need to pass implicit parameter to foo so lazy val would not work. E.g.
lazy val s = get() //Error cannot provide implicit parameter
def foo()(implicit context: Context) = {
println("called")
/* Side Effects */
"foo"
}
def get()(implicit context: Context): String = {
s.compareAndSet(null, foo())
s.get
}
Updated answer
The quick answer is to put this code inside an actor and then you don't have to worry about synchronisation.
If you are using Akka Actors you should never need to do your own thread synchronisation using low-level primitives. The whole point of the actor model is to limit the interaction between threads to just passing asynchronous messages. This provides all the thread synchronisation that you need and guarantees that an actor processes a single message at a time in a single-threaded manner.
You should definitely not have a function that is accessed simultaneously by multiple threads that creates a singleton actor. Just create the actor when you have the information you need and pass the ActorRef to any other actors that need it using dependency injection or a message. Or create the actor at the start and initialise it when the first message arrives (using context.become to manage the actor state).
Original answer
The simplest solution is just to use a lazy val to hold your instance of foo:
class Foo {
lazy val foo = {
println("called")
/* Side Effects */
"foo"
}
}
This will create foo the first time it is used and after that will just return the same value.
If this is not possible for some reason, use an AtomicInteger initialised to 0 and then call incrementAndGet. If this returns 1 then it is the first pass through this code and you can call foo.
Explanation:
Atomic operations such as compareAndSet require support from the CPU instruction set, and modern processors have single atomic instructions for such operations. In some cases (e.g. cache line is held exclusively by this processor) the operation can be very fast. Other cases (e.g. cache line also in cache of another processor) the operation can be significantly slower and can impact other threads.
The result is that the CPU must be holding the new value before the atomic instruction is executed. So the value must be computed before it is known whether it is needed or not.

How to clean up substreams in continuous Akka streams

Given I have a very long running stream of events flowing through something as show below. When a long time has passed there will be lots of sub streams created that is no longer needed.
Is there a way to clean up a specific substream at a given time, for
example the substream created by id 3 should be cleaned and the state
in the scan method lost at 13Pm (expires property of Wid)?
case class Wid(id: Int, v: String, expires: LocalDateTime)
test("Substream with scan") {
val (pub, sub) = TestSource.probe[Wid]
.groupBy(Int.MaxValue, _.id)
.scan("")((a: String, b: Wid) => a + b.v)
.mergeSubstreams
.toMat(TestSink.probe[String])(Keep.both)
.run()
}
TL;DR You can close a substream after some time. However, using input to dynamically set the time with built-in stages is another matter.
Closing a substream
To close a flow, you usually complete it (from upstream), but you can also cancel it (from downstream). For instance, the take(n: Int) flow will cancel once n elements have gone through.
Now, in the groupBy case, you cannot complete a substream, since upstream is shared by all substreams, but you can cancel it. How depends on what condition you want to put on its end.
However, be aware that groupBy removes inputs for subflows that have already been closed: If a new element with id 3 comes from upstream to the groupBy after the 3-substream has been closed, it will simply be ignored and the next element will be pulled in. The reason for this is probably that some elements might be lost in the process between closing and re-opening of the substream. Also, if your stream is supposed to run for a very long time, this will affect performances because each element will be checked against the list of closed substreams before being forwarded to the relevant (live) substream. You might want to implement your own stateful filter (say, with a bloom filter) if you're not satisfied with the performances of this.
To close a substream, I usually use either take (if you want only a given number of elements, but that's probably not the case on an infinite stream), or some kind of timeout: either completionTimeout if you want a fixed time from materialization to closure or idleTimeout if you want to close when no element have been coming through for some time. Note that these flows do not cancel the stream but fail it, so you have to catch the exception with a recover or recoverWith stage to change the failure into a cancel (recoverWith allows you to cancel without sending any last element, by recovering with Source.empty).
Dynamically set the timeout
Now what you want is to set dynamically the closing time according to the first passing element. This is more complicated because materialization of streams is independant of the elements that pass through them. Indeed, in the usual (without groupBy) case, streams are materialized before any element go through them, so it makes no sense to use elements to materialize them.
I had similar issues in that question, and ended up using a modified version of groupBy with signature
paramGroupBy[K, OO, MM](maxSubstreams: Int, f: Out => K, paramSubflow: K => Flow[Out, OO, MM])
that allows to define every substream using the key that defined it. This can be modified to have the first element (instead of the key), as parameter.
Another (probably simpler, in your case) way would be to write your own stage that does exactly what you want: get end-time from first element and cancel the stream at that time. Here is an example implementation for this (I used a scheduler instead of setting a state):
object CancelAfterTimer
class CancelAfter[T](getTimeout: T => FiniteDuration) extends GraphStage[FlowShape[T, T]] {
val in = Inlet[T]("CancelAfter.in")
val out = Outlet[T]("CancelAfter.in")
override val shape: FlowShape[T, T] = FlowShape(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) with InHandler with OutHandler {
override def onPush(): Unit = {
val elem = grab(in)
if (!isTimerActive(CancelAfterTimer))
scheduleOnce(CancelAfterTimer, getTimeout(elem))
push(out, elem)
}
override def onTimer(timerKey: Any): Unit =
completeStage() //this will cancel the upstream and close the downstrean
override def onPull(): Unit = pull(in)
setHandlers(in, out, this)
}
}

How to get truly atomic update for TrieMap.getOrElseUpdate

As I understand, TrieMap.getOrElseUpdate is still not truly atomic, and this fixes only returned result (it could return different instances for different callers before this fix), so the updater function still might be called several times, as documentation (for 2.11.7) says:
Note: This method will invoke op at most once. However, op may be invoked without the result being added to the map if a concurrent process is also trying to add a value corresponding to the same key k.
*I've checked that manually for 2.11.7, still "at least once"
How to guarantee one-time call (if I use TrieMap for factories)?
I think this solution should work for my requirements:
trait LazyComp { val get: Int }
val map = new TrieMap[String, LazyComp]()
val count = new AtomicInteger() //just for test, you don't need it
def getSingleton(key: String) = {
val v = new LazyComp {
lazy val get = {
//compute something
count.incrementAndGet() //just for test, you don't need it
}
}
map.putIfAbsent(key, v).getOrElse(v).get
}
I believe, lazy val actually uses synchronized inside. And also the code inside get should be safe from exceptions
However, performance could be improved in future: SIP-20
Test:
scala> (0 to 10000000).par.map(_ => getSingleton("zzz")).last
res8: Int = 1
P.S. Java has computeIfAbscent method on ConcurrentHashMap which I could use as well.

Stream as constructor arg sometimes fully evaluated during early class initialization

Streams can be used as class constructor arguments:
scala> ( 0 to 10).toStream.map(i =>{println("bla" + i); -i})
bla0
res0: scala.collection.immutable.Stream[Int] = Stream(0, ?)
scala> class B(val a:Seq[Int]){println(a.tail.head)}
defined class B
scala> new B(res0)
bla1
-1
res1: B = B#fdb84e
So, the Stream does not get fully evaluated although handed in as a Seq argument, and although being partly evaluated. Works as expected.
I have a class like this:
class HazelSimpleResultSet[T] (col: Seq[T], comparator:Comparator[T]) extends HGRandomAccessResult[T] with CountMe
{
val foo: Int = -1 // col of type Stream[T] already fully evaluated here.
def count = col.size
....
}
where HGRandomAccessResult and CountMe are simple interfaces.
I most cases I want to use Streams as col constructor arguments, to avoid costly operations. In the debugger I can follow that it works in some cases, since the value displayed for col remains Stream(xy, ?) and "tlVal = null", even after initialization of HazelSimpleResultSet.
Furthermore, for testing, I include println in the blocks that construct the Streams like this:
keyvalues.foldLeft(Stream.empty[KeyType]){ case (a, b) => ({ println("evaluating "+ b); unpack[KeyType](b)}) #:: a}
in order to follow in the Console when exactly the Stream is evaluted.
So, in some cases it works, but in some cases the Stream gets full evaluted during the very first moments of initialization of HazelSimpleResultSet. I cannot see no relevant difference in the Stream handed in, i'm just sure they are unevaluted Streams till that moment.
"Stepping into" with the debugger, I can see that it gets evaluted in the line of the class definition itself, before even reaching the class body, i.e. before initialization of any field.
EDIT:
I can define the class in a (suboptimal) way such that no field at all is referencing to the Stream, and still I get that behaviour.
The CountMe interface defines a "count" method, which calls col.size which would then evaluate all the Stream. I tried to define count in terms of a lazy val size, but that didn't make a difference.
I'm a bit at a loss why it doesn't work in some cases. Anybody has any hints about hidden caveats of Streams?
EDIT:
An important note: The Stream object wraps some serious state that it needs to evaluate, i.e. a reference to a NoSQL database (hazelcast).
Question: what are the caveats here? Is there something in particular I must take care of when my Stream carries stateful references necessary for evaluation?
If you create Stream like this:
Stream ({println("eval 1"); 1}, {println("eval 2"); 2})
then you are actually calling Stream.apply which is implemented like this:
/** A stream consisting of given elements */
override def apply[A](xs: A*): Stream[A] = xs.toStream
which means that what actually happens is:
All elements are evaluated!
A Seq containing these elements is created.
A Stream is created out of this Seq
So as you can see, if you create your Stream this way, all its elements are evaluated eagerly. This is not how you create lazily-evaluated Stream. What you want to do is probably use #:: and #::: operators that evaluate their operands lazily. Look up the docs for their usage.

Scala concurrency on iterators as queues

I'm not really sure of the correct language of my problem, so feel free to provide me with the right terms.
Say I have a process A, which outputs an iterator (lazy evaluation)
This produces Iterator[A]
I then have another process B, which maps the events returning
Iterator[B]
This continues for several more processes
Iterator[A] -> Iterator[B] -> Iterator[C] -> ---
Now eventually I evaluate this stream into a list[Z].
This saves me the memory hit of having a List[A] -> List[B] -> List[C] etc
Now I want to improve performance by introducing parallelisation, but I don't want to parallelise the evaluation of each element across the iterators, but rather each iterator stack. So in this case a thread for process A fills a Queue[A] for Iterator[A], a thread for process B takes from Queue[A], applies whatever mapping, and then adds to Queue[B] for Iterator[B] to read from.
Now I have done this before in other languages by designing my own Async queues, I was wondering what Scala has to solve this.
Heres a first stab solutions I made using an actor.
Its fully blocking, so maybe an implementation using futures could be developed
case class AsyncIterator[T](iterator:Iterator[T]) extends Iterator[T] {
private val queue = new scala.collection.mutable.SynchronizedQueue[Int]()
private var end = !iterator.hasNext
def hasNext() = {
if (end) false
else if (!queue.isEmpty) true
else hasNext
}
def next() = {
while (q.isEmpty) {
if (end) throw new Exception("blah")
}
q.dequeue()
}
private val producer: Actor = actor {
loop {
if (!iterator.hasNext) {
end = true
exit
}
else {
q.enqueue(iterator.next)
}
}
}
producer.start()
}
Since you're open to alternative languages, how about Go?
There was a discussion recently about how to construct an event-driven pipeline, which would achieve the same thing as you describe but in a completely different way.
It's arguably easier to think about and design an event pipeline than it is to reason about lazy iterators because it becomes a data flow system in which the key question at each stage is 'what does this stage do with a single entity?' rather than 'how can I iterate efficiently over many entities?'
Once an event-driven pipeline has been implemented, the question of how to make it concurrent or parallel is moot - you've already done it.