Rx Replay and Join - system.reactive

I create function, it works but i don't understand why.
Task.
There is 2 streams.
Notifications stream N
Quotes stream Q
The function should pair up Notification with Quote on following conditions:
When new Notification arrived to N stream, it should be paired with latest quote from Q stream.
If the notifications arrived before first quote arrived, them all should be paired with the first arrived quote.
If N stream started some time later after Q stream started, it still should have access to the last quote from Q stream.
Marble diagrams
1.
--N1--N2--N3-------------
---------------Q1--------
---------------N1-N2-N3--
Q1 Q1 Q1
2.
-----------N1----N2--N3-----
--Q1--Q2------Q3------------
-----------N1----N2--N3----
Q2 Q3 Q3
Now this is my function
//qs, ns - hot streams
var rqs = qs.replay(null, 1);
qs.connect();
rqs.connect();
ns.connect();
var cs = ns.join(rqs,
_ => rqs,
_ => qs,
(n, q) => {
return {n : n, q : q};
}
).distinctUntilChanged(x => x.n);
https://jsbin.com/zeyiyeg/edit?js,console
And here I don't understand why
Why rqs always notified before qs
Can't grasp a logic of join in this case
When new notification and quote arrived, join opens next window and wait till
_ => rqs stream completed, why it is completed at all, rqs - is hot stream and should not be copmpleted at all.
Thanks.

The logic of the function above as follow:
rqs and qs stream never closed hot streams (which is wrong, the join params should return completed streams), since that each and every quote and notif will produce result to result stream, and there is will be created new stream each time when new notif arrived, and one never be closed. In other words this is wrong.
Still wonder why rqs replay stream receive notification before source rs stream, even it connected after rs
//replay latest quotes hash {"A": latestPrice, "B": latestPrice}
var rqs = qs
.scan({}, (x, v) => { x[v.t] = v; return x;})
.replay(null, 1);
qs.connect();
rqs.connect();
ns.connect();
//group by ticket
var cs = ns.groupBy(n => n.t).selectMany(x => {
//get quotes stream for group ticket
var gqs = rqs.where(q => q[x.key]).map(q => q[x.key]);
//wait till first quote arrive, collect all notifs
var s = x.join(
gqs.take(1),
_ => gqs.take(1),
_ => gqs.take(1),
(n, q) => {return {n : n, q : q};}
);
//s.subscribe(createObserver("s:" + x.key));
return s.concat(x.combineLatest(gqs, (n, q) => {
return {n : n, q : q};})
).distinctUntilChanged(x => x.n);
});
//rqs.subscribe(createObserver("rqs"));
cs.subscribe(createObserver("cs"));
Bonus - quotes and notifications grouped by ticket

Related

Is it possible to extract the substream key in akkastreams?

I can't seem to find any documentation on this but I know that AkkaStreams stores the keys used to group a stream into substreams when calling groupBy in memory. Is it possible to extract those keys from the substream? Say I create a bunch of substreams from my main stream, pass those through a fold that counts the objects in each substream and then store the count in a class. Can I get the key of the substream to also pass to that class? Or is there a better way of doing this? I need to count each element per substream but I also need to store which group the count belongs to.
A nice example is shown in the stream-cookbook:
val counts: Source[(String, Int), NotUsed] = words
// split the words into separate streams first
.groupBy(MaximumDistinctWords, identity)
//transform each element to pair with number of words in it
.map(_ -> 1)
// add counting logic to the streams
.reduce((l, r) => (l._1, l._2 + r._2))
// get a stream of word counts
.mergeSubstreams
Then the following:
val words = Source(List("Hello", "world", "let's", "say", "again", "Hello", "world"))
counts.runWith(Sink.foreach(println))
Will print:
(world,2)
(Hello,2)
(let's,1)
(again,1)
(say,1)
Another example I thought of, was counting numbers by their remainders. So the following, as example:
Source(0 to 101)
.groupBy(10, x => x % 4)
.map(e => e % 4 -> 1)
.reduce((l, r) => (l._1, l._2 + r._2))
.mergeSubstreams.to(Sink.foreach(println)).run()
will print:
(0,26)
(1,26)
(2,25)
(3,25)

For comprehension and number of function creation

Recently I had an interview for Scala Developer position. I was asked such question
// matrix 100x100 (content unimportant)
val matrix = Seq.tabulate(100, 100) { case (x, y) => x + y }
// A
for {
row <- matrix
elem <- row
} print(elem)
// B
val func = print _
for {
row <- matrix
elem <- row
} func(elem)
and the question was: Which implementation, A or B, is more efficent?
We all know that for comprehensions can be translated to
// A
matrix.foreach(row => row.foreach(elem => print(elem)))
// B
matrix.foreach(row => row.foreach(func))
B can be written as matrix.foreach(row => row.foreach(print _))
Supposedly correct answer is B, because A will create function print 100 times more.
I have checked Language Specification but still fail to understand the answer. Can somebody explain this to me?
In short:
Example A is faster in theory, in practice you shouldn't be able to measure any difference though.
Long answer:
As you already found out
for {xs <- xxs; x <- xs} f(x)
is translated to
xxs.foreach(xs => xs.foreach(x => f(x)))
This is explained in §6.19 SLS:
A for loop
for ( p <- e; p' <- e' ... ) e''
where ... is a (possibly empty) sequence of generators, definitions, or guards, is translated to
e .foreach { case p => for ( p' <- e' ... ) e'' }
Now when one writes a function literal, one gets a new instance every time the function needs to be called (§6.23 SLS). This means that
xs.foreach(x => f(x))
is equivalent to
xs.foreach(new scala.Function1 { def apply(x: T) = f(x)})
When you introduce a local function type
val g = f _; xxs.foreach(xs => xs.foreach(x => g(x)))
you are not introducing an optimization because you still pass a function literal to foreach. In fact the code is slower because the inner foreach is translated to
xs.foreach(new scala.Function1 { def apply(x: T) = g.apply(x) })
where an additional call to the apply method of g happens. Though, you can optimize when you write
val g = f _; xxs.foreach(xs => xs.foreach(g))
because the inner foreach now is translated to
xs.foreach(g())
which means that the function g itself is passed to foreach.
This would mean that B is faster in theory, because no anonymous function needs to be created each time the body of the for comprehension is executed. However, the optimization mentioned above (that the function is directly passed to foreach) is not applied on for comprehensions, because as the spec says the translation includes the creation of function literals, therefore there are always unnecessary function objects created (here I must say that the compiler could optimize that as well, but it doesn't because optimization of for comprehensions is difficult and does still not happen in 2.11). All in all it means that A is more efficient but B would be more efficient if it is written without a for comprehension (and no function literal is created for the innermost function).
Nevertheless, all of these rules can only be applied in theory, because in practice there is the backend of scalac and the JVM itself which both can do optimizations - not to mention optimizations that are done by the CPU. Furthermore your example contains a syscall that is executed on every iteration - it is probably the most expensive operation here that outweighs everything else.
I'd agree with sschaef and say that A is the more efficient option.
Looking at the generated class files we get the following anonymous functions and their apply methods:
MethodA:
anonfun$2 -- row => row.foreach(new anonfun$2$$anonfun$1)
anonfun$2$$anonfun$1 -- elem => print(elem)
i.e. matrix.foreach(row => row.foreach(elem => print(elem)))
MethodB:
anonfun$3 -- x => print(x)
anonfun$4 -- row => row.foreach(new anonfun$4$$anonfun$2)
anonfun$4$$anonfun$2 -- elem => func(elem)
i.e. matrix.foreach(row => row.foreach(elem => func(elem)))
where func is just another indirection before calling to print. In addition func needs to be looked up, i.e. through a method call on an instance (this.func()) for each row.
So for Method B, 1 extra object is created (func) and there are # of elem additional function calls.
The most efficient option would be
matrix.foreach(row => row.foreach(func))
as this has the least number of objects created and does exactly as you would expect.
Benchmark
Summary
Method A is nearly 30% faster than method B.
Link to code: https://gist.github.com/ziggystar/490f693bc39d1396ef8d
Implementation Details
I added method C (two while loops) and D (fold, sum). I also increased the size of the matrix and used an IndexedSeq instead. Also I replaced the print with something less heavy (sum all entries).
Strangely the while construct is not the fastest. But if one uses Array instead of IndexedSeq it becomes the fastest by a large margin (factor 5, no boxing anymore). Using explicitly boxed integers, methods A, B, C are all equally fast. In particular they are faster by 50% compared to the implicitly boxed versions of A, B.
Results
A
4.907797735
4.369745787
4.375195012000001
4.7421321800000005
4.35150636
B
5.955951859000001
5.925475619
5.939570085000001
5.955592247
5.939672226000001
C
5.991946029
5.960122757000001
5.970733164
6.025532582
6.04999499
D
9.278486201
9.265983922
9.228320372
9.255641645
9.22281905
verify results
999000000
999000000
999000000
999000000
>$ scala -version
Scala code runner version 2.11.0 -- Copyright 2002-2013, LAMP/EPFL
Code excerpt
val matrix = IndexedSeq.tabulate(1000, 1000) { case (x, y) => x + y }
def variantA(): Int = {
var r = 0
for {
row <- matrix
elem <- row
}{
r += elem
}
r
}
def variantB(): Int = {
var r = 0
val f = (x:Int) => r += x
for {
row <- matrix
elem <- row
} f(elem)
r
}
def variantC(): Int = {
var r = 0
var i1 = 0
while(i1 < matrix.size){
var i2 = 0
val row = matrix(i1)
while(i2 < row.size){
r += row(i2)
i2 += 1
}
i1 += 1
}
r
}
def variantD(): Int = matrix.foldLeft(0)(_ + _.sum)

Can I Pair Two Sequences Together by a Matching Key?

Let's say sequence one is going out to the web to retrieve the contents of sites 1, 2, 3, 4, 5 (but will return in unpredictable order).
Sequence two is going to a database to retrieve context about these same records 1, 2, 3, 4, 5 (but for the purposes of this example will return in unpredictable order).
Is there an Rx extension method that will combine these into one sequence when each matching pair is ready in both sequences? Ie, if the first sequence returns in the order 4,2,3,5,1 and the second sequence returns in the order 1,4,3,2,5, the merged sequence would be (4,4), (3,3), (2,2), (1,1), (5,5) - as soon as each pair is ready. I've looked at Merge and Zip but they don't seem to be exactly what I'm looking for.
I wouldn't want to discard pairs that don't match, which I think rules out a simple .Where.Select combination.
var paired = Observable
.Merge(aSource, bSource)
.GroupBy(i => i)
.SelectMany(g => g.Buffer(2).Take(1));
The test below gives the correct results. It's just taking ints at the moment, if you're using data with keys and values, then you'll need to group by i.Key instead of i.
var aSource = new Subject<int>();
var bSource = new Subject<int>();
paired.Subscribe(g => Console.WriteLine("{0}:{1}", g.ElementAt(0), g.ElementAt(1)));
aSource.OnNext(4);
bSource.OnNext(1);
aSource.OnNext(2);
bSource.OnNext(4);
aSource.OnNext(3);
bSource.OnNext(3);
aSource.OnNext(5);
bSource.OnNext(2);
aSource.OnNext(1);
bSource.OnNext(5);
yields:
4:4
3:3
2:2
1:1
5:5
Edit in response to Brandon:
For the situation where the items are different classes (AClass and BClass), the following adjustment can be made.
using Pair = Tuple<AClass, BClass>;
var paired = Observable
.Merge(aSource.Select(a => new Pair(a, null)), bSource.Select(b => new Pair(null, b)))
.GroupBy(p => p.Item1 != null ? p.Item1.Key : p.Item2.Key)
.SelectMany(g => g.Buffer(2).Take(1))
.Select(g => new Pair(
g.ElementAt(0).Item1 ?? g.ElementAt(1).Item1,
g.ElementAt(0).Item2 ?? g.ElementAt(1).Item2));
So you have 2 observable sequences that you want to pair together?
Pair from Rxx along with GroupBy can help here. I think code similar to the following might do what you want
var pairs = stream1.Pair(stream2)
.GroupBy(pair => pair.Switch(source1 => source1.Key, source2 => source2.Key))
.SelectMany(group => group.Take(2).ToArray()) // each group will have at most 2 results (1 left and 1 right)
.Select(pair =>
{
T1 result1 = default(T1);
T2 result2 = default(T2);
foreach (var r in pair)
{
if (r.IsLeft) result1 = r.Left;
else result2 = r.Right;
}
return new { result1, result2 };
});
```
I've not tested it, and not added in anything for error handling, but I think this is what you want.

Is there an Iteratee-like concept which pulls data from multiple sources?

It is possible to pull on demand from a number (say two for simplicity) of sources using streams (lazy lists). Iteratees can be used to process data coming from a single source.
Is there an Iteratee-like functional concept for processing multiple input sources? I could imagine an Iteratee whose state signals from which source does it want to pull.
To do this using pipes you nest the Pipe monad transformer within itself, once for each producer you wish to interact with. For example:
import Control.Monad
import Control.Monad.Trans
import Control.Pipe
producerA, producerB :: (Monad m) => Producer Int m ()
producerA = mapM_ yield [1,2,3]
producerB = mapM_ yield [4,5,6]
consumes2 :: (Show a, Show b) =>
Consumer a (Consumer b IO) r
consumes2 = forever $ do
a <- await -- await from outer producer
b <- lift await -- await from inner producer
lift $ lift $ print (a, b)
Just like a Haskell curried function of multiple variables, you partially apply it to each source using composition and runPipe:
consumes1 :: (Show b) => Consumer b IO ()
consumes1 = runPipe $ consumes2 <+< producerA
fullyApplied :: IO ()
fullyApplied = runPipe $ consumes1 <+< producerB
The above function outputs when run:
>>> fullyApplied
(1, 4)
(2, 5)
(3, 6)
This trick works for yielding or awaiting to any number of pipes upstream or downstream. It also works for proxies, the bidirectional analogs to pipes.
Edit: Note that this also works for any iteratee library, not just pipes. In fact, John Milikin and Oleg were the original advocates for this approach and I just stole the idea from them.
We're using Machines in Scala to pull in not just two, but an arbitrary amount of sources.
Two examples of binary joins are provided by the library itself, on the Tee module: mergeOuterJoin and hashJoin. Here is what the code for hashJoin looks like (it assumes both streams are sorted):
/**
* A natural hash join according to keys of type `K`.
*/
def hashJoin[A, B, K](f: A => K, g: B => K): Tee[A, B, (A, B)] = {
def build(m: Map[K, A]): Plan[T[A, B], Nothing, Map[K, A]] = (for {
a <- awaits(left[A])
mp <- build(m + (f(a) -> a))
} yield mp) orElse Return(m)
for {
m <- build(Map())
r <- (awaits(right[B]) flatMap (b => {
val k = g(b)
if (m contains k) emit(m(k) -> b) else Return(())
})) repeatedly
} yield r
}
This code builds up a Plan which is "compiled" to a Machine with the repeatedly method. The type being built here is Tee[A, B, (A, B)] which is a machine with two inputs. You request inputs on the left and right with awaits(left) and awaits(right), and you output with emit.
There is also a Haskell version of Machines.
Conduits (and, it can be built for Pipes, but that code hasn't been released yet) has a zip primitive that takes two upstreams and combines them as a stream of tuples.
Check out the pipes library, where vertical concatenation might do what you want. For example,
import Control.Pipe
import Control.Monad
import Control.Monad.State
import Data.Void
source0, source1 :: Producer Char IO ()
source0 = mapM_ yield "say"
source1 = mapM_ yield "what"
sink :: Show b => Consumer b IO ()
sink = forever $ await >>= \x -> lift $ print x
pipeline :: Pipe () Void IO ()
pipeline = sink <+< (source0 >> source1)
The sequencing operator (>>) vertically concatenates the sources, yielding the output (on a runPipe)
's'
'a'
'y'
'w'
'h'
'a'
't'

Nested lazy for-comprehension

I have a deeply "nested" for-comprehension, simplified to 3 levels below: x, y, and z. I was hoping making only x a Stream would make the y and z computations lazy too:
val stream = for {
x <- List(1, 2, 3).toStream
y <- List("foo", "bar", "baz")
z = {
println("Processed " + x + y)
x + y
}
} yield z
stream take (2) foreach (doSomething)
But this computes all 3 elements, as evidenced by the 3 prints. I'd like to only compute the first 2, since those are all I take from the stream. I can work around this by calling toStream on the second List and so on. Is there a better way than calling that at every level of the for-comprehension?
What it prints is:
Processed 1foo
Processed 1bar
Processed 1baz
stream: scala.collection.immutable.Stream[String] = Stream(1foo, ?)
scala> stream take (2) foreach (println)
1foo
1bar
The head of a Stream is always strictly evaluated, which is why you see Processed 1foo etc and not Processed 2foo etc. This is printed when you create the Stream, or more precisely, when the head of stream is evaluated.
You are correct that if you only wish to process each resulting element one by one then all the generators will have to be Streams. You could get around calling toStream by making them Streams to start with as in example below.
stream is a Stream[String] and its head needs to be evaluated. If you don't want to calculate a value eagerly, you could either prepend a dummy value, or better, make your value stream lazy:
lazy val stream = for {
x <- Stream(1, 2, 3)
y <- Stream("foo", "bar", "baz")
z = { println("Processed " + x + y); x + y }
} yield z
This does not do any "processing" until you take each value:
scala> stream take 2 foreach println
Processed 1foo
1foo
Processed 1bar
1bar