RX and buffering - system.reactive

I'm trying to obtain the following observable (with a buffer capacity of 10 ticks):
Time 0 5 10 15 20 25 30 35 40
|----|----|----|----|----|----|----|----|
Source A B C D E F G H
Result A E H
B F
C G
D
Phase |<------->|-------|<------->|<------->|
B I B B
That is, the behavior is very similar to the Buffer observable with the difference that the buffering phase is not in precise time slot, but starts at the first symbol pushed in the idle phase. I mean, in the example above the buffering phases start with the 'A', 'E', and 'H' symbols.
Is there a way to compose the observable or do I have to implement it from scratch?
Any help will be appreciated.

Try this:
IObservable<T> source = ...;
IScheduler scheduler = ...;
IObservable<IList<T>> query = source
.Publish(obs => obs
.Buffer(() => obs.Take(1).IgnoreElements()
.Concat(Observable.Return(default(T)).Delay(duration, scheduler))
.Amb(obs.IgnoreElements())));
The buffer closing selector is called once at the start and then once whenever a buffer closes. The selector says "The buffer being started now should be closed duration after the first element of this buffer, or when the source completes, whichever occurs first."
Edit: Based on your comments, if you want to make multiple subscriptions to query share a single subscription to source, you can do that by appending .Publish().RefCount() to the query.
IObservable<IList<T>> query = source
.Publish(obs => obs
.Buffer(() => obs.Take(1).IgnoreElements()
.Concat(Observable.Return(default(T)).Delay(duration, scheduler))
.Amb(obs.IgnoreElements())));
.Publish()
.RefCount();

Related

Rx how to merge observables and only fire if all provided an onnext in a specified timeframe

I have a number of observable event streams that are all providing events with timestamps. I don't care about the events individually, I need to know when they all fired within a specified timeframe.
For example:
Button one was clicked (don't care)
Button two was clicked (don't care)
Button one was clicked and within 5 seconds button two was clicked (I need this)
I tried "and then when" but I get old events and can't figure out how to filter them out if it is not within the time window.
Thanks!
Edit:
I attempted to create a marble diagram to clarify what I am trying to achieve...
I have a bunch of random event streams represented in the top portion. Some events fire more often then others. I only want to capture the group of events that fired within a specified time window. In this example I used windows of 3 seconds. The events I want are highlighted in dark black all other events should be ignored. I hope this helps better explain the problem.
Is sequence important? If so, the usual way of dealing with "this and then that" is using flatMapLatest. You can achieve your other constraints by applying them to stream passed to flatMapLatest. Consider the following example:
const fromClick = x => Rx.Observable.fromEvent(document.getElementById(x), 'click');
const a$ = fromClick('button1');
const b$ = fromClick('button2');
const sub = a$.flatMapLatest(() => b$.first().timeout(5000, Rx.Observable.never()))
.subscribe(() => console.log('condition met'));
Here we're saying "for each click on button 1, start listening to button 2, and return either the first click or nothing (if we hit the timeout). Here's a working example: https://jsbin.com/zosipociti/edit?js,console,output
You want to use the Sample operator. I'm not sure if you want .NET or JS sample code, since you tagged both.
Edit:
Here's a .NET sample. metaSample is an observable of 10 child observables. Each of the child observables has numbers going from 1 to 99 with random time-gaps between each number. The time gaps are anywhere between 0 to 200 milliseconds.
var random = new Random();
IObservable<IObservable<int>> metaSample = Observable.Generate(1, i => i < 10, i => i + 1, i =>
Observable.Generate(1, j => j < 100, j => j + 1, j => j, j => TimeSpan.FromMilliseconds(random.Next(200))));
We then Sample each of the child operators every one second. This gives us the latest value that occurred in that one second window. We then merge those sampled streams together:
IObservable<int> joined = metaSample
.Select(o => o.Sample(TimeSpan.FromSeconds(1)))
.Merge();
A marble diagram for 5 of them could look like this:
child1: --1----2--3-4---5----6
child2: -1-23---4--5----6-7--8
child3: --1----2----3-4-5--6--
child4: ----1-2--34---567--8-9
child5: 1----2--3-4-5--------6-
t : ------1------2------3-
------------------------------
result: ------13122--45345--5768--
So the after 1 second, it grabs the latest from each child and dumps it, after 2 seconds the same. After 3 seconds, notice that child5 hasn't emitted anything, so there's only 4 numbers emitted. Obviously with our parameters that's impossible, but that's demonstrated as how Sample would work with no events.
This is the closest I have come to accomplishing this task... There has to be a cleaner way with groupjoin but I can't figure it out!
static void Main(string[] args)
{
var random = new Random();
var o1 = Observable.Interval(TimeSpan.FromSeconds(2)).Select(t => "A " + DateTime.Now.ToString("HH:mm:ss"));
o1.Subscribe(Console.WriteLine);
var o2 = Observable.Interval(TimeSpan.FromSeconds(3)).Select(t => "B " + DateTime.Now.ToString("HH:mm:ss"));
o2.Subscribe(Console.WriteLine);
var o3 = Observable.Interval(TimeSpan.FromSeconds(random.Next(3, 7))).Select(t => "C " + DateTime.Now.ToString("HH:mm:ss"));
o3.Subscribe(Console.WriteLine);
var o4 = Observable.Interval(TimeSpan.FromSeconds(random.Next(5, 10))).Select(t => "D " + DateTime.Now.ToString("HH:mm:ss"));
o4.Subscribe(Console.WriteLine);
var joined = o1
.CombineLatest(o2, (s1, s2)=> new { e1 = s1, e2 = s2})
.CombineLatest(o3, (s1, s2) => new { e1 = s1.e1, e2 = s1.e2, e3 = s2 })
.CombineLatest(o4, (s1, s2) => new { e1 = s1.e1, e2 = s1.e2, e3 = s1.e3, e4 = s2 })
.Sample(TimeSpan.FromSeconds(3));
joined.Subscribe(e => Console.WriteLine($"{DateTime.Now}: {e.e1} - {e.e2} - {e.e3} - {e.e4}"));
Console.ReadLine();
}

Unexpected spark caching behavior

I've got a spark program that essentially does this:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
The strange behavior that I'm seeing is that spark stages corresponding to val c = a.map(...) are happening 10 times. I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case. When I look in the "storage" tab of the running job, very few of the partitions of c are cached.
Also, 10 copies of that stage immediately show as "active". 10 copies of the stage corresponding to val next = some_other_rdd_ops(c, current) show up as pending, and they roughly alternate execution.
Am I misunderstanding how to get Spark to cache RDDs?
Edit: here is a gist containing a program to reproduce this: https://gist.github.com/jfkelley/f407c7750a086cdb059c. It expects as input the edge list of a graph (with edge weights). For example:
a b 1000.0
a c 1000.0
b c 1000.0
d e 1000.0
d f 1000.0
e f 1000.0
g h 1000.0
h i 1000.0
g i 1000.0
d g 400.0
Lines 31-42 of the gist correspond to the simplified version above. I get 10 stages corresponding to line 31 when I would only expect 1.
The problem here is that calling cache is lazy. Nothing will be cached until an action is triggered and the RDD is evaluated. All the call does is set a flag in the RDD to indicate that it should be cached when evaluated.
Unpersist however, takes effect immediately. It clears the flag indicating that the RDD should be cached and also begins a purge of data from the cache. Since you only have a single action at the end of your application, this means that by the time any of the RDDs are evaluated, Spark does not see that any of them should be persisted!
I agree that this is surprising behaviour. The way that some Spark libraries (including the PageRank implementation in GraphX) work around this is by explicitly materializing each RDD between the calls to cache and unpersist. For example, in your case you could do the following:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
next.foreachPartition(x => {}) // materialize before unpersisting
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
Caching doesn't reduce stages, it just won't recompute the stage every time.
In the first iteration, in the stage's "Input Size" you can see that the data is coming from Hadoop, and that it reads shuffle input. In subsequent iterations, the data is coming from memory and no more shuffle input. Also, execution time is vastly reduced.
New map stages are created whenever shuffles have to be written, for example when there's a change in partitioning, in your case adding a key to the RDD.

RxJava: why same transformations are recomputed for each observables branch?

Introduction
Consider simple piece of java code. It defines two observables a and b in terms of c which itself is defined using d (a, b, c, d have type Observable<Integer>):
d = Observable.range(1, 10);
c = d.map(t -> t + 1);
a = c.map(t -> t + 2);
b = c.map(t -> t + 3);
This code can be visualised using diagram where each arrow (->) represents transformation (map method):
.--> a
d --> c --|
'--> b
If several chains of observables have own part then (in theory) new values of common part can be calculated only once. In example above: every new d value could be transformed into d --> c only once and used both for a and b.
Question
In practice I observe that transformation is calculated for each chain where this transformation is used (test). In other words example above should be correctly drawn like this:
d --> c --> a
d --> c --> b
In case of resource-consuming transformations new subscription at end of chain will cause computation of whole chain (and performance penalty).
Is where proper way to force transformation result to be cached and computed only once?
My research
I found two solutions for this problem:
Pass unique identificators together with values and store transformation results in some external storage (external to rx library).
Use subject to implement map-like function which hides start of observables chain. MapOnce code; test.
Both works. Second is simple but smells like a hack.
You've identified hot and cold observables.
Observable.range returns a cold observable, though you're describing the resulting queries in a hierarchy as if they're hot; i.e., as if they'd share subscription side effects. They do not. Each time that you subscribe to a cold observable it may cause side effects. In your case, each time that you subscribe to range (or to queries established on range) it generates a range of values.
In the second point of your research, you've identified how to convert a cold observable into a hot observable; namely, using Subjects. (Though in .NET you don't use a Subject<T> directly; instead, you'd use an operator like Publish. I suspect RxJava has a similar operator and I'd recommend using it.)
Additional Details
The definition of hot by my interpretation, as described in detail in my blog post linked above, is when an observable doesn't cause any subscription side effects. (Note that a hot observable may multicast connection side effects when converting from cold to hot, but temperature only refers to the propensity of an observable to cause subscription side effects because that's all we really care about when talking about an observable's temperature in practice.)
The map operator (Select in .NET, mentioned in the conclusion of my blog post) returns an observable that inherits the temperature of its source, so in your bottom diagram c, a and b are cold because d is cold. If, hypothetically, you were to apply publish to d, then c, a and b would inherit the hot temperature from the published observable, meaning that subscribing to them wouldn't cause any subscription side effects. Thus publishing d converts a cold observable, namely range, into a hot observable.
.--> c --> a
d --|
.--> c --> b
However, your question was about how to share the computation of c as well as d. Even if you were to publish d, c would still be recomputed for both a and b for each notification from d. Instead, you want to share the results of c among a and b. I call an observable in which you want to share its computation side effects, "active". (I borrowed the term from the passive & active terminology used in neuroscience to describe electrochemical currents in neurons.)
In your top diagram, you're considering c to be active because it causes significant computation side effects, by your own interpretation. Note that c is active regardless of the temperature of d. To share the computation side effects of an active observable, perhaps surprisingly, you must use publish just like for a cold observable. This is because technically active computations are side effects in the same sense as cold observables, while passive computations have no side effects, just like hot observables. I've restricted the terms hot and cold to only refer to the initial computation side effects, which I call subscription side effects, because that's how people generally use them. I've introduced new terms, active and passive, to refer to computation side effects separately from subscription side effects.
The result is that these terms in practice just blend together intuitively. If you want to share the computation side effects of c, then simply publish it instead of d. By doing so, a and b implicitly become hot because map inherits subscription side effects, as stated previously. Therefore, you're effectively making the right side of the observable hot by publishing either d or c, but publishing c also shares its computation side effects.
If you publish c instead of d, then d remains cold, but it doesn't matter since c hides d from a and b. So by publishing c you're effectively publishing d as well. Therefore, applying publish anywhere within your observable makes the right side of the observable effectively hot. It doesn't matter where you introduce publish or how many observers or pipelines you're creating on the right side of the observable. However, choosing to publish c instead of d also shares the computation side effects of c, which technically completes the answer to your question. Q.E.D.
An Observable is lazily executed each time it is subscribed to (either explicitly or implicitly via composition).
This code shows how the source emits for a, b, and c:
Observable<Integer> d = Observable.range(1, 10)
.doOnNext(i -> System.out.println("Emitted from source: " + i));
Observable<Integer> c = d.map(t -> t + 1);
Observable<Integer> a = c.map(t -> t + 2);
Observable<Integer> b = c.map(t -> t + 3);
a.forEach(i -> System.out.println("a: " + i));
b.forEach(i -> System.out.println("b: " + i));
c.forEach(i -> System.out.println("c: " + i));
If you are okay buffering (caching) the result then it is as simple as using the .cache() operator to achieve this.
Observable<Integer> d = Observable.range(1, 10)
.doOnNext(i -> System.out.println("Emitted from source: " + i))
.cache();
Observable<Integer> c = d.map(t -> t + 1);
Observable<Integer> a = c.map(t -> t + 2);
Observable<Integer> b = c.map(t -> t + 3);
a.forEach(i -> System.out.println("a: " + i));
b.forEach(i -> System.out.println("b: " + i));
c.forEach(i -> System.out.println("c: " + i));
Adding the .cache() to the source makes it so it only emits once and can be subscribed to many times.
For large or infinite data sources caching is not an option so multicasting is the solution to ensure the source only emits once.
The publish() and share() operators are a good place to start, but for simplicity, and since this is a synchronous example, I'll show with the publish(function) overload which is often the easiest to use.
Observable<Integer> d = Observable.range(1, 10)
.doOnNext(i -> System.out.println("Emitted from source: " + i))
.publish(oi -> {
Observable<Integer> c = oi.map(t -> t + 1);
Observable<Integer> a = c.map(t -> t + 2);
Observable<Integer> b = c.map(t -> t + 3);
return Observable.merge(a, b, c);
});
d.forEach(System.out::println);
If a, b, c are wanted individually then we can wire everything up and "connect" the source when ready:
private static void publishWithConnect() {
ConnectableObservable<Integer> d = Observable.range(1, 10)
.doOnNext(i -> System.out.println("Emitted from source: " + i))
.publish();
Observable<Integer> c = d.map(t -> t + 1);
Observable<Integer> a = c.map(t -> t + 2);
Observable<Integer> b = c.map(t -> t + 3);
a.forEach(i -> System.out.println("a: " + i));
b.forEach(i -> System.out.println("b: " + i));
c.forEach(i -> System.out.println("c: " + i));
// now that we've wired up everything we can connect the source
d.connect();
}
Or if the source is async we can use refCounting:
Observable<Integer> d = Observable.range(1, 10)
.doOnNext(i -> System.out.println("Emitted from source: " + i))
.subscribeOn(Schedulers.computation())
.share();
However, refCount (share is an overload to provide it) allows race conditions so won't guarantee all subscribers get the first values. It is usually only wanted for "hot" streams where subscribers are coming and going. For a "cold" source that we want to ensure everyone gets, the previous solutions with cache() or publish()/publish(function) are the preferred approach.
You can learn more here: https://github.com/ReactiveX/RxJava/wiki/Connectable-Observable-Operators

Martin Odersky : Working hard to keep it simple

I was watching the talk given by Martin Odersky as recommended by himself in the coursera scala course and I am quite curious about one aspect of it
var x = 0
async { x = x + 1 }
async { x = x * 2 }
So I get that it can give 2 if if the first statement gets executed first and then the second one:
x = 0;
x = x + 1;
x = x * 2; // this will be x = 2
I get how it can give 1 :
x = 0;
x = x * 2;
x = x + 1 // this will be x = 1
However how can it result 0? Is it possible that the statement don't execute at all?
Sorry for such an easy question but I'm really stuck at it
You need to think about interleaved execution. Remember that the CPU needs to read the value of x before it can work on it. So imagine the following:
Thread 1 reads x (reading 0)
Thread 2 reads x (reading 0)
Thread 1 writes x + 1 (writing 1)
Thread 2 writes x * 2 (writing 0)
I know this has already been answered, but maybe this is still useful:
Think of it as a sequence of atomic operations. The processor is doing one atomic operation at a time.
Here we have the following:
Read x
Write x
Add 1
Multiply 2
The following two sequences are guaranteed to happen in this order "within themselves":
Read x, Add 1, Write x
Read x, Multiply 2, Write x
However, if you are executing them in parallel, the time of execution of each atomic operation relative to any other atomic operation in the other sequence is random i.e. these two sequences interleave.
One of the possible order of execution will produce 0 which is given in the answer by Paul Butcher
Here is an illustration I found on the internet:
Each blue/purple block is one atomic operation, you can see how you can have different results based on the order of the blocks
To solve this problem you can use the keyword "synchronized"
My understanding is that if you mark two blocks of code (e.g. two methods) with synchronized within the same object then each block will own the lock of that object while being executed, so that the other block cannot be executed while the first hasn't finished yet. However, if you have two synchronised blocks in two different objects then they can execute in parallel.

RX - Notifications at a specified rate

I'm a newbie with RX and I'm facing a problem with "shaping the notifications traffic".
I wonder how can I notify observers with a given throughput; that is, I would like that "OnNext" method is called not before a given amount of time is elapsed since the last "OnNext" invocation.
For the sake of completeness: I want that every element in the sequence will be notified.
For example, with a 0.2 symbols/tick:
Tick: 0 10 20 30
|---------|---------|---------|
Producer: A---B------C--D-----E-------F
Result: A B C D E F
0 5 11 16 21 28
Is there a way to compose the observable or I have to implement my own Subject?
Thanks a lot
yeah just turn each value into an async process that does not complete until delay has elapsed and then concatenate them.
var delay = Observable.Empty<T>().Delay(TimeSpan.FromSeconds(2));
var rateLimited = source
.Select(item => Observable.Return(item).Concat(delay))
.Concat();