Merging multiple streams, keeping ordering and avoiding duplicates - system.reactive

I have a problem that I do not know how to handle beautifully with RX.
I have multiple streams that all supposedly contain the same elements
However​ each stream may lose messages (UDP is involved) or be late/early compared to others. Each of these messages have a sequence number.
Now what I want to achieve is get a single stream out of all those streams, ​without duplicate and keeping the message order​. In other words, the same sequence number should not appear twice and their values only have to increase, never decrease.
When a message was lost on all the streams, I'm OK with losing it (as there is another TCP mechanism involved that allows me to ask explicitly for missing messages).
I am looking to do that in RxJava, but I guess my problem is not specific to Java.
Here's a marble diagram to help visualizing what I want to achieve:
marble diagram
You can see in that diagram that we are waiting for 2 on the first stream to output 3 from the second stream.
Likewise, 6 is only outputted once we receive 6 from the second stream because only at that point can we know for sure that 5 will never be received by any stream.

This is browser code, but I think it should give you a good idea of how you could solve this.
public static IObservable<T> Sequenced<T>(
this IObservable<T> source,
Func<T, int> getSequenceNumber,
int sequenceBegin,
int sequenceRedundancy)
{
return Observable.Create(observer =>
{
// The next sequence number in order.
var sequenceNext = sequenceBegin;
// The key is the sequence number.
// The value is (T, Count).
var counts = new SortedDictionary<int, Tuple<T, int>>();
return source.Subscribe(
value =>
{
var sequenceNumber = getSequenceNumber(value);
// If the sequence number for the current value is
// earlier in the sequence, just throw away this value.
if (sequenceNumber < sequenceNext)
{
return;
}
// Update counts based on the current value.
Tuple<T, int> count;
if (!counts.TryGetValue(sequenceNumber, out count))
{
count = Tuple.Create(value, 0);
}
count = Tuple.Create(count.Item1, count.Item2 + 1);
counts[sequenceNumber] = count;
// If the current count has reached sequenceRedundancy,
// that means any seqeunce values S such that
// sequenceNext < S < sequenceNumber and S has not been
// seen yet will never be seen. So we emit everything
// we have seen up to this point, in order.
if (count.Item2 >= sequenceRedundancy)
{
var removal = counts.Keys
.TakeWhile(seq => seq <= sequenceNumber)
.ToList();
foreach (var seq in removal)
{
count = counts[seq];
observer.OnNext(count.Item1);
counts.Remove(seq);
}
sequenceNext++;
}
// Emit stored values as long as we keep having the
// next sequence value.
while (counts.TryGetValue(sequenceNext, out count))
{
observer.OnNext(count.Item1);
counts.Remove(sequenceNext);
sequenceNext++;
}
},
observer.OnError,
() =>
{
// Emit in order any remaining values.
foreach (var count in counts.Values)
{
observer.OnNext(count.Item1);
}
observer.OnCompleted();
});
});
}
If you have two streams IObservable<Message> A and IObservable<Message> B, you would use this by doing Observable.Merge(A, B).Sequenced(msg => msg.SequenceNumber, 1, 2).
For your example marble diagram, this would look like the following, where the source column shows the values emitted by Observable.Merge(A, B) and the counts column shows the contents of the SortedDictionary after each step of the algorithm. I am assuming that the "messages" of the original source sequence (without any lost values) is (A,1), (B,2), (C,3), (D,4), (E,5), (F,6) where the second component of each message is its sequence number.
source | counts
-------|-----------
(A,1) | --> emit A
(A,1) | --> skip
(C,3) | (3,(C,1))
(B,2) | (3,(C,1)) --> emit B,C and remove C
(D,4) | --> emit D
(F,6) | (6,(F,1))
(F,6) | (6,(F,2)) --> emit F and remove

A similar question came up a while ago and I have a custom merge operator that when given ordered streams, it merges them in order but doesn't do deduplication.
Edit:
If you can "afford" it, you can use this custom merge and then distinctUntilChanged(Func1) to filter out subsequent messages with the same sequence number.
Observable<Message> messages = SortedMerge.create(
Arrays.asList(src1, src2, src3), (a, b) -> Long.compare(a.id, b.id))
.distinctUntilChanged(v -> v.id);

Related

RxJs: request list from server, consume values, re-request when we're almost out of values

I'm fetching a list of items from a REST api. The user interacts with each one via a click, and when there are only, say, a couple left unused, I'd like to repeat the request to get more items. I'm trying to do this using a proper RxJs (5) stream-oriented approach.
So, something like:
var userClick$ = Observable.fromEvent(button.nativeElement, 'click');
var needToExtend$ = new BehaviorSubject(1);
var list$ = needToExtend$
.flatMap( () => this.http.get("http://myserver/get-list") )
.flatMap( x => x['list'] );
var itemsUsed$ = userClick$.zip(list$, (click, item) => item);
itemsUsed$.subscribe( item => use(item) );
and then, to trigger a re-load when necessary:
list$.subscribe(
if (list$.isEmpty()) {
needToExtend$.next(1);
}
)
This last bit is wrong, and manually re-triggering doesn't seem very "stream-oriented" even if it did work as intended. Any ideas?
This is similar to Rxjs - Consume API output and re-query when cache is empty but I can't make assumptions about the length of the list returned by the API, and I'd like to re-request before the list is completely consumed. And the solution there feels a bit too clever. There must be a more readable way, right?
How about something like this:
const LIST_LIMIT = 3;
userClick$ = Observable.fromEvent(button.nativeElement, 'click');
list$ = this.http.get("http://myserver/get-list").map(r => r.list);
clickCounter$ = this.userClick$.scan((acc: number, val) => acc + 1, 0);
getList$ = new BehaviorSubject([]);
this.getList$
.switchMap(previousList => this.list$)
.switchMap(list => this.clickCounter$, (list, clickCount) => { return {list, clickCount}; })
.filter(({list, clickCount}) => clickCount >= list.length - LIST_LIMIT)
.map(({list, clickCount}) => list)
.subscribe(this.getList$);
The logic here if you define a list getter stream, and a signal to trigger it.
First, the signal causes switchMap to fetch a new list, which is then fed into another switchmap that resubscribes to a click counter. You combine the result of both streams and feed that to filter, which only emits when the click count is greater than or equal to the list length minus 3 (or whatever you want). Then the signal is subscribed to this whole stream so that it retriggers itself.
Edit: the biggest weakness of this is that you need to set the list value (for display) in a side effect rather than in subscription or with the async pipe. You can rearrange it and multicast though:
const LIST_LIMIT = 3;
userClick$ = Observable.fromEvent(button.nativeElement, 'click');
list$ = this.http.get("http://myserver/get-list").map(r => r.list);
clickCounter$: Observable<number> = this.userClick$.scan((acc: number, val) => acc + 1, 0).startWith(0);
getList$ = new BehaviorSubject([]);
refresh$ = this.getList$
.switchMap(list => this.clickCounter$
.filter(clickCount => list.length <= clickCount + LIST_LIMIT)
.first(),
(list, clickCount) => list)
.switchMap(previousList => this.list$)
.multicast(() => this.getList$);
this.refresh$.connect();
this.refresh$.subscribe(e => console.log(e));
This way has a few advantages, but may be a little less "readable". The pieces are mostly the same, but instead you go to the counter first and let that lead into the switch to the list fetch. and you multicast it to restart the counter.
I'm not clear on how you are tracking getting the next set of items so I will assume it is some form of paging for my answer. I also assume that you don't know the total number of items.
console.clear();
const pageSize = 5;
const pageBuffer = 2;
const data = [...Array(17).keys()]
function getData(page) {
const begin = pageSize * page
const end = begin + pageSize;
return Rx.Observable.of(data.slice(begin, end));
}
const clicks = Rx.Observable.interval(400);
clicks
.scan(count => ++count, 0)
.do(() => console.log('click'))
.map(count => {
const page = Math.floor(count / pageSize) + 1;
const total = page * pageSize;
return { total, page, count }
})
.filter(x => x.total - pageBuffer === x.count)
.startWith({ page: 0 })
.switchMap(x => getData(x.page))
.takeWhile(x => x.length > 0)
.subscribe(
x => { console.log('next: ', x); },
x => { console.log('error: ', x); },
() => { console.log('completed'); }
);
<script src="https://cdnjs.cloudflare.com/ajax/libs/rxjs/5.5.3/Rx.min.js"></script>
Here is an explaination:
Rx.Observable.interval(#): simulates the client click events
.scan(...): accumulates the click events
.map(...): calculates the page index and potential total item count (actual count could be less but it doesn't matter for our purposes
.filter(...): only allow to pass through to get a new page of data if it has just hit the page buffer.
.startWith(...): get the first page without waiting for clicks. The +1 on the page calculation in the .scan accounts for this.
.switchMap(...): get the next page of data.
.takeWhile(...): keep the stream open till we get an empty list.
So it will get an initial page and then go get a new page whenever the number of clicks comes within the designated buffer. Once all items have been retrieved (known by empty list) it will complete.
One thing I didn't figure out how to do is to complete the list when the page length is less than the page size. Not sure if it matters to you.

send message to set of channels in non-deterministic order

I'm building a Promela model in which one process send a request to N other processes, waits for the replies, and then computes a value. Basically a typical map-reduce style execution flow. Currently my model sends requests in a fixed order. I'd like to generalize this to send a non-deterministic order. I've looked at the select statement, but that appears to select a single element non-deterministically.
Is there a good pattern for achieving this? Here the basic structure of what I'm working with:
#define NUM_OBJECTS 2
chan obj_req[NUM_OBJECTS] = [0] of { mtype, chan };
This is the object process that responds to msgtype messages with some value that it computes.
proctype Object(chan request) {
chan reply;
end:
do
:: request ? msgtype(reply) ->
int value = 23
reply ! value
od;
}
This is the client. It sends a request to each of the objects in order 0, 1, 2, ..., and collects all the responses and reduces the values.
proctype Client() {
chan obj_reply = [0] of { int };
int value
// WOULD LIKE NON-DETERMINISM HERE
for (i in obj_req) {
obj_req[i] ! msgtype(obj_reply)
obj_reply ? value
// do something with value
}
}
And I start up the system like this
init {
atomic {
run Object(obj_req[0]);
run Object(obj_req[1]);
run Client();
}
}
From your question I gather that you want to assign a task to a given process in a randomised order, as opposed to simply assign a random task to an ordered sequence of processes.
All in all, the solution for both approaches is very similar. I don't know whether the one I am going to propose is the most elegant approach, though.
#define NUM_OBJECTS 10
mtype = { ASSIGN_TASK };
chan obj_req[NUM_OBJECTS] = [0] of { mtype, chan, int };
init
{
byte i;
for (i in obj_req) {
run Object(i, obj_req[i]);
}
run Client();
};
proctype Client ()
{
byte i, id;
int value;
byte map[NUM_OBJECTS];
int data[NUM_OBJECTS];
chan obj_reply = [NUM_OBJECTS] of { byte, int };
d_step {
for (i in obj_req) {
map[i] = i;
}
}
// scramble task assignment map
for (i in obj_req) {
byte j;
select(j : 0 .. (NUM_OBJECTS - 1));
byte tmp = map[i];
map[i] = map[j];
map[j] = tmp;
}
// assign tasks
for (i in obj_req) {
obj_req[map[i]] ! ASSIGN_TASK(obj_reply, data[i]);
}
// out-of-order wait of data
for (i in obj_req) {
obj_reply ? id(value);
printf("Object[%d]: end!\n", id, value);
}
printf("client ends\n");
};
proctype Object(byte id; chan request)
{
chan reply;
int in_data;
end:
do
:: request ? ASSIGN_TASK(reply, in_data) ->
printf("Object[%d]: start!\n", id)
reply ! id(id)
od;
};
The idea is have an array which acts like a map from the set of indexes to the starting position (or, equivalently, to the assigned task).
The map is then scrambled through a finite number of swap operations. After that, each object is assigned its own task in parallel, so they can all start more-or-less at the same time.
In the following output example, you can see that:
Objects are being assigned a task in a random order
Objects can complete the task in a different random order
~$ spin test.pml
Object[1]: start!
Object[9]: start!
Object[0]: start!
Object[6]: start!
Object[2]: start!
Object[8]: start!
Object[4]: start!
Object[5]: start!
Object[3]: start!
Object[7]: start!
Object[1]: end!
Object[9]: end!
Object[0]: end!
Object[6]: end!
Object[2]: end!
Object[4]: end!
Object[8]: end!
Object[5]: end!
Object[3]: end!
Object[7]: end!
client ends
timeout
#processes: 11
...
If one wants to assign a random task to each object rather than starting them randomly, then it suffices to change:
obj_req[map[i]] ! ASSIGN_TASK(obj_reply, data[i]);
into:
obj_req[i] ! ASSIGN_TASK(obj_reply, data[map[i]]);
Obviously, data should be initialised to some meaningful content first.

Confusion over behavior of Publish().Refcount()

I've got a simple program here that displays the number of letters in various words. It works as expected.
static void Main(string[] args) {
var word = new Subject<string>();
var wordPub = word.Publish().RefCount();
var length = word.Select(i => i.Length);
var report =
wordPub
.GroupJoin(length,
s => wordPub,
s => Observable.Empty<int>(),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.Select(j => new { Word = i.Word, Length = j }));
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Donkey");
word.OnNext("Elephant");
word.OnNext("Zebra");
Console.ReadLine();
}
And the output is:
Apple 5
Banana 6
Cat 3
Donkey 6
Elephant 8
Zebra 5
I used the Publish().RefCount() because "wordpub" is included in "report" twice. Without it, when a word is emitted first one part of the report would get notified by a callback, and then the other part of report would be notified, double the notifications. That is kindof what happens; the output ends up having 11 items rather than 6. At least that is what I think is going on. I think of using Publish().RefCount() in this situation as simultaneously updating both parts of the report.
However if I change the length function to ALSO use the published source like this:
var length = wordPub.Select(i => i.Length);
Then the output is this:
Apple 5
Apple 6
Banana 6
Cat 3
Banana 3
Cat 6
Donkey 6
Elephant 8
Donkey 8
Elephant 5
Zebra 5
Why can't the length function also use the same published source?
This was a great challenge to solve!
So subtle the conditions that this happens.
Apologies in advance for the long explanation, but bear with me!
TL;DR
Subscriptions to the published source are processed in order, but before any other subscription directly to the unpublished source. i.e. you can jump the queue!
With GroupJoin subscription order is important to determine when windows open and close.
My first concern would be that you are publish refcounting a subject.
This should be a no-op.
Subject<T> has no subscription cost.
So when you remove the Publish().RefCount() :
var word = new Subject<string>();
var wordPub = word;//.Publish().RefCount();
var length = word.Select(i => i.Length);
then you get the same issue.
So then I look to the GroupJoin (because my intuition suggests that Publish().Refcount() is a red herring).
For me, eyeballing this alone was too hard to rationalise, so I lean on a simple debugging too I have used dozens of times of the years - a Trace or Log extension method.
public interface ILogger
{
void Log(string input);
}
public class DumpLogger : ILogger
{
public void Log(string input)
{
//LinqPad `Dump()` extension method.
// Could use Console.Write instead.
input.Dump();
}
}
public static class ObservableLoggingExtensions
{
private static int _index = 0;
public static IObservable<T> Log<T>(this IObservable<T> source, ILogger logger, string name)
{
return Observable.Create<T>(o =>
{
var index = Interlocked.Increment(ref _index);
var label = $"{index:0000}{name}";
logger.Log($"{label}.Subscribe()");
var disposed = Disposable.Create(() => logger.Log($"{label}.Dispose()"));
var subscription = source
.Do(
x => logger.Log($"{label}.OnNext({x.ToString()})"),
ex => logger.Log($"{label}.OnError({ex})"),
() => logger.Log($"{label}.OnCompleted()")
)
.Subscribe(o);
return new CompositeDisposable(subscription, disposed);
});
}
}
When I add the logging to your provided code it looks like this:
var logger = new DumpLogger();
var word = new Subject<string>();
var wordPub = word.Publish().RefCount();
var length = word.Select(i => i.Length);
var report =
wordPub.Log(logger, "lhs")
.GroupJoin(word.Select(i => i.Length).Log(logger, "rhs"),
s => wordPub.Log(logger, "lhsDuration"),
s => Observable.Empty<int>().Log(logger, "rhsDuration"),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.Select(j => new { Word = i.Word, Length = j }));
report.Subscribe(i => ($"{i.Word} {i.Length}").Dump("OnNext"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Donkey");
word.OnNext("Elephant");
word.OnNext("Zebra");
This will then output in my log something like the following
Log with Publish().RefCount() used
0001lhs.Subscribe()
0002rhs.Subscribe()
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe()
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe()
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Banana 6
...
However when I remove the usage Publish().RefCount() the new log output is as follows:
Log without only Subject
0001lhs.Subscribe()
0002rhs.Subscribe()
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe()
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe()
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Apple 6
OnNext
Banana 6
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
...
This gives us some insight, however when the issue really becomes clear is when we start annotating our logs with a logical list of subscriptions.
In the original (working) code with the RefCount our annotations might look like this
//word.Subsribers.Add(wordPub)
0001lhs.Subscribe() //wordPub.Subsribers.Add(0001lhs)
0002rhs.Subscribe() //word.Subsribers.Add(0002rhs)
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe() //wordPub.Subsribers.Add(0003lhsDuration)
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe() //wordPub.Subsribers.Add(0005lhsDuration)
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose() //wordPub.Subsribers.Remove(0003lhsDuration)
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Banana 6
So in this example, when word.OnNext("Banana"); is executed the chain of observers is linked in this order
wordPub
0002rhs
However, wordPub has child subscriptions!
So the real subscription list looks like
wordPub
0001lhs
0003lhsDuration
0005lhsDuration
0002rhs
If we annotate the Subject only log we see where the subtlety lies
0001lhs.Subscribe() //word.Subsribers.Add(0001lhs)
0002rhs.Subscribe() //word.Subsribers.Add(0002rhs)
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe() //word.Subsribers.Add(0003lhsDuration)
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe() //word.Subsribers.Add(0005lhsDuration)
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Apple 6
OnNext
Banana 6
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
So in this example, when word.OnNext("Banana"); is executed the chain of observers is linked in this order
1. 0001lhs
2. 0002rhs
3. 0003lhsDuration
4. 0005lhsDuration
As the 0003lhsDuration subscription is activated after the 0002rhs, it wont see the "Banana" value to terminate the window, until after the rhs has been sent the value, thus yielding it in the still open window.
Whew
As #francezu13k50 points out the obvious and simple solution to your problem is to just use word.Select(x => new { Word = x, Length = x.Length });, but as I think you have given us a simplified version of your real problem (appreciated) I understand why this isn't suitable.
However, as I dont know what your real problem space is I am not sure what to suggest to you to provide a solution, except that you have one with your current code, and now you should know why it works the way it does.
RefCount returns an Observable that stays connected to the source as long as there is at least one subscription to the returned Observable. When the last subscription is disposed, RefCount disposes it's connection to the source, and reconnects when a new subscription is being made. It might be the case with your report query that all subscriptions to the 'wordPub' are disposed before the query is fulfilled.
Instead of the complicated GroupJoin query you could simply do :
var report = word.Select(x => new { Word = x, Length = x.Length });
Edit:
Change your report query to this if you want to use the GroupJoin operator :
var report =
wordPub
.GroupJoin(length,
s => wordPub,
s => Observable.Empty<int>(),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.FirstAsync().Select(j => new { Word = i.Word, Length = j }));
Because GroupJoin seems to be very tricky to work with, here is another approach for correlating the inputs and outputs of functions.
static void Main(string[] args) {
var word = new Subject<string>();
var length = new Subject<int>();
var report =
word
.CombineLatest(length, (w, l) => new { Word = w, Length = l })
.Scan((a, b) => new { Word = b.Word, Length = a.Word == b.Word ? b.Length : -1 })
.Where(i => i.Length != -1);
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple"); length.OnNext(5);
word.OnNext("Banana");
word.OnNext("Cat"); length.OnNext(3);
word.OnNext("Donkey");
word.OnNext("Elephant"); length.OnNext(8);
word.OnNext("Zebra"); length.OnNext(5);
Console.ReadLine();
}
This approach works if every input has 0 or more outputs subject to the constraints that (1) outputs only arrive in the same order as the inputs AND (2) each output corresponds to its most recent input. This is like a LeftJoin - each item in the first list (word) is paired with items in the right list (length) that subsequently arrive, up until another item in the first list is emitted.
Trying to use regular Join instead of GroupJoin. I thought the problem was that when a new word was created there was a race condition inside Join between creating a new window and ending the current one. So here I tried to elimate that by pairing every word with a null signifying the end of the window. Doesn't work, just like the first version did not. How is it possible that a new window is created for each word without the previous one being closed first? Completely confused.
static void Main(string[] args) {
var lgr = new DelegateLogger(Console.WriteLine);
var word = new Subject<string>();
var wordDelimited =
word
.Select(i => Observable.Return<string>(null).StartWith(i))
.SelectMany(i => i);
var wordStart = wordDelimited.Where(i => i != null);
var wordEnd = wordDelimited.Where(i => i == null);
var report = Observable
.Join(
wordStart.Log(lgr, "word"), // starts window
wordStart.Select(i => i.Length),
s => wordEnd.Log(lgr, "expireWord"), // ends current window
s => Observable.Empty<int>(),
(l, r) => new { Word = l, Length = r });
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Zebra");
word.OnNext("Elephant");
word.OnNext("Bear");
Console.ReadLine();
}

Inline parsing of IObservable<byte>

I have an observable query that produces an IObservable<byte> from a stream that I want to parse inline. I want to be able to use different strategies depending on the data source to parse discrete messages from this sequence. Bear in mind I am still on the upward learning curve of RX. I have come up with a solution, but am unsure if there is a way to accomplish this using out-of-the-box operators.
First, I wrote the following extension method to IObservable:
public static IObservable<IList<T>> Parse<T>(
this IObservable<T> source,
Func<IObservable<T>, IObservable<IList<T>>> parsingFunction)
{
return parsingFunction(source);
}
This allows me to specify the message framing strategy in use by a particular data source. One data source might be delimited by one or more bytes while another might be delimited by both start and stop block patterns while another might use a length prefixing strategy. So here is an example of the Delimited strategy that I have defined:
public static class MessageParsingFunctions
{
public static Func<IObservable<T>, IObservable<IList<T>>> Delimited<T>(T[] delimiter)
{
if (delimiter == null) throw new ArgumentNullException("delimiter");
if (delimiter.Length < 1) throw new ArgumentException("delimiter must contain at least one element.");
Func<IObservable<T>, IObservable<IList<T>>> parser =
(source) =>
{
var shared = source.Publish().RefCount();
var windowOpen = shared.Buffer(delimiter.Length, 1)
.Where(buffer => buffer.SequenceEqual(delimiter))
.Publish()
.RefCount();
return shared.Buffer(windowOpen)
.Select(bytes =>
bytes
.Take(bytes.Count - delimiter.Length)
.ToList());
};
return parser;
}
}
So ultimately, as an example, I can use the code in the following fashion to parse discrete messages from the sequence any time the byte pattern for the string '<EOF>' is encountered in the sequence:
var messages = ...operators that surface an IObservable<byte>
.Parse(MessageParsingFunctions.Delimited(Encoding.ASCII.GetBytes("<EOF>")))
...further operators to package discrete messages along with additional metadata
Questions:
Is there a more straight-forward way to accomplish this using just out of the box operators?
If not, would it be preferable to just define the different parsing functions (i.e. ParseDelimited, ParseLengthPrefixed, etc.) as local extensions instead of having a more generic Parse extension method that accepts a parsing function?
Thanks in advance!
Take a look at Rxx Parsers. Here's a related lab. For example:
IObservable<byte> bytes = ...;
var parsed = bytes.ParseBinary(parser =>
from next in parser
let magicNumber = parser.String(Encoding.UTF8, 3).Where(value => value == "RXX")
let header = from headerLength in parser.Int32
from header in next.Exactly(headerLength)
from headerAsString in header.Aggregate(string.Empty, (s, b) => s + " " + b)
select headerAsString
let message = parser.String(Encoding.UTF8)
let entry = from length in parser.Int32
from data in next.Exactly(length)
from value in data.Aggregate(string.Empty, (s, b) => s + " " + b)
select value
let entries = from count in parser.Int32
from entries in entry.Exactly(count).ToList()
select entries
select from _ in magicNumber.Required("The file's magic number is invalid.")
from h in header.Required("The file's header is invalid.")
from m in message.Required("The file's message is invalid.")
from e in entries.Required("The file's data is invalid.")
select new
{
Header = h,
Message = m,
Entries = e.Aggregate(string.Empty, (acc, cur) => acc + cur + Environment.NewLine)
});

RX misunderstood behavior

I have the below repro code which demonstrate a problem in a more complex flow:
static void Main(string[] args)
{
var r = Observable.Range(1, 10).Finally(() => Console.WriteLine("Disposed"));
var x = Observable.Create<int>(o =>
{
for (int i = 1; i < 11; i++)
{
o.OnNext(i);
}
o.OnCompleted();
return Disposable.Create(() => Console.WriteLine("Disposed"));
});
var src = x.Publish().RefCount();
var a = src.Where(i => i % 2 == 0).Do(i => Console.WriteLine("Pair:" + i));
var b = src.Where(i => i % 2 != 0).Do(i => Console.WriteLine("Even:" + i));
var c = Observable.Merge(a, b);
using (c.Subscribe(i => Console.WriteLine("final " + i), () => Console.WriteLine("Complete")))
{
Console.ReadKey();
}
}
running this snippet with r as src (var src = r.Publish().RefCount()) will produce all the numbers from 1 till 10,
switching the src to x(like in example) will produce only the pairs, actually the first observable to subscribe unless i change Publish() to Replay().
Why? What is the difference between r and x?
Thanks.
Although I do not have the patience to sort through the Rx.NET source code to find exactly what implementation detail causes this exact behavior, I can provide the following insight:
The difference in behavior your are seeing is caused by a race condition. The racers in this case are the subscriptions of a and b which happen as a result of your subscription to the observable returned by Observable.Merge. You subscribe to c, which in turn subscribes to a and b. a and b are defined in terms of a Publish and RefCount of either x or r, depending on which case you choose.
Here's what's happening.
src = r
In this case, you are using a custom Observable. When subscribed to, your custom observible immediately and synchronously begins to onNext the numbers 1 though 10, and then calls onCompleted. Interestingly enough, this subscription is caused by your Publish().RefCount() Observable when it is subscribe to the first time. It is subscribed to the first time by a, because a is the first parameter to Merge. So, before Merge has even subscribed to b, your subscription has already completed. Merge subscribes to b, which is the RefCount observable. That observable is already completed, so Merge looks for the next Observable to merge. Since there are no more Observables to merge, and because all of the existing Observables have completed, the merged observable completes.
The values onNext'd through your custom observable have traveled through the "pairs" observable, but not the "evens" observable. Therefore, you end up with the following:
// "pairs" (has this been named incorrectly?)
[2, 4, 6, 8, 10]
src = x
In this case, you are using the built-in Range method to create an Observable. When subscribed to, this Range Observable does something that eventually ends up yielding the numbers 1 though 10. Interesting. We haven't a clue what's happening in that method, or when it's happening. We can, however, make some observations about it. If we look at what happens when src = r (above), we can see that only the first subscription takes effect because the observable is yielding immediately and synchronously. Therefore, we can determine that the Range Observable must not be yielding in the same manner, but instead allows the application's control flow to execute the subscription to b before any values are yielded. The difference between your custom Observable and this Range Observable, is probably that the Range Observable is scheduling the yields to happen on the CurrentThread Scheduler.
How to avoid this kind of race condition:
var src = a.Publish(); // not ref count
var a = src.where(...);
var b = src.where(...);
var c = Observable.Merge(a, b);
var subscription = c.Subscribe(i => Console.WriteLine("final " + i), () => Console.WriteLine("Complete"))
// don't dispose of the subscription. The observable creates an auto-disposing subscription which will call dispose once `OnCompleted` or `OnError` is called.
src.Connect(); // connect to the underlying observable, *after* merge has subscribed to both a and b.
Notice that the solution to fixing the subscription to this composition of Observables was not to change how the source observable works, but instead to make sure your subscription logic isn't allowing any race conditions to exist. This is important, because trying to fix this problem in the Observable is simply changing behavior, not fixing the race. Had we changed the source and switched it out later, the subscription logic would still be buggy.
I suspect it's the schedulers. This change causes the two to behave identically:
var x = Observable.Create<int>(o =>
{
NewThreadScheduler.Default.Schedule(() =>
{
for (int i = 1; i < 11; i++)
{
o.OnNext(i);
}
o.OnCompleted();
});
return Disposable.Create(() => Console.WriteLine("Disposed"));
});
Whereas using Scheduler.Immediate gives the same behavior as yours.