Reconstructing nested Observables - system.reactive

This is a beginner's question. I am streaming back the output of several processes back to client, and I want to store each stream to a separate file.
So what I want is (large letter indicates the end of the nested stream)
S: -a-a-a-b-b-a-A-c-b-B-c-c-...-d--C...D-e--...-z-E-z--z--...
R: a-a-a-----a-A (complete)
b-b-------b-B (complete)
c-----c-c--------C (complete)
d------D (complete)
e--------E (complete)
.
.
.
(end many more nested streams coming)
.
So I want something like a dynamic factory of Observables. Similar to using(), but as I understand using() creates Observables that exist as long as the original Observable, whereas I want to complete and close file every time nested stream completes.
IMPORTANT - I don't want to buffer in memory, as these are very long streams (output of long running processes). So I would like to avoid buffer(), groupBy().

If you can determine from a value if it is the final for a particular letter, you can use groupBy with takeUntil in the inner groups (you may need a few try-catches in the lambdas):
source
.groupBy(v -> v.type)
.flatMap(g ->
Observable.using(
() -> createOutput(g.getKey()),
f -> g.takeUntil(v -> v.isEnd).doOnNext(v -> f.write(v))),
f -> f.close()
)
)
.subscribe(...)
TakeUntil makes sure groupBy only keeps sub-streams as long as we expect values of it (and if the source is properly ordered, groupBy won't recreate the group).
If you really want to avoid groupBy, you have to manually track each open file and close them at the appropriate times:
Observable.using(
HashMap::new,
map -> source
.doOnNext(v -> {
Output o = map.get(v.type);
if (o == null) {
o = new Output(v.type);
map.put(v.type, o);
}
o.write(v);
if (v.isEnd) {
o.close();
map.remove(v.type);
}
}),
map -> map.values().forEach(e -> e.close())
)
.subscribe(...);

Related

Killing the spark.sql

I am new to scala and spark both .
I have a code in scala which executes quieres in while loop one after the other.
What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move on to the next one
for example
do {
var f = Future(
spark.sql("some query"))
)
f onSucess {
case suc - > println("Query ran in 10mins")
}
f failure {
case fail -> println("query took more than 10mins")
}
}while(some condition)
var result = Await.ready(f,Duration(10,TimeUnit.MINUTES))
I understand that when we call spark.sql the control is sent to spark which i need to kill/stop when the duration is over so that i can get back the resources
I have tried multiple things but I am not sure how to solve this.
Any help would be welcomed as i am stuck with this.

Combine Mono with every Flux element emitted

I have a Flux and Mono as below:
Mono<MyRequest> req = request.bodyToMono(MyRequest.class);
Mono<List<String>> mono1 = req.map(r -> r.getList());;
Flux<Long> flux1 = req.map(r -> r.getVals()) // getVals() return list of Long
.flatMapMany(Flux::fromIterable);
Now for each number in flux1, I want to call a method where params are the id from flux1 and the List<String> from mono1. Something like,
flux1.flatMap(id -> process(id, mono1))
But passing and processing same mono1 results in error Only one connection receive subscriber allowed. How can I achieve above? Thanks!
Since both information are coming from the same source, you could just run the whole thing with one pipeline like this and wrap both elements in a Tuple or better, a domain object that has more meaning:
Mono<MyRequest> req = // ...
Flux<Tuple2<Long, List<String>>> tuples = req.flatMapMany(r ->
Flux.fromIterable(r.getVals())
.map(id -> Tuples.of(id, r.getList()))
);
// once there, you can map that with your process method like
tuples.map(tup -> process(tup.getT1(), tup.getT2());
Note that this looks unusual, and this basically comes from the structure of that object you're receiving.

RxJS interleaving merged observables (priority queue?)

UPDATE
I think I've figured out the solution. I explain it in this video. Basically, use timeoutWith, and some tricks with zip (within zip).
https://youtu.be/0A7C1oJSJDk
If I have a single observable like this:
A-1-2--B-3-4-5-C--D--6-7-E
I want to put the "numbers" as lower priority; it should wait until the "letters" is filled up (a group of 2 for example) OR a timeout is reached, and then it can emit. Maybe the following illustration (of the desired result) can help:
A------B-1-----C--D-2----E-3-4-5-6-7
I've been experimenting with some ideas... one of them: first step is to split that stream (groupBy), one containing letters, and the other containing numbers..., then "something in the middle" happen..., and finally those two (sub)streams get merged.
It's that "something in the middle" what I'm trying to figure out.
How to achieve it? Is that even possible with RxJS (ver 5.5.6)? If not, what's the closest one? I mean, what I want to avoid is having the "numbers" flooding the stream, and not giving enough chance for the "letters" to be processed in timely manner.
Probably this video I made of my efforts so far can clarify as well:
Original problem statement: https://www.youtube.com/watch?v=mEmU4JK5Tic
So far: https://www.youtube.com/watch?v=HWDI9wpVxJk&feature=youtu.be
The problem with my solution so far (delaying each emission in "numbers" substream using .delay) is suboptimal, because it keeps clocking at slow pace (10 seconds) even after the "characters" (sub)stream has ended (not completed -- no clear boundary here -- just not getting more value for indeterminate amount of time). What I really need is, to have the "numbers" substream raise its pace (to 2 seconds) once that happen.
Unfortunately I don't know RxJs5 that much and use xstream myself (authored by one of the contributor to RxJS5) which is a little bit simpler in terms of the number of operators.
With this I crafted the following example:
(Note: the operators are pretty much the same as in Rx5, the main difference is with flatten wich is more or less like switch but seems to handle synchronous streams differently).
const xs = require("xstream").default;
const input$ = xs.of("A",1,2,"B",3,4,5,"C","D",6,7,"E");
const initialState = { $: xs.never(), count: 0, buffer: [] };
const state$ = input$
.fold((state, value) => {
const t = typeof value;
if (t === "string") {
return {
...state,
$: xs.of(value),
count: state.count + 1
};
}
if (state.count >= 2) {
const l = state.buffer.length;
return {
...state,
$: l > 0 ? xs.of(state.buffer[0]) : xs.of(value) ,
count: 0,
buffer: state.buffer.slice(1).concat(value)
};
}
return {
...state,
$: xs.never(),
buffer: state.buffer.concat(value),
};
}, initialState);
xs
.merge(
state$
.map(s => s.$),
state$
.last()
.map(s => xs.of.apply(xs, s.buffer))
)
.flatten()
.subscribe({
next: console.log
});
Which gives me the result you are looking for.
It works by folding the stream on itself, looking at the type of values and emitting a new stream depending on it. When you need to wait because not enough letters were dispatched I emit an emptystream (emits no value, no errors, no complete) as a "placeholder".
You could instead of emitting this empty stream emit something like
xs.empty().endsWith(xs.periodic(timeout)).last().mapTo(value):
// stream that will emit a value only after a specified timeout.
// Because the streams are **not** flattened concurrently you can
// use this as a "pending" stream that may or may not be eventually
// consumed
where value is the last received number in order to implement timeout related conditions however you would then need to introduce some kind of reflexivity with either a Subject in Rx or xs.imitate with xstream because you would need to notify your state that your "pending" stream has been consumed wich makes the communication bi-directionnal whereas streams / observables are unidirectionnal.
The key here the use of timeoutWith, to switch to the more aggresive "pacer", when the "events" kicks in. In this case the "event" is "idle detected in the higher-priority stream".
The video: https://youtu.be/0A7C1oJSJDk

Batching large result sets using Rx

I've got an interesting question for Rx experts. I've a relational table keeping information about events. An event consists of id, type and time it happened. In my code, I need to fetch all the events within a certain, potentially wide, time range.
SELECT * FROM events WHERE event.time > :before AND event.time < :after ORDER BY time LIMIT :batch_size
To improve reliability and deal with large result sets, I query the records in batches of size :batch_size. Now, I want to write a function that, given :before and :after, will return an Observable representing the result set.
Observable<Event> getEvents(long before, long after);
Internally, the function should query the database in batches. The distribution of events along the time scale is unknown. So the natural way to address batching is this:
fetch first N records
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
... and so on (the idea should be clear)
My question is:
Is there a way to express this function in terms of higher-level Observable primitives (filter/map/flatMap/scan/range etc), without using the subscribers explicitly?
So far, I've failed to do this, and come up with the following straightforward code instead:
private void observeGetRecords(long before, long after, Subscriber<? super Event> subscriber) {
long start = before;
while (start < after) {
final List<Event> records;
try {
records = getRecordsByRange(start, after);
} catch (Exception e) {
subscriber.onError(e);
return;
}
if (records.isEmpty()) break;
records.forEach(subscriber::onNext);
start = Iterables.getLast(records).getTime();
}
subscriber.onCompleted();
}
public Observable<Event> getRecords(final long before, final long after) {
return Observable.create(subscriber -> observeGetRecords(before, after, subscriber));
}
Here, getRecordsByRange implements the SELECT query using DBI and returns a List. This code works fine, but lacks elegance of high-level Rx constructs.
NB: I know that I can return Iterator as a result of SELECT query in DBI. However, I don't want to do that, and prefer to run multiple queries instead. This computation does not have to be atomic, so the issues of transaction isolation are not relevant.
Although I don't fully understand why you want such time-reuse, here is how I'd do it:
BehaviorSubject<Long> start = BehaviorSubject.create(0L);
start
.subscribeOn(Schedulers.trampoline())
.flatMap(tstart ->
getEvents(tstart, tstart + twindow)
.publish(o ->
o.takeLast(1)
.doOnNext(r -> start.onNext(r.time))
.ignoreElements()
.mergeWith(o)
)
)
.subscribe(...)

Combining parts of Stream

I've got an observable watching a log that is continuously being written too. Each line is a new onNext call. Sometimes the log outputs a single log item over multiple lines. Detecting this is easy, I just can't find the right RX call.
I'd like to find a way to collect the single log items into a List of lines, and onNext the list when the single log item is complete.
Buffer doesn't seem right as this isn't time based, it's algorithm based.
GroupBy might be what I want, but the documentation is confusing for it. It also seems that the observables it creates probably won't have onComplete called until the completion of the source observable.
This solution can't delay the log much (preferably not at all). I need to be reading the log as close to real time as possible, and order matters.
Any push in the right direction would be great.
This is a typical reactive parsing problem. You could use Rxx Parsers, or for a native solution you can build your own state machine with either Scan or by defining an async iterator. Scan is preferable for simple parsers and often uses a Scan-Where-Select pattern.
Async iterator state machine example: Turnstile
Scan parser example (untested):
IObservable<string> lines = ReadLines();
IObservable<IReadOnlyList<string>> parsed = lines.Scan(
new
{
ParsingItem = (IEnumerable<string>)null,
Item = (IEnumerable<string>)null
},
(state, line) =>
// I'm assuming here that items never span lines partially.
IsItem(line)
? IsItemLastLine(line)
? new
{
ParsingItem = (IEnumerable<string>)null,
Item = (state.ParsingItem ?? Enumerable.Empty<string>()).Concat(line)
}
: new
{
ParsingItem = (state.ParsingItem ?? Enumerable.Empty<string>()).Concat(line),
Item = (List<string>)null
}
: new
{
ParsingItem = (IEnumerable<string>)null,
Item = new[] { line }
})
.Where(result => result.Item != null)
.Select(result => result.Item.ToList().AsReadOnly());