Batching large result sets using Rx - reactive-programming

I've got an interesting question for Rx experts. I've a relational table keeping information about events. An event consists of id, type and time it happened. In my code, I need to fetch all the events within a certain, potentially wide, time range.
SELECT * FROM events WHERE event.time > :before AND event.time < :after ORDER BY time LIMIT :batch_size
To improve reliability and deal with large result sets, I query the records in batches of size :batch_size. Now, I want to write a function that, given :before and :after, will return an Observable representing the result set.
Observable<Event> getEvents(long before, long after);
Internally, the function should query the database in batches. The distribution of events along the time scale is unknown. So the natural way to address batching is this:
fetch first N records
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
... and so on (the idea should be clear)
My question is:
Is there a way to express this function in terms of higher-level Observable primitives (filter/map/flatMap/scan/range etc), without using the subscribers explicitly?
So far, I've failed to do this, and come up with the following straightforward code instead:
private void observeGetRecords(long before, long after, Subscriber<? super Event> subscriber) {
long start = before;
while (start < after) {
final List<Event> records;
try {
records = getRecordsByRange(start, after);
} catch (Exception e) {
subscriber.onError(e);
return;
}
if (records.isEmpty()) break;
records.forEach(subscriber::onNext);
start = Iterables.getLast(records).getTime();
}
subscriber.onCompleted();
}
public Observable<Event> getRecords(final long before, final long after) {
return Observable.create(subscriber -> observeGetRecords(before, after, subscriber));
}
Here, getRecordsByRange implements the SELECT query using DBI and returns a List. This code works fine, but lacks elegance of high-level Rx constructs.
NB: I know that I can return Iterator as a result of SELECT query in DBI. However, I don't want to do that, and prefer to run multiple queries instead. This computation does not have to be atomic, so the issues of transaction isolation are not relevant.

Although I don't fully understand why you want such time-reuse, here is how I'd do it:
BehaviorSubject<Long> start = BehaviorSubject.create(0L);
start
.subscribeOn(Schedulers.trampoline())
.flatMap(tstart ->
getEvents(tstart, tstart + twindow)
.publish(o ->
o.takeLast(1)
.doOnNext(r -> start.onNext(r.time))
.ignoreElements()
.mergeWith(o)
)
)
.subscribe(...)

Related

How to sequence a task to execute once all Single's in a collection complete

I am using Helidon DBClient transactions and have found myself in a situation where I end up with a list of Singles, List<Single<T>> and want to perform the next task only after completing all of the singles.
I am looking for something of equivalent to CompletableFuture.allOf() but with Single.
I could map each of the single toCompletableFuture() and then do a CompletableFuture.allOf() on top, but is there a better way? Could someone point me in the right direction with this?
--
Why did I end up with a List<Single>?
I have a collection of POJOs which I turn into named insert .execute() all within an open transaction. Since I .stream() the original collection and perform inserts using the .map() operator, I end up with a List when I terminate the stream to collect a List. None of the inserts might have actually been executed. At this point, I want to wait until all of the Singles have been completed before I proceed to the next stage.
This is something I would naturally do with a CompletableFuture.allOf(), but I do not want to change the API dialect for just this and stick to Single/Multi.
Single.flatMap, Single.flatMapSingle, Multi.flatMap will effectively inline the future represented by the publisher passed as argument.
You can convert a List<Single<T>> to Single<List<T>> like this:
List<Single<Integer>> listOfSingle = List.of(Single.just(1), Single.just(2));
Single<List<Integer>> singleOfList = Multi.just(listOfSingle)
.flatMap(Function.identity())
.collectList();
Things can be tricky when you are dealing with Single<Void> as Void cannot be instantiated and null is not a valid value (i.e. Single.just(null) throws a NullPointerException).
// convert List<Single<Void>> to Single<List<Void>>
Single<List<Void>> listSingle =
Multi.just(List.of(Single.<Void>empty(), Single.<Void>empty()))
.flatMap(Function.identity())
.collectList();
// convert Single<List<Void>> to Single<Void>
// Void cannot be instantiated, it needs to be casted from null
// BUT null is not a valid value...
Single<Void> single = listSingle.toOptionalSingle()
// convert Single<List<Void>> to Single<Optional<List<Void>>>
// then Use Optional.map to convert Optional<List<Void>> to Optional<Void>
.map(o -> o.map(i -> (Void) null))
// convert Single<Optional<Void>> to Single<Void>
.flatMapOptional(Function.identity());
// Make sure it works
single.forSingle(o -> System.out.println("ok"))
.await();

RxJS interleaving merged observables (priority queue?)

UPDATE
I think I've figured out the solution. I explain it in this video. Basically, use timeoutWith, and some tricks with zip (within zip).
https://youtu.be/0A7C1oJSJDk
If I have a single observable like this:
A-1-2--B-3-4-5-C--D--6-7-E
I want to put the "numbers" as lower priority; it should wait until the "letters" is filled up (a group of 2 for example) OR a timeout is reached, and then it can emit. Maybe the following illustration (of the desired result) can help:
A------B-1-----C--D-2----E-3-4-5-6-7
I've been experimenting with some ideas... one of them: first step is to split that stream (groupBy), one containing letters, and the other containing numbers..., then "something in the middle" happen..., and finally those two (sub)streams get merged.
It's that "something in the middle" what I'm trying to figure out.
How to achieve it? Is that even possible with RxJS (ver 5.5.6)? If not, what's the closest one? I mean, what I want to avoid is having the "numbers" flooding the stream, and not giving enough chance for the "letters" to be processed in timely manner.
Probably this video I made of my efforts so far can clarify as well:
Original problem statement: https://www.youtube.com/watch?v=mEmU4JK5Tic
So far: https://www.youtube.com/watch?v=HWDI9wpVxJk&feature=youtu.be
The problem with my solution so far (delaying each emission in "numbers" substream using .delay) is suboptimal, because it keeps clocking at slow pace (10 seconds) even after the "characters" (sub)stream has ended (not completed -- no clear boundary here -- just not getting more value for indeterminate amount of time). What I really need is, to have the "numbers" substream raise its pace (to 2 seconds) once that happen.
Unfortunately I don't know RxJs5 that much and use xstream myself (authored by one of the contributor to RxJS5) which is a little bit simpler in terms of the number of operators.
With this I crafted the following example:
(Note: the operators are pretty much the same as in Rx5, the main difference is with flatten wich is more or less like switch but seems to handle synchronous streams differently).
const xs = require("xstream").default;
const input$ = xs.of("A",1,2,"B",3,4,5,"C","D",6,7,"E");
const initialState = { $: xs.never(), count: 0, buffer: [] };
const state$ = input$
.fold((state, value) => {
const t = typeof value;
if (t === "string") {
return {
...state,
$: xs.of(value),
count: state.count + 1
};
}
if (state.count >= 2) {
const l = state.buffer.length;
return {
...state,
$: l > 0 ? xs.of(state.buffer[0]) : xs.of(value) ,
count: 0,
buffer: state.buffer.slice(1).concat(value)
};
}
return {
...state,
$: xs.never(),
buffer: state.buffer.concat(value),
};
}, initialState);
xs
.merge(
state$
.map(s => s.$),
state$
.last()
.map(s => xs.of.apply(xs, s.buffer))
)
.flatten()
.subscribe({
next: console.log
});
Which gives me the result you are looking for.
It works by folding the stream on itself, looking at the type of values and emitting a new stream depending on it. When you need to wait because not enough letters were dispatched I emit an emptystream (emits no value, no errors, no complete) as a "placeholder".
You could instead of emitting this empty stream emit something like
xs.empty().endsWith(xs.periodic(timeout)).last().mapTo(value):
// stream that will emit a value only after a specified timeout.
// Because the streams are **not** flattened concurrently you can
// use this as a "pending" stream that may or may not be eventually
// consumed
where value is the last received number in order to implement timeout related conditions however you would then need to introduce some kind of reflexivity with either a Subject in Rx or xs.imitate with xstream because you would need to notify your state that your "pending" stream has been consumed wich makes the communication bi-directionnal whereas streams / observables are unidirectionnal.
The key here the use of timeoutWith, to switch to the more aggresive "pacer", when the "events" kicks in. In this case the "event" is "idle detected in the higher-priority stream".
The video: https://youtu.be/0A7C1oJSJDk

Spark - how to handle with lazy evaluation in case of iterative (or recursive) function calls

I have a recursive function that needs to compare the results of the current call to the previous call to figure out whether it has reached a convergence. My function does not contain any action - it only contains map, flatMap, and reduceByKey. Since Spark does not evaluate transformations (until an action is called), my next iteration does not get the proper values to compare for convergence.
Here is a skeleton of the function -
def func1(sc: SparkContext, nodes:RDD[List[Long]], didConverge: Boolean, changeCount: Int) RDD[(Long] = {
if (didConverge)
nodes
else {
val currChangeCount = sc.accumulator(0, "xyz")
val newNodes = performSomeOps(nodes, currChangeCount) // does a few map/flatMap/reduceByKey operations
if (currChangeCount.value == changeCount) {
func1(sc, newNodes, true, currChangeCount.value)
} else {
func1(sc, newNode, false, currChangeCount.value)
}
}
}
performSomeOps only contains map, flatMap, and reduceByKey transformations. Since it does not have any action, the code in performSomeOps does not execute. So my currChangeCount does not get the actual count. What that implies, the condition to check for the convergence (currChangeCount.value == changeCount) is going to be invalid. One way to overcome is to force an action within each iteration by calling a count but that is an unnecessary overhead.
I am wondering what I can do to force an action w/o much overhead or is there another way to address this problem?
I believe there is a very important thing you're missing here:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Because of that accumulators cannot be reliably used for managing control flow and are better suited for job monitoring.
Moreover executing an action is not an unnecessary overhead. If you want to know what is the result of the computation you have to perform it. Unless of course the result is trivial. The cheapest action possible is:
rdd.foreach { case _ => }
but it won't address the problem you have here.
In general iterative computations in Spark can be structured as follows:
def func1(chcekpoinInterval: Int)(sc: SparkContext, nodes:RDD[List[Long]],
didConverge: Boolean, changeCount: Int, iteration: Int) RDD[(Long] = {
if (didConverge) nodes
else {
// Compute and cache new nodes
val newNodes = performSomeOps(nodes, currChangeCount).cache
// Periodically checkpoint to avoid stack overflow
if (iteration % checkpointInterval == 0) newNodes.checkpoint
/* Call a function which computes values
that determines control flow. This execute an action on newNodes.
*/
val changeCount = computeChangeCount(newNodes)
// Unpersist old nodes
nodes.unpersist
func1(checkpointInterval)(
sc, newNodes, currChangeCount.value == changeCount,
currChangeCount.value, iteration + 1
)
}
}
I see that these map/flatMap/reduceByKey transformations are updating an accumulator. Therefore the only way to perform all updates is to execute all these functions and count is the easiest way to achieve that and gives the lowest overhead compared to other ways (cache + count, first or collect).
Previous answers put me on the right track to solve a similar convergence detection problem.
foreach is presented in the docs as:
foreach(func) : Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
It seems like instead of using rdd.foreach() as a cheap action to trigger accumulator increments placed in various transformations, it should be used to do the incrementing itself.
I'm unable to produce a scala example, but here's a basic java version, if it can still help:
// Convergence is reached when two iterations
// return the same number of results
long previousCount = -1;
long currentCount = 0;
while (previousCount != currentCount){
rdd = doSomethingThatUpdatesRdd(rdd);
// Count entries in new rdd with foreach + accumulator
rdd.foreach(tuple -> accumulator.add(1));
// Update helper values
previousCount = currentCount;
currentCount = accumulator.sum();
accumulator.reset();
}
// Convergence is reached

Combining parts of Stream

I've got an observable watching a log that is continuously being written too. Each line is a new onNext call. Sometimes the log outputs a single log item over multiple lines. Detecting this is easy, I just can't find the right RX call.
I'd like to find a way to collect the single log items into a List of lines, and onNext the list when the single log item is complete.
Buffer doesn't seem right as this isn't time based, it's algorithm based.
GroupBy might be what I want, but the documentation is confusing for it. It also seems that the observables it creates probably won't have onComplete called until the completion of the source observable.
This solution can't delay the log much (preferably not at all). I need to be reading the log as close to real time as possible, and order matters.
Any push in the right direction would be great.
This is a typical reactive parsing problem. You could use Rxx Parsers, or for a native solution you can build your own state machine with either Scan or by defining an async iterator. Scan is preferable for simple parsers and often uses a Scan-Where-Select pattern.
Async iterator state machine example: Turnstile
Scan parser example (untested):
IObservable<string> lines = ReadLines();
IObservable<IReadOnlyList<string>> parsed = lines.Scan(
new
{
ParsingItem = (IEnumerable<string>)null,
Item = (IEnumerable<string>)null
},
(state, line) =>
// I'm assuming here that items never span lines partially.
IsItem(line)
? IsItemLastLine(line)
? new
{
ParsingItem = (IEnumerable<string>)null,
Item = (state.ParsingItem ?? Enumerable.Empty<string>()).Concat(line)
}
: new
{
ParsingItem = (state.ParsingItem ?? Enumerable.Empty<string>()).Concat(line),
Item = (List<string>)null
}
: new
{
ParsingItem = (IEnumerable<string>)null,
Item = new[] { line }
})
.Where(result => result.Item != null)
.Select(result => result.Item.ToList().AsReadOnly());

How to make Appstats show both small and read operations?

I'm profiling my application locally (using the Dev server) to get more information about how GAE works. My tests are comparing the common full Entity query and the Projection Query. In my tests both queries do the same query, but the Projection is specified with 2 properties. The test kind has 100 properties, all with the same value for each Entity, with a total of 10 Entities. An image with the Datastore viewer and the Appstats generated data is shown bellow. In the Appstats image, Request 4 is a memcache flush, Request 3 is the test database creation (it was already created, so no costs here), Request 2 is the full Entity query and Request 1 is the projection query.
I'm surprised that both queries resulted in the same amount of reads. My guess is that small and read operations and being reported the same by Appstats. If this is the case, I want to separate them in the reports. That's the queries related functions:
// Full Entity Query
public ReturnCodes doQuery() {
DatastoreService dataStore = DatastoreServiceFactory.getDatastoreService();
for(int i = 0; i < numIters; ++i) {
Filter filter = new FilterPredicate(DBCreation.PROPERTY_NAME_PREFIX + i,
FilterOperator.NOT_EQUAL, i);
Query query = new Query(DBCreation.ENTITY_NAME).setFilter(filter);
PreparedQuery prepQuery = dataStore.prepare(query);
Iterable<Entity> results = prepQuery.asIterable();
for(Entity result : results) {
log.info(result.toString());
}
}
return ReturnCodes.SUCCESS;
}
// Projection Query
public ReturnCodes doQuery() {
DatastoreService dataStore = DatastoreServiceFactory.getDatastoreService();
for(int i = 0; i < numIters; ++i) {
String projectionPropName = DBCreation.PROPERTY_NAME_PREFIX + i;
Filter filter = new FilterPredicate(DBCreation.PROPERTY_NAME_PREFIX + i,
FilterOperator.NOT_EQUAL, i);
Query query = new Query(DBCreation.ENTITY_NAME).setFilter(filter);
query.addProjection(new PropertyProjection(DBCreation.PROPERTY_NAME_PREFIX + 0, Integer.class));
query.addProjection(new PropertyProjection(DBCreation.PROPERTY_NAME_PREFIX + 1, Integer.class));
PreparedQuery prepQuery = dataStore.prepare(query);
Iterable<Entity> results = prepQuery.asIterable();
for(Entity result : results) {
log.info(result.toString());
}
}
return ReturnCodes.SUCCESS;
}
Any ideas?
EDIT: To get a better overview of the problem I have created another test, which do the same query but uses the keys only query instead. For this case, Appstats is correctly showing DATASTORE_SMALL operations in the report. I'm still pretty confused about the behavior of the projection query which should also be reporting DATASTORE_SMALL operations. Please help!
[I wrote the go port of appstats, so this is based on my experience and recollection.]
My guess is this is a bug in appstats, which is a relatively unmaintained program. Projection queries are new, so appstats may not be aware of them, and treats them as normal read queries.
For some background, calculating costs is difficult. For write ops, the cost are returned with the results, as they must be, since the app has no way of knowing what changed (which is where the write costs happen). For reads and small ops, however, there is a formula to calculate the cost. Each appstats implementation (python, java, go) must implement this calculation, including reflection or whatever is needed over the request object to determine what's going on. The APIs for doing this are not entirely obvious, and there's lots of little things, so it's easy to get it wrong, and annoying to get it right.