Spring Reactor | Batching the input without mutating - reactive-programming

I'm trying to batch the records constantly emitted from a streaming source (Kafka) and call my service in a batch of 100.
What I get as the input is a single record. I'm trying what's the best way to achieve it in the Reactive way using Spring Reactor without having to have a mutation and locking outside the pipeline.
Here is my naive attempt which simply reflects my sequential way of thinking:
Mono.just(input)
.subscribe(i -> {
batches.add(input);
if(batches.size() >= 100) {
// Invoke another reactive pipeline.
// Clear the batch (requires locking in order to be thread safe).
}
});
What's the best way to achieve batching on a streaming source using reactor.

.buffer(100) or bufferTimeout(100, Duration.ofSeconds(xxx) comes to the rescue
Using Flux.buffer or Flux.bufferTimeout you will be capable of gathering the fixed amount of elements into the List
StepVerifier.create(
Flux.range(0, 1000)
.buffer(100)
)
.expectNextCount(10)
.expectComplete()
.verify()
Update for the use case
In case, when the input is a single value, suppose like an invocation of the method with parameter:
public void invokeMe(String element);
You may adopt UnicastProcessor technique and transfer all data to that processor so then it will take care of batching
class Batcher {
final UnicastProcessor processor = UnicastProcessor.create();
public void invokeMe(String element) {
processor.sink().next(element);
// or Mono.just(element).subscribe(processor);
}
public Flux<List<String>> listen() {
return processor.bufferTimeout(100, Duration.ofSeconds(5));
}
}
Batcher batcher = new Batcher();
StepVerifier.create(
batcher.listen()
)
.then(() -> Flux.range(0, 1000)
.subscribe(i -> batcher.invokeMe("" + i)))
.expectNextCount(10)
.thenCancel()
.verify()
From that example, we might learn how to provide a single point of receiving events and then listen to results of the batching process.
Please note that UnicastPorcessor allows only one subscriber, so it will be useful for the model when there is one interested party in batching results and many data producers. In a case when you have subscribers as many as producers you may want to use one of the next processors -> DirectProcessor, TopicProcessor, WorkerQueueProcessor. To learn more about Reactor Processors follow the link

Related

How to use/control RxJava Observable.cache

I am trying to use the RxJava caching mechanism ( RxJava2 ) but i can't seem to catch how it works or how can i control the cached contents since there is the cache operator.
I want to verify the cached data with some conditions before emitting the new data.
for example
someObservable.
repeat().
filter { it.age < maxAge }.
map(it.name).
cache()
How can i check and filter the cache value and emit it if its succeeds and if not then i will request a new value.
since the value changes periodically i need to verify if the cache is still valid before i can request a new one.
There is also ObservableCache<T> class but i can't find any resources of using it.
Any help would be much appreciated. Thanks.
This is not how replay/ cache works. Please read the #replay/ #cache documentation first.
replay
This operator returns a ConnectableObservable, which has some methods (#refCount/ #connect/ #autoConnect) for connecting to the source.
When #replay is applied without an overload, the source subscription is multicasted and all emitted values sind connection will be replayed. The source subscription is lazy and can connect to the source via #refCount/ #connect/ #autoConnect.
Returns a ConnectableObservable that shares a single subscription to the underlying ObservableSource that will replay all of its items and notifications to any future Observer.
Applying #relay without any connect-method (#refCount/ #connect/ #autoConnect) will not emit any values on subscription
A Connectable ObservableSource resembles an ordinary ObservableSource, except that it does not begin emitting items when it is subscribed to, but only when its connect method is called.
replay(1)#autoConnect(-1) / #refCount(1) / #connect
Applying replay(1) will cache the last value and will emit the cached value on each subscription. The #autoConnect will connect open an connection immediately and stay open until a terminal event (onComplete, onError) happens. #refCount is smiular, but will disconnect from the source, when all subscriber disappear. The #connect opreator can be used, when you need to wait, when alle subscriptions have been done to the observable, in order not to miss values.
usage
#replay(1) -- most of the it should be used at the end of the observable.
sourcObs.
.filter()
.map()
.replay(bufferSize)
.refCount(connectWhenXSubsciberSubscribed)
caution
applying #replay without a buffer-limit or expiration date will lead to memory-leaks, when you observale is infinite
cache / cacheWithInitialCapacity
Operators are similar to #replay with autoConnect(1). The operators will cache every value and replay on each subsciption.
The operator subscribes only when the first downstream subscriber subscribes and maintains a single subscription towards this ObservableSource. In contrast, the operator family of replay() that return a ConnectableObservable require an explicit call to ConnectableObservable.connect().
Note: You sacrifice the ability to dispose the origin when you use the cache Observer so be careful not to use this Observer on ObservableSources that emit an infinite or very large number of items that will use up memory. A possible workaround is to apply takeUntil with a predicate or another source before (and perhaps after) the application of cache().
example
#Test
fun skfdsfkds() {
val create = PublishSubject.create<Int>()
val cacheWithInitialCapacity = create
.cacheWithInitialCapacity(1)
cacheWithInitialCapacity.subscribe()
create.onNext(1)
create.onNext(2)
create.onNext(3)
cacheWithInitialCapacity.test().assertValues(1, 2, 3)
cacheWithInitialCapacity.test().assertValues(1, 2, 3)
}
usage
Use cache operator, when you can not control the connect phase
This is useful when you want an ObservableSource to cache responses and you can't control the subscribe/dispose behavior of all the Observers.
caution
As with replay() the cache is unbounded and could lead to memory-leaks.
Note: The capacity hint is not an upper bound on cache size. For that, consider replay(int) in combination with ConnectableObservable.autoConnect() or similar.
further reading
https://blog.danlew.net/2018/09/25/connectable-observables-so-hot-right-now/
https://blog.danlew.net/2016/06/13/multicasting-in-rxjava/
If your event source (Observable) is an expensive operation, such as reading from a database, you shouldn't use Subject to observe the events, since that will repeat the expensive operation for each subscriber. Caching can also be risky with infinite streams due to "OutOfMemory" exceptions. A more appropriate solution may be ConnectableObservable, which only performs the source operation once, and broadcasts the updated value to all subscribers.
Here is a code sample. I didn't bother creating an infinite periodic stream or including error handling to keep the example simple. Let me know if it does what you need.
class RxJavaTest {
private final int maxValue = 50;
private final ConnectableObservable<Integer> source =
Observable.<Integer>create(
subscriber -> {
log("Starting Event Source");
subscriber.onNext(readFromDatabase());
subscriber.onNext(readFromDatabase());
subscriber.onNext(readFromDatabase());
subscriber.onComplete();
log("Event Source Terminated");
})
.subscribeOn(Schedulers.io())
.filter(value -> value < maxValue)
.publish();
void run() throws InterruptedException {
log("Starting Application");
log("Subscribing");
source.subscribe(value -> log("Subscriber 1: " + value));
source.subscribe(value -> log("Subscriber 2: " + value));
log("Connecting");
source.connect();
// Add sleep to give event source enough time to complete
log("Application Terminated");
sleep(4000);
}
private Integer readFromDatabase() throws InterruptedException {
// Emulate long database read time
log("Reading data from database...");
sleep(1000);
int randomValue = new Random().nextInt(2 * maxValue) + 1;
log(String.format("Read value: %d", randomValue));
return randomValue;
}
private static void log(Object message) {
System.out.println(
Thread.currentThread().getName() + " >> " + message
);
}
}
Here's the output:
main >> Starting Application
main >> Subscribing
main >> Connecting
main >> Application Terminated
RxCachedThreadScheduler-1 >> Starting Event Source
RxCachedThreadScheduler-1 >> Reading data from database...
RxCachedThreadScheduler-1 >> Read value: 88
RxCachedThreadScheduler-1 >> Reading data from database...
RxCachedThreadScheduler-1 >> Read value: 42
RxCachedThreadScheduler-1 >> Subscriber 1: 42
RxCachedThreadScheduler-1 >> Subscriber 2: 42
RxCachedThreadScheduler-1 >> Reading data from database...
RxCachedThreadScheduler-1 >> Read value: 37
RxCachedThreadScheduler-1 >> Subscriber 1: 37
RxCachedThreadScheduler-1 >> Subscriber 2: 37
RxCachedThreadScheduler-1 >> Event Source Terminated.
Note the following:
Events only start firing once connect() is called on the source, not when observers subscribe to the source.
Database calls are only made once per event update
Filtered values are not emitted to subscribers
All subscribers are executed in the same thread
Application terminates before the events are processed due to concurrency. Normally your app will run in an event loop, so your app will remain responsive during slow operations.

SubscribeOn does not change the thread pool for the whole chain

I want to trigger longer running operation via rest request and WebFlux. The result of a call should just return an info that operation has started. The long running operation I want to run on different scheduler (e.g. Schedulers.single()). To achieve that I used subscribeOn:
Mono<RecalculationRequested> recalculateAll() {
return provider.size()
.doOnNext(size -> log.info("Size: {}", size))
.doOnNext(size -> recalculate(size))
.map(RecalculationRequested::new);
}
private void recalculate(int toRecalculateSize) {
Mono.just(toRecalculateSize)
.flatMapMany(this::toPages)
.flatMap(page -> recalculate(page))
.reduce(new RecalculationResult(), RecalculationResult::increment)
.subscribeOn(Schedulers.single())
.subscribe(result -> log.info("Result of recalculation - success:{}, failed: {}",
result.getSuccess(), result.getFailed()));
}
private Mono<RecalculationResult> recalculate(RecalculationPage pageToRecalculate) {
return provider.findElementsToRecalculate(pageToRecalculate.getPageNumber(), pageToRecalculate.getPageSize())
.flatMap(this::recalculateSingle)
.reduce(new RecalculationResult(), RecalculationResult::increment);
}
private Mono<RecalculationResult> recalculateSingle(ElementToRecalculate elementToRecalculate) {
return recalculationTrigger.recalculate(elementToRecalculate)
.doOnNext(result -> {
log.info("Finished recalculation for element: {}", elementToRecalculate);
})
.doOnError(error -> {
log.error("Error during recalculation for element: {}", elementToRecalculate, error);
});
}
From the above I want to call:
private void recalculate(int toRecalculateSize)
in a different thread. However, it does not run on a single thread pool - it uses a different thread pool. I would expect subscribeOn change it for the whole chain. What should I change and why to execute it in a single thread pool?
Just to mention - method:
provider.findElementsToRecalculate(...)
uses WebClient to get elements.
One caveat of subscribeOn is it does what it says: it runs the act of "subscribing" on the provided Scheduler. Subscribing flows from bottom to top (the Subscriber subscribes to its parent Publisher), at runtime.
Usually you see in documentation and presentations that subscribeOn affects the whole chain. That is because most operators / sources will not themselves change threads, and by default will start sending onNext/onComplete/onError signals from the thread from which they were subscribed to.
But as soon as one operator switches threads in that top-to-bottom data path, the reach of subscribeOn stops there. Typical example is when there is a publishOn in the chain.
The source of data in this case is reactor-netty and netty, which operate on their own threads and thus act as if there was a publishOn at the source.
For WebFlux, I'd say favor using publishOn in the main chain of operators, or alternatively use subscribeOn inside of inner chains, like inside flatMap.
As per the documentation , all operators prefixed with doOn , are sometimes referred to as having a “side-effect”. They let you peek inside the sequence’s events without modifying them.
If you want to chain the 'recalculate' step after 'provider.size()' do it with flatMap.

Use kafka to detect changes on values

I have a streaming application that continuously takes in a stream of coordinates along with some custom metadata that also includes a bitstring. This stream is produced onto a kafka topic using producer API. Now another application needs to process this stream [Streams API] and store the specific bit from the bit string and generate alerts when this bit changes
Below is the continuous stream of messages that need to be processed
{"device_id":"1","status_bit":"0"}
{"device_id":"2","status_bit":"1"}
{"device_id":"1","status_bit":"0"}
{"device_id":"3","status_bit":"1"}
{"device_id":"1","status_bit":"1"} // need to generate alert with change: 0->1
{"device_id":"3","status_bits":"1"}
{"device_id":"2","status_bit":"1"}
{"device_id":"3","status_bits":"0"} // need to generate alert with change 1->0
Now I would like to write these alerts to another kafka topic like
{"device_id":1,"init":0,"final":1,"timestamp":"somets"}
{"device_id":3,"init":1,"final":0,"timestamp":"somets"}
I can save the current bit in the state store using something like
streamsBuilder
.stream("my-topic")
.mapValues((key, value) -> value.getStatusBit())
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.reduce((oldAggValue, newMessageValue) -> newMessageValue, Materialized.as("bit-temp-store"));
but I am unable to understand how can I detect this change from the existing bit. Do I need to query the state store somehow inside the processor topology? If yes? How? If no? What else could be done?
Any suggestions/ideas that I can try(maybe completely different from what I am thinking) are also appreciated. I am new to Kafka and thinking in terms of event driven streams is eluding me.
Thanks in advance.
I am not sure this is the best approach, but in the similar task I used an intermediate entity to capture the state change. In your case it will be something like
streamsBuilder.stream("my-topic").groupByKey()
.aggregate(DeviceState::new, new Aggregator<String, Device, DeviceState>() {
public DeviceState apply(String key, Device newValue, DeviceState state) {
if(!newValue.getStatusBit().equals(state.getStatusBit())){
state.setChanged(true);
}
state.setStatusBit(newValue.getStatusBit());
state.setDeviceId(newValue.getDeviceId());
state.setKey(key);
return state;
}
}, TimeWindows.of(…) …).filter((s, t) -> (t.changed())).toStream();
In the resulting topic you will have the changes. You can also add some attributes to DeviceState to initialise it first, depending whether you want to send the event, when the first device record arrives, etc.

Function now executing properly after subscribe

I am having a Mono object, On which I have subscribed for doOnsuccess, In this method again I am saving the data in DB(CouchBase Using ReactiveCouchbaseRepository). after that, I am not getting any logs for Line1 and line2.
But this is working fine if I do not save this object, means I am getting logs for line 2.
Mono<User> result = context.getPayload(User.class);
result.doOnSuccess( user -> {
System.out.println("############I got the user"+user);
userRepository.save(user).doOnSuccess(user2->{
System.out.println("user saved"); // LINE 1
}).subscribe();
System.out.println("############"+user); // LINE2
}).subscribe();
Your code snippet is breaking a few rules you should follow closely:
You should not call subscribe from within a method/lambda that returns a reactive type such as Mono or Flux; this will decouple the execution from the main task while they'll both still operate on that shared state. This often ends up on issues because things are trying to read twice the same stream. It's a bit like you're trying to create two separate threads that try reading on the same outputstream.
you should not do I/O operations in doOnXYZ operators. Those are "side-effects" operators, meaning they are useful for logging, increment counters.
What you should try instead to chain Reactor operators to create a single reactive pipeline and return the reactive type for the final client to subscribe on it. In a Spring WebFlux application, the HTTP clients (through the WebFlux engine) are subscribing.
Your code snippet could look like this:
Mono<User> result = context.getPayload(User.class)
.doOnSuccess(user -> System.out.println("############Received user "+user))
.flatMap(user -> {return userRepository.save(user)})
.doOnSuccess(user -> System.out.println("############ Saved "+user));
return result;

Callback when the query has finished processing in Siddhi

I am writing a small CEP program using Siddhi. I can add a callback whenever a given filter outputs a data like this
executionPlanRuntime.addCallback("query1", new QueryCallback() {
#Override
public void receive(long timeStamp, Event[] inEvents, Event[] removeEvents) {
EventPrinter.print(inEvents);
System.out.println("data received after processing");
}
});
but is there is a way to know that the filter has finished processing and it won't give any more of the above callback. Something like didFinish. I think that would be the ideal place for shutting down SiddhiManager and ExecutionPlanRuntime instances.
No. There in no such functionality and can't be supported in the future also. Rationale behind that is, in real time stream processing queries will process the incoming stream and emit an output stream. There is no concept as 'finished processing'. Query will rather process event as long as there is input.
Since your requirement is to shutdown SiddhiManager and ExecutionPlanRuntime, recommended way is to do this inside some cleaning method of your program. Or else you can write some java code inside callback to count responses or time wait and call shutdown. Hope this helps!!