Kafka Streams - Integral versus Separable handler for flatMapValues

Kafka Streams - Integral versus Separable handler for flatMapValues - apache-kafka

I would like help deciding one of two paths I can follow from those more experienced with Kafka Streams in JAVA. I have two working JAVA apps that can take an inbound stream of integers and perform various calculations and tasks, creating four resultant outbound streams to different topics. The actual calc/tasks is not important, I am concerned
with the two possible methods I could use to define the handler that performs the math and any associated risks with my favorite.
Method 1 uses a separately defined function that is of type Iterable and returns a List type.
Method 2 uses the more common integral method that places the function within the KStream declaration.
I am very new to Kafka Streams and do not want to head down the wrong path. I like Method 1 because the code is very readable, easy to follow, and can have the handlers tested offline without needing to invoke traffic with streams.
Method 2 seems more common, but as the complexity grows, the code gets polluted in main(). Additionally I am boxed-in to testing algorithms using stream traffic, which slows development.
Method 1: Separable handlers (partial):
// Take inbound stream from math-input and perform transformations A-D, then write out to 4 streams.
KStream<String, String> source = src_builder.stream("math-input");
source.flatMapValues(value -> transformInput_A(Arrays.asList(value.split("\\W+"))) ).to("math-output-A");
source.flatMapValues(value -> transformInput_B(Arrays.asList(value.split("\\W+"))) ).to("math-output-B");
source.flatMapValues(value -> transformInput_C(Arrays.asList(value.split("\\W+"))) ).to("math-output-C");
source.flatMapValues(value -> transformInput_D(Arrays.asList(value.split("\\W+"))) ).to("math-output-D");
// More code here, removed for brevity.
// Transformation handlers A, B, C, and D.
// ******************************************************************
// Perform data transformation using method A
public static Iterable transformInput_A (List str_array) {
// Imagine some very complex math here using the integer
// values. This could be 50+ lines of code.
for (int i = 0; i < str_array.size(); i++) {
// grab values and perform ops
}
// Return results in string format
return math_results;
}
// End of Transformation Method A
// ******************************************************************
// Imagine similar handlers for methods B, C, and D below.
Method 2: Handlers internal to KStream declaration (partial):
// Take inbound stream from math-input and perform transformations A-D, then write out to 4 streams.
KStream<String, String> inputStream = src_builder.stream("math-input");
KStream<String, String> outputStream_A = inputStream.mapValues(new ValueMapper<String, String>() {
#Override
public String apply(String s) {
// Imagine some very complex math here using the integer
// values. This could be 50+ lines of code.
for (int i = 0; i < str_array.length; i++) {
// grab values and perform ops
}
// Return results in Iterbale string format
return math_results;
}
});
// Send the data to the outbound topic A.
outputStream_A.to("math-output-A");
KStream<String, String> outputStream_B ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_B.to("math-output-B");
KStream<String, String> outputStream_C ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_C.to("math-output-C");
KStream<String, String> outputStream_D ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_D.to("math-output-D");
Other than my desire to keep main() neat and push the complexity out of view, am I heading in the wrong direction with Method 1?

Related

Compare List<String> to Flux<String> in non blocking way

How to compare List to Flux in non blocking way
Below the code in blocking way
public static void main(String[] args) {
List<String> all = List.of("A", "B", "C", "D");
Flux<String> valid = Flux.just("A", "D");
Map<Boolean, List<String>> collect = all.stream()
.collect(Collectors.groupingBy(t -> valid.collectList().block().contains(t)));
System.out.println(collect.get(Boolean.TRUE));
System.out.println(collect.get(Boolean.FALSE));
}
how to get it working in non-blocking way?
Above is an example of what i am trying to do in web application. I receive list of object which is List all. Then i query database which return Flux . Flux returned by database will be subset of List all. I need to prepare two lists. List of items which are present in Flux of valid and List of items which are not present in Flux of valid
EDIT:
I converted Flux to Mono and List to Mono,
public static void main(String[] args) {
Mono<List<String>> all = Mono.just(List.of("A", "B", "C", "D"));
Mono<List<String>> valid = Mono.just(List.of("A", "D"));
var exist = all.flatMap(a -> valid.map(v -> a.stream().collect(Collectors.groupingBy(v::contains))));
System.out.println(exist.block().get(Boolean.TRUE));
System.out.println(exist.block().get(Boolean.FALSE));
}

There is no straightforward way of achieve this in reactive programming without breaking some of its semantics.
If you reflect back on what reactive programming tris to achieve and your problem statement, you should notice that those won't play well that much together.
Reactive programming, as the name suggests, is about reacting to events which in your case would be valid items emitted from your datastore. In a typical situation, you should have been programming your statement to compute some assertions around the emitted valid items then emit these (or some other transformations downstream). Unfortunately, you won't be able to compute the all and valid items intersection and diversion without stopping at some point (otherwise how would you know that an item you assumed non-valid is not emitted at some point by the valid publisher).
Though, to achieve the desired behavior, you will lean on memory to buffer items then trigger your validations.
Retrieving valid items should be achievable using the filterWhen operator paired with the hasElement one:
Flux<String> validItems = Flux.fromIterable(all)
.filterWhen(valid::hasElement);
To retrieve the invalid items, you can collect all and validItems merged together then filter out elements that do appear more than once:
Flux<String> inValidItems = Flux.fromIterable(all)
.mergeWith(validItems)
.collectList()
.flatMapIterable(list -> list.stream().filter(item -> Collections.frequency(list, item) == 1).collect(Collectors.toList()));

Can I use AtomicReference to get value of a Mono and code still remain reactive

Sorry, I am new to reactive paradigm. Is is possible to use AtomicReference to get value of a Mono since reactive code can run asynchronously and different events run on different thread. Please see the sample below. I am also not sure if this piece of code is considered reactive
sample code:
public static void main(String[] a) {
AtomicReference<UserDTO> dto = new AtomicReference<>();
Mono.just(new UserDTO())
.doOnNext(d -> d.setUserId(123L))
.subscribe(d -> dto.set(d));
UserDTO result = dto.get();
dto.set(null);
System.out.println(result); // produce UserDTO(userId=123)
System.out.println(dto.get()); // produce null
}

The code snippet you have shared is not guaranteed to always work. There is no way to guarantee that the function inside doOnNext will happen before dto.get(). You have created a race condition.
You can run the follow code to simulate this.
AtomicReference<UserDTO> dto = new AtomicReference<>();
Mono.just(new UserDTO())
.delayElement(Duration.ofSeconds(1))
.doOnNext(d -> d.setUserId(123L))
.subscribe(dto::set);
UserDTO result = dto.get();
System.out.println(result); // produces null
To make this example fully reactive, you should print out in the subscribe operator
Mono.just(new UserDTO())
.doOnNext(d -> d.setUserId(123L))
.subscribe(System.out::println)
In a more "real world" example, your method would return a Mono<UserDTO> and you would then perform transformations on this using map or flatMap operators.
** EDIT **
If you are looking to make a blocking call within a reactive stream this previous stack overflow question contains a good answer

Spring webflux/reactor using #Scheduled to read database and perform some tasks

I am new to spring webflux and my current spring boot application uses a scheduler(annotated as #Scheduled) to read list of data from DB, call a rest api concurrently in batches and then writes to event stream
I want to achieve the same in Spring webflux.
Should I use #Scheduled or use schedulePeriodically from Webflux?
How can I batch items from DB into smaller sets(say 10 items) and concurrently call rest api?
At present the app fetches max 100 records in one scheduler run and then it processes them. I am planning to shift to r2dbc, if i do so, can i limit the flow of data like 100?
Thanks

1. Should I use #Scheduled or use schedulePeriodically from Webflux?
#Scheduled is an annotation which is part of the spring framework scheduled package, while schedulePeriodically is a function which is part of reactor, so you can't really compare the two. I dont see any problems in using the annotation since it is part of the core framework.
2. How can I batch items from DB into smaller sets (say 10 items) and concurrently call rest api?
By using the Flux#buffer functions which will emit a list of items when the buffer is full.
Flux.just("1", "2", "3", "4")
.buffer(2)
.doOnNext(list -> {
System.out.println(list.size());
}).subscribe()
Will print 2 each time.
3. At present the app fetches max 100 records in one scheduler run and then it processes them. I am planning to shift to r2dbc, if i do so, can i limit the flow of data like 100?
Well you can as written before, you fetch, and then buffer the responses into lists of 100, you can then place each list in its own flux and emit items again, or process each list of 100 items. Up to you.
There are a lot of functions under the buffer segment, check them out.
Flux#buffer

Flux.buffer will combine the streams and will emit a list of streams of mentioned buffer size.
For batching purpose, you can use Flux.expand or Mono.expand. You only have to provide your condition in the expand to execute it again or finally end it.
Here are the examples:
public static void main(String[] args) {
List<String> list = new ArrayList<>();
list.add("1");
Flux.just(list)
.buffer(2)
.doOnNext(ls -> {
System.out.println(ls.getClass());
// Buffering a list returns the list of list of String
System.out.println(ls);
}).subscribe();
Flux.just(list)
.expand(listObj -> {
// Condition to finally end the batch
if (listObj.size()>4) {
return Flux.empty();
}
// Can return the size of data as much as you require
list.add("a");
return Flux.just(listObj);
}).map(ls -> {
// Here it returns list of String which was the original object type not list of list as in case of buffer
System.out.println(ls.getClass());
System.out.println(ls);
return ls;
}).subscribe();
}
Output:
class java.util.ArrayList
[[1]] /// Output of buffer list of list
class java.util.ArrayList
[1]
class java.util.ArrayList
[1, a]
class java.util.ArrayList
[1, a, a]
class java.util.ArrayList
[1, a, a, a]
class java.util.ArrayList
[1, a, a, a, a]

can i conditionally "merge" a Single with an Observable?

i'm a RxJava newcomer, and i'm having some trouble wrapping my head around how to do the following.
i'm using Retrofit to invoke a network request that returns me a Single<Foo>, which is the type i ultimately want to consume via my Subscriber instance (call it SingleFooSubscriber)
Foo has an internal property items typed as List<String>.
if Foo.items is not empty, i would like to invoke separate, concurrent network requests for each of its values. (the actual results of these requests are inconsequential for SingleFooSubscriber as the results will be cached externally).
SingleFooSubscriber.onComplete() should be invoked only when Foo and all Foo.items have been fetched.
fetchFooCall
.subscribeOn(Schedulers.io())
// Approach #1...
// the idea here would be to "merge" the results of both streams into a single
// reactive type, but i'm not sure how this would work given that the item emissions
// could be far greater than one. using zip here i don't think it would every
// complete.
.flatMap { foo ->
if(foo.items.isNotEmpty()) {
Observable.zip(
Observable.fromIterable(foo.items),
Observable.just(foo),
{ source1, source2 ->
// hmmmm...
}
).toSingle()
} else {
Single.just(foo)
}
}
// ...or Approach #2...
// i think this would result in the streams for Foo and items being handled sequentially,
// which is not really ideal because
// 1) i think it would entail nested streams (i get the feeling i should be using flatMap
// instead)
// 2) and i'm not sure SingleFooSubscriber.onComplete() would depend on the completion of
// the stream for items
.doOnSuccess { data ->
if(data.items.isNotEmpty()) {
// hmmmm...
}
}
.observeOn(AndroidSchedulers.mainThread())
.subscribe(
{ data -> /* onSuccess() */ },
{ error -> /* onError() */ }
)
any thoughts on how to approach this would be greatly appreciated!
bonus points: in trying to come up with a solution to this, i've begun to question the decision to use the Single reactive type vs the Observable reactive type. most (all, except this one Foo.items case?) of my streams actually revolve around consuming a single instance of something, so i leaned toward Single to represent my streams as i thought it would add some semantic clarity around the code. anybody have any general guidance around when to use one vs the other?

You need to nest flatMaps and then convert back to Single:
retrofit.getMainObject()
.flatMap(v ->
Flowable.fromIterable(v.items)
.flatMap(w ->
retrofit.getItem(w.id).doOnNext(x -> w.property = x)
)
.ignoreElements()
.toSingle(v)
)

Workarounds for RX glitches?

I'm experimenting with Reactive Extensions on various platforms, and one thing that annoys me a bit are the glitches.
Even though for UI code these glitches might not be that problematic, and usually one can find an operator that works around them, I still find debugging code more difficult in the presence of glitches: the intermediate results are not important to debug, but my mind does not know when a result is intermediate or "final".
Having worked a bit with pure functional FRP in Haskell and synchronous data-flow systems, it also 'feels' wrong, but that is of course subjective.
But when hooking RX to non-UI actuators (like motors or switches), I think glitches are more problematic. How would one make sure that only the correct value is send to the external actuators?
Maybe this can be solved by some 'dispatcher' that knows when some 'external sensor' fired the initiating event, so that all internal events are handled before forwarding the final result(s) to the actuators. Something like described in the flapjax paper.
The question(s) I hope to get answers for are:
Is there something in RX that makes fixing glitches for synchronous notifications impossible?
If not, does a (preferably production quality) library or approach exists for RX that fixes synchronous glitches? Especially for the single-threaded Javascript this might make sense?
If no general solution exists, how would RX be used to control external sensors/actuators without glitches at the actuators?
Let me give an example
Suppose I want to print a sequence of tuples (a,b) where the contract is
a=n b=10 * floor(n/10)
n is a natural number stream = 0,1,2....
So I expect the following sequence
(a=0, b=0)
(a=1, b=0)
(a=2, b=0)
...
(a=9, b=0)
(a=10, b=10)
(a=11, b=10)
...
In RX, to make things more interesting, I will use filter for computing the b stream
var n = Observable
.Interval(TimeSpan.FromSeconds(1))
.Publish()
.RefCount();
var a = n.Select(t => "a=" + t);
var b = n.Where(t => t % 10 == 0)
.Select(t => "b=" + t);
var ab = a.CombineLatest(b, Tuple.Create);
ab.Subscribe(Console.WriteLine);
This gives what I believed to be a glitch (temporary violation of the invariant/contract):
(a=0, b=0)
(a=1, b=0)
(a=2, b=0)
...
(a=10, b=0) <-- glitch?
(a=10, b=10)
(a=11, b=10)
I realize that this is the correct behavior of CombineLatest, but I also thought this was called a glitch because in a real pure FRP system, you do not get these intermediate-invariant-violating results.
Note that in this example, I would not be able to use Zip, and also WithLatestFrom would give an incorrect result.
Of course I could just simplify this example into one monadic computation, never multi-casting the n stream occurrences (this would mean not being able to filter but just map), but that's not the point: IMO in RX you always get a 'glitch' whenever you split and rejoin an observable stream:
s
/ \
a b
\ /
t
For example, in FlapJAX you don't get these problems.
Does any of this make sense?
Thanks a lot,
Peter

Update: Let me try to answer my own question in an RX context.
First of all, it seems my understanding of what a "glitch" is, was wrong. From a pure FRP standpoint, what looked like glitches in RX to me, seems actually correct behavior in RX.
So I guess that in RX we need to be explicit about the "time" at which we expect to actuate values combined from sensors.
In my own example, the actuator is the console, and the sensor the interval n.
So if I change my code
ab.Subscribe(Console.WriteLine);
into
ab.Sample(n).Subscribe(Console.WriteLine);
then only the "correct" values are printed.
This does mean that when we get an observable sequence that combines values from sensors, that we must know all the original sensors, merge them all, and sample the values with that merged signal before sending any values to actuators...
So an alternative approach would be to "lift" IObservable into a "Sensed" structure that remembers and merges the originating sensors, for example like this:
public struct Sensed<T>
{
public IObservable<T> Values;
public IObservable<Unit> Sensors;
public Sensed(IObservable<T> values, IObservable<Unit> sensors)
{
Values = values;
Sensors = sensors;
}
public IObservable<Unit> MergeSensors(IObservable<Unit> sensors)
{
return sensors == Sensors ? Sensors : Sensors.Merge(sensors);
}
public IObservable<T> MergeValues(IObservable<T> values)
{
return values == Values ? Values : Values.Merge(values);
}
}
And then we must transfer all RX method to this "Sensed" structure:
public static class Sensed
{
public static Sensed<T> Sensor<T>(this IObservable<T> source)
{
var hotSource = source.Publish().RefCount();
return new Sensed<T>(hotSource, hotSource.Select(_ => Unit.Default));
}
public static Sensed<long> Interval(TimeSpan period)
{
return Observable.Interval(period).Sensor();
}
public static Sensed<TOut> Lift<TIn, TOut>(this Sensed<TIn> source, Func<IObservable<TIn>, IObservable<TOut>> lifter)
{
return new Sensed<TOut>(lifter(source.Values), source.Sensors);
}
public static Sensed<TOut> Select<TIn, TOut>(this Sensed<TIn> source, Func<TIn, TOut> func)
{
return source.Lift(values => values.Select(func));
}
public static Sensed<T> Where<T>(this Sensed<T> source, Func<T, bool> func)
{
return source.Lift(values => values.Where(func));
}
public static Sensed<T> Merge<T>(this Sensed<T> source1, Sensed<T> source2)
{
return new Sensed<T>(source1.MergeValues(source2.Values), source1.MergeSensors(source2.Sensors));
}
public static Sensed<TOut> CombineLatest<TIn1, TIn2, TOut>(this Sensed<TIn1> source1, Sensed<TIn2> source2, Func<TIn1, TIn2, TOut> func)
{
return new Sensed<TOut>(source1.Values.CombineLatest(source2.Values, func), source1.MergeSensors(source2.Sensors));
}
public static IDisposable Actuate<T>(this Sensed<T> source, Action<T> next)
{
return source.Values.Sample(source.Sensors).Subscribe(next);
}
}
My example then becomes:
var n = Sensed.Interval(TimeSpan.FromMilliseconds(100));
var a = n.Select(t => "a=" + t);
var b = n.Where(t => t % 10 == 0).Select(t => "b=" + t);
var ab = a.CombineLatest(b, Tuple.Create);
ab.Actuate(Console.WriteLine);
And again only the "desired" values are passed to the actuator, but with this design, the originating sensors are remember in the Sensed structure.
I'm not sure if any of this makes "sense" (pun intended), maybe I should just let go of my desire for pure FRP, and live with it. After all, time is relative ;-)
Peter

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kafka Streams - Integral versus Separable handler for flatMapValues - apache-kafka

Related

Compare List<String> to Flux<String> in non blocking way

Can I use AtomicReference to get value of a Mono and code still remain reactive

Spring webflux/reactor using #Scheduled to read database and perform some tasks

can i conditionally "merge" a Single with an Observable?

Workarounds for RX glitches?

Categories

Resources