Why is Subscriber requesting different number of elements in different cases? - publish-subscribe

I am learning reactive streams and publish-subcribe utility and i am using the default behaviour of Publisher(Flux in my case) and Subscriber.
I have two scenarios, both having the same number of elements in Flux. But when i analyse the logs, the onSubscribe method is requesting for different numbers of elements (say in one case it is request for unbounded elements and in other case it is requesting 32 elements).
Here are the two cases and the logs:
System.out.println("*********Calling MapData************");
List<Integer> elements = new ArrayList<>();
Flux.just(1, 2, 3, 4)
.log()
.map(i -> i * 2)
.subscribe(elements::add);
//printElements(elements);
System.out.println("-------------------------------------");
System.out.println("Inside Combine Streams");
List<Integer> elems = new ArrayList<>();
Flux.just(10,20,30,40)
.log()
.map(x -> x * 2)
.zipWith(Flux.range(0, Integer.MAX_VALUE),
(two, one) -> String.format("First : %d, Second : %d \n", one, two))
.subscribe(new Consumer<String>() {
#Override
public void accept(String s) {
}
});
System.out.println("-------------------------------------");
and here are the logs:
*********Calling MapData************
[warn] LoggerFactory has not been explicitly initialized. Default system-logger will be used. Please invoke StaticLoggerBinder#setLog(org.apache.maven.plugin.logging.Log) with Mojo's Log instance at the early start of your Mojo
[info] | onSubscribe([Synchronous Fuseable] FluxArray.ArraySubscription)
[info] | request(unbounded)
[info] | onNext(1)
[info] | onNext(2)
[info] | onNext(3)
[info] | onNext(4)
[info] | onComplete()
-------------------------------------
Inside Combine Streams
[info] | onSubscribe([Synchronous Fuseable] FluxArray.ArraySubscription)
[info] | request(32)
[info] | onNext(10)
[info] | onNext(20)
[info] | onNext(30)
[info] | onNext(40)
[info] | onComplete()
[info] | cancel()
-------------------------------------
As i have not used any customised subscriber implementation, then why in "MapData" case, it is logging "[info] | request(unbounded)" and in ""Inside Combine Streams"" case it is logging "[info] | request(32)" ?
Please suggest.

First, you should know that this is the expected behavior.
Depending on the operators you're using, Reactor will apply different prefetching strategies:
some operators will use default values like 32 or 256
some arrangements will use the value you provided if you added a buffering operator with a specific value
Reactor can guess that the stream of values is finite, and will request an unbounded value
You can always change this behavior if you use operators variants with the int prefetch method argument, or if you implement your own Subscriber using BaseSubscriber (which provides several useful methods for that).
The bottom line is that you often don't need to pay attention to that particular value; it can only be useful if you want to optimize that prefetch strategy for a particular data source.

Related

apache beam (python SDK): Late (or early) events discarded and triggers. How to know how many discarded and why?

I have a streaming pipeline connected with a PubSub subscription (with around 2MLN elements every hour. I need to collect them in a group and then extract some information.
def expand(self, pcoll):
return (
pcoll |
beam.WindowInto(FixedWindows(10),
trigger=AfterAny(AfterCount(2000), AfterProcessingTime(30)),
allowed_lateness=10,
trigger=AfterAny(
AfterCount(8000),
AfterProcessingTime(30),
AfterWatermark(
early=AfterProcessingTime(60),
late=AfterProcessingTime(60)
)
),
allowed_lateness=60 * 60 * 24,
accumulation_mode=AccumulationMode.DISCARDING)
| "Group by Key" >> beam.GroupByKey()
I try my best to NOT miss any data. But I found out that I have around 4% missing data.
As you can see in the code I trigger anytime I hit 8k elements or every 30 seconds.
Allowing lateness 1 day, and it should trigger both if the pipeline is analyzing early or late events.
Still missing those 4% though. So, is there a way to know if the pipeline is discarding some data? How many elements? For which reason?
Thank you so much in advance
First, I see you have two triggers in the sample code, I assume this is a typo, though.
It looks you are dropping elements due to no using Repeatedly, so all elements after the first trigger get lost. There's an official doc on this from Beam.
Allow me to post an example:
test_stream = (TestStream()
.add_elements([
TimestampedValue('in_time_1', 0),
TimestampedValue('in_time_2', 0)])
.advance_watermark_to(9)
.advance_processing_time(9)
.add_elements([TimestampedValue('late_but_in_window', 8)])
.advance_watermark_to(10)
.advance_processing_time(10)
.add_elements([TimestampedValue('in_time_window2', 12)])
.advance_watermark_to(20) # Past window time
.advance_processing_time(20)
.add_elements([TimestampedValue('late_window_closed', 9),
TimestampedValue('in_time_window2_2', 12)])
.advance_watermark_to_infinity())
class RecordFn(beam.DoFn):
def process(
self,
element=beam.DoFn.ElementParam,
timestamp=beam.DoFn.TimestampParam):
yield ("key", (element, timestamp))
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
with TestPipeline(options=options) as p:
records = (p | test_stream
| beam.ParDo(RecordFn())
| beam.WindowInto(FixedWindows(10),
allowed_lateness=0,
# trigger=trigger.AfterCount(1),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.GroupByKey()
| beam.Map(print)
)
If we have trigger trigger.Repeatedly(trigger.AfterCount(1)), all elements are fired as they come, with no dropped element (but late_window_closed which is expected as it was late):
('key', [('in_time_1', Timestamp(0)), ('in_time_2', Timestamp(0))]) # this two are together since they arrived together
('key', [('late_but_in_window', Timestamp(8))])
('key', [('in_time_window2', Timestamp(12))])
('key', [('in_time_window2_2', Timestamp(12))])
If we use trigger.AfterCount(1) (no repeatedly), we only get the first elements that arrived in the pipeline:
('key', [('in_time_1', Timestamp(0)), ('in_time_2', Timestamp(0))])
('key', [('in_time_window2', Timestamp(12))])
Note that both in_time_(1,2) appear in the first fired pane because the arrived at the same time (0), were one of them appear later it would have been dropped.

Expensive flatMap() operation on streams originating from Stream.emits()

I just encountered an issue with degrading fs2 performance using a stream of strings to be written to a file via text.utf8encode. I tried to change my source to use chunked strings to increase performance, but the observation was performance degradation instead.
As far as I can see, it boils down to the following: Invoking flatMap on a stream that originates from Stream.emits() can be very expensive. Time usage seems to be exponential based on the size of the sequence passed to Stream.emits(). The code snippet below shows an example:
/*
Test done with scala 2.11.11 and fs2 version 0.10.0-M7.
*/
val rangeSize = 20000
val integers = (1 to rangeSize).toVector
// Note that the last flatMaps are just added to show extreme load for streamA.
val streamA = Stream.emits(integers).flatMap(Stream.emit(_))
val streamB = Stream.range(1, rangeSize + 1).flatMap(Stream.emit(_))
streamA.toVector // Uses approx. 25 seconds (!)
streamB.toVector // Uses approx. 15 milliseconds
Is this a bug, or should usage of Stream.emits() for large sequences be avoided?
TLDR: Allocations.
Longer answer:
Interesting question. I ran a JFR profile on both methods separately, and looked at the results. First thing which immediately attracted my eye was the amount of allocations.
Stream.emit:
Stream.range:
We can see that Stream.emit allocates a significant amount of Append instances, which are the concrete implementation of Catenable[A], which is the type used in Stream.emit to fold:
private[fs2] final case class Append[A](left: Catenable[A], right: Catenable[A]) extends Catenable[A]
This actually comes from the implementation of how Catenable[A] implemented foldLeft:
foldLeft(empty: Catenable[B])((acc, a) => acc :+ f(a))
Where :+ allocates a new Append object for each element. This means we're at least generating 20000 such Append objects.
There is also a hint in the documentation of Stream.range about how it produces a single chunk instead of dividing the stream further, which may be bad if this was a big range we're generating:
/**
* Lazily produce the range `[start, stopExclusive)`. If you want to produce
* the sequence in one chunk, instead of lazily, use
* `emits(start until stopExclusive)`.
*
* #example {{{
* scala> Stream.range(10, 20, 2).toList
* res0: List[Int] = List(10, 12, 14, 16, 18)
* }}}
*/
def range(start: Int, stopExclusive: Int, by: Int = 1): Stream[Pure,Int] =
unfold(start){i =>
if ((by > 0 && i < stopExclusive && start < stopExclusive) ||
(by < 0 && i > stopExclusive && start > stopExclusive))
Some((i, i + by))
else None
}
You can see that there is no additional wrapping here, only the integers that get emitted as part of the range. On the other hand, Stream.emits creates an Append object for every element in the sequence, where we have a left containing the tail of the stream, and right containing the current value we're at.
Is this a bug? I would say no, but I would definitely open this up as a performance issue to the fs2 library maintainers.

Parallelizing sequential for-loop for GPU

I have a for-loop where the current index of a vector depends on the previous indices that I am trying to parallelize for a GPU in MATLAB.
A is an nx1 known vector
B is an nx1 output vector that is initialized to zeros.
The code is as follows:
for n = 1:size(A)
B(n+1) = B(n) + A(n)*B(n) + A(n)^k + B(n)^2
end
I have looked at this similar question and tried to find a simple closed form for the recurrence relation, but couldn't find one.
I could do a prefix sum as mentioned in the first link over the A(n)^k term, but I was hoping there would be another method to speed up the loop.
Any advice is appreciated!
P.S. My real code involves 3D arrays that index and sum along 2D slices, but any help for the 1D case should transfer to a 3D scaling.
A word "Parallelizing" sounds magically, but scheduling rules apply:
Your problem is not in spending efforts on trying to convert a pure SEQ-process into it's PAR-re-representation, but in handling the costs of doing so, if you indeed persist into going PAR at any cost.
m = size(A); %{
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
const A[] := | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | | M |
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
:
\
\
\
\
\
\
\
\
\
\
\
+---+---+---+---+---+ .. + .. +---+---+---+---+---+ .. +---+
var B[] := | 0 | 0 | 0 | 0 | 0 | : | 0 | 0 | 0 | 0 | 0 | | 0 |
+---+---+---+---+---+ .. : .. +---+---+---+---+---+ .. +---+ }%
%% : ^ :
%% : | :
for n = 1:m %% : | :
B(n+1) =( %% ====:===+ : .STO NEXT n+1
%% : :
%% v :
B(n)^2 %% : { FMA B, B, .GET LAST n ( in SEQ :: OK, local data, ALWAYS )
+ B(n) %% v B } ( in PAR :: non-local data. CSP + bcast + many distributed-caches invalidates )
+ B(n) * A(n) %% { FMA B, A,
+ A(n)^k %% ApK}
);
end
Once the SEQ-process data-dependency is recurrent ( having a need to re-use the LAST B(n-1) for an assignment of the NEXT B(n), any attempt to make such SEQ calculation work in PAR will have to introduce a system-wide communication of known values, before "new"-values could get computed only after the respective "previous" B(n-1) has been evaluated and assigned -- through the pure serial SEQ chain of recurrent evaluation, thus not before all the previous cell have been processed serially -- as the LAST piece is always needed for the NEXT step , ref. the "crossroads" in for()-loop iterator dependency-map ( having this, all the rest have to wait in a "queue", to become able do their two primitive .FMA-s + .STO result for the next one in the recurrency indoctrinated "queue" ).
Yes, one can "enforce" the formula to become PAR-executed, but the very costs of such LAST values being communicated "across" the PAR-execution fabric ( towards the NEXT ) is typically prohibitively expensive ( in terms of resources and accrued delays -- either damaging the SIMT-optimised scheduler latency-masking, or blocking all the threads until receiving their "neighbour"-assigned LAST-value that they rely on and cannot proceed without getting it first -- either of which effectively devastates any potential benefit from all the efforts invested into going PAR ).
Even just a pair of FMA-s is not enough code to justify add-on costs -- indeed an extremely small amount of work to do -- for all the PAR efforts.
Unless some very mathematically "dense" processing is being in place, all the additional costs do not get easily adjusted and such attempt to introduce a PAR-mode of computing exhibits nothing but a negative ( adverse ) effect, instead of any wished speedup. One ought, in all professional cases, express all add-on costs during the Proof-of-Concept phase ( a PoC ), before deciding whether any feasible PAR-approach is possible at all, and how to achieve a speedup of >> 1.0 x
Relying on just advertised theoretical GFLOPS and TFLOPS is a nonsense. Your actual GPU-kernel will never be able to repeat the advertised tests' performance figures ( unless you run exactly the same optimised layout and code, which one does not need, does one? ). One typically needs to compute one's own specific algorithmisation, that is related to one's problem domain, not willing to artificially align all the toy-problem elements so that the GPU-silicon will not have to wait for real data and can enjoy some tweaked cache/register based ILP-artifacts, practically not achievable in most of the real-world problem solutions ). If there is one step to recommend -- do always evaluate overhead-fair PoC first to see, if there exists any such chance for speedup, before diving any resources and investing time and money into prototyping detailed design & testing.
Recurrent and weak processing GPU kernel-payloads almost in every case will fight hard to get at least their additional overhead-times ( bidirectional data-transfer related ( H2D + D2H ) + kernel-code related loads ) adjusted.

mongodb map reduce value.count

In mongodb, I have a map function as below:
var map = function() {
emit( this.username, {count: 1, otherdata:otherdata} );
}
and reduce function as below:
var reduce = function(key, values) {
values.forEach(function(value){
total += value.count; //note this line
}
return {count: total, otherdata: values[0].otherdata}; //please ignore otherdata
}
The problem is with the line noted:
total += value.count;
In my dataset, reduce function is called 9 times, and the supposed map reduced result count should be 8908.
With the line above, the returned result would be correctly returned as 8908.
But if I changed the line to:
total += 1;
The returned result would be only 909, about 1/9 of the supposed result.
Also that I tried print(value.count) and the printed result is 1.
What explains this behavior?
short answer : value.count is not always equal to one.
long answer : This is the expected behavior of map reduce : the reduce function is aggreagating the results of the map function. However, it does aggregate on the results of map function by small groups producing intermediate results (sub total in your case). Then reduce functions are runned again on these intermediate results as they were direct results of the map function. And so on until there is only one intermediate result left for each key, that's the final results.
It can be seen as a pyramid of intermediate results :
emit(...)-|
|- reduce -> |
emit(...)-| |
| |- reduce ->|
emit(...)-| | |
| | |
emit(...)-|- reduce -> | |
| |-> reduce = final result
emit(...)-| |
|
emit(...)--- reduce ------------ >|
|
emit(...)-----------------reduce ->|
The number of reduce and their inputs is unpredicatable and is meant to remain hidden.
That's why you have to give a reduce function which return data of the same type (same schema) as input.
The reduce function does not only get called on the original input data, but also on its own output, until there is a final result. So it needs to be able to handle these intermediate results, such as [{count: 5}, {count:3}, {count: 4}] coming out of an earlier stage.

Scala - calculate average of SomeObj.double in a List[SomeObj]

I'm on my second evening of scala, and I'm resisting the urge to write things in scala how I used to do them in java and trying to learn all of the idioms. In this case I'm looking to just compute an average using such things as closures, mapping, and perhaps list comprehension. Irrespective of whether this is the best way to compute an average, I just want to know how to do these things in scala for learning purposes only
Here's an example: the average method below is left pretty much unimplemented. I've got a couple of other methods for looking up the rating an individual userid gave that uses the find method of TraversableLike (I think), but nothing more that is scala specific, really. How would I compute an average given a List[RatingEvent] where RatingEvent.rating is a double value that I'd to compute an average of across all values of that List in a scala-like manner?.
package com.brinksys.liftnex.model
class Movie(val id : Int, val ratingEvents : List[RatingEvent]) {
def getRatingByUser(userId : Int) : Int = {
return getRatingEventByUserId(userId).rating
}
def getRatingEventByUserId(userId : Int) : RatingEvent = {
var result = ratingEvents find {e => e.userId == userId }
return result.get
}
def average() : Double = {
/*
fill in the blanks where an average of all ratingEvent.rating values is expected
*/
return 3.8
}
}
How would a seasoned scala pro fill in that method and use the features of scala to make it as concise as possible? I know how I would do it in java, which is what I want to avoid.
If I were doing it in python, I assume the most pythonic way would be:
sum([re.rating. for re in ratingEvents]) / len(ratingEvents)
or if I were forcing myself to use a closure (which is something I at least want to learn in scala):
reduce(lambda x, y : x + y, [re.rating for re in ratingEvents]) / len(ratingEvents)
It's the usage of these types of things I want to learn in scala.
Your suggestions? Any pointers to good tutorials/reference material relevant to this are welcome :D
If you're going to be doing math on things, using List is not always the fastest way to go because List has no idea how long it is--so ratingEvents.length takes time proportional to the length. (Not very much time, granted, but it does have to traverse the whole list to tell.) But if you're mostly manipulating data structures and only occasionally need to compute a sum or whatever, so it's not the time-critical core of your code, then using List is dandy.
Anyway, the canonical way to do it would be with a fold to compute the sum:
(0.0 /: ratingEvents){_ + _.rating} / ratingEvents.length
// Equivalently, though more verbosely:
// ratingEvents.foldLeft(0.0)(_ + _.rating) / ratingEvents.length
or by mapping and then summing (2.8 only):
ratingEvents.map(_.rating).sum / ratingEvents.length
For more information on maps and folds, see this question on that topic.
You might calculate sum and length in one go, but I doubt that this helps except for very long lists. It would look like this:
val (s,l) = ratingEvents.foldLeft((0.0, 0))((t, r)=>(t._1 + r.rating, t._2 + 1))
val avg = s / l
I think for this example Rex' solution is much better, but in other use cases the "fold-over-tuple-trick" can be essential.
Since mean and other descriptive statistics like standard deviation or median are needed in different contexts, you could also use a small reusable implicit helper class to allow for more streamlined chained commands:
implicit class ImplDoubleVecUtils(values: Seq[Double]) {
def mean = values.sum / values.length
}
val meanRating = ratingEvents.map(_.rating).mean
It even seems to be possible to write this in a generic fashion for all number types.
Tail recursive solution can achieve both single traversal and avoid high memory allocation rates
def tailrec(input: List[RatingEvent]): Double = {
#annotation.tailrec def go(next: List[RatingEvent], sum: Double, count: Int): Double = {
next match {
case Nil => sum / count
case h :: t => go(t, sum + h.rating, count + 1)
}
}
go(input, 0.0, 0)
}
Here are jmh measurements of approaches from above answers on a list of million elements:
[info] Benchmark Mode Score Units
[info] Mean.foldLeft avgt 0.007 s/op
[info] Mean.foldLeft:·gc.alloc.rate avgt 4217.549 MB/sec
[info] Mean.foldLeft:·gc.alloc.rate.norm avgt 32000064.281 B/op
...
[info] Mean.mapAndSum avgt 0.039 s/op
[info] Mean.mapAndSum:·gc.alloc.rate avgt 1690.077 MB/sec
[info] Mean.mapAndSum:·gc.alloc.rate.norm avgt 72000009.575 B/op
...
[info] Mean.tailrec avgt 0.004 s/op
[info] Mean.tailrec:·gc.alloc.rate avgt ≈ 10⁻⁴ MB/sec
[info] Mean.tailrec:·gc.alloc.rate.norm avgt 0.196 B/op
I can suggest 2 ways:
def average(x: Array[Double]): Double = x.foldLeft(0.0)(_ + _) / x.length
def average(x: Array[Double]): Double = x.sum / x.length
Both are fine, but in 1 case when using fold you can not only make "+" operation, but as well replace it with other (- or * for example)