The problem: there is a stream of numeric values. Values are pushed in bursts so 100 values can come very close to each other (time-wise) say each 5-10 ms and then it possibly stops for a while, then can burst again.The idea is to show accumulated value (sum) of windows of at most length of 500 ms.
My first attempt was with Buffer(500ms) but this is causing constant pumping of events (every 500 ms) with the sum of 0 (as the accumulated buffer items os 0), it could be fixed with filtering by empty buffers but I would really like to avoid that entirely and only open the buffering after a value is actually pushed after a period of "silence".
Additional restrictions: the implementation is UniRx which does not contain all the Rx operators, notably Window (which I suspect could be useful in that case) so solution is limited to basic operators including Buffer.
Since you just want the sum, using Buffer is overkill.
We can run a Scan or Aggregation.
var burstSum =
source
.Scan(0, (acc, current) => acc + current)
.Throttle(TimeSpan.FromMilliseconds(500))
.Take(1)
.Repeat();
This will start a stream which accumulates the sum until the stream has been idle for at least 500ms.
But if we want to emit at least every time bucket, we'll have to go a different path. We're making two assumptions:
The sum of time-intervals between elements should be equal to the time-interval between the first and last element.
Throttle will release the last value when the stream completes.
source
.TimeInterval()
.Scan((acc, cur) => new TimeInterval<int>(acc.Value + cur.Value, acc.Interval + cur.Interval))
.TakeWhile(acc => acc.Interval <= TimeSpan.FromMilliseconds(500))
.Throttle(TimeSpan.FromMilliseconds(500))
.Select(acc => acc.Value)
.Take(1)
.Repeat();
Related
I'm trying to play with Kafka Stream to aggregate some attribute of People.
I have a kafka stream test like this :
new ConsumerRecordFactory[Array[Byte], Character]("input", new ByteArraySerializer(), new CharacterSerializer())
var i = 0
while (i != 5) {
testDriver.pipeInput(
factory.create("input",
Character(123,12), 15*10000L))
i+=1;
}
val output = testDriver.readOutput....
I'm trying to group the value by key like this :
streamBuilder.stream[Array[Byte], Character](inputKafkaTopic)
.filter((key, _) => key == null )
.mapValues(character=> PersonInfos(character.id, character.id2, character.age) // case class
.groupBy((_, value) => CharacterInfos(value.id, value.id2) // case class)
.count().toStream.print(Printed.toSysOut[CharacterInfos, Long])
When i'm running the code, I got this :
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 1
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 2
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 3
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 4
[KTABLE-TOSTREAM-0000000012]: CharacterInfos(123,12), 5
Why i'm getting 5 rows instead of just one line with CharacterInfos and the count ?
Doesn't groupBy just change the key ?
If you use the TopologyTestDriver caching is effectively disabled and thus, every input record will always produce an output record. This is by design, because caching implies non-deterministic behavior what makes itsvery hard to write an actual unit test.
If you deploy the code in a real application, the behavior will be different and caching will reduce the output load -- which intermediate results you will get, is not defined (ie, non-deterministic); compare Michael Noll's answer.
For your unit test, it should actually not really matter, and you can either test for all output records (ie, all intermediate results), or put all output records into a key-value Map and only test for the last emitted record per key (if you don't care about the intermediate results) in the test.
Furthermore, you could use suppress() operator to get fine grained control over what output messages you get. suppress()—in contrast to caching—is fully deterministic and thus writing a unit test works well. However, note that suppress() is event-time driven, and thus, if you stop sending new records, time does not advance and suppress() does not emit data. For unit testing, this is important to consider, because you might need to send some additional "dummy" data to trigger the output you actually want to test for. For more details on suppress() check out this blog post: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers
Update: I didn't spot the line in the example code that refers to the TopologyTestDriver in Kafka Streams. My answer below is for the 'normal' KStreams application behavior, whereas the TopologyTestDriver behaves differently. See the answer by Matthias J. Sax for the latter.
This is expected behavior. Somewhat simplified, Kafka Streams emits by default a new output record as soon as a new input record was received.
When you are aggregating (here: counting) the input data, then the aggregation result will be updated (and thus a new output record produced) as soon as new input was received for the aggregation.
input record 1 ---> new output record with count=1
input record 2 ---> new output record with count=2
...
input record 5 ---> new output record with count=5
What to do about it: You can reduce the number of 'intermediate' outputs through configuring the size of the so-called record caches as well as the setting of the commit.interval.ms parameter. See Memory Management. However, how much reduction you will be seeing depends not only on these settings but also on the characteristics of your input data, and because of that the extent of the reduction may also vary over time (think: could be 90% in the first hour of data, 76% in the second hour of data, etc.). That is, the reduction process is deterministic but from the resulting reduction amount is difficult to predict from the outside.
Note: When doing windowed aggregations (like windowed counts) you can also use the Suppress() API so that the number of intermediate updates is not only reduced, but there will only ever be a single output per window. However, in your use case/code you the aggregation is not windowed, so cannot use the Suppress API.
To help you understand why the setup is this way: You must keep in mind that a streaming system generally operates on unbounded streams of data, which means the system doesn't know 'when it has received all the input data'. So even the term 'intermediate outputs' is actually misleading: at the time the second input record was received, for example, the system believes that the result of the (non-windowed) aggregation is '2' -- its the correct result to the best of its knowledge at this point in time. It cannot predict whether (or when) another input record might arrive.
For windowed aggregations (where Suppress is supported) this is a bit easier, because the window size defines a boundary for the input data of a given window. Here, the Suppress() API allows you to make a trade-off decision between better latency but with multiple outputs per window (default behavior, Suppress disabled) and longer latency but you'll get only a single output per window (Suppress enabled). In the latter case, if you have 1h windows, you will not see any output for a given window until 1h later, so to speak. For some use cases this is acceptable, for others it is not.
I have an application that consumes work to do from an AWS topic. Work is added several times a day and my application quickly consumes it and the queue length goes back to 0. I am able to produce a metric for the length of the queue.
I would like a metric for the time since the length of queue was last zero. Any ideas how to get started?
Assuming a queue_size gauge that records the size of the queue, you can define a recorded rule like this:
# Timestamp of the most recent `queue_size` == 0 sample; else propagate the previous value
- record: last_empty_queue_timestamp
expr: timestamp(queue_size == 0) or last_empty_queue_timestamp
Then you can compute the time since the last time the queue was empty as simply as:
timestamp(queue_size) - last_empty_queue_timestamp
Note however that because this is a gauge (and because of the limitations of sampling), you may end up with weird results. E.g. if one work item is added every minute, your sampling interval is one minute and you sample exactly after the work items have been added, your queue may never (or very rarely) appear empty from the point of view of Prometheus. If that turns out to be an issue (or simply a concern) you may be better off having your application export a metric that is the last timestamp when something was added to an empty queue (basically what the recorded rule attempts to compute).
Similar to Alin's answer; upon revisiting this problem I found this from the Prometheus documentation:
https://prometheus.io/docs/practices/instrumentation/#timestamps,-not-time-since
If you want to track the amount of time since something happened, export the
Unix timestamp at which it happened - not the time since it happened.
With the timestamp exported, you can use the expression time() -
my_timestamp_metric to calculate the time since the event, removing the need for
update logic and protecting you against the update logic getting stuck.
I was wondering if it's possible to create a WindowAssigner that is similar to:
EventTimeSessionWindows.withGap(Time.seconds(1L))
Except I don't want the window to keep growing in event-time on each element. I want the beginning of the window to be defined at the first element received (for that key), and end exactly 1 second later, no matter how many elements arrive in that second.
So it would probably look like this hypothetically:
EventTimeSessionWindows.withMax(Time.seconds(1L))
Thanks!
There is no built-in window for this use case.
However, you can implement this with a GlobalWindow, which collects all incoming elements, and a Trigger that registers a timer when an element is received and the window is empty, i.e., the first element or the first element after the window was purged. The window collects new elements until the timer fires. At that point, the window is evaluated and purged.
How would a q implementation of Windows' waitfor look like?
I have a q process that feeds rows of a table, one at a time, to another process (say tickerplant).
feed:{h`.u.upd,x;};
feed each tbl;
I'd like to feed the next row if any one of the conditions is met:
the process receives the signal (next or ready) from the other process to do so;
OR
time waiting for signal runs out (timeout is specific to each row of tbl and will be provided as a column of tbl)
Think of tbl as containing the list of events that are to be published sequentially either when the rest of CEP/tickerplant is ready OR when the next event is due (timeout measured as deltas timespan between the last published event and the next one, i.e. update timeout:((1 _deltas time),0Wp) from tbl), whichever happens sooner.
I've tried while[.z.N<prevtstamp+timeout;wait()];feed x but while blocks the process from listening to async messages from the other process AFAIK.
Other solutions considered:
Checking for signals with .z.ts is too slow as \t can't go below 1ms precision.
Polling the tickerplant for next(ready) signal continuously from within the while loop would slow down the tickerplant.
One solution is to maintain i index of the current row of tbl, separate the feeder out into two processes each handling one condition separately and polling for the current i index. This sounds slow, compared to eaching rows.
KDB is a single thread application so any waitFor alike things will do nothing more than while loop.
1ms precision is pretty high already...And KDB is NOT design for a real time stuff. So even 1ms can be easily missed if the process is busying on something else.
If u need higher precision than that.. U might need C or even Driver level help to improve precision.
Maybe U can consider a better design to avoid the real time requirement :)
I have scenarios where I will need to process thousands of records at a time. Sometime, it might be in hundreds, may be upto 30000 records. I was thinking of using the scala's parallel collection. So just to understand the difference, I wrote a simple pgm like below:
object Test extends App{
val list = (1 to 100000).toList
Util.seqMap(list)
Util.parMap(list)
}
object Util{
def seqMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken =" + (end - start))
end - start
}
def parMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.par.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken=" + (end - start))
end - start
}
}
I expected that running in parallel will be faster. However, the output I was getting was
time taken =32
time taken=127
machine config :
Intel i7 processor with 8 cores
16GB RAM
64bit Windows 8
What am I doing wrong? Is this not a correct scenario for parallel mapping?
The issue is that the operation you are performing is so fast (just adding two ints) that the overhead of doing the parallelization is more than the benefit. The parallelization only really makes sense if the operations are slower.
Think of it this way: if you had 8 friends and you gave each one an integer on a piece of paper and told them to add one, write the result down, and give it back to you, which you would record before giving them the next integer, you'd spend so much time passing messages back and forth that you could have just done all the adding yourself faster.
ALSO: Never do .par on a List because the parallelization procedure has to copy the entire list into a parallel collection and then copy the whole thing back out. If you use a Vector, then it doesn't have to do this extra work.
The overhead in parallelizing the list proves more time-consuming than the actual processing of the x + 1 operations sequentially.
Yet consider this modification where we include an operation that elapses over 1 millisecond approximately,
case class Delay() {
Thread.sleep(1)
}
and replace
list.map(x => x + 1).toList.sum
with
list.map(_ => Delay()).toList
Now for val list = (1 to 10000).toList (note 10000 instead of 100000), in a quadcore 8GB machine,
scala> Util.parMap(list)
time taken=3451
res4: Long = 3451
scala> Util.seqMap(list)
time taken =10816
res5: Long = 10816
We can infer (better, guess) that for large collections with time-consuming operations, the overhead of parallelizing a collection does not significantly affect the elapsed time, in contrast with a sequential collection processing.
If you are doing benchmarks, consider using something like JMH to avoid all the possible problems you might encounter, if you are measuring it in the way your program shows. For example, JIT may change your results dramatically, but only after some iterations.
In my experience parallel collections are normally slower, if the input is not large enough: If the input is small the initial split and the "putting together" at the end does not pay off.
So benchmark again, using lists of different sizes (try 30 000, 100 000, and 1 000 000).
Moreover if you do numerical processing, consider using Array (instead of List) and while (instead of map). These are "more native" (= faster) to the underlying JVM whereas in your case you are possibly measuring the performance of the garbage collector. As for Array you could store the result of the operation "in place".
Parallel collections initialize threads before performing operation that tooks some time.
So when you perform operations by parallel collections with low number of elements or operations takes low time parallel collections will performs slower