SkipWhile to check if a condition holds continuously in RxJava - reactive-programming

There are several Observables of a continuous stream of accelerometer data. In one of the Observables I'd like to sample the data and receive only about 5 a second, or in effect, skip x elements.
I tried sample and skipWhile. Sample simply queues the data and gives it in the order sent be the emitter with a delay.
skipWhile is closer to what I need, but it checks for the condition only once. Once skipWhile returns false the stream continues and then never checks for the condition.
.skipWhile(imuSensorEvent -> {
return imuSensorEvent.getTimestamp() - last < 100 * 1000000;
})
How would I use something like skipWhile and always check for the condition?

Related

Batching Kafka Events using Faust

I have a Kafka topic we will call ingest that receives an entry every x seconds. I have a process that I want to run on this data but it has to be run on 100 events at a time. Thus, I want to batch the entries together and send them to a new topic called batched-ingest. The two topics will look like this...
ingest = [entry, entry, entry, ...]
batched-ingest = [[entry_0, entry_1, ..., entry_99]]
What is the correct way to do this using faust? The solution I have right now is this...
app = faust.App("explore", value_serializer="raw")
ingest = app.topic('ingest')
ingest_batch = app.topic('ingest-batch')
#app.agent(ingest, sink=[ingest_batch])
async def test(stream):
async for values in stream.take(10, within=1000):
yield values
I am not sure if this is the correct way to do this in Faust. If so, what should I set within to in order to make it always wait until len(values) = 100?
as mentioned in the faust take documentation if you omit the within from take(100, within=10) the code will block forever if there are 99 messages and the last hundredth message is never received. To solve this add a within timeout so that up to 100 values will be processed within 10 seconds. so that if there are periods of 10 seconds with no events received it will still process what it has gathered.

How to process only last event in a period?

I receive a lot of events with interval of a second. I want to precess the most recent event each second. eg
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15. 5 events per second.Thus i want proceed only events 5 on second 1, 10 on second 2,15 on second 3. I tohught about flowable but it just introduce delay between events, and debounce will not trigger if event stream is constant
Seems that throttleLast(1, TimeUnit.SECONDS) is what you need. It emits only the last item emitted by a reactive source during sequential time windows of a specified duration.
sample() operator rx docs
throttleLast() description

KafkaConsumer: consume all available messages once, and exit

I want to create a KafkaConsumer, with Kafka 2.0.0, that consumes all available messages once, and exits immediately. This differs slightly from the standard console consumer utility because that utility waits for a specified timeout for new messages, and only exits once that timeout has expired.
This seemingly simple task seems to be surprisingly hard using the KafkaConsumer. My gut reaction was the following pseudo-code:
consumer.assign(all partitions)
consumer.seekToBeginning(all partitions)
do
result = consumer.poll(Duration.ofMillis(0))
// onResult(result)
while result is not empty
However this does not work, as poll always returns an empty collection even though there are many messages on the topic.
Researching this, it looks like one reason may have been that assign/subscribe are considered lazy, and partitions are not assigned until a poll loop has completed (although I cannot find any support for this assertion in the docs). However, the following pseudo-code also returns an empty collection on each call to poll:
consumer.assign(all partitions)
consumer.seekToBeginning(all partitions)
// returns nothing
result = consumer.poll(Duration.ofMillis(0))
// returns nothing
result = consumer.poll(Duration.ofMillis(0))
// returns nothing
result = consumer.poll(Duration.ofMillis(0))
// deprecated poll also returns nothing
result = consumer.poll(0)
// returns nothing
result = consumer.poll(0)
// returns nothing
result = consumer.poll(0)
...
So clearly "laziness" is not the issue.
The javadoc states:
This method returns immediately if there are records available.
which seems to imply that the first pseudo-code above should work. However, it does not.
The only thing that seems to work is to specify a non-zero timeout on poll, and not just any non-zero value, for example 1 doesn't work. Which indicates that there is some non-deterministic behavior happening inside poll which assumes that poll will always be carried out in an infinite loop and it doesn't matter that it occasionally returns an empty collection despite the availability of messages. The code seems to confirm this with various calls to check if the timeout is expired sprinkled throughout the poll implementation and its callees.
So with the naive approach, a longer timeout is obviously required (and ideally Long.MAX_VALUE to avoid the non-deterministic behavior of a shorter poll interval), but unfortunately this will cause the consumer to block on the last poll, which isn't desired in this situation. With the naive approach, we now have a trade-off between how deterministic we want the behavior to be, vs how long we have to wait for no reason on the last poll. How do we avoid this?
The only way to accomplish this seems to be with some additional logic that self-manages the offsets. Here is the pseudo-code:
consumer.assign(all partitions)
consumer.seekToBeginning(all partitions)
// record the current ending offsets and poll until we get there
endOffsets = consumer.endOffsets(all partitions)
do
result = consumer.poll(NONTRIVIAL_TIMEOUT)
// onResult(result)
while given any partition p, consumer.position(p) < endOffsets[p]
and an implementation in Kotlin:
val topicPartitions = consumer.partitionsFor(topic).map { TopicPartition(it.topic(), it.partition()) }
consumer.assign(topicPartitions)
consumer.seekToBeginning(consumer.assignment())
val endOffsets = consumer.endOffsets(consumer.assignment())
fun pendingMessages() = endOffsets.any { consumer.position(it.key) < it.value }
do {
records = consumer.poll(Duration.ofMillis(1000))
onResult(records)
} while(pendingMessages())
The poll duration can now be set to a reasonable value (such as 1s) without concern of missing messages, since the loop continues until the consumer reaches the end offsets identified at the beginning of the loop.
There is one other corner case this deals with correctly: if the end offsets have changed, but there are actually no messages between the current offset and the end offset, then the poll will block and timeout. So it is important that the timeout not be set too low (otherwise the consumer will timeout before retrieving messages that are available) and it must also not be set too high (otherwise the consumer will take too long to timeout when retrieving messages that are not available). The latter situation can happen if those messages were deleted, or if the topic was deleted and recreated.
If there's noone producing concurrently, you can also use endOffsets to get the position of last message, and consume until that.
So, in pseudocode:
long currentOffset = -1
long endOffset = consumer.endOffset(partition)
while (currentOffset < endOffset) {
records = consumer.poll(NONTRIVIAL_TIMEOUT) // discussed in your answer
currentOffset = records.offsets().max()
}
This way we avoid final non-zero hangup, as we are always sure there is something to receive.
You might need to add safeguards if your consumer's position is equal to end offset (as you'd get no messages there).
Also, you might want to set max.poll.records to 1, so you don't consume messages positioned after end offset, if someone is producing in parallel.

how to get result of Kafka streams aggregate task and send the data to another service?

I use Kafka streams to process the real-time data and I need to do some aggregate operations for data of a windowed time.
I have two questions about the aggregate operation.
How to get the aggregated data? I need to send it to a 3rd service.
After the aggregate operation, I can't send message to a 3rd service, the code doesn't run.
Here is my code:
stream = builder.stream("topic");
windowedKStream = stream.map(XXXXX).groupByKey().windowedBy("5mins");
ktable = windowedKStream.aggregate(()->"", new Aggregator(K,V,result));
// my data is stored in 'result' variable, but I can't get it at the end of the 5 mins window.
// I need to send the 'result' to a 3rd service. But I don't know where to temporarily store it and then how to get it.
// below is the code the call a 3rd service, but the code can't be executed(reachable).
// I think it should be executed every 5 mins when thewindows is over. But it isn't.
result = httpclient.execute('result');
I guess might want to do something like:
ktable.toStream().foreach((k,v) -> httpclient.execute(v));
Each time the KTable is updated (with caching disabled), the update record will be sent downstream, and foreach will be executed with v being the current aggregation result.

Usage of Monix Debounce Observable

I'm trying out some of the operations that I could do on the Observable from Monix. I came across this debounce operator and could not understand its behavior:
Observable.interval(5.seconds).debounce(2.seconds)
This one above just emits a Long every 5 seconds.
Observable.interval(2.seconds).debounce(5.seconds)
This one however does not emit anything at all. So what is the real purpose of the debounce operator and in which cases could I use it?
The term debounce comes from mechanical relays. You can think of it as a frequency filter: o.debounce(5.seconds) filters out any events that are emitted more frequently than once every 5 seconds.
An example of where I've used it is where I expect to get a batch of similar events in rapid succession, and my response to each event is the same. By debouncing I can reduce the amount of work I need to do by making the batch look like just one event.
It isn't useful in situations like your examples where the input frequency is constant, as the only possibilities are that it does nothing or it filters out everything.