How do I collect from a flux without closing the stream - reactive-programming

My usecase is to create an reactive endpoint like this :
public Flux<ServerEvent> getEventFlux(Long forId){
ServicePoller poller = new ServicePollerImpl();
Map<String,Object> params = new HashMap<>();
params.put("id", forId);
Flux<Long> interval = Flux.interval(Duration.ofMillis(pollDuration));
Flux<ServerEvent> serverEventFlux =
Flux.fromStream(
poller.getEventStream(url, params) //poll a given endpoint after a fixed duration.
);
Flux<ServerEvent> sourceFlux= Flux.zip(interval, serverEventFlux)
.map(Tuple2::getT2); // Zip the two streams.
/* Here I want to store data from sourceFlux into a collection whenever some data arrives without disturbing the downstream processing in Spring. So that I can access collection later on without polling again */
This sends back the data to front end as soon as it is available , however my second use case is to pool that data as it arrives into a separate collection , so that if a similar request arrives later on , I can offload the whole data from the pool without hitting the service again .
I tried to subscribe the flux , buffer , cache and collect the flux before returning from the original flux the controller , but all of that seems to close the stream hence Spring cant process it.
What are my options to tap into the flux and store values into a collection as and when they arrive without closing the flux stream ?
Exception encountered :
java.lang.IllegalStateException: stream has already been operated upon
or closed at
java.util.stream.AbstractPipeline.spliterator(AbstractPipeline.java:343)
~[na:1.8.0_171] at
java.util.stream.ReferencePipeline.iterator(ReferencePipeline.java:139)
~[na:1.8.0_171] at
reactor.core.publisher.FluxStream.subscribe(FluxStream.java:57)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.Flux.subscribe(Flux.java:6873)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.FluxZip$ZipCoordinator.subscribe(FluxZip.java:573)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.FluxZip.handleBoth(FluxZip.java:326)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE]

poller.getEventStream returns a Java 8 stream that can be consumed only once. You can either convert the stream to a collection first or defer the execution of poller.getEventStream by using a supplier:
Flux.fromStream(
() -> poller.getEventStream(url, params)
);

Solution that worked for me as suggested by #a better oliver
public Flux<ServerEvent> getEventFlux(Long forId){
ServicePoller poller = new ServicePollerImpl();
Map<String,Object> params = new HashMap<>();
params.put("id", forId);
Flux<Long> interval = Flux.interval(Duration.ofMillis(pollDuration));
Flux<ServerEvent> serverEventFlux =
Flux.fromStream(
()->{
return poller.getEventStream(url, params).peek((se)->{reactSink.addtoSink(forId, se);});
}
);
Flux<ServerEvent> sourceFlux= Flux.zip(interval, serverEventFlux)
.map(Tuple2::getT2);
return sourceFlux;
}

Related

Mongo change-Stream with Spring resumeAt vs startAfter and fault tolerance in case of connection loss

Can't find an answer on stackOverflow, nor in any documentation,
I have the following change stream code(listen to a DB not a specific collection)
Mongo Version is 4.2
#Configuration
public class DatabaseChangeStreamListener {
//Constructor, fields etc...
#PostConstruct
public void initialize() {
MessageListenerContainer container = new DefaultMessageListenerContainer(mongoTemplate, new SimpleAsyncTaskExecutor(), this::onException);
ChangeStreamRequest.ChangeStreamRequestOptions options =
new ChangeStreamRequest.ChangeStreamRequestOptions(mongoTemplate.getDb().getName(), null, buildChangeStreamOptions());
container.register(new ChangeStreamRequest<>(this::onDatabaseChangedEvent, options), Document.class);
container.start();
}
private ChangeStreamOptions buildChangeStreamOptions() {
return ChangeStreamOptions.builder()
.returnFullDocumentOnUpdate()
.filter(newAggregation(match(where(OPERATION_TYPE).in(INSERT.getValue(), UPDATE.getValue(), REPLACE.getValue(), DELETE.getValue()))))
.resumeAt(Instant.now().minusSeconds(1))
.build();
}
//more code
}
I want the stream to start listening from system initiation time only, without taking anything prior in the op-log, will .resumeAt(Instant.now().minusSeconds(1)) work?
do I need to use starAfter method if so how can I found the latest resumeToken in the db?
or is it ready out of the box and I don't need to add any resume/start lines?
second question, I never stop the container(it should always live while app is running), In case of disconnection from the mongoDB and reconnection will the listener in current configuration continue to consume messages? (I am having a hard time simulation DB disconnection)
If it will not resume handling events, what do I need to change in the configuration so that the change stream will continue and will take all the event from the last received resumeToken prior to the disconnection?
I have read this great article on medium change stream in prodcution,
but it uses the cursor directly, and I want to use the spring DefaultMessageListenerContainer, as it is much more elegant.
So I will answer my own(some more dumb, some less :)...) questions:
when no resumeAt timestamp provided the change stream will start from current time, and will not draw any previous events.
resumeAfter event vs timestamp difference can be found here: stackOverflow answer
but keep in mind, that for timestamp it is inclusive of the event, so if you want to start from next event(in java) do:
private BsonTimestamp getNextEventTimestamp(BsonTimestamp timestamp) {
return new BsonTimestamp(timestamp.getValue() + 1);
}
In case of internet disconnection the change stream will not resume,
as such I recommend to take following approach in case of error:
private void onException() {
ScheduledExecutorService executorService = newSingleThreadScheduledExecutor();
executorService.scheduleAtFixedRate(() -> recreateChangeStream(executorService), 0, 1, TimeUnit.SECONDS);
}
private void recreateChangeStream(ScheduledExecutorService executorService) {
try {
mongoTemplate.getDb().runCommand(new BasicDBObject("ping", "1"));
container.stop();
startNewContainer();
executorService.shutdown();
} catch (Exception ignored) {
}
}
First I am creating a runnable scheduled task that always runs(but only 1 at a time newSingleThreadScheduledExecutor()), I am trying to ping the DB, after a successful ping I am stopping the old container and starting a new one, you can also pass the last timestamp you took so that you can get all events you might have missed
timestamp retrieval from event:
BsonTimestamp resumeAtTimestamp = changeStreamDocument.getClusterTime();
then I am shutting down the task.
also make sure the resumeAtTimestamp exist in oplog...

How to access a KStreams Materialized State Store from another Stream Processor

I need to be able to remove a record from a Ktable from a separate Stream Processor. Today I'm using aggregate() and passing a materialized state store. In a separate processor that reads from a "termination" topic, I'd like to query that materialized state store either in a .transform() or a different .aggregate() and 'remove' that key/value. Every time I try to access the materialized state from a separate stream processor, it keeps telling me either the store isn't added to the topology, so then I add it and run it again, then it tells me it's already be registered and errors out.
builder.stream("topic1").map().groupByKey().aggregate(() -> null,
(aggKey, newValue, aggValue) -> {
//add to the Ktable
return newValue;
},
stateStoreMaterialized);
and in a separate stream I want to delete a key from that stateStoreMaterialized
builder.stream("topic2")
.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
stateStoreDeleteTransformer will query the key and delete it.
//in ctor
KeyValueBytesStoreSupplier stateStoreSupplier = Stores.persistentKeyValueStore("store1");
stateStoreMaterialized = Materialized.<String, MyObj>as(stateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(mySerDe);
I don't have a terminal flag on my topic1 stream object value that can trigger a deletion. It has to come from another stream/topic.
When I try to use the same Materialized Store on two separate stream processors I get..
Invalid topology: Topic STATE_STORE-repartition has already been registered by another source.
at org.springframework.kafka.config.StreamsBuilderFactoryBean.start(StreamsBuilderFactoryBean.java:268)
Edit:
This is the 1st error I receive.
Caused by: org.apache.kafka.streams.errors.StreamsException: Processor KSTREAM-TRANSFORMVALUES-0000000012 has no access to StateStore store1 as the store is not connected to the processor. If you add stores manually via '.addStateStore()' make sure to connect the added store to the processor by providing the processor name to '.addStateStore()' or connect them via '.connectProcessorAndStateStores()'. DSL users need to provide the store name to '.process()', '.transform()', or '.transformValues()' to connect the store to the corresponding operator. If you do not add stores manually, please file a bug report at https://issues.apache.org/jira/projects/KAFKA.
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.getStateStore(ProcessorContextImpl.java:104)
at org.apache.kafka.streams.processor.internals.ForwardingDisabledProcessorContext.getStateStore(ForwardingDisabledProcessorContext.java:85)
So then I do this:
stateStoreSupplier = Stores.persistentKeyValueStore(STATE_STORE_NAME);
storeStoreBuilder = Stores.keyValueStoreBuilder(stateStoreSupplier, Serdes.String(), jsonSerDe);
stateStoreMaterialized = Materialized.as(stateStoreSupplier);
Then I get this error:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore 'state-store' is already added.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:520)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:512)
Here's the code that fixed my issue. As it turns out, order matters when building the streams. Had to set the materialized store first and then in subsequent lines of code, setup the transformer.
/**
* Create the streams using the KStreams DSL - a method to configure the stream and add any state stores.
*/
#Bean
public KafkaStreamsConfig setup() {
final JsonSerDe<Bus> ltaSerde = new JsonSerDe<>(Bus.class);
final StudentSerde<Student> StudentSerde = new StudentSerde<>();
//start lta stream
KStream<String, Bus> ltaStream = builder
.stream(ltaInputTopic, Consumed.with(Serdes.String(), ltaSerde));
final KStream<String, Student> statusStream = this.builder
.stream(this.locoStatusInputTopic,
Consumed.with(Serdes.String(),
StudentSerde));
//create lta store
KeyValueBytesStoreSupplier ltaStateStoreSupplier = Stores.persistentKeyValueStore(LTA_STATE_STORE_NAME);
final Materialized<String, Bus, KeyValueStore<Bytes, byte[]>> ltaStateStoreMaterialized =
Materialized.
<String, Bus>as(ltaStateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(ltaSerde);
KTable<String, Bus> ltaStateProcessor = ltaStream
//map and convert lta stream into Loco / LTA key value pairs
.groupByKey(Grouped.with(Serdes.String(), ltaSerde))
.aggregate(
//The 'aggregate' and 'reduce' functions ignore messages with null values FYI.
// so if the value after the groupbykey produces a null value, it won't be removed from the state store.
//which is why it's very important to send a message with some terminal flag indicating this value should be removed from the store.
() -> null, /* initializer */
(aggKey, newValue, aggValue) -> {
if (null != newValue.getAssociationEndTime()) { //if there is an endTime associated to this train/loco then remove it from the ktable
logger.trace("removing LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return null; //Returning null removes the record from the state store as well as its changelog topic. re: https://objectpartners.com/2019/07/31/slimming-down-your-kafka-streams-data/
}
logger.trace("adding LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return newValue;
}, /* adder */
ltaStateStoreMaterialized
);
// don't need builder.addStateStore(keyValueStoreStoreBuilder); and CANT use it
// because the ltaStateStoreMaterialized will already be added to the topology in the KTable aggregate method above.
// The below transformer can use the state store because it's already added (apparently) by the aggregate method.
// Add the KTable processors first, then if there are any transformers that need to use the store, add them after the KTable aggregate method.
statusStream.map((k, v) -> new KeyValue<>(v.getLocoId(), v))
.transform(locoStatusTransformerSupplier, ltaStateStoreSupplier.name())
.to("testing.outputtopic", Produced.with(Serdes.String(), StudentSerde));
return this; //can return anything except for void.
}
is stateStoreMaterialized and stateStoreSupplier.name() has the same name?
Use have a error in your topology
KStream.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
You have to supply new instant of StateStoreDeleteTransformer per ProcessContext in TransformerSupplier, like this:
KStream.transform(StateStoreDeleteTransformer::new, stateStoreSupplier.name())
or
KStream.transform(() -> StateStoreDeleteTransformerSupplier.get(), stateStoreSupplier.name())//StateStoreDeleteTransformerSupplier return new instant of StateStoreDeleteTransformer
in stateStoreDeleteTransformer how do you intent on using stateStoreMaterialized inside transformer directly?
I have the similar use case and I using a KeyValueStore<String, MyObj>
public void init(ProcessorContext context) {
kvStore = (KeyValueStore<String, MyObj>) context.getStateStore("store1");
}

Send message to Kafka when SessionWindows was started and ended

I want to send a message to the Kafka topic when new SessionWindow was created and when was ended. I have the following code
stream
.filter(user -> user.isAdmin)
.keyBy(user -> user.username)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
//what now? Trigger?
Now I want to send message when new session was started (with some metadata like web browser and timestamps, these informations are available in each element of stream) and send message to Kafka when session was ended (in this example 10 seconds after last element I think) with number of total requests.
It's possible in Flink? I think I should use some trigger but I don't know how and I can't find any example.
If You want to do this when the window is processed, then You can simply use the WindowProcessFunction, basically what You need to do is to add .process(new MyProcessFunction() to Your code. In the ProcessFunction You can have access to the whole window including its first (start) and last (end) element. You can simply use the Side output to just output the beginning and the end of the given window. You can then create a stream from side output and sink it to Kafka. More on Side outputs can be found here.
You can write a custom window trigger.
How to tell a new session is started?
You can create a ValueState with the default value to null, so in case the state value is null, it is a session start.
When the session ended?
Just before TriggerResult.FIRE.
Here is a demo based on ProcessingTimeTrigger of Flink, I only put the question-related logics here, you can check other details from the source code.
public class MyProcessingTimeTrigger extends Trigger<Object, TimeWindow> {
// a state which keeps a session start.
private final ValueStateDescriptor<Long> stateDescriptor = new ValueStateDescriptor<Long>("session-start", Long.class);
#Override
public TriggerResult onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
ValueState<Long> state = ctx.getPartitionedState(stateDescriptor);
if(state.value() == null) {
// if value is null, it's a session start.
state.update(window.getStart());
}
ctx.registerProcessingTimeTimer(window.maxTimestamp());
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) {
// here is a session end.
return TriggerResult.FIRE;
}
#Override
public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
ctx.getPartitionedState(stateDescriptor).clear();
ctx.deleteProcessingTimeTimer(window.maxTimestamp());
}
}

Flink Multiple Windows on same data

My flink application does the following
source: read data in form of records from Kafka
split: based on certain criteria
window : timewindow of 10seconds to aggregate into one bulkrecord
sink: dump these bulkrecords to elasticsearch
I am facing issue where flink consumer is not able to hold data for 10seconds, and is throwing the following exception:
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Size of the state is larger than the maximum permitted memory-backed state. Size=18340663 , maxSize=5242880
I cannot apply countWindow, because if the frequency of records is too slow, then the elasticsearch sink might be deferred for a long time.
My question:
Is it possible to apply a OR function of TimeWindow and CountWindow, which goes as
> if ( recordCount is 500 OR 10 seconds have elapsed)
> then dump data to flink
Not directly. But you can use a GlobalWindow with a custom triggering logic. Take a look at the source for the count trigger here.
Your triggering logic will look something like this.
private final ReducingStateDescriptor<Long> stateDesc =
new ReducingStateDescriptor<>("count", new Sum(), LongSerializer.INSTANCE);
private long triggerTimestamp = 0;
#Override
public TriggerResult onElement(String element, long l, GlobalWindow globalWindow, TriggerContext triggerContext) throws Exception {
ReducingState<Long> count = triggerContext.getPartitionedState(stateDesc);
// Increment window counter by one, when an element is received
count.add(1L);
// Start the timer when the first packet is received
if (count.get() == 1) {
triggerTimestamp = triggerContext.getCurrentProcessingTime() + 10000; // trigger at 10 seconds from reception of first event
triggerContext.registerProcessingTimeTimer(triggerTimestamp); // Override the onProcessingTime method to trigger the window at this time
}
// Or trigger the window when the number of packets in the window reaches 500
if (count.get() >= 500) {
// Delete the timer, clear the count and fire the window
triggerContext.deleteProcessingTimeTimer(triggerTimestamp);
count.clear();
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
}
You could also use the RocksDB state backend, but a custom Trigger will perform better.

Apache Beam: Error assigning event time using Withtimestamp

I have an unbounded Kafka stream sending data with the following fields
{"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"}
I read the stream using the apache beam sdk for kafka
import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("kafka:9092")
.withTopic("test")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true"))
.updateConsumerProperties(ImmutableMap.of("group.id", "Consumer1"))
.commitOffsetsInFinalize()
.withoutMetadata()))
Since I want to window using event time ("ts" in my example), i parse the incoming string and assign "ts" field of the incoming datastream as the timestamp.
PCollection<Temperature> tempCollection = p.apply(new SetupKafka())
.apply(ParDo.of(new ReadFromTopic()))
.apply("ParseTemperature", ParDo.of(new ParseTemperature()));
tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));
The window function and the computation is applied as below:
PCollection<Output> output = tempCollection.apply(Window
.<Temperature>into(FixedWindows.of(Duration.standardSeconds(30)))
.triggering(AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.standardDays(1))
.accumulatingFiredPanes())
.apply(new ComputeMax());
I stream data into the input stream with a lag of 5 seconds from current utc time since in practical scenrios event timestamp is usually earlier than the processing timestamp.
I get the following error:
Cannot output with timestamp 2019-01-16T11:15:45.560Z. Output
timestamps must be no earlier than the timestamp of the current input
(2019-01-16T11:16:50.640Z) minus the allowed skew (0 milliseconds).
See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing
the allowed skew.
If I comment out the line for AssignTimeStamps, there are no errors but I guess, then it is considering the processing time.
How do I ensure my computation and windows are based on event time and not for processing time?
Please provide some inputs on how to handle this scenario.
To be able to use custom timestamp, first You need to implement CustomTimestampPolicy, by extending TimestampPolicy<KeyT,ValueT>
For example:
public class CustomFieldTimePolicy extends TimestampPolicy<String, Foo> {
protected Instant currentWatermark;
public CustomFieldTimePolicy(Optional<Instant> previousWatermark) {
currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
}
#Override
public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, Foo> record) {
currentWatermark = new Instant(record.getKV().getValue().getTimestamp());
return currentWatermark;
}
#Override
public Instant getWatermark(PartitionContext ctx) {
return currentWatermark;
}
}
Then you need to pass your custom TimestampPolicy, when you setting up your KafkaIO source using functional interface TimestampPolicyFactory
KafkaIO.<String, Foo>read().withBootstrapServers("http://localhost:9092")
.withTopic("foo")
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(KafkaAvroDeserializer.class, AvroCoder.of(Foo.class)) //if you use avro
.withTimestampPolicyFactory((tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))
.updateConsumerProperties(kafkaProperties))
This line is responsible for creating a new timestampPolicy, passing a related partition and previous checkpointed watermark see the documentation
withTimestampPolicyFactory(tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))
Have you had a chance to try this using the time stamp policy, sorry I have not tried this one out myself, but I believe with 2.9.0 you should look at using the policy along with the KafkaIO read.
https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#withTimestampPolicyFactory-org.apache.beam.sdk.io.kafka.TimestampPolicyFactory-