How to handle backpressure when using Reactor Kafka? - reactive-programming

I'm using Reactor Kafka to both consume and produce Kafka events. In the case of consuming events my consumer is slow and therefor I need to handle backpressure.
However, I experience that no matter what I call Subscription.request() with, the publisher will publish all events from the topic immediately, therefor overwhelming the consumer.
I'm using a custom Subscriber, setting a small number of initial request by calling Subscription.request(), when I subscribe to KafkaReceiver.receive() to do this. To my understanding this is how I tell the publisher how many events my consumer initially wants.
My subscriber:
public class KafkaEventSubscriber extends BaseSubscriber {
private final int numberOfItemsToRequestOnSubscribe;
private final int numberOfItemsToRequestOnNext;
public KafkaEventSubscriber(int numberOfItemsToRequestOnSubscribe,
int numberOfItemsToRequestOnNext) {
this.numberOfItemsToRequestOnSubscribe = numberOfItemsToRequestOnSubscribe;
this.numberOfItemsToRequestOnNext = numberOfItemsToRequestOnNext;
}
#Override
protected void hookOnSubscribe(Subscription subscription) {
subscription.request(numberOfItemsToRequestOnSubscribe);
}
#Override
protected void hookOnNext(EnrichedMetadata value) {
request(numberOfItemsToRequestOnNext);
}
}
How I use the subscriber:
kafkaReceiver.receive().map(ReceiverRecord::value).map(KafkaConsumer::acknowledge).subscribe(new KafkaEventSubscriber(10, 1));
I expect the KafkaReceiver to output 10 events before any call to the subscribers onNext() method is done, but the KafkaReceiver outputs all events that are not already ACK:ed from the topic.
I experience that no matter what we call Subscription.request() with, the publisher will publish all events from the topic immediately, not respecting the backpressure measures I've been taking.

Related

Kafka streams event deduplication keeping last event in window

I'm using Kafka Streams in a deduplication events problem over short time windows (<= 1 minute).
First I've tried to tackle the problem by using DSL API with .suppress(Suppressed.untilWindowCloses(...)) operator but, given the fact that wall-clock time is not yet supported (I've seen the KIP 424), this operator is not viable for my use case.
Then, I've followed this official Confluent example in which low level Processor API is used and it was working fine but has one major limitation for my use-case. The single event (obtained by deduplication) is emitted at the beginning of the time window, subsequent duplicated events are "suppressed". In my use case I need the reverse of that, meaning that a single event should be emitted at the end of the window.
I'm asking for suggestions on how to implement this use case with Processor API.
My idea was to use the Processor API with a custom Transformer and a Punctuator.
The transformer would store in a WindowStore the distinct keys received without returning any KeyValue. Simultaneously, I'd schedule a punctuator running with an interval equal to the size of the window in the WindowStore. This punctuator will iterate over the elements in the store and forward them downstream.
The following are some core parts of the logic:
DeduplicationTransformer (slightly modified from official Confluent example):
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, V>) context.getStateStore(this.storeName);
// Schedule punctuator for this transformer.
context.schedule(Duration.ofMillis(this.windowSizeMs), PunctuationType.WALL_CLOCK_TIME,
new DeduplicationPunctuator<E, V>(eventIdStore, context, this.windowSizeMs));
}
#Override
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
if (!isDuplicate(eventId)) {
rememberNewEvent(eventId, value, context.timestamp());
}
return null;
}
}
DeduplicationPunctuator:
public DeduplicationPunctuator(WindowStore<E, V> eventIdStore, ProcessorContext context,
long retainPeriodMs) {
this.eventIdStore = eventIdStore;
this.context = context;
this.retainPeriodMs = retainPeriodMs;
}
#Override
public void punctuate(long invocationTime) {
LOGGER.info("Punctuator invoked at {}, searching from {}", new Date(invocationTime), new Date(invocationTime-retainPeriodMs));
KeyValueIterator<Windowed<E>, V> it =
eventIdStore.fetchAll(invocationTime - retainPeriodMs, invocationTime + retainPeriodMs);
while (it.hasNext()) {
KeyValue<Windowed<E>, V> next = it.next();
LOGGER.info("Punctuator running on {}", next.key.key());
context.forward(next.key.key(), next.value);
// Delete from store with tombstone
eventIdStore.put(next.key.key(), null, invocationTime);
context.commit();
}
it.close();
}
Is this a valid approach?
With the previous code, I'm running some integration tests and I've some synchronization issues. How can I be sure that the start of the window will coincide with the Punctuator's scheduled interval?
Also as an alternative approach, I was wondering (I've googled with no result), if there is any event triggered by window closing to which I can attach a callback in order to iterate over store and publish only distinct events.
Thanks.

Kafka Processor API with exactly once semantic for each record processed

Scenario:
We are using kafka processor API ( not DSL ) for reading records from source topic, stream
processor will write records to one or more target topics.
We know exactly once can be implemented for the entire processor level by using :
props.put("isolation.level", "read_committed");
But we want to decide based on the incoming records key if we want exactly once or at-least once semantic .
import org.apache.kafka.streams.processor.Processor;
public class StreamRouterProcessor implements Processor<String,Object>
{
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
}
#Override
public void process(String eventName, String eventMessage) // this is called for each record
{
}
}
Is there a way to select exactly-once or at-least once on the fly for
each record
being processed ( perhaps for each record processed by the process() method above) ? .
For enabling exactly_once semantic you should use StreamsConfig.PROCESSING_GUARANTEE_CONFIG property. ConsumerConfig.ISOLATION_LEVEL_CONFIG (isolation.level) is consumer config and should be use if you use raw Consumer
It is not possible to choose processing guarantees (exactly-once or at-least-once) at message level

Detecting abandoned processess in Kafka Streams 2.0

I have this case: users collect orders as order lines. I implemented this with Kafka topic containing events with order changes, they are merged, stored in local key-value store and broadcasted as second topic with order versions.
I need to somehow react to abandoned orders - ones that were started but there was no change for at least last x hours.
Simple solution could be to scan local storage every y minutes and post event of order status change to Abandoned. It seems I cannot access store not from processor... But it is also not very elegant coding. Any suggestions are welcome.
--edit
I cannot just add puctuation to merge/validation transformer, because its output is different and should be routed elsewhere, like on this image (single kafka streams app):
so "abandoned orders processor/transformer" will be a no-op for its input (the only trigger here is time). Another thing is that i such case (as on image) my transformer gets ForwardingDisabledProcessorContext upon initialization so I cannot emit any messages in punctuator. I could just pass there kafkaTemplate bean and just produce new messages, but then whole processor/transformer is just empty shell only to access local store...
this is snippet of code I used:
public class AbandonedOrdersTransformer implements ValueTransformer<OrderEvent, OrderEvent> {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
stateStore = (KeyValueStore)processorContext.getStateStore(KafkaConfig.OPENED_ORDERS_STORE);
//main scheduler
this.context.schedule(TimeUnit.MINUTES.toMillis(5), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
KeyValueIterator<String, Order> iter = this.stateStore.all();
while (iter.hasNext()) {
KeyValue<String, Order> entry = iter.next();
if(OrderStatuses.NEW.equals(entry.value.getStatus()) &&
(timestamp - entry.value.getLastChanged().getTime()) > TimeUnit.HOURS.toMillis(4)) {
//SEND ABANDON EVENT "event"
context.forward(entry.key, event);
}
}
iter.close();
context.commit();
});
}
#Override
public OrderEvent transform(OrderEvent orderEvent) {
//do nothing
return null;
}
#Override
public void close() {
//do nothing
}
}

Apache Flink: How to sink events to different Kafka topics depending on the event type?

I was wondering if it was possible to use the Flink Kafka sink to write events different topic depending on the type of events?
Let's say that we have different type of events: notification, messages and friend requests. We want to stream these events to different topics named: notification-topic, messages-topic, friendsRequest-topic.
I tried many different ways to resolve this problem, but still couldn't find the right solution. I heard that I could use the ProcessFunction but how can it be related to my problem?
In case you are using Kafka:
FlinkKafkaProducer011<Event> producer = new FlinkKafkaProducer011<>(
"default.topic",
new KeyedSerializationSchema<Event>() {
#Override
public byte[] serializeKey( Event element ) {
return null; or element.getKey to bytes...
}
#Override
public byte[] serializeValue( Event element ) {
return event.toBytes() ...
}
#Override
public String getTargetTopic( Event element ) {
return element.getTopic();
}
},
parameterTool.getProperties());
input.addSink(producer);
It will call getTargetTopic for every event, to get the topic where you want to route the event. It will override the "default.topic"

Samza: Delay processing of messages until timestamp

I'm processing messages from a Kafka topic with Samza. Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. In the meantime, I'd like to keep processing other incoming messages.
What I tried to do is make my Task queue the messages and implement the WindowableTask to periodically check the messages if their timestamp allows to process them. The basic idea looks like this:
public class MyTask implements StreamTask, WindowableTask {
private HashSet<MyMessage> waitingMessages = new HashSet<>();
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
// Do the processing
} else {
waitingMessages.add(parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
for (MyMessage message : waitingMessages) {
if (message.getValidFromDateTime().isBeforeNow()) {
// Do the processing and remove the message from the set
}
}
}
}
This obviously has some downsides. I'd be losing my waiting messages in memory when I redeploy my task. So I'd like to know the best practice for delaying the processing of messages with Samza. Do I need to reemit the messages to the same topic again and again until I can finally process them? We're talking about delaying the processing for a few minutes up to 1-2 hours here.
It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. It is expected that a properly-functioning message queue will deliver messages on demand. What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message.
Notice that delay is not a configurable part of the equation. Instead, delay is an output variable of a system with a queue. In fact, Little's Law offers some interesting insights into this.
So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. Typically a queryable database would make sense in this particular instance. If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker.
I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set.
It should look something like:
public class MyTask implements StreamTask, WindowableTask, InitableTask {
private KeyValueStore<String, MyMessage> waitingMessages;
#SuppressWarnings("unchecked")
#Override
public void init(Config config, TaskContext context) throws Exception {
this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
}
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
// Do the processing
} else {
waitingMessages.put(parsedMessage.getId(), parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
KeyValueIterator<String, MyMessage> all = waitingMessages.all();
while(all.hasNext()) {
MyMessage message = all.next().getValue();
// Do the processing and remove the message from the set
}
}
}
If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). You need of course provide some extra configuration of your store (in above example for messages-store).
You could read about key-value store here (for the latest Samza version):
https://samza.apache.org/learn/documentation/0.14/container/state-management.html