Detecting abandoned processess in Kafka Streams 2.0 - apache-kafka

I have this case: users collect orders as order lines. I implemented this with Kafka topic containing events with order changes, they are merged, stored in local key-value store and broadcasted as second topic with order versions.
I need to somehow react to abandoned orders - ones that were started but there was no change for at least last x hours.
Simple solution could be to scan local storage every y minutes and post event of order status change to Abandoned. It seems I cannot access store not from processor... But it is also not very elegant coding. Any suggestions are welcome.
--edit
I cannot just add puctuation to merge/validation transformer, because its output is different and should be routed elsewhere, like on this image (single kafka streams app):
so "abandoned orders processor/transformer" will be a no-op for its input (the only trigger here is time). Another thing is that i such case (as on image) my transformer gets ForwardingDisabledProcessorContext upon initialization so I cannot emit any messages in punctuator. I could just pass there kafkaTemplate bean and just produce new messages, but then whole processor/transformer is just empty shell only to access local store...
this is snippet of code I used:
public class AbandonedOrdersTransformer implements ValueTransformer<OrderEvent, OrderEvent> {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
stateStore = (KeyValueStore)processorContext.getStateStore(KafkaConfig.OPENED_ORDERS_STORE);
//main scheduler
this.context.schedule(TimeUnit.MINUTES.toMillis(5), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
KeyValueIterator<String, Order> iter = this.stateStore.all();
while (iter.hasNext()) {
KeyValue<String, Order> entry = iter.next();
if(OrderStatuses.NEW.equals(entry.value.getStatus()) &&
(timestamp - entry.value.getLastChanged().getTime()) > TimeUnit.HOURS.toMillis(4)) {
//SEND ABANDON EVENT "event"
context.forward(entry.key, event);
}
}
iter.close();
context.commit();
});
}
#Override
public OrderEvent transform(OrderEvent orderEvent) {
//do nothing
return null;
}
#Override
public void close() {
//do nothing
}
}

Related

Kafka streams event deduplication keeping last event in window

I'm using Kafka Streams in a deduplication events problem over short time windows (<= 1 minute).
First I've tried to tackle the problem by using DSL API with .suppress(Suppressed.untilWindowCloses(...)) operator but, given the fact that wall-clock time is not yet supported (I've seen the KIP 424), this operator is not viable for my use case.
Then, I've followed this official Confluent example in which low level Processor API is used and it was working fine but has one major limitation for my use-case. The single event (obtained by deduplication) is emitted at the beginning of the time window, subsequent duplicated events are "suppressed". In my use case I need the reverse of that, meaning that a single event should be emitted at the end of the window.
I'm asking for suggestions on how to implement this use case with Processor API.
My idea was to use the Processor API with a custom Transformer and a Punctuator.
The transformer would store in a WindowStore the distinct keys received without returning any KeyValue. Simultaneously, I'd schedule a punctuator running with an interval equal to the size of the window in the WindowStore. This punctuator will iterate over the elements in the store and forward them downstream.
The following are some core parts of the logic:
DeduplicationTransformer (slightly modified from official Confluent example):
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, V>) context.getStateStore(this.storeName);
// Schedule punctuator for this transformer.
context.schedule(Duration.ofMillis(this.windowSizeMs), PunctuationType.WALL_CLOCK_TIME,
new DeduplicationPunctuator<E, V>(eventIdStore, context, this.windowSizeMs));
}
#Override
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
if (!isDuplicate(eventId)) {
rememberNewEvent(eventId, value, context.timestamp());
}
return null;
}
}
DeduplicationPunctuator:
public DeduplicationPunctuator(WindowStore<E, V> eventIdStore, ProcessorContext context,
long retainPeriodMs) {
this.eventIdStore = eventIdStore;
this.context = context;
this.retainPeriodMs = retainPeriodMs;
}
#Override
public void punctuate(long invocationTime) {
LOGGER.info("Punctuator invoked at {}, searching from {}", new Date(invocationTime), new Date(invocationTime-retainPeriodMs));
KeyValueIterator<Windowed<E>, V> it =
eventIdStore.fetchAll(invocationTime - retainPeriodMs, invocationTime + retainPeriodMs);
while (it.hasNext()) {
KeyValue<Windowed<E>, V> next = it.next();
LOGGER.info("Punctuator running on {}", next.key.key());
context.forward(next.key.key(), next.value);
// Delete from store with tombstone
eventIdStore.put(next.key.key(), null, invocationTime);
context.commit();
}
it.close();
}
Is this a valid approach?
With the previous code, I'm running some integration tests and I've some synchronization issues. How can I be sure that the start of the window will coincide with the Punctuator's scheduled interval?
Also as an alternative approach, I was wondering (I've googled with no result), if there is any event triggered by window closing to which I can attach a callback in order to iterate over store and publish only distinct events.
Thanks.

How to handle backpressure when using Reactor Kafka?

I'm using Reactor Kafka to both consume and produce Kafka events. In the case of consuming events my consumer is slow and therefor I need to handle backpressure.
However, I experience that no matter what I call Subscription.request() with, the publisher will publish all events from the topic immediately, therefor overwhelming the consumer.
I'm using a custom Subscriber, setting a small number of initial request by calling Subscription.request(), when I subscribe to KafkaReceiver.receive() to do this. To my understanding this is how I tell the publisher how many events my consumer initially wants.
My subscriber:
public class KafkaEventSubscriber extends BaseSubscriber {
private final int numberOfItemsToRequestOnSubscribe;
private final int numberOfItemsToRequestOnNext;
public KafkaEventSubscriber(int numberOfItemsToRequestOnSubscribe,
int numberOfItemsToRequestOnNext) {
this.numberOfItemsToRequestOnSubscribe = numberOfItemsToRequestOnSubscribe;
this.numberOfItemsToRequestOnNext = numberOfItemsToRequestOnNext;
}
#Override
protected void hookOnSubscribe(Subscription subscription) {
subscription.request(numberOfItemsToRequestOnSubscribe);
}
#Override
protected void hookOnNext(EnrichedMetadata value) {
request(numberOfItemsToRequestOnNext);
}
}
How I use the subscriber:
kafkaReceiver.receive().map(ReceiverRecord::value).map(KafkaConsumer::acknowledge).subscribe(new KafkaEventSubscriber(10, 1));
I expect the KafkaReceiver to output 10 events before any call to the subscribers onNext() method is done, but the KafkaReceiver outputs all events that are not already ACK:ed from the topic.
I experience that no matter what we call Subscription.request() with, the publisher will publish all events from the topic immediately, not respecting the backpressure measures I've been taking.

Samza: Delay processing of messages until timestamp

I'm processing messages from a Kafka topic with Samza. Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. In the meantime, I'd like to keep processing other incoming messages.
What I tried to do is make my Task queue the messages and implement the WindowableTask to periodically check the messages if their timestamp allows to process them. The basic idea looks like this:
public class MyTask implements StreamTask, WindowableTask {
private HashSet<MyMessage> waitingMessages = new HashSet<>();
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
// Do the processing
} else {
waitingMessages.add(parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
for (MyMessage message : waitingMessages) {
if (message.getValidFromDateTime().isBeforeNow()) {
// Do the processing and remove the message from the set
}
}
}
}
This obviously has some downsides. I'd be losing my waiting messages in memory when I redeploy my task. So I'd like to know the best practice for delaying the processing of messages with Samza. Do I need to reemit the messages to the same topic again and again until I can finally process them? We're talking about delaying the processing for a few minutes up to 1-2 hours here.
It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. It is expected that a properly-functioning message queue will deliver messages on demand. What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message.
Notice that delay is not a configurable part of the equation. Instead, delay is an output variable of a system with a queue. In fact, Little's Law offers some interesting insights into this.
So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. Typically a queryable database would make sense in this particular instance. If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker.
I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set.
It should look something like:
public class MyTask implements StreamTask, WindowableTask, InitableTask {
private KeyValueStore<String, MyMessage> waitingMessages;
#SuppressWarnings("unchecked")
#Override
public void init(Config config, TaskContext context) throws Exception {
this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
}
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
// Do the processing
} else {
waitingMessages.put(parsedMessage.getId(), parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
KeyValueIterator<String, MyMessage> all = waitingMessages.all();
while(all.hasNext()) {
MyMessage message = all.next().getValue();
// Do the processing and remove the message from the set
}
}
}
If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). You need of course provide some extra configuration of your store (in above example for messages-store).
You could read about key-value store here (for the latest Samza version):
https://samza.apache.org/learn/documentation/0.14/container/state-management.html

How can I create a state store that is restorable from an existing changelog topic?

I am using the streams DSL to deduplicate a topic called users:
topology.addStateStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("users"), byteStringSerde, userSerde));
KStream<ByteString, User> users = topology.stream("users", Consumed.with(byteStringSerde, userSerde));
users.transform(() -> new Transformer<ByteString, User, KeyValue<ByteString, User>>() {
private KeyValueStore<ByteString, User> store;
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
store = (KeyValueStore<ByteString, User>) context.getStateStore("users");
}
#Override
public KeyValue<ByteString, User> transform(ByteString key, User value) {
User user = store.get(key);
if (user != null) {
store.put(key, value);
return new KeyValue<>(key, value);
}
return null;
}
#Override
public KeyValue<ByteString, User> punctuate(long timestamp) {
return null;
}
#Override
public void close() {
}
}, "users");
Given this code, Kafka Streams creates an internal changelog topic for the users store. I am wondering, is there some way I can use the existing users topic instead of creating an essentially identical changelog topic?
PS. I see that StreamsBuilder says this is possible:
However, no internal changelog topic is created since the original input topic can be used for recovery
But following the code to InternalStreamsBuilder#table() and InternalStreamsBuilder#createKTable(), I am not seeing how it's achieving this effect.
Not all thing the DSL does are possible at the Processor API level -- it's using some internals, that are not part of public API to achieve what you describe.
It's the call to InternalTopologyBuilder#connectSourceStoreAndTopic() that does the trick (cf. InternalStreamsBuilder#table()).
For your use case about de-duplication, it seem that you need two topics though (depending what de-duplication logic you apply). Restoring via a changelog topic does key-based updates and thus does not consider values (that might be part of your deduplication logic, too).

Kafka Streams: Appropriate way to find min value in a stream

I'm using Kafka Streams version 0.10.0.1, and trying to find the min value in a stream.
The incoming messages come from a topic called kafka-streams-topic and have a key and the value is a JSON payload that looks like this:
{"value":2334}
This is a simple payload but I want to find the min value of this JSON.
The outgoing message is just a number:
2334
and the key is also part of the message.
So if the incoming topic got:
key=1, value={"value":1000}
outgoing topic, named min-topic, would get
key=1,value=1000
another message comes through:
key=1, value={"value":100}
because this is the same key I would like to now produce a message with key=1 value=100 since this is now smaller than the first message
Now lets say we got:
key=2 value=99
A new message would be produced where:
key=2 and value=99 but the key=1 and associated value shouldn't change.
Additionally if we got the message:
key=1 value=2000
No message should be produced since this message is larger than the current value of 100
This works but I'm wondering if this adheres to the intent of the API:
public class MinProcessor implements Processor<String,String> {
private ProcessorContext context;
private KeyValueStore<String, Long> kvStore;
private Gson gson = new Gson();
#Override
public void init(ProcessorContext context) {
this.context = context;
this.context.schedule(1000);
kvStore = (KeyValueStore) context.getStateStore("Counts");
}
#Override
public void process(String key, String value) {
Long incomingPotentialMin = ((Double)gson.fromJson(value, Map.class).get("value")).longValue();
Long minForKey = kvStore.get(key);
System.out.printf("key: %s incomingPotentialMin: %s minForKey: %s \n", key, incomingPotentialMin, minForKey);
if (minForKey == null || incomingPotentialMin < minForKey) {
kvStore.put(key, incomingPotentialMin);
context.forward(key, incomingPotentialMin.toString());
context.commit();
}
}
#Override
public void punctuate(long timestamp) {}
#Override
public void close() {
kvStore.close();
}
}
Here is the code that actually runs the processor:
public class MinLauncher {
public static void main(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
StateStoreSupplier countStore = Stores.create("Counts")
.withKeys(Serdes.String())
.withValues(Serdes.Long())
.persistent()
.build();
builder.addSource("source", "kafka-streams-topic")
.addProcessor("process", () -> new MinProcessor(), "source")
.addStateStore(countStore, "process")
.addSink("sink", "min-topic", "process");
KafkaStreams streams = new KafkaStreams(builder, KafkaStreamsProperties.properties("kafka-streams-min-poc"));
streams.cleanUp();
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
Not sure what your exact input data and result is (maybe you can update you question with this information: what are your input records? what is your output? What "EXTRA messages [] are produced [] that [you] don't expect"?).
However, a few general clarifications (can refine this answer later on if required).
You do your computation based in keys, so you should expect a result for each key (not sure if you have multiple different keys in your input).
You emit data in punctuate() which is called periodically (base in the internally tracked stream-time -- i.e., based on the timestamp values extracted from your input records via TimestampExtractor). Hence, you will write the current min value of each key written to the topic when punctuate() gets called and therefore, you can have multiple updates per key that are all appended to your result topic. (Topics are append only and if you write two messages with the same key, you see both -- there is no overwrite.)