Apache Flink: How to sink events to different Kafka topics depending on the event type? - streaming

I was wondering if it was possible to use the Flink Kafka sink to write events different topic depending on the type of events?
Let's say that we have different type of events: notification, messages and friend requests. We want to stream these events to different topics named: notification-topic, messages-topic, friendsRequest-topic.
I tried many different ways to resolve this problem, but still couldn't find the right solution. I heard that I could use the ProcessFunction but how can it be related to my problem?

In case you are using Kafka:
FlinkKafkaProducer011<Event> producer = new FlinkKafkaProducer011<>(
"default.topic",
new KeyedSerializationSchema<Event>() {
#Override
public byte[] serializeKey( Event element ) {
return null; or element.getKey to bytes...
}
#Override
public byte[] serializeValue( Event element ) {
return event.toBytes() ...
}
#Override
public String getTargetTopic( Event element ) {
return element.getTopic();
}
},
parameterTool.getProperties());
input.addSink(producer);
It will call getTargetTopic for every event, to get the topic where you want to route the event. It will override the "default.topic"

Related

Kafka Processor API with exactly once semantic for each record processed

Scenario:
We are using kafka processor API ( not DSL ) for reading records from source topic, stream
processor will write records to one or more target topics.
We know exactly once can be implemented for the entire processor level by using :
props.put("isolation.level", "read_committed");
But we want to decide based on the incoming records key if we want exactly once or at-least once semantic .
import org.apache.kafka.streams.processor.Processor;
public class StreamRouterProcessor implements Processor<String,Object>
{
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
}
#Override
public void process(String eventName, String eventMessage) // this is called for each record
{
}
}
Is there a way to select exactly-once or at-least once on the fly for
each record
being processed ( perhaps for each record processed by the process() method above) ? .
For enabling exactly_once semantic you should use StreamsConfig.PROCESSING_GUARANTEE_CONFIG property. ConsumerConfig.ISOLATION_LEVEL_CONFIG (isolation.level) is consumer config and should be use if you use raw Consumer
It is not possible to choose processing guarantees (exactly-once or at-least-once) at message level

How to handle backpressure when using Reactor Kafka?

I'm using Reactor Kafka to both consume and produce Kafka events. In the case of consuming events my consumer is slow and therefor I need to handle backpressure.
However, I experience that no matter what I call Subscription.request() with, the publisher will publish all events from the topic immediately, therefor overwhelming the consumer.
I'm using a custom Subscriber, setting a small number of initial request by calling Subscription.request(), when I subscribe to KafkaReceiver.receive() to do this. To my understanding this is how I tell the publisher how many events my consumer initially wants.
My subscriber:
public class KafkaEventSubscriber extends BaseSubscriber {
private final int numberOfItemsToRequestOnSubscribe;
private final int numberOfItemsToRequestOnNext;
public KafkaEventSubscriber(int numberOfItemsToRequestOnSubscribe,
int numberOfItemsToRequestOnNext) {
this.numberOfItemsToRequestOnSubscribe = numberOfItemsToRequestOnSubscribe;
this.numberOfItemsToRequestOnNext = numberOfItemsToRequestOnNext;
}
#Override
protected void hookOnSubscribe(Subscription subscription) {
subscription.request(numberOfItemsToRequestOnSubscribe);
}
#Override
protected void hookOnNext(EnrichedMetadata value) {
request(numberOfItemsToRequestOnNext);
}
}
How I use the subscriber:
kafkaReceiver.receive().map(ReceiverRecord::value).map(KafkaConsumer::acknowledge).subscribe(new KafkaEventSubscriber(10, 1));
I expect the KafkaReceiver to output 10 events before any call to the subscribers onNext() method is done, but the KafkaReceiver outputs all events that are not already ACK:ed from the topic.
I experience that no matter what we call Subscription.request() with, the publisher will publish all events from the topic immediately, not respecting the backpressure measures I've been taking.

How to subscribe multiple topic using #KafkaListner annotation

I am using Kafka message broker to publish and subscribe event. For that using spring infrastructure. My requirement is I need to create one consumer which will subscribe multiple topic.
Following is the code which is working perfectly fine when it subscribe to single topic.
#KafkaListener(topics = "com.customer.nike")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
But I want , it should be subscribe to some pattern of topic.
like..
#KafkaListener(topics = "com.cusotmer.*.nike")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
In this code * will keep changing. It may be some numeric value like 1000. 1010 and so on. For this I also used SpeL.
#KafkaListener(topics = "#{com.cusotmer.*.nike}")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
But this one is also not working for me.
Could someone help me to subscribe multiple topic.
Thanks in advance.
I use #KafkaListener(topics = "#{'${kafka.topics}'.split(',')}" where kafka.topics is taken from my property file and contains the comma separated topics to which my listener should listen to.
But may be, during start up u can add a logic to generate all the possible topics and assign to a variable, which can be later used as above.
Update : Wild card is possible as Alexandre commented below.
Regarding the subscription of multuiple topics, you can use topicPatterns to achieve that:
The topic pattern for this listener. The entries can be 'topic
pattern', a 'property-placeholder key' or an 'expression'. The
framework will create a container that subscribes to all topics
matching the specified pattern to get dynamically assigned partitions.
The pattern matching will be performed periodically against topics
existing at the time of check. An expression must be resolved to the
topic pattern (String or Pattern result types are supported).
Mutually exclusive with topics() and topicPartitions().
#KafkaListener(topicPattern = "com.customer.*")
public void receive(String payload) {
LOGGER.info("received payload='{}'", payload);
}
Regarding the programatic access to the topic name, you can use #Header annotated method to extract a specific header value, defined by KafkaHeaders, which in your case is RECEIVED_TOPIC:
The header containing the topic from which the message was received.
#KafkaListener(topics = "com.customer.nike")
public void receive(String payload, #Header(KafkaHeaders.RECEIVED_TOPIC) String topic) {
LOGGER.info("received payload='{}'", payload);
LOG.info("received from topic: {}", topic);
}
If we have to fetch multiple topics from the application.properties file :
#KafkaListener(topics = { "${spring.kafka.topic1}", "${spring.kafka.topic2}" })

Detecting abandoned processess in Kafka Streams 2.0

I have this case: users collect orders as order lines. I implemented this with Kafka topic containing events with order changes, they are merged, stored in local key-value store and broadcasted as second topic with order versions.
I need to somehow react to abandoned orders - ones that were started but there was no change for at least last x hours.
Simple solution could be to scan local storage every y minutes and post event of order status change to Abandoned. It seems I cannot access store not from processor... But it is also not very elegant coding. Any suggestions are welcome.
--edit
I cannot just add puctuation to merge/validation transformer, because its output is different and should be routed elsewhere, like on this image (single kafka streams app):
so "abandoned orders processor/transformer" will be a no-op for its input (the only trigger here is time). Another thing is that i such case (as on image) my transformer gets ForwardingDisabledProcessorContext upon initialization so I cannot emit any messages in punctuator. I could just pass there kafkaTemplate bean and just produce new messages, but then whole processor/transformer is just empty shell only to access local store...
this is snippet of code I used:
public class AbandonedOrdersTransformer implements ValueTransformer<OrderEvent, OrderEvent> {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
stateStore = (KeyValueStore)processorContext.getStateStore(KafkaConfig.OPENED_ORDERS_STORE);
//main scheduler
this.context.schedule(TimeUnit.MINUTES.toMillis(5), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
KeyValueIterator<String, Order> iter = this.stateStore.all();
while (iter.hasNext()) {
KeyValue<String, Order> entry = iter.next();
if(OrderStatuses.NEW.equals(entry.value.getStatus()) &&
(timestamp - entry.value.getLastChanged().getTime()) > TimeUnit.HOURS.toMillis(4)) {
//SEND ABANDON EVENT "event"
context.forward(entry.key, event);
}
}
iter.close();
context.commit();
});
}
#Override
public OrderEvent transform(OrderEvent orderEvent) {
//do nothing
return null;
}
#Override
public void close() {
//do nothing
}
}

Samza: Delay processing of messages until timestamp

I'm processing messages from a Kafka topic with Samza. Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. In the meantime, I'd like to keep processing other incoming messages.
What I tried to do is make my Task queue the messages and implement the WindowableTask to periodically check the messages if their timestamp allows to process them. The basic idea looks like this:
public class MyTask implements StreamTask, WindowableTask {
private HashSet<MyMessage> waitingMessages = new HashSet<>();
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
// Do the processing
} else {
waitingMessages.add(parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
for (MyMessage message : waitingMessages) {
if (message.getValidFromDateTime().isBeforeNow()) {
// Do the processing and remove the message from the set
}
}
}
}
This obviously has some downsides. I'd be losing my waiting messages in memory when I redeploy my task. So I'd like to know the best practice for delaying the processing of messages with Samza. Do I need to reemit the messages to the same topic again and again until I can finally process them? We're talking about delaying the processing for a few minutes up to 1-2 hours here.
It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. It is expected that a properly-functioning message queue will deliver messages on demand. What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message.
Notice that delay is not a configurable part of the equation. Instead, delay is an output variable of a system with a queue. In fact, Little's Law offers some interesting insights into this.
So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. Typically a queryable database would make sense in this particular instance. If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker.
I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set.
It should look something like:
public class MyTask implements StreamTask, WindowableTask, InitableTask {
private KeyValueStore<String, MyMessage> waitingMessages;
#SuppressWarnings("unchecked")
#Override
public void init(Config config, TaskContext context) throws Exception {
this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
}
#Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
// Do the processing
} else {
waitingMessages.put(parsedMessage.getId(), parsedMessage);
}
}
#Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
KeyValueIterator<String, MyMessage> all = waitingMessages.all();
while(all.hasNext()) {
MyMessage message = all.next().getValue();
// Do the processing and remove the message from the set
}
}
}
If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). You need of course provide some extra configuration of your store (in above example for messages-store).
You could read about key-value store here (for the latest Samza version):
https://samza.apache.org/learn/documentation/0.14/container/state-management.html