How process() method in kafka-stream-processor is called automatically?

How process() method in kafka-stream-processor is called automatically? - apache-kafka

I am learning kafka streams and have written a simple app, snippet below:
MainApp:
Topology topology = new Topology();
topology.addSource("SOURCE", "source-topic");
topology.addProcessor("Processor1", () -> new Processor1(), "SOURCE");
topology.addProcessor("Processor2", () -> new Processor2(), "Processor1");
topology.addProcessor("Processor3", () -> new Processor3(), "Processor2");
topology.addSink("SINK", "sink-topic", "Processor3");
KafkaStreams streams = new KafkaStreams(topology, config);
streams.start();
Snippet of Individual stream proccesor:
public class Processor1 implements Processor<String, String> {
// Rest of code
#Override
public void process(String key, String value) {
System.out.println("Inside Processor1#process() method");
context.forward(key, value);
}
I understood that we need to create Topology and then to initiate it, we invoke streams.start();
I am not able to understand how process() method is being invoked automatically and who calls it?

Processor process() method invoked by ProcessorContextImpl class automatically on each incoming message for specific topology node.
For your built topology, when a message arrived at the incoming topic, SOURCE node consumes it and forwards (propagates) message to child node by internally calling forward method (you could debug/take a look at code from class ProcessorContextImpl). In your case, SOURCE node forwards key and value to child node Processor1. After that, process() method from class Processor1 triggered. When code reaches context.forward(), message forwards to the next child node, Processor2. After that message propagates to Processor3 and SINK nodes in a similar manner, and finally, message produced to outbound topic. Such pipeline for specific message executes on a single thread (and if you have a default value for config num.stream.threads = 1, all messages will be processed on a single thread per app instance).

Related

Create message liseners dynamically for the topics

I was analysing a problem on creating a generic consumer library which can be deployed in multiple microservices ( all of them are spring based) . The requirement is to have around 15-20 topics to listen .If we use annotation based kafka listener ,we need to add more code for each microservice . Is there any way where we can create the consumers dynamically based on some xml file where each consumer can have these data injected
topic
groupid
partition
filter (if any)
With annotations ,the design is very rigid .The only way I can think of is ,we can create messagelisteners after parsing xml config and each topic will have its own concurrentmessagelistenercontainer .
Is there any alternative better approach available using spring ?
P.S: I am little new to spring & kafka . Please let me know if there is confusion in explaning the requirements
Thanks,
Rajasekhar

Maybe you can use topic patterns. Take a look at consumer properties. E.g. the listener
#KafkaListener(topicPattern = "topic1|topic2")
will listen to topic1 and topic2.
If you need to create a listener dynamically extra care must be taken, because you must shutdown it.
I would use a similar approach as spring's KafkaListenerAnnotationBeanPostProcessor. This post processor is responsible for processing #KafkaListeners.
Here is a proposal of how it could work:
public class DynamicEndpointRegistrar {
private BeanFactory beanFactory;
private KafkaListenerContainerFactory<?> containerFactory;
private KafkaListenerEndpointRegistry endpointRegistry;
private MessageHandlerMethodFactory messageHandlerMethodFactory;
public DynamicEndpointRegistrar(BeanFactory beanFactory,
KafkaListenerContainerFactory<?> containerFactory,
KafkaListenerEndpointRegistry endpointRegistry, MessageHandlerMethodFactory messageHandlerMethodFactory) {
this.beanFactory = beanFactory;
this.containerFactory = containerFactory;
this.endpointRegistry = endpointRegistry;
this.messageHandlerMethodFactory = messageHandlerMethodFactory;
}
public void registerMethodEndpoint(String endpointId, Object bean, Method method, Properties consumerProperties,
String... topics) throws Exception {
KafkaListenerEndpointRegistrar registrar = new KafkaListenerEndpointRegistrar();
registrar.setBeanFactory(beanFactory);
registrar.setContainerFactory(containerFactory);
registrar.setEndpointRegistry(endpointRegistry);
registrar.setMessageHandlerMethodFactory(messageHandlerMethodFactory);
MethodKafkaListenerEndpoint<Integer, String> endpoint = new MethodKafkaListenerEndpoint<>();
endpoint.setBeanFactory(beanFactory);
endpoint.setMessageHandlerMethodFactory(messageHandlerMethodFactory);
endpoint.setId(endpointId);
endpoint.setGroupId(consumerProperties.getProperty(ConsumerConfig.GROUP_ID_CONFIG));
endpoint.setBean(bean);
endpoint.setMethod(method);
endpoint.setConsumerProperties(consumerProperties);
endpoint.setTopics(topics);
registrar.registerEndpoint(endpoint);
registrar.afterPropertiesSet();
}
}
You should then be able to register a listener dynamically. E.g.
DynamicEndpointRegistrar dynamicEndpointRegistrar = ...;
MyConsumer myConsumer = ...; // create an instance of your consumer
Properties properties = ...; // consumer properties
// the method that should be invoked
// (the method that's normally annotated with KafkaListener)
Method method = MyConsumer.class.getDeclaredMethod("consume", String.class);
dynamicEndpointRegistrar.registerMethodEndpoint("endpointId", myConsumer, method, properties, "topic");

Adding data to state store for stateful processing and fault tolerance

I have a microservice that perform some stateful processing. The application construct a KStream from an input topic, do some stateful processing then write data into the output topic.
I will be running 3 of this applications in the same group. There are 3 parameters that I need to store in the event when the microservice goes down, the microservice that takes over can query the shared statestore and continue where the crashed service left off.
I am thinking of pushing these 3 parameters into a statestore and query the data when the other microservice takes over. From my research, I have seen a lot of example when people perform event counting using state store but that's not exactly what I want, does anyone know an example or what is the right approach for this problem?

So you want to do 2 things:
a. the service going down have to store the parameters:
If you want to do it in a straightforward way than all you have to do is to write a message in the topic associated with the state store (the one you are reading with a KTable). Use the Kafka Producer API or a KStream (could be kTable.toStream()) to do it and that's it.
Otherwise you could create manually a state store:
// take these serde as just an example
Serde<String> keySerde = Serdes.String();
Serde<String> valueSerde = Serdes.String();
KeyValueBytesStoreSupplier storeSupplier = inMemoryKeyValueStore(stateStoreName);
streamsBuilder.addStateStore(Stores.keyValueStoreBuilder(storeSupplier, keySerde, valueSerde));
then use it in a transformer or processor to add items to it; you'll have to declare this in the transformer/processor:
// depending on the serde above you might have something else then String
private KeyValueStore<String, String> stateStore;
and initialize the stateStore variable:
#Override
public void init(ProcessorContext context) {
stateStore = (KeyValueStore<String, String>) context.getStateStore(stateStoreName);
}
and later use the stateStore variable:
#Override
public KeyValue<String, String> transform(String key, String value) {
// using stateStore among other actions you might take here
stateStore.put(key, processedValue);
}
b. read the parameters in the service taking over:
You could do it with a Kafka consumer but with Kafka Streams you first have to make the store available; the easiest way to do it is by creating a KTable; then you have to get the queryable store name that is automatically created with the KTable; then you have to actually get access to the store; then you extract a record value from the store (i.e. a parameter value by its key).
// this example is a modified copy of KTable javadocs example
final StreamsBuilder streamsBuilder = new StreamsBuilder();
// Creating a KTable over the topic containing your parameters a store shall automatically be created.
//
// The serde for your MyParametersClassType could be
// new org.springframework.kafka.support.serializer.JsonSerde(MyParametersClassType.class)
// though further configurations might be necessary here - e.g. setting the trusted packages for the ObjectMapper behind JsonSerde.
//
// If the parameter-value class is a String then you could use Serdes.String() instead of a MyParametersClassType serde.
final KTable paramsTable = streamsBuilder.table("parametersTopicName", Consumed.with(Serdes.String(), <<your InstanceOfMyParametersClassType serde>>));
...
// see the example from KafkaStreams javadocs for more KafkaStreams related details
final KafkaStreams streams = ...;
streams.start()
...
// get the queryable store name that is automatically created with the KTable
final String queryableStoreName = paramsTable.queryableStoreName();
// get access to the store
ReadOnlyKeyValueStore view = streams.store(queryableStoreName, QueryableStoreTypes.timestampedKeyValueStore());
// extract a record value from the store
InstanceOfMyParametersClassType parameter = view.get(key);

How do I concurrently process Reactor Kafka Streams by Topic and Partition with Auto Acknowledgement?

I am trying to achieve concurrent processing of Kafka Topic-Partitions using Reactor Kafka with auto-acknowledgement. The documentation here makes it seem like this is possible:
http://projectreactor.io/docs/kafka/milestone/reference/#concurrent-ordered
The only difference between that and what I am attempting is I am using auto-acknowledgement.
I have the following code (relevant method is receiveAuto):
public class KafkaFluxFactory<K, V> {
private final Map<String, Object> properties;
public KafkaFluxFactory(Map<String, Object> properties) {
this.properties = properties;
}
public Flux<ConsumerRecord<K, V>> receiveAuto(Collection<String> topics, Scheduler scheduler) {
return KafkaReceiver.create(ReceiverOptions.create(properties).subscription(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition))
.flatMap(topicPartitionFlux -> topicPartitionFlux.publishOn(scheduler));
}
private TopicPartition extractTopicPartition(ConsumerRecord<K, V> record) {
return new TopicPartition(record.topic(), record.partition());
}
}
When I use this to create a Flux of Consumer Records from Kafka with a parallel Scheduler (Schedulers.newParallel("debug", 10)), I see that they all end up getting processed on the same Thread.
Any thoughts on what I may be doing wrong?

After quite a bit of trial-and-error plus some rethinking of what I want to accomplish I realized I was trying to solve two problems in one bit of code.
The two things I need are:
In-order processing of Kafka Partitions
Ability to parallelize the processing of each partition
In trying to solve both with this piece of code, I was limiting downstream users' abilities to configure the level of parallelization. I therefore changed the method to return a Flux of GroupedFluxes which provides downstream users with the correct granularity of determining what is parallelizable:
public Flux<GroupedFlux<TopicPartition, ConsumerRecord<K, V>>> receiveAuto(Collection<String> topics) {
return KafkaReceiver.create(createReceiverOptions(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition));
}
Downstream, users are able to parallelize each emitted GroupedFlux using whatever Scheduler they wish:
public <V> void work(Flux<GroupedFlux<TopicPartition, V>> flux) {
flux.doOnNext(groupPublisher -> groupPublisher
.publishOn(Schedulers.elastic())
.subscribe(this::doWork))
.subscribe();
}
This has the desired behavior processing each TopicPartition-GroupedFlux in-order and parallel to other GroupedFluxes.

I guess it executes sequentially at least in your consumer. To do a parallel consuming you should convert you flux to ParallelFlux
public ParallelFlux<ConsumerRecord<K, V>> receiveAuto(Collection<String> topics, Scheduler scheduler) {
return KafkaReceiver.create(ReceiverOptions.create(properties).subscription(topics))
.receiveAutoAck()
.flatMap(flux -> flux.groupBy(this::extractTopicPartition))
.flatMap(topicPartitionFlux -> topicPartitionFlux.parallel().runOn(Schedulers.parallel()));
}
After in your consumer function if you want to consume in parallel way you should use method such as:
void subscribe(Consumer<? super T> onNext, Consumer<? super Throwable>
onError, Runnable onComplete, Consumer<? super Subscription> onSubscribe)
Or any other overloaded method with Consumer<T super T> onNext arguments.
If you just use method as below you will consume flux in sequential way
void subscribe(Subscriber<? super T> s)

Kafka Streams dynamic routing (ProducerInterceptor might be a solution?)

I'm working with Apache Kafka and I've been experimenting with the Kafka Streams functionality.
What I'm trying to achieve is very simple, at least in words and it can be achieved easily with the regular plain Consumer/Producer approach:
Read a from a dynamic list of topics
Do some processing on the message
Push the message to another topic which name is computed based on the message content
Initially I thought I could create a custom Sink or inject some kind of endpoint resolver in order to programmatically define the topic name for each single message, although ultimately couldn't find any way to do that.
So I dug into the code and found the ProducerInterceptor class that is (quoting from the JavaDoc):
A plugin interface that allows you to intercept (and possibly mutate)
the records received by the producer before they are published to the
Kafka cluster.
And it's onSend method:
This is called from KafkaProducer.send(ProducerRecord) and
KafkaProducer.send(ProducerRecord, Callback) methods, before key and
value get serialized and partition is assigned (if partition is not
specified in ProducerRecord).
It seemed like the perfect solution for me as I can effectively return a new ProducerRecord with the topic name I want.
Although apparently there's a bug (I've opened an issue on their JIRA: KAFKA-4691) and that method is called when the key and value have already been serialized.
Bummer as I don't think doing an additional deserialization at this point is acceptable.
My question to you more experienced and knowledgeable users would be your input and ideas and any kind of suggestions on how would be an efficient and elegant way of achieving it.
Thanks in advance for your help/comments/suggestions/ideas.
Below are some code snippets of what I've tried:
public static void main(String[] args) throws Exception {
StreamsConfig streamingConfig = new StreamsConfig(getProperties());
StringDeserializer stringDeserializer = new StringDeserializer();
StringSerializer stringSerializer = new StringSerializer();
MyObjectSerializer myObjectSerializer = new MyObjectSerializer();
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, myObjectSerializer, Pattern.compile("input-.*"));
.addProcessor("PROCESS", MyCustomProcessor::new, "SOURCE");
System.out.println("Starting PurchaseProcessor Example");
KafkaStreams streaming = new KafkaStreams(topologyBuilder, streamingConfig);
streaming.start();
System.out.println("Now started PurchaseProcessor Example");
}
private static Properties getProperties() {
Properties props = new Properties();
.....
.....
props.put(StreamsConfig.producerPrefix(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG), "com.test.kafka.streams.OutputTopicRouterInterceptor");
return props;
}
OutputTopicRouterInterceptor onSend implementation:
#Override
public ProducerRecord<String, MyObject> onSend(ProducerRecord<String, MyObject> record) {
MyObject obj = record.value();
String topic = computeTopicName(obj);
ProducerRecord<String, MyObject> newRecord = new ProducerRecord<String, MyObject>(topic, record.partition(), record.timestamp(), record.key(), obj);
return newRecord;
}

Consuming messages from a Hazelcast Queue only once in a distributed environment

I have a similar question as this post:
Consume message only once from Topic per listeners running in cluster
When I tried using a queue to publish messages and added an item listener in two different JVMs, I am receiving the messages twice in both of them. I want to receive the message only once in a clustered/distributed environments.
Here's my code snippet:
Publishing of the message:
getQueue().add("some sample message");
I have the same listener configured in two different JVMs which goes like this:
public HazelcastQueueListener(){
HazelcastInstance instance = HazelcastClient.newHazelcastClient(HazelClientConfig.getClientConfig());
IQueue<String> queue1 = instance.getQueue("SAMPLEQUEUE");
queue1.addItemListener(this, false);
}
public static void main(String args[]){
HazelcastQueueListener listener = new HazelcastQueueListener();
}
#Override
public void itemAdded(ItemEvent<String> arg0) {
// TODO Auto-generated method stub
if(arg0!=null){
System.out.println("Item coming out of queue 1" +arg0);
}
else{
System.out.println("null");
}
}

You have to poll the queue, like a standard java BlockingQueue in order to consume an item only once.
String item = queue1.take()
AFAIK, Hazelcast doesn't support asynchronous operation on queue. The ItemListener doesn't consume the item, it only notifies that an item is available.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How process() method in kafka-stream-processor is called automatically? - apache-kafka

Related

Create message liseners dynamically for the topics

Adding data to state store for stateful processing and fault tolerance

How do I concurrently process Reactor Kafka Streams by Topic and Partition with Auto Acknowledgement?

Kafka Streams dynamic routing (ProducerInterceptor might be a solution?)

Consuming messages from a Hazelcast Queue only once in a distributed environment

Categories

Resources