Does KStream filter consume every message? - apache-kafka

I have used Kafka in the past, but never the streams API. I am tasked with building a scalable service that accepts websocket connections and routes outbound messages from a central topic to the correct session based on user id.
This looks ridiculously simple using KStream<String, Object>. From one online tutorial:
builder.stream(inputTopic, Consumed.with(Serdes.String(), publicationSerde))
.filter((name, publication) -> "George R. R. Martin".equals(publication.getName()))
.to(outputTopic, Produced.with(Serdes.String(), publicationSerde));
But does the filter command consume every message from the topic and perform a filter in application space? Or does KStream<K, V> filter(Predicate<? super K,? super V> predicate) contain hooks into the inner workings of Kafka that allow it only to receive messages matching the correct key?
The wording on the KStream<K,V> javadoc seem to suggest the former: "consumed message by message."
If the only purpose of the filter is to consume every message of a topic and throw away those that are not relevant, I could do that by hand.

You are correct - messages need to be deserialized, then inspected against a predicate (in application space)
throw away those that are not relevant, I could do that by hand
Sure, you could, but Kafka Streams has useful methods for defining session windows. Plus, you wouldn't need to define a consumer and producer instance to forward to new topics.

Related

Kafka Consumer and Producer

Can I have the consumer act as a producer(publisher) as well? I have a user case where a consumer (C1) polls a topic and pulls messages. after processing the message and performing a commit, it needs to notify another process to carry on remaining work. Given this use case is it a valid design for Consumer (C1) to publish a message to a different topic? i.e. C1 is also acting as a producer
Yes. This is a valid use case. We have many production applications does the same, consuming events from a source topic, perform data enrichment/transformation and publish the output into another topic for further processing.
Again, the implementation pattern depends on which tech stack you are using. But if you after Spring Boot application, you can have look at https://medium.com/geekculture/implementing-a-kafka-consumer-and-kafka-producer-with-spring-boot-60aca7ef7551
Totally valid scenario, for example you can have connector source or a producer which simple push raw data to a topic.
The receiver is loosely coupled to your publisher so they cannot communicate each other directly.
Then you need C1 (Mediator) to consume message from the source, transform the data and publish the new data format to a different topic.
If you're using a JVM based client, this is precisely the use case for using Kafka Streams rather than the base Consumer/Producer API.
Kafka Streams applications must consume from an initial topic, then can convert(map), filter, aggregate, split, etc into other topics.
https://kafka.apache.org/documentation/streams/

What happens to the consumer offset when an error occurs within a custom class in a KStream topology?

I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?
I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.

Consume all messages of a topic in all instances of a Streams app

In a Kafka Streams app, an instance only gets messages of an input topic for the partitions that have been assigned to that instance. And as the group.id, which is based on the (for all instances identical) application.id, that means that every instance sees only parts of a topic.
This all makes perfect sense of course, and we make use of that with the high-throughput data topic, but we would also like to control the streams application by adding topic-wide "control messages" to the input topic. But as all instances need to get those messages, we would either have to send
one control message per partition (making it necessary for the sender to know about the partitioning scheme, something we would like to avoid)
one control message per key (so every active partition would be getting at least one control message)
Because this is cumbersome for the sender, we are thinking about creating a new topic for control messages that the streams application consumes, in addition to the data topic. But how can we make it so that every partition receives all messages from the control message topic?
According to https://stackoverflow.com/a/55236780/709537, the group id cannot be set for Kafka Streams.
One way to do this would be to create and use a KafkaConsumer in addition to using Kafka Streams, which would allow us to set the group id as we like. However this sounds complex and dirty enough to wonder if there isn't a more straightforward way that we are missing.
Any ideas?
You can use a global store which sources data from all the partitions.
From the documentation,
Adds a global StateStore to the topology. The StateStore sources its
data from all partitions of the provided input topic. There will be
exactly one instance of this StateStore per Kafka Streams instance.
The syntax is as follows:
public StreamsBuilder addGlobalStore(StoreBuilder storeBuilder,
String topic,
Consumed consumed,
ProcessorSupplier stateUpdateSupplier)
The last argument is the ProcessorSupplier which has a get() that returns a Processor that will be executed for every new message. The Processor contains the process() method that will be executed every time there is a new message to the topic.
The global store is per stream instance, so you get all the topic data in every stream instance.
In the process(K key, V value), you can write your processing logic.
A global store can be in-memory or persistent and can be backed by a changelog topic, so that even if the streams instance local data (state) is deleted, the store can be built using the changelog topic.

Kafka Consumer API vs Streams API for event filtering

Should I use the Kafka Consumer API or the Kafka Streams API for this use case? I have a topic with a number of consumer groups consuming off it. This topic contains one type of event which is a JSON message with a type field buried internally. Some messages will be consumed by some consumer groups and not by others, one consumer group will probably not be consuming many messages at all.
My question is:
Should I use the consumer API, then on each event read the type field and drop or process the event based on the type field.
OR, should I filter using the Streams API, filter method and predicate?
After I consume an event, the plan is to process that event (DB delete, update, or other depending on the service) then if there is a failure I will produce to a separate queue which I will re-process later.
Thanks you.
This seems more a matter of opinion. I personally would go with Streams/KSQL, likely smaller code that you would have to maintain. You can have another intermediary topic that contains the cleaned up data that you can then attach a Connect sink, other consumers, or other Stream and KSQL processes. Using streams you can scale a single application on different machines, you can store state, have standby replicas and more, which would be a PITA to do it all yourself.

Implement filering for kafka messages

I have started using Kafka recently and evaluating Kafka for few use cases.
If we wanted to provide the capability for filtering messages for consumers (subscribers) based on message content, what is best approach for doing this?
Say a topic named "Trades" is exposed by producer which has different trades details such as market name, creation date, price etc.
Some consumers are interested in trades for a specific markets and others are interested in trades after certain date etc. (content based filtering)
As filtering is not possible on broker side, what is best possible approach for implementing below cases :
If filtering criteria is specific to consumer. Should we use
Consumer-Interceptor (though interceptor are suggested for logging
purpose as per documentation)?
If filtering criteria (content based filtering) is common among consumers, what should be the approach?
Listen to topic and filter the messages locally and write to new topic (using either interceptor or streams)
If I understand you question correctly, you have one topic and different consumer which are interested in specific parts of the topic. At the same time, you do not own those consumer and want to avoid that those consumer just read the whole topic and do the filtering by themselves?
For this, the only way to go it to build a new application, that does read the whole topic, does the filtering (or actually splitting) and write the data back into two (multiple) different topics. The external consumer would consumer from those new topics and only receive the date they are interested in.
Using Kafka Streams for this purpose would be a very good way to go. The DSL should offer everything you need.
As an alternative, you can just write your own application using KafkaConsumer and KafkaProducer to do the filtering/splitting manually in your user code. This would not be much different from using Kafka Streams, as a Kafka Streams application would do the exact same thing internally. However, with Streams your effort to get it done would be way less.
I would not use interceptors for this. Even is this would work, it seems not to be a good software design for you use case.
Create your own interceptor class that implements org.apache.kafka.clients.consumer.ConsumerInterceptor and implement your logic in method 'onConsume' before setting 'interceptor.classes' config for the consumer.