How to consume the latest records from the topic using Spring Cloud Stream? - apache-kafka

I have a Kafka Streams processor that consumes three topics and tries to merge (Joining operation) them on the key. After joining successfully, it does some aggregation and then pushes results to the target topic.
After the application runs for the first time, it tries to consume all data from those topics. Two of those topics use like lookup table, which means that I need to consume all data from the beginning. But one of those topics is my main topic. So I need to consume from the latest. But my application consumes all Kafka topics from the beginning. So I want to consume two topics from the start and one from the latest.
I'm using Spring Cloud Stream, Kafka Streams Binder. Here are my configs and some code snippets;
Application.yaml :
spring.cloud.stream.function.definition: processName;
spring.cloud.stream.kafka.streams.binder.functions.processName.applicationId: myappId
spring.cloud.stream.bindings.processName-in-0.destination: mainTopic
spring.cloud.stream.bindings.processName-in-0.consumer.group: mainTopic-cg
spring.cloud.stream.bindings.processName-in-0.consumer.startOffset: latest
spring.cloud.stream.bindings.processName-in-1.destination: secondTopic
spring.cloud.stream.bindings.processName-in-1.consumer.group: secondTopic-cg
spring.cloud.stream.bindings.processName-in-1.consumer.startOffset: earliest
spring.cloud.stream.bindings.processName-in-2.destination: thirdTopic
spring.cloud.stream.bindings.processName-in-2.consumer.group: thirdTopic-cg
spring.cloud.stream.bindings.processName-in-2.consumer.startOffset: earliest
spring.cloud.stream.bindings.processName-out-0.destination: otputTopics
spring.cloud.stream.kafka.streams.binder.replication-factor: 1
spring.cloud.stream.kafka.streams.binder.configuration.commit.interval.ms: 10000
spring.cloud.stream.kafka.streams.binder.configuration.state.dir: state-store
Streams processor:
public Function<KStream<String, MainTopic>,
Function<KTable<String, SecondTopic>,
Function<KTable<String, ThirdTopic>,
KStream<String, OutputTopic>>>> processName(){
return mainTopicKStream -> (
secondTopicTable -> (
thirdTopicKTable -> (
aggregateOperations.AggregateByAmount(
joinOperations.JoinSecondThirdTopic(mainTopicKStream ,secondTopicTable ,thirdTopicKTable )
.filter((k,v) -> v.IsOk() != 4)
.groupByKey(Grouped.with(AppSerdes.String(), AppSerdes.OutputTopic()))
, TimeWindows.of(Duration.ofMinutes(1)).advanceBy(Duration.ofMinutes(1))
).toStream()
)
));
}

A couple of points. When you have a Kafka Streams application using Spring Cloud Stream binder, you do not need to set group information on the bindings, just your applicationId setup is sufficient. Therefore, I suggest removing those 3 group properties from your configuration. Another thing is that any consumer specific binding properties when using Kafka streams binder needs to be set under spring.cloud.stream.kafka.streams.bindings.<binding-name>.consumer.... This is mentioned in this section of the docs. Please change your startOffset configuration accordingly. Also, look at the same section of the docs for an explanation of the semantics for using startOffset in Kafka Streams binder. Basically, the start offset property is honored only when you start the application for the first time. By default it is earliest when there are no committed offsets, but you can override to latest using the property. You can materialize the incoming KTables as state stores and thus have access to all the lookup data.

Related

Kafka Consumer and Producer

Can I have the consumer act as a producer(publisher) as well? I have a user case where a consumer (C1) polls a topic and pulls messages. after processing the message and performing a commit, it needs to notify another process to carry on remaining work. Given this use case is it a valid design for Consumer (C1) to publish a message to a different topic? i.e. C1 is also acting as a producer
Yes. This is a valid use case. We have many production applications does the same, consuming events from a source topic, perform data enrichment/transformation and publish the output into another topic for further processing.
Again, the implementation pattern depends on which tech stack you are using. But if you after Spring Boot application, you can have look at https://medium.com/geekculture/implementing-a-kafka-consumer-and-kafka-producer-with-spring-boot-60aca7ef7551
Totally valid scenario, for example you can have connector source or a producer which simple push raw data to a topic.
The receiver is loosely coupled to your publisher so they cannot communicate each other directly.
Then you need C1 (Mediator) to consume message from the source, transform the data and publish the new data format to a different topic.
If you're using a JVM based client, this is precisely the use case for using Kafka Streams rather than the base Consumer/Producer API.
Kafka Streams applications must consume from an initial topic, then can convert(map), filter, aggregate, split, etc into other topics.
https://kafka.apache.org/documentation/streams/

How to determine topic has been read completely by Kafka Stream application from very first offset to last offset from Java application

I need some help in Kafka Streams. I have started a Kafka stream application, which is streaming one topic from the very first offset. Topic is very huge in data, so I want to implement a mechanism in my application, using Kafka streams, so that I can get notified when topic has been read completely to the very last offset.
I have read Kafka Streams 2.8.0 api, I have found an api method i-e allLocalStorePartitionLags, which is returning map of store names to another map of partition containing all the lag information against each partition. This method returns lag information for all store partitions (active or standby) local to this Streams. This method is quite useful for me, in above case, when I have one node running that stream application.
But in my case, system is distributed and application nodes are 3 and topic partitions are 10, which meaning each node have at least 3 partitions for the topic to read from.
I need help here. How I can implement this functionality where I can get notified when topic has been read completely from partition 0 to partition 9. Please note that I don't have option to use database here as of now.
Other approaches to achieve goal are also welcomed. Thank you.
I was able to achieve lag information from adminClient api. Below code results end offsets and current offsets for each partitions against topics read by given stream application i-e applicationId.
AdminClient adminClient = AdminClient.create(kafkaProperties);
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(applicationId);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));

Couchbase kafka source connector order of messages

My question is regarding the ordering of messages from CouchBase to Kafka topic. This is not well documented in my opinion and hence this question.
Use case
The use case is to get the changes for every document and get the latest change for every document. Till the messages reach the Kafka topic, its real time and the consumer of the Kafka topic can be a batch application as well. This means that i may end up receiving multiple events for the same document in a given batch.
My understanding
CouchBase bucket has multiple vBuckets
When a document is inserted into the bucket, it gets into one of the vBuckets (based on the hash of the document key). This means that if the document with the same key is updated, it goes to the same vBucket.
CouchBase streams the change events using the DCP for the other applications like Kafka Source connector to consume
CouchBase ensures the ordering of the events per vBucket. Cluster wide ordering is not guaranteed. This means that if the document with the same key is updated multiple times, then those event ordering is guaranteed.
Now, when the Kafka source connector reads the DCP events, it reads in the same order that came into the DCP streams
Ordering within a vBucket is guaranteed. So far so good
Question
When the Kafka source connector publishes the messages to the Kafka topic, does it maintain the same ordering?
How does the source connector decides the kafka partitions for the messages? (assuming that there would be more than 1 partition for the
topic)
Note
CouchBase version - 5.5.3
Kafka Source Connector - 3.4.5

Is there anyway to use different auto.offset.reset strategy for different input topics in kafka streams app?

The use case is: I have a kafka streams app that consumer from an input topic, and output to a intermediate topic, then in the same streams another topology consume from this intermediate topic.
Whenever the application id is updated, both topic start to consumer from earliest. I want to change the auto.offset.reset for the intermediate topic to latest while keep that to earliest for the input topic.
Yes. You can set the reset strategy for each topic via:
// Processor API
topology.addSource(AutoOffsetReset offsetReset, String name, String... topics);
// DSL
builder.stream(String topic, Consumed.with(AutoOffsetReset offsetReset));
builder.table(String topic, Consumed.with(AutoOffsetReset offsetReset));
All those methods have some overloads that allow to set it.

Dynamically create and change Kafka topics with Flink

I'm using Flink to read and write data from different Kafka topics.
Specifically, I'm using the FlinkKafkaConsumer and FlinkKafkaProducer.
I'd like to know if it is possible to change the Kafka topics I'm reading from and writing to 'on the fly' based on either logic within my program, or the contents of the records themselves.
For example, if a record with a new field is read, I'd like to create a new topic and start diverting records with that field to the new topic.
Thanks.
If you have your topics following a generic naming pattern, for example, "topic-n*", your Flink Kafka consumer can automatically reads from "topic-n1", "topic-n2", ... and so on as they are added to Kafka.
Flink 1.5 (FlinkKafkaConsumer09) added support for dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees.
Consumer constructor that accepts subscriptionPattern: link.
Thinking more about the requirement,
1st step is - You will start from one topic (for simplicity) and will spawn more topic during runtime based on the data provided and direct respective messages to these topics. It's entirely possible and will not be a complicated code. Use ZkClient API to check if topic-name exists, if does not exist create a model topic with new name and start pushing messages into it through a new producer tied to this new topic. You don't need to restart job to produce messages to a specific topic.
Your initial consumer become producer(for new topics) + consumer(old topic)
2nd step is - You want to consume messages for new topic. One way could be to spawn a new job entirely. You can do this be creating a thread pool initially and supplying arguments to them.
Again be more careful with this, more automation can lead to overload of cluster in case of a looping bug. Think about the possibility of too many topics created after some time if input data is not controlled or is simply dirty. There could be better architectural approaches as mentioned above in comments.