How does SpringCloudStream Kafka consume in bulk? - apache-kafka

Native Kafka can be used KafkaConsumer.poll () to obtain messages in batches. How to configure it after integrating spring cloud starter stream Kafka (3.0.10. Release)? I didn't understand the official documents
Hope to give an example, thank you

See https://docs.spring.io/spring-cloud-stream/docs/3.0.10.RELEASE/reference/html/spring-cloud-stream.html#_batch_consumers
Set the batch-mode consumer binding property and use
#Bean
Consumer<List<Foo>> input() {
return list -> {
System.out.println(list);
};
}

Related

How to consume the latest records from the topic using Spring Cloud Stream?

I have a Kafka Streams processor that consumes three topics and tries to merge (Joining operation) them on the key. After joining successfully, it does some aggregation and then pushes results to the target topic.
After the application runs for the first time, it tries to consume all data from those topics. Two of those topics use like lookup table, which means that I need to consume all data from the beginning. But one of those topics is my main topic. So I need to consume from the latest. But my application consumes all Kafka topics from the beginning. So I want to consume two topics from the start and one from the latest.
I'm using Spring Cloud Stream, Kafka Streams Binder. Here are my configs and some code snippets;
Application.yaml :
spring.cloud.stream.function.definition: processName;
spring.cloud.stream.kafka.streams.binder.functions.processName.applicationId: myappId
spring.cloud.stream.bindings.processName-in-0.destination: mainTopic
spring.cloud.stream.bindings.processName-in-0.consumer.group: mainTopic-cg
spring.cloud.stream.bindings.processName-in-0.consumer.startOffset: latest
spring.cloud.stream.bindings.processName-in-1.destination: secondTopic
spring.cloud.stream.bindings.processName-in-1.consumer.group: secondTopic-cg
spring.cloud.stream.bindings.processName-in-1.consumer.startOffset: earliest
spring.cloud.stream.bindings.processName-in-2.destination: thirdTopic
spring.cloud.stream.bindings.processName-in-2.consumer.group: thirdTopic-cg
spring.cloud.stream.bindings.processName-in-2.consumer.startOffset: earliest
spring.cloud.stream.bindings.processName-out-0.destination: otputTopics
spring.cloud.stream.kafka.streams.binder.replication-factor: 1
spring.cloud.stream.kafka.streams.binder.configuration.commit.interval.ms: 10000
spring.cloud.stream.kafka.streams.binder.configuration.state.dir: state-store
Streams processor:
public Function<KStream<String, MainTopic>,
Function<KTable<String, SecondTopic>,
Function<KTable<String, ThirdTopic>,
KStream<String, OutputTopic>>>> processName(){
return mainTopicKStream -> (
secondTopicTable -> (
thirdTopicKTable -> (
aggregateOperations.AggregateByAmount(
joinOperations.JoinSecondThirdTopic(mainTopicKStream ,secondTopicTable ,thirdTopicKTable )
.filter((k,v) -> v.IsOk() != 4)
.groupByKey(Grouped.with(AppSerdes.String(), AppSerdes.OutputTopic()))
, TimeWindows.of(Duration.ofMinutes(1)).advanceBy(Duration.ofMinutes(1))
).toStream()
)
));
}
A couple of points. When you have a Kafka Streams application using Spring Cloud Stream binder, you do not need to set group information on the bindings, just your applicationId setup is sufficient. Therefore, I suggest removing those 3 group properties from your configuration. Another thing is that any consumer specific binding properties when using Kafka streams binder needs to be set under spring.cloud.stream.kafka.streams.bindings.<binding-name>.consumer.... This is mentioned in this section of the docs. Please change your startOffset configuration accordingly. Also, look at the same section of the docs for an explanation of the semantics for using startOffset in Kafka Streams binder. Basically, the start offset property is honored only when you start the application for the first time. By default it is earliest when there are no committed offsets, but you can override to latest using the property. You can materialize the incoming KTables as state stores and thus have access to all the lookup data.

Spring Cloud 3.1 documentation on how to use KTable in a Spring boot app

Im struggling to find any documentation on where i can use Spring Cloud Streams that takes a Kafka topic and puts it into a KTable.
Having looked for documentation for example on here https://cloud.spring.io/spring-cloud-static/spring-cloud-stream-binder-kafka/3.0.0.RC1/reference/html/spring-cloud-stream-binder-kafka.html#_materializing_ktable_as_a_state_store nothing is very concrete on the way to do this in Spring boot using annotations.
I was hoping i could just create a simple KTable using a KStream where in my application.properties i have this:
spring.cloud.stream.bindings.process-in-0.destination: my-topic
Then in my Configuration I was hoping i could do something like this
#Bean
public Consumer<KStream<String, String>> process() {
return input -> input.toTable(Materialized.as("my-store"))
}
Please advise what im missing?
If all you want to do is to consume data from Kafka topic as KTable, then you can do this as below.
#Bean
public Consumer<KTable<String, String>> process() {
return input -> {
};
}
If you want to materialize the table into a named store, then you can add this to the configuration.
spring.cloud.stream.kafka.streams.bindings.process_in_0.consumer.materializedAs: my-store
You could also do what you had in the question, i.e. receive it as a KStream and then convert to KTable. However, if that is all you need to do, you might rather receive it as KTable in the first place as I suggest here.

How to make Spring Cloud Stream consumer in Webflux application?

I have a Webflux based microservice that has a simple reactive repository:
public interface NotificationRepository extends ReactiveMongoRepository<Notification, ObjectId> {
}
Now I would like to extend this microservice to consume event messages from Kafka. This message/event will be then saved into the database.
For the Kafka listener, I used Spring Cloud Stream. I created some simple Consumer and it works great - I'm able to consume the message and save it into the database.
#Bean
public Consumer<KStream<String, Event>> documents(NotificationRepository repository) {
return input ->
input.foreach((key, value) -> {
LOG.info("Received event, Key: {}, value: {}", key, value);
repository.save(initNotification(value)).subscribe();
});
}
But is this the correct way to connect Spring Cloud Stream consumer and reactive repository? It doesn't look like it is when I have to call subscribe() in the end.
I read the Spring Cloud Stream documentation (for 3.0.0 release) and they say
Native support for reactive programming - since v3.0.0 we no longer distribute spring-cloud-stream-reactive modules and instead relying on native reactive support provided by spring cloud function. For backward compatibility you can still bring spring-cloud-stream-reactive from previous versions.
and also in this presentation video they mention they have reactive programming support using project reactor. So I guess there is a way I just don't know it. Can you show me how to do it right?
I apologize if this all sounds too stupid but I'm very new to Spring Cloud Stream and reactive programming and haven't found many articles describing this.
Just use Flux as consumed type, something like this:
#Bean
public Consumer<Flux<Message<Event>>> documents(NotificationRepository repository) {
return input ->
input
.map(message-> /*map the necessary value like:*/ message.getPayload().getEventValue())
.concatMap((value) -> repository.save(initNotification(value)))
.subscribe();
}
If you use Function with empty return type (Function<Flux<Message<Event>>, Mono<Void>>) instead of a Consumer, then framework can automatically subscribe. With Consumer you have to subscribe manually, because the framework has no reference to the stream. But in Consumer case you subscribe not to the repository but the whole stream which is ok.

Moving millions of documents from Mongo to Kafka

I have 10 Millions of documents in Mongo DB which I would like to push to Kafka as a JSON as it is without any change.
Looking for the best approaches.
1.Use Mongo reactive streams
Flux.from(collection.find()).doOnError(e -> {
e.printStackTrace();
}).doOnComplete(() -> {
System.out.println("Finished ");
}).subscribe(doc -> {
// Code to insert into Kafka
});
Using Akka Stream
Is there any other connector available?
Also, Do i need to do multi threading inside subscribe method?
Any other better approach?
You can use Kafka Connect and MongoDB source connector in order to replicate data from Mongo to Kafka. Using Kafka Connect is much more flexible, scalable and simple.
An example configuration would be:
name=mongodb-source-connector
connector.class=org.apache.kafka.connect.mongodb.MongodbSourceConnector
tasks.max=1
uri=mongodb://127.0.0.1:27017
batch.size=100
schema.name=yourSchemaName
topic.prefix=aPrefix # optional
databases=mydb.test,mydb.test2

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?
If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);
KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}